Machine learning Workflow

Machine learning Workflow #

Data is the foundation of any machine learning system. Quality of data matters more than model complexity.

Role of Data #

Data determines:

  • What patterns the model can learn
  • How well it generalises
  • Whether bias or noise is introduced

Bad data → bad model (even with perfect algorithms).

Data Preprocessing, wrangling #

Raw data is never ready for training.

Common preprocessing steps:

  • Handling missing values
  • Removing duplicates
  • Normalisation / standardisation
  • Encoding categorical variables
  • Feature selection

Convert raw data → usable features.

Data skewness removal (sampling) #

Real-world data is often imbalanced.

Example:

  • 95% normal transactions
  • 5% fraud cases

If left untreated, the model becomes biased.

Common techniques:

  • Oversampling minority class
  • Undersampling majority class
  • Stratified sampling

Goal: Ensure the model learns fairly from all classes.

Model Training #

This is where learning happens.

  • Choose an algorithm
  • Initialise parameters
  • Optimise a loss function
  • Iterate until performance improves

Training is iterative, not one-time.

Model Testing and performance metrics #

Training accuracy alone is misleading.

Models must be evaluated on unseen data.

Common metrics:

  • Accuracy
  • Precision
  • Recall
  • F1-score
  • ROC-AUC

This tells us: Does the model generalise beyond training data?


Home | Machine Learning