Machine Learning Pipeline: Preprocessing & Models #

This page explains both data preprocessing and model development concepts in a clear, structured way to support understanding.

A complete ML pipeline includes preprocessing, feature engineering, feature selection, and model training.

1. Data Preprocessing Overview #

Raw data is often:

Noisy
Incomplete
Inconsistent

Preprocessing ensures data is suitable for machine learning.

2. Missing Values #

Why they occur

Sensor errors
Data collection issues

Methods

Numerical → Median (robust to outliers)
Categorical → Mode

3. Outlier Handling #

IQR Method

\[ IQR = Q_3 - Q_1 \] \[ \text{Lower} = Q_1 - 1.5 \times IQR \] \[ \text{Upper} = Q_3 + 1.5 \times IQR \]

These formulas are used for IQR-based outlier detection:

( Q_1 ): First quartile (25th percentile)
( Q_3 ): Third quartile (75th percentile)
( IQR ): Spread of the middle 50% of the data

Any value:

below Lower bound → potential outlier
above Upper bound → potential outlier

This method is robust because it is not affected by extreme values.

✔ Keeps all data
✔ Simple

Isolation Forest

Detects anomalies using tree structures
Works on multiple features

✔ Detects complex outliers
✖ Removes rows

4. Encoding #

One-Hot Encoding

Converts categories to binary columns
No ordering assumption

Ordinal Encoding

Converts categories to integers
Can introduce false relationships

5. Feature Scaling #

StandardScaler: Centres data around mean (0) with unit variance. Sensitive to outliers.
MinMaxScaler: Scales data to a fixed range (usually 0–1). Preserves shape but sensitive to extreme values.
RobustScaler: Uses median and IQR, making it resistant to outliers — ideal for skewed distributions.

StandardScaler

\[ z = \frac{x - \mu}{\sigma} \]

MinMaxScaler

\[ x' = \frac{x - \min(x)}{\max(x) - \min(x)} \]

RobustScaler

\[ x' = \frac{x - \text{median}(x)}{IQR} \]

✔ Best for noisy data

6. Feature Engineering #

Examples

Discomfort Index = Temperature × Humidity
Pressure × Visibility
Seasonal encoding using sin/cos

7. Feature Selection #

Mutual Information

Detects non-linear relationships

Random Forest Importance

Model-based ranking

8. Model Development #

8.1 k-Nearest Neighbours (k-NN) #

k-NN is a non-parametric, instance-based learning algorithm.
It does not learn explicit parameters but instead stores the training data.

Key Idea #

Prediction is made based on the majority class of the nearest neighbours.

Distance Metric (Euclidean) #

\[ d = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2} \]

Measures similarity between data points
Smaller distance → more similar

Prediction Rule #

\[ \hat{y} = \arg\max_{c} \sum_{i \in N_k(x)} \mathbf{1}(y_i = c) \]

Where:

( N_k(x) ): k nearest neighbours
( \mathbf{1} ): indicator function

Key Characteristics #

No training phase
Sensitive to scaling → requires normalisation
Computationally expensive at prediction time

8.2 Support Vector Machine (SVM) #

SVM finds the optimal hyperplane that maximises the margin between classes.

Decision Boundary #

\[ w \cdot x + b = 0 \]

Margin Constraint #

\[ y_i(w \cdot x_i + b) \geq 1 \]

Optimisation Objective #

\[ \min_{w,b} \frac{1}{2} \|w\|^2 \]

With Hinge Loss #

\[ L = \max(0, 1 - y_i(w \cdot x_i + b)) \]

Key Characteristics #

Effective in high-dimensional spaces
Sensitive to feature scaling
Can use kernels for non-linear boundaries

8.3. Decision Tree #

A Decision Tree splits data recursively based on feature thresholds.

Gini Impurity #

\[ Gini = 1 - \sum_{i=1}^{C} p_i^2 \]

Split Quality #

\[ G_{split} = \frac{N_L}{N}Gini_L + \frac{N_R}{N}Gini_R \]

Information Gain #

\[ Gain = Gini_{parent} - G_{split} \]

Key Characteristics #

Highly interpretable
Captures non-linear relationships
Prone to overfitting

8.4. Naïve Bayes (Gaussian) #

Naïve Bayes is a probabilistic classifier based on Bayes’ Theorem.

Bayes Theorem #

\[ P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)} \]

Naïve Assumption #

Features are conditionally independent given the class.

Gaussian Likelihood #

\[ P(x_i \mid y=c) = \frac{1}{\sqrt{2\pi\sigma_c^2}} \exp\left(-\frac{(x_i - \mu_c)^2}{2\sigma_c^2}\right) \]

Final Prediction #

\[ \hat{y} = \arg\max_c \left( \log P(c) + \sum_{i=1}^{n} \log P(x_i \mid c) \right) \]

Key Characteristics #

Fast and efficient
Works well with high-dimensional data
Assumption rarely holds → still performs well

8.5. Random Forest (Ensemble Model) #

Random Forest is an ensemble of decision trees using:

Bootstrap sampling
Random feature selection
Majority voting

Bootstrap Sampling #

Each tree is trained on a random sample of data with replacement.

Random Feature Selection #

At each split:

\[ m = \sqrt{d} \]

Where:

( d ): number of features
( m ): features considered per split

Final Prediction (Voting) #

\[ \hat{y} = \operatorname{mode}(y_1, y_2, ..., y_T) \]

Key Characteristics #

Reduces overfitting
Handles non-linear interactions
Robust to noise

9. Model Comparison Summary #

Model	Strengths	Weaknesses
k-NN	Simple, intuitive	Slow at prediction, sensitive to scaling
Naïve Bayes	Fast, probabilistic	Strong independence assumption
Decision Tree	Interpretable, flexible	Overfitting
SVM	Powerful, margin-based	Sensitive to tuning
Random Forest	Robust, accurate	Less interpretable

Key Insight #

Different models capture different structures:
k-NN → local similarity
Naïve Bayes → probabilistic independence
Decision Tree → rule-based splits
SVM → margin optimisation
Random Forest → ensemble robustness
Combining them gives a comprehensive understanding of the dataset.

10. Full Pipeline #

Raw Data
↓
Preprocessing
↓
Feature Engineering
↓
Feature Selection
↓
Model Training
↓
Evaluation

Key Takeaways #

Preprocessing is critical for model success
Outliers and scaling significantly impact performance
Different models have different strengths
Ensemble models often perform best

Home | Machine Learning