Machine Learning Pipeline: Preprocessing & Models #
This page explains both data preprocessing and model development concepts in a clear, structured way to support understanding.
A complete ML pipeline includes preprocessing, feature engineering, feature selection, and model training.
1. Data Preprocessing Overview #
Raw data is often:
- Noisy
- Incomplete
- Inconsistent
Preprocessing ensures data is suitable for machine learning.
2. Missing Values #
Why they occur
- Sensor errors
- Data collection issues
Methods
- Numerical → Median (robust to outliers)
- Categorical → Mode
3. Outlier Handling #
IQR Method
\[ IQR = Q_3 - Q_1 \] \[ \text{Lower} = Q_1 - 1.5 \times IQR \] \[ \text{Upper} = Q_3 + 1.5 \times IQR \]These formulas are used for IQR-based outlier detection:
- ( Q_1 ): First quartile (25th percentile)
- ( Q_3 ): Third quartile (75th percentile)
- ( IQR ): Spread of the middle 50% of the data
Any value:
- below Lower bound → potential outlier
- above Upper bound → potential outlier
This method is robust because it is not affected by extreme values.
✔ Keeps all data
✔ Simple
Isolation Forest
- Detects anomalies using tree structures
- Works on multiple features
✔ Detects complex outliers
✖ Removes rows
4. Encoding #
One-Hot Encoding
- Converts categories to binary columns
- No ordering assumption
Ordinal Encoding
- Converts categories to integers
- Can introduce false relationships
5. Feature Scaling #
- StandardScaler: Centres data around mean (0) with unit variance. Sensitive to outliers.
- MinMaxScaler: Scales data to a fixed range (usually 0–1). Preserves shape but sensitive to extreme values.
- RobustScaler: Uses median and IQR, making it resistant to outliers — ideal for skewed distributions.
StandardScaler
\[ z = \frac{x - \mu}{\sigma} \]MinMaxScaler
\[ x' = \frac{x - \min(x)}{\max(x) - \min(x)} \]RobustScaler
\[ x' = \frac{x - \text{median}(x)}{IQR} \]✔ Best for noisy data
6. Feature Engineering #
Examples
- Discomfort Index = Temperature × Humidity
- Pressure × Visibility
- Seasonal encoding using sin/cos
7. Feature Selection #
Mutual Information
- Detects non-linear relationships
Random Forest Importance
- Model-based ranking
8. Model Development #
8.1 k-Nearest Neighbours (k-NN) #
k-NN is a non-parametric, instance-based learning algorithm.
It does not learn explicit parameters but instead stores the training data.
Key Idea #
Prediction is made based on the majority class of the nearest neighbours.
Distance Metric (Euclidean) #
\[ d = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2} \]- Measures similarity between data points
- Smaller distance → more similar
Prediction Rule #
\[ \hat{y} = \arg\max_{c} \sum_{i \in N_k(x)} \mathbf{1}(y_i = c) \]Where:
- ( N_k(x) ): k nearest neighbours
- ( \mathbf{1} ): indicator function
Key Characteristics #
- No training phase
- Sensitive to scaling → requires normalisation
- Computationally expensive at prediction time
8.2 Support Vector Machine (SVM) #
SVM finds the optimal hyperplane that maximises the margin between classes.
Decision Boundary #
\[ w \cdot x + b = 0 \]Margin Constraint #
\[ y_i(w \cdot x_i + b) \geq 1 \]Optimisation Objective #
\[ \min_{w,b} \frac{1}{2} \|w\|^2 \]With Hinge Loss #
\[ L = \max(0, 1 - y_i(w \cdot x_i + b)) \]Key Characteristics #
- Effective in high-dimensional spaces
- Sensitive to feature scaling
- Can use kernels for non-linear boundaries
8.3. Decision Tree #
A Decision Tree splits data recursively based on feature thresholds.
Gini Impurity #
\[ Gini = 1 - \sum_{i=1}^{C} p_i^2 \]Split Quality #
\[ G_{split} = \frac{N_L}{N}Gini_L + \frac{N_R}{N}Gini_R \]Information Gain #
\[ Gain = Gini_{parent} - G_{split} \]Key Characteristics #
- Highly interpretable
- Captures non-linear relationships
- Prone to overfitting
8.4. Naïve Bayes (Gaussian) #
Naïve Bayes is a probabilistic classifier based on Bayes’ Theorem.
Bayes Theorem #
\[ P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)} \]Naïve Assumption #
Features are conditionally independent given the class.
Gaussian Likelihood #
\[ P(x_i \mid y=c) = \frac{1}{\sqrt{2\pi\sigma_c^2}} \exp\left(-\frac{(x_i - \mu_c)^2}{2\sigma_c^2}\right) \]Final Prediction #
\[ \hat{y} = \arg\max_c \left( \log P(c) + \sum_{i=1}^{n} \log P(x_i \mid c) \right) \]Key Characteristics #
- Fast and efficient
- Works well with high-dimensional data
- Assumption rarely holds → still performs well
8.5. Random Forest (Ensemble Model) #
Random Forest is an ensemble of decision trees using:
- Bootstrap sampling
- Random feature selection
- Majority voting
Bootstrap Sampling #
Each tree is trained on a random sample of data with replacement.
Random Feature Selection #
At each split:
\[ m = \sqrt{d} \]Where:
- ( d ): number of features
- ( m ): features considered per split
Final Prediction (Voting) #
\[ \hat{y} = \operatorname{mode}(y_1, y_2, ..., y_T) \]Key Characteristics #
- Reduces overfitting
- Handles non-linear interactions
- Robust to noise
9. Model Comparison Summary #
| Model | Strengths | Weaknesses |
|---|---|---|
| k-NN | Simple, intuitive | Slow at prediction, sensitive to scaling |
| Naïve Bayes | Fast, probabilistic | Strong independence assumption |
| Decision Tree | Interpretable, flexible | Overfitting |
| SVM | Powerful, margin-based | Sensitive to tuning |
| Random Forest | Robust, accurate | Less interpretable |
Key Insight #
Different models capture different structures:
- k-NN → local similarity
- Naïve Bayes → probabilistic independence
- Decision Tree → rule-based splits
- SVM → margin optimisation
- Random Forest → ensemble robustness
Combining them gives a comprehensive understanding of the dataset.
10. Full Pipeline #
Raw Data
↓
Preprocessing
↓
Feature Engineering
↓
Feature Selection
↓
Model Training
↓
Evaluation
Key Takeaways #
- Preprocessing is critical for model success
- Outliers and scaling significantly impact performance
- Different models have different strengths
- Ensemble models often perform best