Ensemble Learning #

Ensemble Learning is a machine learning approach where we combine multiple models to produce a stronger final prediction.

Instead of depending on one model, an ensemble uses a group of models and combines their outputs.

The main idea is simple:

Many weak or moderately good models can work together to produce a better and more stable model.

Key takeaway:
Ensemble Learning improves prediction by combining several models.
It is especially useful when a single model is unstable, overfits, or does not generalise well.

Combining classifiers
Bagging
Random Forest
Boosting
AdaBoost
Gradient Boosting
XGBoost

Why Ensemble Learning Matters #

A single model can make mistakes because of:

noise in the data
overfitting
high variance
weak decision boundaries
limited training examples
complex relationships between features and target

Ensemble methods try to reduce these problems by combining several models.

For example, one decision tree may overfit the training data.

But a group of decision trees, if built with randomness and combined properly, can produce a more reliable prediction.

This is the central idea behind Random Forest.

Ensemble Learning Map #

flowchart LR
    A["Ensemble Learning"] --> B["Combining Classifiers"]
    A --> C["Bagging"]
    A --> D["Boosting"]

    C --> E["Random Forest"]

    D --> F["AdaBoost"]
    D --> G["Gradient Boosting"]
    D --> H["XGBoost"]

    style A fill:#E1F5FE,stroke:#5b7db1,color:#222
    style B fill:#C8E6C9,stroke:#5f8f6a,color:#222
    style C fill:#FFF9C4,stroke:#b59b3b,color:#222
    style D fill:#EDE7F6,stroke:#8a6fb3,color:#222
    style E fill:#FFF9C4,stroke:#b59b3b,color:#222
    style F fill:#EDE7F6,stroke:#8a6fb3,color:#222
    style G fill:#EDE7F6,stroke:#8a6fb3,color:#222
    style H fill:#EDE7F6,stroke:#8a6fb3,color:#222

1. Combining Classifiers ☆ #

Combining classifiers means using multiple classifiers and combining their predictions.

Each classifier gives an output.

The ensemble then decides the final output using a rule such as:

majority voting
weighted voting
averaging
weighted averaging

The purpose is to get a final prediction that is more reliable than the prediction from a single model.

Majority Voting #

In classification, each model votes for a class.

The class with the maximum number of votes becomes the final prediction.

Example:

Model	Prediction
Model 1	Yes
Model 2	No
Model 3	Yes
Model 4	Yes
Model 5	No

Final prediction:

Yes, because it receives the majority vote.

Majority Voting Formula #

Let there be \( M \) classifiers.

Each classifier predicts one class.

The ensemble selects the class that receives the highest number of votes.

\[ \hat{y} = \arg\max_{c} \sum_{m=1}^{M} \mathbf{1}(h_m(x)=c) \]

Where:

\( h_m(x) \) is the prediction of model \( m \)
\( c \) is a possible class
\( \mathbf{1}(h_m(x)=c) \) counts whether model \( m \) voted for class \( c \)

Averaging for Regression #

For regression, models output numerical values.

The final prediction is usually the average of all predictions.

\[ \hat{y} = \frac{1}{M}\sum_{m=1}^{M} h_m(x) \]

Where:

\( M \) is the number of models
\( h_m(x) \) is the prediction of model \( m \)

2. Bagging ☆ #

Bagging stands for Bootstrap Aggregating.

It is an ensemble method where we train multiple models independently on different random samples of the training data.

The models are trained in parallel.

The final output is obtained by voting or averaging.

Intuition Behind Bagging #

Suppose one decision tree overfits.

Instead of relying on one tree, we create many trees.

Each tree sees a slightly different version of the training data.

Because each tree learns a different pattern, their errors may not be identical.

When we combine them, the overall prediction becomes more stable.

Bootstrap Sampling #

In bagging, we create multiple training subsets using sampling with replacement.

This means:

some training examples may appear more than once
some training examples may not appear in a particular subset
each model gets a slightly different training set

\[ D_m \sim \text{Bootstrap}(D) \]

Where:

\( D \) is the original dataset
\( D_m \) is the bootstrap sample for model \( m \)

Bagging Workflow #

flowchart LR
    A["Original Dataset"] --> B["Bootstrap Sample 1"]
    A --> C["Bootstrap Sample 2"]
    A --> D["Bootstrap Sample 3"]

    B --> E["Model 1"]
    C --> F["Model 2"]
    D --> G["Model 3"]

    E --> H["Combine Predictions"]
    F --> H
    G --> H

    H --> I["Final Output"]

    style A fill:#E1F5FE,stroke:#5b7db1,color:#222
    style B fill:#FFF9C4,stroke:#b59b3b,color:#222
    style C fill:#FFF9C4,stroke:#b59b3b,color:#222
    style D fill:#FFF9C4,stroke:#b59b3b,color:#222
    style E fill:#C8E6C9,stroke:#5f8f6a,color:#222
    style F fill:#C8E6C9,stroke:#5f8f6a,color:#222
    style G fill:#C8E6C9,stroke:#5f8f6a,color:#222
    style H fill:#EDE7F6,stroke:#8a6fb3,color:#222
    style I fill:#E1F5FE,stroke:#5b7db1,color:#222

3. Random Forest ☆ #

Random Forest is one of the most important bagging-based ensemble algorithms.

It builds many decision trees and combines their predictions.

The word forest means a collection of trees.

The word random comes from two types of randomness:

Random selection of rows using bootstrap sampling
Random selection of features when splitting nodes

Why Random Forest Is Needed #

A single decision tree can easily overfit.

It may try to fit every detail in the training data, including noise.

Random Forest reduces this risk by combining many trees.

Each tree is trained on a slightly different dataset and considers different feature subsets.

This makes the trees less correlated.

When less-correlated trees vote together, the final prediction is usually more stable.

Random Forest for Classification #

For classification, each tree votes for a class.

The class with the most votes becomes the final prediction.

\[ \hat{y} = \text{mode}\left(h_1(x), h_2(x), \ldots, h_M(x)\right) \]

Random Forest for Regression #

For regression, each tree predicts a number.

The final prediction is the average of all tree predictions.

\[ \hat{y} = \frac{1}{M}\sum_{m=1}^{M} h_m(x) \]

Random Forest Process #

flowchart LR
    A["Training Data"] --> B["Random Rows"]
    A --> C["Random Rows"]
    A --> D["Random Rows"]

    B --> E["Random Feature Subset"]
    C --> F["Random Feature Subset"]
    D --> G["Random Feature Subset"]

    E --> H["Tree 1"]
    F --> I["Tree 2"]
    G --> J["Tree 3"]

    H --> K["Voting or Averaging"]
    I --> K
    J --> K

    K --> L["Random Forest Prediction"]

    style A fill:#E1F5FE,stroke:#5b7db1,color:#222
    style B fill:#FFF9C4,stroke:#b59b3b,color:#222
    style C fill:#FFF9C4,stroke:#b59b3b,color:#222
    style D fill:#FFF9C4,stroke:#b59b3b,color:#222
    style E fill:#C8E6C9,stroke:#5f8f6a,color:#222
    style F fill:#C8E6C9,stroke:#5f8f6a,color:#222
    style G fill:#C8E6C9,stroke:#5f8f6a,color:#222
    style H fill:#EDE7F6,stroke:#8a6fb3,color:#222
    style I fill:#EDE7F6,stroke:#8a6fb3,color:#222
    style J fill:#EDE7F6,stroke:#8a6fb3,color:#222
    style K fill:#E1F5FE,stroke:#5b7db1,color:#222
    style L fill:#C8E6C9,stroke:#5f8f6a,color:#222

Important Random Forest Parameters #

Parameter	Meaning
Number of trees	How many decision trees are built
Maximum depth	Maximum depth of each tree
Minimum samples split	Minimum samples required to split a node
Number of features	How many features are considered at each split
Bootstrap	Whether sampling with replacement is used

Advantages of Random Forest #

Reduces overfitting compared with a single decision tree
Works for classification and regression
Handles non-linear relationships
Works well with tabular data
Gives feature importance
Usually performs well without heavy tuning

Limitations of Random Forest #

Less interpretable than a single decision tree
Can be slower than one tree
May require more memory
Not ideal when real-time low-latency prediction is required

4. Boosting ☆ #

Boosting is another ensemble technique.

Unlike bagging, boosting trains models sequentially.

Each new model tries to correct the mistakes made by the previous models.

The main idea is:

Focus more on the examples that previous models got wrong.

Bagging vs Boosting #

Aspect	Bagging	Boosting
Training style	Parallel	Sequential
Main goal	Reduce variance	Reduce bias and improve weak learners
Data handling	Bootstrap samples	Reweights or focuses on errors
Base models	Usually strong or unstable models like trees	Usually weak learners
Example	Random Forest	AdaBoost, Gradient Boosting, XGBoost

Boosting Workflow #

flowchart LR
    A["Training Data"] --> B["Weak Learner 1"]
    B --> C["Find Mistakes"]
    C --> D["Increase Focus on Mistakes"]
    D --> E["Weak Learner 2"]
    E --> F["Find Remaining Mistakes"]
    F --> G["Weak Learner 3"]
    G --> H["Weighted Combination"]
    H --> I["Final Strong Model"]

    style A fill:#E1F5FE,stroke:#5b7db1,color:#222
    style B fill:#C8E6C9,stroke:#5f8f6a,color:#222
    style C fill:#FFF9C4,stroke:#b59b3b,color:#222
    style D fill:#FFF9C4,stroke:#b59b3b,color:#222
    style E fill:#C8E6C9,stroke:#5f8f6a,color:#222
    style F fill:#FFF9C4,stroke:#b59b3b,color:#222
    style G fill:#C8E6C9,stroke:#5f8f6a,color:#222
    style H fill:#EDE7F6,stroke:#8a6fb3,color:#222
    style I fill:#E1F5FE,stroke:#5b7db1,color:#222

5. AdaBoost ☆ #

AdaBoost stands for Adaptive Boosting.

It is called adaptive because it changes the importance of training examples after each round.

Misclassified examples receive higher weight.

Correctly classified examples receive lower weight.

The next weak learner then focuses more on the difficult examples.

Weak Learner #

A weak learner is a model that performs only slightly better than random guessing.

In AdaBoost, the common weak learner is a decision stump.

A decision stump is a decision tree with only one split.

AdaBoost Intuition #

AdaBoost works like this:

Start with equal weight for all training examples.
Train a weak learner.
Check which examples are misclassified.
Increase the weights of misclassified examples.
Train the next weak learner on the reweighted data.
Combine all weak learners using weighted voting.

Initial Weights #

If there are \( N \) training examples, each example initially receives equal weight.

\[ w_i = \frac{1}{N} \]

Weighted Error #

For learner \( t \) , the weighted error is:

\[ \epsilon_t = \sum_{i=1}^{N} w_i \mathbf{1}\left(y_i \neq h_t(x_i)\right) \]

Where:

\( w_i \) is the weight of example \( i \)
\( y_i \) is the actual label
\( h_t(x_i) \) is the prediction of weak learner \( t \)

Learner Weight #

A weak learner with lower error gets higher importance in the final model.

\[ \alpha_t = \frac{1}{2}\ln\left(\frac{1-\epsilon_t}{\epsilon_t}\right) \]

Interpretation:

If error is low, \( \alpha_t \) is high.
If error is high, \( \alpha_t \) is low.
If the learner is no better than random guessing, it is not useful.

Weight Update #

After each weak learner, the example weights are updated.

\[ w_i \leftarrow w_i \exp\left(-\alpha_t y_i h_t(x_i)\right) \]

Then the weights are normalised so that they sum to 1.

Final AdaBoost Classifier #

The final model is a weighted combination of weak learners.

\[ H(x) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t h_t(x)\right) \]

Where:

\( T \) is the number of weak learners
\( \alpha_t \) is the importance of learner \( t \)
\( h_t(x) \) is the weak learner prediction

AdaBoost Notes #

AdaBoost is sequential.
It gives more attention to misclassified examples.
It combines weak learners into a strong learner.
It is sensitive to noisy data and outliers because difficult points receive higher weight.
The number of learners is a hyperparameter.

6. Gradient Boosting ☆ #

Gradient Boosting is another boosting method.

It combines:

boosting idea
gradient descent idea
loss function minimisation

AdaBoost focuses on changing the weights of misclassified examples.

Gradient Boosting focuses on fitting new models to the errors or residuals of previous models.

Main Idea of Gradient Boosting #

Suppose the current model makes predictions.

Some error remains.

The next model is trained to predict that remaining error.

Then this correction is added to the previous model.

This process continues step by step.

Additive Model #

Gradient Boosting builds the final model as a sum of smaller models.

\[ F_M(x) = F_0(x) + \sum_{m=1}^{M} \gamma_m h_m(x) \]

Where:

\( F_M(x) \) is the final model
\( F_0(x) \) is the initial model
\( h_m(x) \) is the weak learner at step \( m \)
\( \gamma_m \) is the step size or contribution of learner \( m \)

Residuals for Squared Error #

For regression with squared error loss, the residual is:

\[ r_i = y_i - F_{m-1}(x_i) \]

The next weak learner is trained to predict these residuals.

General Gradient Boosting Idea #

For a general loss function, Gradient Boosting uses the negative gradient of the loss.

\[ r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F=F_{m-1}} \]

This means the next model is trained in the direction that reduces the loss.

Gradient Boosting Workflow #

flowchart LR
    A["Initial Model"] --> B["Compute Residuals or Gradients"]
    B --> C["Train Weak Learner"]
    C --> D["Update Model"]
    D --> E["Repeat"]
    E --> F["Final Boosted Model"]

    style A fill:#E1F5FE,stroke:#5b7db1,color:#222
    style B fill:#FFF9C4,stroke:#b59b3b,color:#222
    style C fill:#C8E6C9,stroke:#5f8f6a,color:#222
    style D fill:#EDE7F6,stroke:#8a6fb3,color:#222
    style E fill:#FFF9C4,stroke:#b59b3b,color:#222
    style F fill:#E1F5FE,stroke:#5b7db1,color:#222

Important Gradient Boosting Parameters #

Parameter	Meaning
Number of estimators	Number of weak learners added sequentially
Learning rate	Controls how much each new learner contributes
Maximum tree depth	Controls complexity of each weak learner
Loss function	Defines what error the model tries to minimise
Subsample	Uses a fraction of data for each learner, if enabled

7. XGBoost ☆ #

XGBoost stands for Extreme Gradient Boosting.

It is an efficient and regularised implementation of Gradient Boosting.

It is widely used for structured or tabular data problems.

XGBoost became popular because it is fast, accurate, and includes strong regularisation techniques.

Why XGBoost Is Powerful #

XGBoost improves Gradient Boosting by adding:

regularisation
efficient tree construction
handling of missing values
parallel processing
shrinkage using learning rate
column subsampling
better control over overfitting

XGBoost Objective Function #

XGBoost minimises an objective function that contains two parts:

Training loss
Regularisation term

\[ \text{Objective} = \sum_{i=1}^{N} L(y_i, \hat{y}_i) + \sum_{k=1}^{K} \Omega(f_k) \]

Where:

\( L(y_i, \hat{y}_i) \) measures prediction error
\( \Omega(f_k) \) penalises model complexity
\( f_k \) is an individual tree

Regularisation Term #

A common form of tree regularisation is:

\[ \Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2 \]

Where:

\( T \) is the number of leaves
\( w_j \) is the score on leaf \( j \)
\( \gamma \) controls penalty for adding leaves
\( \lambda \) controls weight regularisation

Bagging, Random Forest, Boosting, AdaBoost, Gradient Boosting and XGBoost #

Method	Core Idea	Training Style	Common Base Learner	Main Strength
Bagging	Train models on bootstrap samples	Parallel	Decision trees	Reduces variance
Random Forest	Bagging with random feature selection	Parallel	Decision trees	Reduces overfitting of trees
Boosting	Each model fixes previous mistakes	Sequential	Weak learners	Builds strong learner
AdaBoost	Reweights misclassified examples	Sequential	Decision stumps	Focuses on hard examples
Gradient Boosting	Fits new models to residuals or gradients	Sequential	Shallow trees	Minimises loss step by step
XGBoost	Regularised, efficient gradient boosting	Sequential with optimisations	Trees	Accuracy and regularisation

Notes #

Bagging Notes #

Bagging creates multiple datasets using bootstrap sampling.
Models are trained independently.
Final output is voting for classification or averaging for regression.
It mainly reduces variance.
Random Forest is a major bagging-based method.

Random Forest Notes #

Random Forest is a collection of decision trees.
It uses randomness in both rows and features.
Classification uses majority vote.
Regression uses average prediction.
It is less interpretable than a single tree but usually more robust.

Boosting Notes #

Boosting trains models sequentially.
Each model focuses on the mistakes of earlier models.
It can reduce bias and improve weak learners.
It can overfit if too many learners are used or the data has many outliers.

AdaBoost Notes #

AdaBoost increases weights of misclassified examples.
Weak learners with lower error receive higher importance.
Final prediction is a weighted combination of weak learners.
It is sensitive to noisy examples.

Gradient Boosting Notes #

Gradient Boosting fits new learners to residuals or negative gradients.
It directly minimises a chosen loss function.
Learning rate is important.
Smaller learning rate usually needs more learners.

XGBoost Notes #

XGBoost is an advanced version of gradient boosting.
It adds regularisation to control overfitting.
It is widely used for tabular ML tasks.
It often performs very well in competitions and practical ML problems.

Common Mistakes #

Mistake	Correction
Thinking Random Forest and Boosting are the same	Random Forest is bagging-based; Boosting is sequential
Thinking all trees in Random Forest use the same data	Each tree uses random rows and random feature subsets
Thinking AdaBoost trains all learners independently	AdaBoost trains learners sequentially
Ignoring outliers in AdaBoost	AdaBoost may give high weight to noisy points
Using too many boosting rounds without control	Use learning rate, validation, and regularisation
Assuming XGBoost is just ordinary Gradient Boosting	XGBoost includes regularisation and efficiency improvements

Practical Interpretation in ML #

Use Random Forest when:

you want a strong baseline for tabular data
decision trees overfit
interpretability is useful but not the only priority
you want feature importance

Use Boosting when:

you want high predictive accuracy
weak learners need to be improved step by step
you can tune hyperparameters carefully

Use XGBoost when:

the dataset is structured or tabular
performance is important
regularisation and tuning are needed
you want a strong practical ML model

Revision #

Ensemble Learning combines multiple models.
Classifier combination can use majority voting or weighted voting.
Regression ensembles usually use averaging.
Bagging trains models independently on bootstrap samples.
Random Forest is a bagging method using many decision trees.
Random Forest reduces overfitting by using random rows and random features.
Boosting trains models sequentially.
Boosting focuses on mistakes made by previous models.
AdaBoost increases weights of misclassified examples.
Gradient Boosting fits new learners to residuals or negative gradients.
XGBoost is regularised and efficient Gradient Boosting.

Summary #

Ensemble Learning improves model performance by combining multiple learners.
Bagging reduces variance by training models independently on random samples.
Random Forest is a bagging method based on many decision trees.
Boosting builds models sequentially, where each model tries to correct previous mistakes.
AdaBoost, Gradient Boosting, and XGBoost are important boosting methods.

Home | Machine Learning