AI

Attention Mechanism

Attention Mechanism #

Attention is a deep learning mechanism that allows a model to focus on the most relevant parts of an input sequence when producing an output.

Instead of compressing the whole input into one fixed vector, attention computes a weighted combination of useful information.

Key takeaway:
Attention answers a simple question:

For the current prediction, which input tokens should the model focus on most?

  • Queries, Keys, and Values
  • Attention Pooling by Similarity
  • Attention Pooling via Nadaraya–Watson Regression
  • Attention Scoring Functions
  • Dot Product Attention
  • Convenience Functions
  • Scaled Dot Product Attention
  • Additive Attention
  • Bahdanau Attention Mechanism
  • Multi-Head Attention
  • Self-Attention
  • Positional Encoding

Why Attention Is Needed ☆ #

Traditional encoder-decoder RNN models compress the full input sequence into one context vector.

Bayesian Learning

Bayesian Learning #

Bayesian Learning is a probabilistic approach to machine learning.

Instead of only asking, “Which output should the model predict?”, Bayesian Learning asks:

Given the data we have observed, how likely is each hypothesis, class, or parameter value?

This makes Bayesian Learning useful when uncertainty matters.

It is especially important in classification, probabilistic modelling, generative models, and situations where we want to combine prior knowledge with observed data.

Ensemble Learning

Ensemble Learning #

Ensemble Learning is a machine learning approach where we combine multiple models to produce a stronger final prediction.

Instead of depending on one model, an ensemble uses a group of models and combines their outputs.

The main idea is simple:

Many weak or moderately good models can work together to produce a better and more stable model.

Key takeaway:
Ensemble Learning improves prediction by combining several models.

Transformer

Transformer #

A transformer is a neural network architecture that uses attention as its main mechanism for processing sequences.

Unlike RNNs, transformers do not process tokens one by one.

They process many tokens in parallel and use self-attention to learn relationships between tokens.

  • is an architecture of neural networks

  • based on the multi-head attention mechanism

  • text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table

Optimisation of Deep models

Optimisation of Deep models #

Optimizers are algorithms that update neural network parameters to reduce the loss function.

Deep networks usually have millions or billions of parameters, so there is usually no closed-form solution.

Instead, training uses iterative optimisation.

Key takeaway:
An optimiser decides how the model moves through the loss landscape towards lower loss.


  • Goal of Optimization
  • Optimization Challenges in Deep Learning
  • Gradient Descent
  • Stochastic Gradient Descent
  • Minibatch Stochastic Gradient Descent
  • Momentum
  • Adagrad and Algorithm
  • RMSProp and Algorithm
  • Adadelta and Algorithm
  • Adam and Algorithm
  • Code Implementation and comparison of algorithms (webinar)

flowchart TD
    A["Optimisers in DNN"] --> B["Gradient Descent Variants"]
    A --> C["Momentum-based Optimiser"]
    A --> D["Adaptive Methods"]
    A --> E["Learning Rate Schedules"]

    D --> D1["Parameter-specific learning rates"]

    E --> E1["Learning rate changes during training"]

    style A fill:#E1F5FE,stroke:#4A90E2,stroke-width:2px
    style B fill:#EDE7F6,stroke:#7E57C2
    style C fill:#C8E6C9,stroke:#43A047
    style D fill:#FFF9C4,stroke:#FBC02D
    style E fill:#F8BBD0,stroke:#D81B60

Goal of Optimisation ☆ #

The goal is to find parameters \( \theta \) that minimise the loss.

Unsupervised Learning

Unsupervised Learning #

Unsupervised Learning is used when we have input data but no target labels.

The model is not told the correct answer. Instead, it tries to discover hidden structure in the data.

  • K-means Clustering and variants
  • Review of EM algorithm
  • GMM based Soft Clustering
  • Applications

Supervised vs Unsupervised Learning #

AspectSupervised LearningUnsupervised Learning
Data contains target label?YesNo
Learns fromInput-output pairsInput features only
Main goalPredict outputDiscover structure
Example taskClassification, regressionClustering
Example algorithmLogistic regression, decision treeK-means, GMM

  • Works on unlabelled raw data.
  • The algorithm discovers hidden patterns without prior knowledge of outcomes.
  • Requires no human intervention during training.
  • Does not make direct predictions — it groups or organises data instead.
  • Carries a higher risk because there’s no ground truth to verify results.
  • Common techniques include Clustering, Association, and Dimensionality Reduction.

The most common example is clustering, where similar records are grouped together.

Gradient Descent and Mini-Batch Gradient Descent

Optimisation: Gradient Descent and Mini-Batch Gradient Descent #

Gradient descent is the core optimisation idea behind neural network training. It updates the model parameters by moving in the opposite direction of the gradient of the loss.

Key takeaway:
Gradient descent uses the gradient to decide how to change the parameters. The learning rate controls how large each update step is.


flowchart TD
    A["Gradient Descent Variants"] --> B["Batch Gradient Descent"]
    A --> C["Stochastic Gradient Descent"]
    A --> D["Mini-batch Gradient Descent"]

    B --> B1["Uses full dataset"]
    B --> B2["One update per epoch"]
    B --> B3["Smooth but slow"]

    C --> C1["Uses one example at a time"]
    C --> C2["Frequent updates"]
    C --> C3["Fast but noisy"]

    D --> D1["Uses small batches"]
    D --> D2["Efficient on hardware"]
    D --> D3["Balanced and practical"]

    style A fill:#E1F5FE,stroke:#4A90E2,stroke-width:2px
    style B fill:#EDE7F6,stroke:#7E57C2
    style C fill:#C8E6C9,stroke:#43A047
    style D fill:#FFF9C4,stroke:#FBC02D

Gradient Descent Rule ☆ #

The gradient tells us the direction in which the loss increases fastest. To reduce the loss, we move in the opposite direction.

Momentum Methods

Optimisation: Momentum Methods #

Momentum improves gradient descent by adding a memory of previous update directions. Instead of using only the current gradient, the optimiser accumulates velocity across iterations.

Key takeaway:
Momentum helps the optimiser move faster in consistent directions and reduces zigzag movement in directions where gradients oscillate.


flowchart TD
    A["Momentum-based Optimiser"] --> B["SGD with Momentum"]

    B --> B1["Adds velocity term"]
    B --> B2["Accumulates past gradients"]
    B --> B3["Reduces zig-zag movement"]
    B --> B4["Speeds up movement in useful direction"]
    B --> B5["Helps through shallow regions"]

    B1 --> C1["Current update depends on previous update"]
    B2 --> C2["Builds inertia"]
    B3 --> C3["Smoother path to minimum"]

    style A fill:#C8E6C9,stroke:#43A047,stroke-width:2px
    style B fill:#E1F5FE,stroke:#4A90E2
    style B1 fill:#EDE7F6,stroke:#7E57C2
    style B2 fill:#FFF9C4,stroke:#FBC02D
    style B3 fill:#F8BBD0,stroke:#D81B60
    style B4 fill:#EDE7F6,stroke:#7E57C2
    style B5 fill:#FFF9C4,stroke:#FBC02D

Physical Intuition ☆ #

Momentum is often explained using the analogy of a ball rolling down a hill.

Regularisation for Deep models

Regularisation for Deep models #

Regularisation means adding constraints or techniques that prevent a model from becoming too complex and memorising the training data.

The goal is not only low training error.

The goal is good performance on unseen data.

Key takeaway:
Regularisation helps the model generalise by controlling complexity, stabilising training, and reducing overfitting.

  • Generalization for regression
  • Training Error and Generalization Error
  • Underfitting or Overfitting
  • Model Selection
  • Weight Decay and Norms
  • Generalization in Classification
  • Environment and Distribution Shift
  • Generalization in Deep Learning
  • Dropout
  • Batch Normalization
  • Layer Normalization

Underfitting, Good Fit, and Overfitting ☆ #

CaseModel behaviourTraining errorTest error
Underfittingtoo simplehighhigh
Good fitcaptures useful patternlowlow
Overfittingmemorises training noisevery lowhigh
flowchart LR
    A["Model Complexity"] --> B["Too Simple: Underfitting"]
    A --> C["Just Right: Good Fit"]
    A --> D["Too Complex: Overfitting"]

    style A fill:#E1F5FE,stroke:#4A90E2
    style B fill:#FFF9C4,stroke:#FBC02D
    style C fill:#C8E6C9,stroke:#43A047
    style D fill:#EDE7F6,stroke:#7E57C2

Training Error and Generalisation Error ☆ #

Training error measures performance on data used for learning.