AI

Attention Mechanism

AI, Deep Learning, DNN, Attention, Self-Attention, Multi-Head Attention

Attention Mechanism #

Attention is a deep learning mechanism that allows a model to focus on the most relevant parts of an input sequence when producing an output.

Instead of compressing the whole input into one fixed vector, attention computes a weighted combination of useful information.

Key takeaway:
Attention answers a simple question:
For the current prediction, which input tokens should the model focus on most?

Queries, Keys, and Values
Attention Pooling by Similarity
Attention Pooling via Nadaraya–Watson Regression
Attention Scoring Functions
Dot Product Attention
Convenience Functions
Scaled Dot Product Attention
Additive Attention
Bahdanau Attention Mechanism
Multi-Head Attention
Self-Attention
Positional Encoding

Why Attention Is Needed ☆ #

Traditional encoder-decoder RNN models compress the full input sequence into one context vector.

Bayesian Learning

AI, ML

AI, ML, Bayesian Learning, MLE, MAP, Naive Bayes

Bayesian Learning #

Bayesian Learning is a probabilistic approach to machine learning.

Instead of only asking, “Which output should the model predict?”, Bayesian Learning asks:

Given the data we have observed, how likely is each hypothesis, class, or parameter value?

This makes Bayesian Learning useful when uncertainty matters.

It is especially important in classification, probabilistic modelling, generative models, and situations where we want to combine prior knowledge with observed data.

Ensemble Learning

AI, ML

AI, ML, Machine Learning, Ensemble Learning, Random Forest, Boosting

Ensemble Learning #

Ensemble Learning is a machine learning approach where we combine multiple models to produce a stronger final prediction.

Instead of depending on one model, an ensemble uses a group of models and combines their outputs.

The main idea is simple:

Many weak or moderately good models can work together to produce a better and more stable model.

Key takeaway:
Ensemble Learning improves prediction by combining several models.

Transformer

AI, Deep Learning

AI, Deep Learning, DNN, Transformer, Encoder, Decoder, Self-Attention

Transformer #

A transformer is a neural network architecture that uses attention as its main mechanism for processing sequences.

Unlike RNNs, transformers do not process tokens one by one.

They process many tokens in parallel and use self-attention to learn relationships between tokens.

is an architecture of neural networks
based on the multi-head attention mechanism
text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table

Optimisation of Deep models

AI, Deep Learning

AI, Deep Learning, DNN, Optimisation, Optimizers

Optimisation of Deep models #

Optimizers are algorithms that update neural network parameters to reduce the loss function.

Deep networks usually have millions or billions of parameters, so there is usually no closed-form solution.

Instead, training uses iterative optimisation.

Key takeaway:
An optimiser decides how the model moves through the loss landscape towards lower loss.

Goal of Optimization
Optimization Challenges in Deep Learning
Gradient Descent
Stochastic Gradient Descent
Minibatch Stochastic Gradient Descent
Momentum
Adagrad and Algorithm
RMSProp and Algorithm
Adadelta and Algorithm
Adam and Algorithm
Code Implementation and comparison of algorithms (webinar)

flowchart TD
    A["Optimisers in DNN"] --> B["Gradient Descent Variants"]
    A --> C["Momentum-based Optimiser"]
    A --> D["Adaptive Methods"]
    A --> E["Learning Rate Schedules"]

    D --> D1["Parameter-specific learning rates"]

    E --> E1["Learning rate changes during training"]

    style A fill:#E1F5FE,stroke:#4A90E2,stroke-width:2px
    style B fill:#EDE7F6,stroke:#7E57C2
    style C fill:#C8E6C9,stroke:#43A047
    style D fill:#FFF9C4,stroke:#FBC02D
    style E fill:#F8BBD0,stroke:#D81B60

Goal of Optimisation ☆ #

The goal is to find parameters \( \theta \) that minimise the loss.

Unsupervised Learning

AI, ML

AI, ML, Unsupervised Learning, K-Means, EM, GMM, Clustering

Unsupervised Learning #

Unsupervised Learning is used when we have input data but no target labels.

The model is not told the correct answer. Instead, it tries to discover hidden structure in the data.

K-means Clustering and variants
Review of EM algorithm
GMM based Soft Clustering
Applications

Supervised vs Unsupervised Learning #

Aspect	Supervised Learning	Unsupervised Learning
Data contains target label?	Yes	No
Learns from	Input-output pairs	Input features only
Main goal	Predict output	Discover structure
Example task	Classification, regression	Clustering
Example algorithm	Logistic regression, decision tree	K-means, GMM

Works on unlabelled raw data.
The algorithm discovers hidden patterns without prior knowledge of outcomes.
Requires no human intervention during training.
Does not make direct predictions — it groups or organises data instead.
Carries a higher risk because there’s no ground truth to verify results.
Common techniques include Clustering, Association, and Dimensionality Reduction.

The most common example is clustering, where similar records are grouped together.

Gradient Descent and Mini-Batch Gradient Descent

AI, Deep Learning

AI, Deep Learning, DNN, Gradient Descent, Mini-Batch Gradient Descent

Optimisation: Gradient Descent and Mini-Batch Gradient Descent #

Gradient descent is the core optimisation idea behind neural network training. It updates the model parameters by moving in the opposite direction of the gradient of the loss.

Key takeaway:
Gradient descent uses the gradient to decide how to change the parameters. The learning rate controls how large each update step is.

flowchart TD
    A["Gradient Descent Variants"] --> B["Batch Gradient Descent"]
    A --> C["Stochastic Gradient Descent"]
    A --> D["Mini-batch Gradient Descent"]

    B --> B1["Uses full dataset"]
    B --> B2["One update per epoch"]
    B --> B3["Smooth but slow"]

    C --> C1["Uses one example at a time"]
    C --> C2["Frequent updates"]
    C --> C3["Fast but noisy"]

    D --> D1["Uses small batches"]
    D --> D2["Efficient on hardware"]
    D --> D3["Balanced and practical"]

    style A fill:#E1F5FE,stroke:#4A90E2,stroke-width:2px
    style B fill:#EDE7F6,stroke:#7E57C2
    style C fill:#C8E6C9,stroke:#43A047
    style D fill:#FFF9C4,stroke:#FBC02D

Gradient Descent Rule ☆ #

The gradient tells us the direction in which the loss increases fastest. To reduce the loss, we move in the opposite direction.

Momentum Methods

AI, Deep Learning

AI, Deep Learning, DNN, Momentum, SGD With Momentum

Optimisation: Momentum Methods #

Momentum improves gradient descent by adding a memory of previous update directions. Instead of using only the current gradient, the optimiser accumulates velocity across iterations.

Key takeaway:
Momentum helps the optimiser move faster in consistent directions and reduces zigzag movement in directions where gradients oscillate.

flowchart TD
    A["Momentum-based Optimiser"] --> B["SGD with Momentum"]

    B --> B1["Adds velocity term"]
    B --> B2["Accumulates past gradients"]
    B --> B3["Reduces zig-zag movement"]
    B --> B4["Speeds up movement in useful direction"]
    B --> B5["Helps through shallow regions"]

    B1 --> C1["Current update depends on previous update"]
    B2 --> C2["Builds inertia"]
    B3 --> C3["Smoother path to minimum"]

    style A fill:#C8E6C9,stroke:#43A047,stroke-width:2px
    style B fill:#E1F5FE,stroke:#4A90E2
    style B1 fill:#EDE7F6,stroke:#7E57C2
    style B2 fill:#FFF9C4,stroke:#FBC02D
    style B3 fill:#F8BBD0,stroke:#D81B60
    style B4 fill:#EDE7F6,stroke:#7E57C2
    style B5 fill:#FFF9C4,stroke:#FBC02D

Physical Intuition ☆ #

Momentum is often explained using the analogy of a ball rolling down a hill.

Evaluation/Comparison

AI, ML

Machine Learning Model Evaluation/Comparison #

Comparing Machine Learning Models #

Emerging requirements e.g., bias, fairness, interpretability of ML models #

Home | Machine Learning

Regularisation for Deep models

AI, Deep Learning

AI, Deep Learning, DNN, Regularisation, Dropout, Batch Normalisation

Regularisation for Deep models #

Regularisation means adding constraints or techniques that prevent a model from becoming too complex and memorising the training data.

The goal is not only low training error.

The goal is good performance on unseen data.

Key takeaway:
Regularisation helps the model generalise by controlling complexity, stabilising training, and reducing overfitting.

Generalization for regression
Training Error and Generalization Error
Underfitting or Overfitting
Model Selection
Weight Decay and Norms
Generalization in Classification
Environment and Distribution Shift
Generalization in Deep Learning
Dropout
Batch Normalization
Layer Normalization

Underfitting, Good Fit, and Overfitting ☆ #

Case	Model behaviour	Training error	Test error
Underfitting	too simple	high	high
Good fit	captures useful pattern	low	low
Overfitting	memorises training noise	very low	high

flowchart LR
    A["Model Complexity"] --> B["Too Simple: Underfitting"]
    A --> C["Just Right: Good Fit"]
    A --> D["Too Complex: Overfitting"]

    style A fill:#E1F5FE,stroke:#4A90E2
    style B fill:#FFF9C4,stroke:#FBC02D
    style C fill:#C8E6C9,stroke:#43A047
    style D fill:#EDE7F6,stroke:#7E57C2

Training Error and Generalisation Error ☆ #

Training error measures performance on data used for learning.