Attention Mechanism
#
Attention is a deep learning mechanism that allows a model to focus on the most relevant parts of an input sequence when producing an output.
Instead of compressing the whole input into one fixed vector, attention computes a weighted combination of useful information.
Key takeaway:
Attention answers a simple question:
For the current prediction, which input tokens should the model focus on most?
- Queries, Keys, and Values
- Attention Pooling by Similarity
- Attention Pooling via Nadaraya–Watson Regression
- Attention Scoring Functions
- Dot Product Attention
- Convenience Functions
- Scaled Dot Product Attention
- Additive Attention
- Bahdanau Attention Mechanism
- Multi-Head Attention
- Self-Attention
- Positional Encoding
Why Attention Is Needed ☆
#
Traditional encoder-decoder RNN models compress the full input sequence into one context vector.
Bayesian Learning
#
Bayesian Learning is a probabilistic approach to machine learning.
Instead of only asking, “Which output should the model predict?”, Bayesian Learning asks:
Given the data we have observed, how likely is each hypothesis, class, or parameter value?
This makes Bayesian Learning useful when uncertainty matters.
It is especially important in classification, probabilistic modelling, generative models, and situations where we want to combine prior knowledge with observed data.
Ensemble Learning
#
Ensemble Learning is a machine learning approach where we combine multiple models to produce a stronger final prediction.
Instead of depending on one model, an ensemble uses a group of models and combines their outputs.
The main idea is simple:
Many weak or moderately good models can work together to produce a better and more stable model.
Key takeaway:
Ensemble Learning improves prediction by combining several models.
A transformer is a neural network architecture that uses attention as its main mechanism for processing sequences.
Unlike RNNs, transformers do not process tokens one by one.
They process many tokens in parallel and use self-attention to learn relationships between tokens.
is an architecture of neural networks
based on the multi-head attention mechanism
text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table
Optimisation of Deep models
#
Optimizers are algorithms that update neural network parameters to reduce the loss function.
Deep networks usually have millions or billions of parameters, so there is usually no closed-form solution.
Instead, training uses iterative optimisation.
Key takeaway:
An optimiser decides how the model moves through the loss landscape towards lower loss.
- Goal of Optimization
- Optimization Challenges in Deep Learning
- Gradient Descent
- Stochastic Gradient Descent
- Minibatch Stochastic Gradient Descent
- Momentum
- Adagrad and Algorithm
- RMSProp and Algorithm
- Adadelta and Algorithm
- Adam and Algorithm
- Code Implementation and comparison of algorithms (webinar)
flowchart TD
A["Optimisers in DNN"] --> B["Gradient Descent Variants"]
A --> C["Momentum-based Optimiser"]
A --> D["Adaptive Methods"]
A --> E["Learning Rate Schedules"]
D --> D1["Parameter-specific learning rates"]
E --> E1["Learning rate changes during training"]
style A fill:#E1F5FE,stroke:#4A90E2,stroke-width:2px
style B fill:#EDE7F6,stroke:#7E57C2
style C fill:#C8E6C9,stroke:#43A047
style D fill:#FFF9C4,stroke:#FBC02D
style E fill:#F8BBD0,stroke:#D81B60
Goal of Optimisation ☆
#
The goal is to find parameters
\( \theta \)
that minimise the loss.
Unsupervised Learning
#
Unsupervised Learning is used when we have input data but no target labels.
The model is not told the correct answer. Instead, it tries to discover hidden structure in the data.
- K-means Clustering and variants
- Review of EM algorithm
- GMM based Soft Clustering
- Applications
Supervised vs Unsupervised Learning
#
| Aspect | Supervised Learning | Unsupervised Learning |
|---|
| Data contains target label? | Yes | No |
| Learns from | Input-output pairs | Input features only |
| Main goal | Predict output | Discover structure |
| Example task | Classification, regression | Clustering |
| Example algorithm | Logistic regression, decision tree | K-means, GMM |
- Works on unlabelled raw data.
- The algorithm discovers hidden patterns without prior knowledge of outcomes.
- Requires no human intervention during training.
- Does not make direct predictions — it groups or organises data instead.
- Carries a higher risk because there’s no ground truth to verify results.
- Common techniques include Clustering, Association, and Dimensionality Reduction.
The most common example is clustering, where similar records are grouped together.
Optimisation: Gradient Descent and Mini-Batch Gradient Descent
#
Gradient descent is the core optimisation idea behind neural network training.
It updates the model parameters by moving in the opposite direction of the gradient of the loss.
Key takeaway:
Gradient descent uses the gradient to decide how to change the parameters.
The learning rate controls how large each update step is.
flowchart TD
A["Gradient Descent Variants"] --> B["Batch Gradient Descent"]
A --> C["Stochastic Gradient Descent"]
A --> D["Mini-batch Gradient Descent"]
B --> B1["Uses full dataset"]
B --> B2["One update per epoch"]
B --> B3["Smooth but slow"]
C --> C1["Uses one example at a time"]
C --> C2["Frequent updates"]
C --> C3["Fast but noisy"]
D --> D1["Uses small batches"]
D --> D2["Efficient on hardware"]
D --> D3["Balanced and practical"]
style A fill:#E1F5FE,stroke:#4A90E2,stroke-width:2px
style B fill:#EDE7F6,stroke:#7E57C2
style C fill:#C8E6C9,stroke:#43A047
style D fill:#FFF9C4,stroke:#FBC02D
Gradient Descent Rule ☆
#
The gradient tells us the direction in which the loss increases fastest.
To reduce the loss, we move in the opposite direction.
Optimisation: Momentum Methods
#
Momentum improves gradient descent by adding a memory of previous update directions.
Instead of using only the current gradient, the optimiser accumulates velocity across iterations.
Key takeaway:
Momentum helps the optimiser move faster in consistent directions and reduces zigzag movement in directions where gradients oscillate.
flowchart TD
A["Momentum-based Optimiser"] --> B["SGD with Momentum"]
B --> B1["Adds velocity term"]
B --> B2["Accumulates past gradients"]
B --> B3["Reduces zig-zag movement"]
B --> B4["Speeds up movement in useful direction"]
B --> B5["Helps through shallow regions"]
B1 --> C1["Current update depends on previous update"]
B2 --> C2["Builds inertia"]
B3 --> C3["Smoother path to minimum"]
style A fill:#C8E6C9,stroke:#43A047,stroke-width:2px
style B fill:#E1F5FE,stroke:#4A90E2
style B1 fill:#EDE7F6,stroke:#7E57C2
style B2 fill:#FFF9C4,stroke:#FBC02D
style B3 fill:#F8BBD0,stroke:#D81B60
style B4 fill:#EDE7F6,stroke:#7E57C2
style B5 fill:#FFF9C4,stroke:#FBC02D
Physical Intuition ☆
#
Momentum is often explained using the analogy of a ball rolling down a hill.
Machine Learning Model Evaluation/Comparison
#
Comparing Machine Learning Models
#
Emerging requirements e.g., bias, fairness, interpretability of ML models
#
Home | Machine Learning
Regularisation for Deep models
#
Regularisation means adding constraints or techniques that prevent a model from becoming too complex and memorising the training data.
The goal is not only low training error.
The goal is good performance on unseen data.
Key takeaway:
Regularisation helps the model generalise by controlling complexity, stabilising training, and reducing overfitting.
- Generalization for regression
- Training Error and Generalization Error
- Underfitting or Overfitting
- Model Selection
- Weight Decay and Norms
- Generalization in Classification
- Environment and Distribution Shift
- Generalization in Deep Learning
- Dropout
- Batch Normalization
- Layer Normalization
Underfitting, Good Fit, and Overfitting ☆
#
| Case | Model behaviour | Training error | Test error |
|---|
| Underfitting | too simple | high | high |
| Good fit | captures useful pattern | low | low |
| Overfitting | memorises training noise | very low | high |
flowchart LR
A["Model Complexity"] --> B["Too Simple: Underfitting"]
A --> C["Just Right: Good Fit"]
A --> D["Too Complex: Overfitting"]
style A fill:#E1F5FE,stroke:#4A90E2
style B fill:#FFF9C4,stroke:#FBC02D
style C fill:#C8E6C9,stroke:#43A047
style D fill:#EDE7F6,stroke:#7E57C2
Training Error and Generalisation Error ☆
#
Training error measures performance on data used for learning.