DNN on Arshad Siddiqui

Attention Mechanism

Mon, 01 Jan 0001 00:00:00 +0000

Attention Mechanism #

Attention is a deep learning mechanism that allows a model to focus on the most relevant parts of an input sequence when producing an output.

Instead of compressing the whole input into one fixed vector, attention computes a weighted combination of useful information.

Key takeaway:
Attention answers a simple question:

For the current prediction, which input tokens should the model focus on most?

Queries, Keys, and Values
Attention Pooling by Similarity
Attention Pooling via Nadaraya–Watson Regression
Attention Scoring Functions
Dot Product Attention
Convenience Functions
Scaled Dot Product Attention
Additive Attention
Bahdanau Attention Mechanism
Multi-Head Attention
Self-Attention
Positional Encoding

Why Attention Is Needed ☆ #

Traditional encoder-decoder RNN models compress the full input sequence into one context vector.

Transformer

Mon, 01 Jan 0001 00:00:00 +0000

Transformer #

A transformer is a neural network architecture that uses attention as its main mechanism for processing sequences.

Unlike RNNs, transformers do not process tokens one by one.

They process many tokens in parallel and use self-attention to learn relationships between tokens.

is an architecture of neural networks
based on the multi-head attention mechanism
text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table

Optimisation of Deep models

Mon, 01 Jan 0001 00:00:00 +0000

Optimisation of Deep models #

Optimizers are algorithms that update neural network parameters to reduce the loss function.

Deep networks usually have millions or billions of parameters, so there is usually no closed-form solution.

Instead, training uses iterative optimisation.

Key takeaway:
An optimiser decides how the model moves through the loss landscape towards lower loss.

Goal of Optimization
Optimization Challenges in Deep Learning
Gradient Descent
Stochastic Gradient Descent
Minibatch Stochastic Gradient Descent
Momentum
Adagrad and Algorithm
RMSProp and Algorithm
Adadelta and Algorithm
Adam and Algorithm
Code Implementation and comparison of algorithms (webinar)

flowchart TD
 A["Optimisers in DNN"] --> B["Gradient Descent Variants"]
 A --> C["Momentum-based Optimiser"]
 A --> D["Adaptive Methods"]
 A --> E["Learning Rate Schedules"]

 D --> D1["Parameter-specific learning rates"]

 E --> E1["Learning rate changes during training"]

 style A fill:#E1F5FE,stroke:#4A90E2,stroke-width:2px
 style B fill:#EDE7F6,stroke:#7E57C2
 style C fill:#C8E6C9,stroke:#43A047
 style D fill:#FFF9C4,stroke:#FBC02D
 style E fill:#F8BBD0,stroke:#D81B60

Goal of Optimisation ☆ #

The goal is to find parameters \( \theta \) that minimise the loss.

Gradient Descent and Mini-Batch Gradient Descent

Mon, 01 Jan 0001 00:00:00 +0000

Optimisation: Gradient Descent and Mini-Batch Gradient Descent #

Gradient descent is the core optimisation idea behind neural network training. It updates the model parameters by moving in the opposite direction of the gradient of the loss.

Key takeaway:
Gradient descent uses the gradient to decide how to change the parameters. The learning rate controls how large each update step is.

flowchart TD
 A["Gradient Descent Variants"] --> B["Batch Gradient Descent"]
 A --> C["Stochastic Gradient Descent"]
 A --> D["Mini-batch Gradient Descent"]

 B --> B1["Uses full dataset"]
 B --> B2["One update per epoch"]
 B --> B3["Smooth but slow"]

 C --> C1["Uses one example at a time"]
 C --> C2["Frequent updates"]
 C --> C3["Fast but noisy"]

 D --> D1["Uses small batches"]
 D --> D2["Efficient on hardware"]
 D --> D3["Balanced and practical"]

 style A fill:#E1F5FE,stroke:#4A90E2,stroke-width:2px
 style B fill:#EDE7F6,stroke:#7E57C2
 style C fill:#C8E6C9,stroke:#43A047
 style D fill:#FFF9C4,stroke:#FBC02D

Gradient Descent Rule ☆ #

The gradient tells us the direction in which the loss increases fastest. To reduce the loss, we move in the opposite direction.

Momentum Methods

Mon, 01 Jan 0001 00:00:00 +0000

Optimisation: Momentum Methods #

Momentum improves gradient descent by adding a memory of previous update directions. Instead of using only the current gradient, the optimiser accumulates velocity across iterations.

Key takeaway:
Momentum helps the optimiser move faster in consistent directions and reduces zigzag movement in directions where gradients oscillate.

flowchart TD
 A["Momentum-based Optimiser"] --> B["SGD with Momentum"]

 B --> B1["Adds velocity term"]
 B --> B2["Accumulates past gradients"]
 B --> B3["Reduces zig-zag movement"]
 B --> B4["Speeds up movement in useful direction"]
 B --> B5["Helps through shallow regions"]

 B1 --> C1["Current update depends on previous update"]
 B2 --> C2["Builds inertia"]
 B3 --> C3["Smoother path to minimum"]

 style A fill:#C8E6C9,stroke:#43A047,stroke-width:2px
 style B fill:#E1F5FE,stroke:#4A90E2
 style B1 fill:#EDE7F6,stroke:#7E57C2
 style B2 fill:#FFF9C4,stroke:#FBC02D
 style B3 fill:#F8BBD0,stroke:#D81B60
 style B4 fill:#EDE7F6,stroke:#7E57C2
 style B5 fill:#FFF9C4,stroke:#FBC02D

Physical Intuition ☆ #

Momentum is often explained using the analogy of a ball rolling down a hill.

Regularisation for Deep models

Mon, 01 Jan 0001 00:00:00 +0000

Regularisation for Deep models #

Regularisation means adding constraints or techniques that prevent a model from becoming too complex and memorising the training data.

The goal is not only low training error.

The goal is good performance on unseen data.

Key takeaway:
Regularisation helps the model generalise by controlling complexity, stabilising training, and reducing overfitting.

Generalization for regression
Training Error and Generalization Error
Underfitting or Overfitting
Model Selection
Weight Decay and Norms
Generalization in Classification
Environment and Distribution Shift
Generalization in Deep Learning
Dropout
Batch Normalization
Layer Normalization

Underfitting, Good Fit, and Overfitting ☆ #

Case	Model behaviour	Training error	Test error
Underfitting	too simple	high	high
Good fit	captures useful pattern	low	low
Overfitting	memorises training noise	very low	high

flowchart LR
 A["Model Complexity"] --> B["Too Simple: Underfitting"]
 A --> C["Just Right: Good Fit"]
 A --> D["Too Complex: Overfitting"]

 style A fill:#E1F5FE,stroke:#4A90E2
 style B fill:#FFF9C4,stroke:#FBC02D
 style C fill:#C8E6C9,stroke:#43A047
 style D fill:#EDE7F6,stroke:#7E57C2

Training Error and Generalisation Error ☆ #

Training error measures performance on data used for learning.

DNN Formula and Numerical Sheet

Mon, 01 Jan 0001 00:00:00 +0000

DNN Formula and Numerical Sheet #

This page consolidates the most useful Deep Neural Networks formulas and numerical patterns for revision.

It is designed for preparation and should be used together with the topic pages.

Revision strategy:
Do not only memorise formulas.

For each formula, know:

what each symbol means

when to apply it

how to substitute values carefully

what the output shape or answer represents

1. Artificial Neuron #

Weighted Sum ☆ #

\[ z = \sum_{i=1}^{n} w_i x_i + b \]

Vector form: