Optimisation of Deep models

Optimisation of Deep models #

Optimizers are algorithms that update neural network parameters to reduce the loss function.

Deep networks usually have millions or billions of parameters, so there is usually no closed-form solution.

Instead, training uses iterative optimisation.

Key takeaway:
An optimiser decides how the model moves through the loss landscape towards lower loss.


  • Goal of Optimization
  • Optimization Challenges in Deep Learning
  • Gradient Descent
  • Stochastic Gradient Descent
  • Minibatch Stochastic Gradient Descent
  • Momentum
  • Adagrad and Algorithm
  • RMSProp and Algorithm
  • Adadelta and Algorithm
  • Adam and Algorithm
  • Code Implementation and comparison of algorithms (webinar)

flowchart TD
    A["Optimisers in DNN"] --> B["Gradient Descent Variants"]
    A --> C["Momentum-based Optimiser"]
    A --> D["Adaptive Methods"]
    A --> E["Learning Rate Schedules"]

    D --> D1["Parameter-specific learning rates"]

    E --> E1["Learning rate changes during training"]

    style A fill:#E1F5FE,stroke:#4A90E2,stroke-width:2px
    style B fill:#EDE7F6,stroke:#7E57C2
    style C fill:#C8E6C9,stroke:#43A047
    style D fill:#FFF9C4,stroke:#FBC02D
    style E fill:#F8BBD0,stroke:#D81B60

Goal of Optimisation ☆ #

The goal is to find parameters \( \theta \) that minimise the loss.

\[ \theta^* = \arg\min_{\theta} \mathcal{L}(\theta) \]

Here:

SymbolMeaning
\( \theta \)all model parameters
\( \mathcal{L}(\theta) \)loss function for the current parameters
\( \theta^* \)best parameters found by optimisation

Deep neural networks may contain millions or billions of parameters. Because of this, it is usually not possible to compute the best parameters directly. Instead, we use iterative algorithms such as gradient descent, mini-batch gradient descent, momentum, RMSProp, Adam, and learning rate schedules.

Without optimisation, the network keeps random weights, produces random predictions, and does not learn useful patterns.

Without optimisation:

  • weights remain random
  • predictions remain random
  • no learning happens

Gradient Descent Core Idea #

Gradient descent moves parameters in the opposite direction of the gradient.

\[ \theta_{t+1}=\theta_t-\eta \nabla_{\theta}\mathcal{L}(\theta_t) \]

Where:

  • \( \eta \) is the learning rate
  • \( \nabla_{\theta}\mathcal{L} \) is the gradient of loss

Gradient Descent Variants ☆ #

MethodData used per updateUpdates per epochMemoryBehaviour
Batch GDfull dataset1highsmooth but slow
SGDone examplemanylownoisy but fast
Mini-batch GDbatch of examplesmediummediumbalanced and practical

Mini-Batch Gradient Descent ☆ #

Mini-batch gradient descent is the practical default for deep learning.

For mini-batch \( B \) :

\[ \mathcal{L}_B(\theta)=\frac{1}{B}\sum_{i \in B}\ell_i(\theta) \] \[ g = \nabla_{\theta}\mathcal{L}_B(\theta)=\frac{1}{B}\sum_{i \in B}\nabla_{\theta}\ell_i(\theta) \] \[ \theta \leftarrow \theta - \eta g \]

Common batch sizes:

  • 32
  • 64
  • 128
  • 256

Learning Rate Effects ☆ #

Learning rateBehaviourResult
Too largejumps around, oscillatesmay diverge
Too smalltiny updatesvery slow learning
Just rightstable descentsmooth convergence
flowchart LR
    A["Learning Rate"] --> B["Too Large"]
    A --> C["Too Small"]
    A --> D["Good Value"]

    B --> B1["Oscillation or divergence"]
    C --> C1["Slow convergence"]
    D --> D1["Stable loss reduction"]

    style A fill:#E1F5FE,stroke:#4A90E2
    style B fill:#FFF9C4,stroke:#FBC02D
    style C fill:#EDE7F6,stroke:#7E57C2
    style D fill:#C8E6C9,stroke:#43A047

Loss Functions Review ☆ #

The loss function measures how wrong the model prediction is. Different tasks use different loss functions.

Loss FunctionFormulaGradient FormCommon Use
Mean Squared Error\( \frac{1}{2}(\hat{y} - y)^2 \)\( \hat{y} - y \)Regression
Mean Absolute Error\( \lvert \hat{y} - y \rvert \)\( \operatorname{sign}(\hat{y} - y) \)Regression with outliers
Binary Cross-Entropy\( -[y\log(\hat{y}) + (1-y)\log(1-\hat{y})] \)similar to prediction minus targetBinary classification
Categorical Cross-Entropy\( -\sum_k y_k \log(\hat{y}_k) \)similar to prediction minus targetMulti-class classification

Many useful gradients have the intuitive form:

prediction minus target

This is one reason why backpropagation can efficiently compute updates across many layers.


Optimisation Challenges #

Deep learning loss surfaces are difficult.

Important challenges:

ChallengeMeaning
Local minimapoint lower than nearby points but not globally best
Saddle pointgradient is zero, but not a minimum
Plateauflat region with very small gradient
Non-convex landscapemany valleys, ridges, and irregular paths
Exploding gradientgradient becomes extremely large
Vanishing gradientgradient becomes extremely small

Training deep networks is difficult because the loss surface can be complicated. The optimiser may face several problems.

Saddle Points #

A saddle point is a point where the gradient is close to zero, but the point is not a true minimum. In some directions, the loss may increase. In other directions, the loss may decrease.

\[ \nabla \mathcal{L}(\theta) \approx 0 \]

This can make the optimiser slow down even though a better solution exists nearby.

Plateaus #

A plateau is a flat region of the loss surface. Gradients become very small, so parameter updates become tiny. Learning then becomes very slow.

Non-Convex Landscapes #

Deep learning loss surfaces are usually non-convex. This means there can be many local minima and saddle points. There is usually no guarantee that training will find the global minimum.

flowchart LR
    A[High Loss] --> B[Local Minimum]
    B --> C[Saddle Point]
    C --> D[Plateau]
    D --> E[Lower Loss Region]
    E --> F[Global Minimum]

    style A fill:#E1F5FE
    style B fill:#FFF9C4
    style C fill:#EDE7F6
    style D fill:#C8E6C9
    style E fill:#E1F5FE
    style F fill:#C8E6C9

How to Identify Optimisation Problems ☆ #

During training, optimisation issues can be detected by looking at the loss curve, validation error, and gradient norms.

Common signs include:

  • The loss curve flattens but validation error is still high.
  • Gradient norms approach zero.
  • Training progress stagnates for many epochs.
  • Loss jumps around instead of decreasing smoothly.
  • Training loss decreases, but validation loss becomes worse.

A flat loss curve does not always mean the model has learned well. It may also mean that the optimiser is stuck on a plateau, near a saddle point, or using an unsuitable learning rate.


Momentum ☆ #

Momentum adds velocity to gradient descent.

It remembers previous update directions and smooths zig-zag movement.

\[ v_t = \beta v_{t-1} + g_t \] \[ \theta_{t+1}=\theta_t-\eta v_t \]

Where:

  • \( v_t \) is velocity
  • \( \beta \) is momentum coefficient, often around 0.9
  • \( g_t \) is current gradient

Momentum Intuition #

flowchart TD
    A["SGD Path"] --> B["Noisy zig-zag"]
    C["Momentum Path"] --> D["Smoother movement"]
    D --> E["Faster progress in useful direction"]
    D --> F["Less oscillation across ravines"]

    style A fill:#FFF9C4,stroke:#FBC02D
    style B fill:#EDE7F6,stroke:#7E57C2
    style C fill:#E1F5FE,stroke:#4A90E2
    style D fill:#C8E6C9,stroke:#43A047
    style E fill:#C8E6C9,stroke:#43A047
    style F fill:#C8E6C9,stroke:#43A047

Adaptive Optimisers ☆ #

Adaptive methods change the effective learning rate for each parameter.

They are useful when:

  • gradients vary strongly across dimensions
  • some parameters are sparse
  • some directions are steep
  • fixed learning rate causes zig-zag movement

Adagrad #

Adagrad accumulates squared gradients and gives smaller learning rates to frequently updated parameters.

\[ r_t = r_{t-1} + g_t \odot g_t \] \[ \theta_{t+1}=\theta_t-\frac{\eta}{\sqrt{r_t}+\epsilon}\odot g_t \]

Good for sparse features, but the learning rate may shrink too much over time.


RMSProp #

RMSProp uses an exponentially weighted average of squared gradients.

\[ r_t = \rho r_{t-1} + (1-\rho)g_t \odot g_t \] \[ \theta_{t+1}=\theta_t-\frac{\eta}{\sqrt{r_t}+\epsilon}\odot g_t \]

It avoids Adagrad’s continuously shrinking learning rate problem.


Adam ☆ #

Adam combines momentum and RMSProp-style adaptive scaling.

It tracks:

  • first moment: average gradient
  • second moment: average squared gradient
\[ m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t \] \[ v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \]

Bias correction:

\[ \hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \]

Update:

\[ \theta_{t+1}=\theta_t-\eta\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} \]

Common defaults:

  • \( \beta_1 = 0.9 \)
  • \( \beta_2 = 0.999 \)
  • \( \epsilon = 10^{-8} \)

Adadelta #

Adadelta is an adaptive method designed to reduce sensitivity to the initial learning rate.

It uses running averages of squared gradients and squared parameter updates.

For exams, remember it as an extension of adaptive learning-rate methods.


Learning Rate Schedules ☆ #

Learning rate schedules change the learning rate over training.

Common strategies:

ScheduleIdea
Step decayreduce learning rate at fixed epochs
Exponential decaygradually reduce by a constant factor
Cosine decaysmooth reduction using cosine curve
Warm-upstart small, then increase early
Reduce on plateaureduce when validation loss stops improving

Gradient Clipping #

Gradient clipping limits very large gradients.

It is especially useful for RNNs and unstable training.

\[ g \leftarrow g \cdot \frac{c}{\|g\|} \quad \text{if } \|g\| > c \]

Choosing an Optimiser #

SituationGood choice
Basic baselineMini-batch SGD
Noisy gradientsMomentum
Sparse featuresAdagrad or Adam
General deep learning defaultAdam
Need strong final generalisationSGD with momentum after tuning
Exploding gradientsoptimiser plus gradient clipping

Practical Interpretation #

Optimisers decide how the model moves through the loss landscape. A poor optimiser or a poor learning rate can make training slow, unstable, or completely unsuccessful. A good optimiser can speed up training and make deep networks easier to train.


Common Exam Mistakes ☆ #

  • saying optimiser changes the dataset
  • confusing learning rate with momentum
  • forgetting mini-batch is the practical default
  • assuming Adam always gives best generalisation
  • not explaining why too high learning rate diverges
  • forgetting that adaptive methods use parameter-wise scaling

Revision / Summary #

Remember these points:

  • The aim of optimisation is to minimise the loss function.
  • Deep networks usually need iterative optimisation because closed-form solutions are not available.
  • Gradients tell the optimiser which direction increases the loss most quickly.
  • Gradient descent moves in the opposite direction of the gradient.
  • Saddle points, plateaus, and non-convexity make optimisation difficult.
  • The choice of loss function depends on the task type.
  • The choice of optimiser affects convergence speed and stability.

Remember the optimiser progression:

Gradient Descent → Mini-batch GD → Momentum → Adaptive Methods → Adam → Learning Rate Scheduling

OptimiserMain idea
Batch GDfull dataset update
SGDone-example update
Mini-batch GDbalanced practical update
Momentumaccumulate velocity
Adagradscale by accumulated squared gradients
RMSPropmoving average of squared gradients
Adammomentum plus adaptive scaling
Adadeltaadaptive method with reduced LR sensitivity

Reference #

  • Dive into deep learning. Cambridge University Press.. (Ch12)

Home | Deep Learning