Regularisation for Deep models

Regularisation for Deep models #

Regularisation means adding constraints or techniques that prevent a model from becoming too complex and memorising the training data.

The goal is not only low training error.

The goal is good performance on unseen data.

Key takeaway:
Regularisation helps the model generalise by controlling complexity, stabilising training, and reducing overfitting.

  • Generalization for regression
  • Training Error and Generalization Error
  • Underfitting or Overfitting
  • Model Selection
  • Weight Decay and Norms
  • Generalization in Classification
  • Environment and Distribution Shift
  • Generalization in Deep Learning
  • Dropout
  • Batch Normalization
  • Layer Normalization

Underfitting, Good Fit, and Overfitting ☆ #

CaseModel behaviourTraining errorTest error
Underfittingtoo simplehighhigh
Good fitcaptures useful patternlowlow
Overfittingmemorises training noisevery lowhigh
flowchart LR
    A["Model Complexity"] --> B["Too Simple: Underfitting"]
    A --> C["Just Right: Good Fit"]
    A --> D["Too Complex: Overfitting"]

    style A fill:#E1F5FE,stroke:#4A90E2
    style B fill:#FFF9C4,stroke:#FBC02D
    style C fill:#C8E6C9,stroke:#43A047
    style D fill:#EDE7F6,stroke:#7E57C2

Training Error and Generalisation Error ☆ #

Training error measures performance on data used for learning.

Generalisation error measures expected performance on unseen data.

A model can have excellent training performance and poor test performance.

That is overfitting.


Regularisation Taxonomy #

flowchart TD
    A["Regularisation"] --> B["Explicit Regularisation"]
    A --> C["Implicit Regularisation"]

    B --> B1["Weight penalties: L1 and L2"]
    B --> B2["Structural methods: Dropout"]

    C --> C1["Training procedures: Early stopping"]
    C --> C2["Normalisation: Batch Norm and Layer Norm"]
    C --> C3["Data augmentation"]

    style A fill:#E1F5FE,stroke:#4A90E2
    style B fill:#C8E6C9,stroke:#43A047
    style C fill:#FFF9C4,stroke:#FBC02D
    style B1 fill:#EDE7F6,stroke:#7E57C2
    style B2 fill:#EDE7F6,stroke:#7E57C2
    style C1 fill:#C8E6C9,stroke:#43A047
    style C2 fill:#C8E6C9,stroke:#43A047
    style C3 fill:#C8E6C9,stroke:#43A047

L2 Regularisation / Weight Decay ☆ #

L2 regularisation penalises large weights.

\[ J_{regularised}(\theta)=J(\theta)+\lambda \|\theta\|_2^2 \]

This encourages smoother models with smaller weights.

Weight decay is closely related to L2 regularisation in gradient-based optimisation.


L1 Regularisation #

L1 regularisation encourages sparse weights.

\[ J_{regularised}(\theta)=J(\theta)+\lambda \|\theta\|_1 \]

It can push some weights towards zero.


Dropout ☆ #

Dropout randomly disables some neurons during training.

This prevents the model from relying too heavily on any single neuron or feature.

Dropout does not usually remove neurons because they are bad.

It randomly drops units during training to make the network more robust.


Dropout Intuition #

flowchart LR
    A["Full Network"] --> B["Randomly Drop Some Units During Training"]
    B --> C["Different Sub-Network Each Batch"]
    C --> D["More Robust Features"]
    D --> E["Better Generalisation"]

    style A fill:#E1F5FE,stroke:#4A90E2
    style B fill:#FFF9C4,stroke:#FBC02D
    style C fill:#EDE7F6,stroke:#7E57C2
    style D fill:#C8E6C9,stroke:#43A047
    style E fill:#C8E6C9,stroke:#43A047

Dropout Formula #

During training, a binary mask is sampled.

\[ \tilde{h}=m \odot h \]

Where:

  • \( h \) is the activation
  • \( m \) is a random mask
  • \( \odot \) means element-wise multiplication

Batch Normalisation ☆ #

Batch normalisation normalises activations using mini-batch statistics.

It helps stabilise training and often allows faster convergence.

\[ \mu_B = \frac{1}{m}\sum_{i=1}^{m}x_i \] \[ \sigma_B^2 = \frac{1}{m}\sum_{i=1}^{m}(x_i-\mu_B)^2 \] \[ \hat{x}_i = \frac{x_i-\mu_B}{\sqrt{\sigma_B^2+\epsilon}} \] \[ y_i=\gamma \hat{x}_i+\beta \]

Layer Normalisation ☆ #

Layer normalisation normalises across features within each example.

It is commonly used in transformers.

Batch normalisation depends on batch statistics.

Layer normalisation is more suitable for sequence models where batch statistics may be less stable.


Early Stopping #

Early stopping monitors validation performance and stops training when validation error no longer improves.

It prevents the model from continuing to memorise the training set.

flowchart LR
    A["Start Training"] --> B["Training Loss Falls"]
    B --> C["Validation Loss Improves"]
    C --> D{"Validation Loss Stops Improving?"}
    D -- "No" --> B
    D -- "Yes" --> E["Stop Training"]

    style A fill:#E1F5FE,stroke:#4A90E2
    style B fill:#C8E6C9,stroke:#43A047
    style C fill:#C8E6C9,stroke:#43A047
    style D fill:#FFF9C4,stroke:#FBC02D
    style E fill:#EDE7F6,stroke:#7E57C2

Data Augmentation #

Data augmentation creates modified versions of training examples.

In computer vision, this may include:

  • rotation
  • cropping
  • flipping
  • brightness changes
  • zooming

It helps the model learn robust patterns instead of memorising exact images.


Vanishing Gradient Problem ☆ #

Vanishing gradient means gradients become extremely small in earlier layers.

This makes early layers learn very slowly.

\[ \frac{\partial L}{\partial W^{(1)}} = \frac{\partial L}{\partial h^{(L)}} \prod_{l=2}^{L} \frac{\partial h^{(l)}}{\partial h^{(l-1)}} \frac{\partial h^{(1)}}{\partial W^{(1)}} \]

If many factors in the product are less than one, the gradient becomes very small.

Common in:

  • deep networks with sigmoid or tanh
  • long RNN sequences
  • poor weight initialisation

Exploding Gradient Problem ☆ #

Exploding gradient means gradients become extremely large.

Symptoms:

  • loss becomes NaN or infinity
  • huge weight updates
  • unstable training curves

Solutions:

  • gradient clipping
  • better initialisation
  • normalisation
  • residual connections

Weight Initialisation ☆ #

Good weight initialisation helps avoid vanishing or exploding gradients.

Bad initialisation can:

  • break learning
  • make neurons compute identical functions
  • slow convergence
  • cause saturation

Xavier / Glorot Initialisation #

Xavier initialisation keeps activation variance and gradient variance more stable across layers.

It is commonly used for sigmoid or tanh-style activations.

\[ W \sim U\left(-\sqrt{\frac{6}{n_{in}+n_{out}}}, \sqrt{\frac{6}{n_{in}+n_{out}}}\right) \]

He Initialisation #

He initialisation is commonly used with ReLU and Leaky ReLU.

\[ W \sim N\left(0, \frac{2}{n_{in}}\right) \]

When to Apply Regularisation ☆ #

SituationUseful technique
Small datasetdata augmentation, L2, dropout
Deep networkbatch norm, residual connections, dropout
Transformerlayer norm, dropout, weight decay
RNN / sequence modelrecurrent dropout, gradient clipping
Overfittingdropout, early stopping, L2
Training instabilitynormalisation, better initialisation, gradient clipping

Common Mistakes ☆ #

  • saying regularisation always improves training accuracy
  • confusing training error and test error
  • saying dropout permanently deletes neurons
  • forgetting dropout behaves differently during training and inference
  • using batch norm and layer norm interchangeably without context
  • ignoring weight initialisation in deep networks

Revision / Summary #

Regularisation is about finding the right balance:

enough model capacity to learn patterns, but enough constraint to avoid memorisation.

TechniqueMain purpose
L1sparsity
L2 / weight decaysmaller weights
Dropoutreduce co-adaptation
Batch normstabilise mini-batch activations
Layer normstabilise per-example feature activations
Early stoppingstop before overfitting worsens
Data augmentationincrease effective data variety
Gradient clippingprevent exploding gradients
Xavier / He initialisationsupport stable gradient flow

Reference #


Home | Deep Learning