Optimisation: Gradient Descent and Mini-Batch Gradient Descent #
Gradient descent is the core optimisation idea behind neural network training. It updates the model parameters by moving in the opposite direction of the gradient of the loss.
Key takeaway:
Gradient descent uses the gradient to decide how to change the parameters. The learning rate controls how large each update step is.
flowchart TD
A["Gradient Descent Variants"] --> B["Batch Gradient Descent"]
A --> C["Stochastic Gradient Descent"]
A --> D["Mini-batch Gradient Descent"]
B --> B1["Uses full dataset"]
B --> B2["One update per epoch"]
B --> B3["Smooth but slow"]
C --> C1["Uses one example at a time"]
C --> C2["Frequent updates"]
C --> C3["Fast but noisy"]
D --> D1["Uses small batches"]
D --> D2["Efficient on hardware"]
D --> D3["Balanced and practical"]
style A fill:#E1F5FE,stroke:#4A90E2,stroke-width:2px
style B fill:#EDE7F6,stroke:#7E57C2
style C fill:#C8E6C9,stroke:#43A047
style D fill:#FFF9C4,stroke:#FBC02D
Gradient Descent Rule ☆ #
The gradient tells us the direction in which the loss increases fastest. To reduce the loss, we move in the opposite direction.
\[ \theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t) \]Where:
| Symbol | Meaning |
|---|---|
| \( \theta_t \) | parameters at iteration \( t \) |
| \( \eta \) | learning rate |
| \( \nabla \mathcal{L}(\theta_t) \) | gradient of the loss |
| \( \theta_{t+1} \) | updated parameters |
One Iteration Numerical Example ☆ #
Consider the simple function:
\[ f(x,y) = x^2 + y^2 \]The gradient is:
\[ \nabla f = [2x, 2y] \]Start at:
\[ (x_0, y_0) = (3,4), \qquad \eta = 0.1 \]Before update:
| Quantity | Value |
|---|---|
| Position | \( (3,4) \) |
| Loss | \( 3^2 + 4^2 = 25 \) |
| Gradient | \( [6,8] \) |
Update:
\[ x_1 = 3 - 0.1(6) = 2.4 \] \[ y_1 = 4 - 0.1(8) = 3.2 \]New loss:
\[ f(2.4,3.2) = 2.4^2 + 3.2^2 = 16 \]The loss decreases from \( 25 \) to \( 16 \) . This is a \( 36\% \) reduction in one step.
Batch GD, SGD, and Mini-Batch GD ☆ #
Gradient descent can use different amounts of data per update.
| Method | Gradient Uses | Updates per Epoch | Memory | Convergence |
|---|---|---|---|---|
| Batch Gradient Descent | all examples | \( 1 \) | high | smooth but slow |
| Stochastic Gradient Descent | one example | \( N \) | low | noisy but fast |
| Mini-Batch Gradient Descent | small batch | \( N/B \) | medium | balanced |
For a dataset with \( N = 10000 \) samples and batch size \( B = 32 \) :
| Method | Updates per Epoch |
|---|---|
| Batch GD | \( 1 \) |
| SGD | \( 10000 \) |
| Mini-batch GD | \( 10000 / 32 \approx 313 \) |
Mini-batch gradient descent is the common practical choice for deep learning. It balances GPU efficiency, convergence stability, and update frequency.
Mini-Batch Gradient Descent Algorithm ☆ #
flowchart TD
A[Initialise parameters] --> B[Shuffle training data]
B --> C[Choose mini-batch]
C --> D[Compute mini-batch loss]
D --> E[Compute mini-batch gradient]
E --> F[Update parameters]
F --> G{Stopping criterion met?}
G -- No --> B
G -- Yes --> H[Return trained parameters]
style A fill:#E1F5FE
style B fill:#C8E6C9
style C fill:#FFF9C4
style D fill:#EDE7F6
style E fill:#E1F5FE
style F fill:#C8E6C9
style G fill:#FFF9C4
style H fill:#C8E6C9For a mini-batch \( \mathcal{B} \) of size \( B \) , the mini-batch loss is:
\[ \mathcal{L}(\theta) = \frac{1}{B}\sum_{i \in \mathcal{B}} \mathcal{L}_i(\theta) \]The gradient is:
\[ g = \nabla_\theta \mathcal{L}(\theta) = \frac{1}{B}\sum_{i \in \mathcal{B}} \nabla \mathcal{L}_i(\theta) \]The parameter update is:
\[ \theta \leftarrow \theta - \eta g \]Toy Problem for Optimisation Examples ☆ #
A useful toy loss function is:
\[ \mathcal{L}(w_1,w_2) = w_1^2 + 4w_2^2 \]The gradient is:
\[ \nabla \mathcal{L} = [2w_1, 8w_2] \]This creates an elongated bowl. The \( w_2 \) direction is four times steeper than the \( w_1 \) direction. This causes ordinary gradient descent to move too aggressively in the steep direction and more slowly in the flatter direction.
Mini-Batch GD Numerical Example ☆ #
Problem setup:
\[ \mathcal{L}(w_1,w_2) = w_1^2 + 4w_2^2, \qquad (w_1,w_2) = (4,2), \qquad \eta = 0.1 \]| Iteration | Loss | Gradient \( [\nabla_{w_1}, \nabla_{w_2}] \) | New \( w_1 \) | New \( w_2 \) |
|---|---|---|---|---|
| 0 | 32.00 | \( [8,16] \) | \( 3.20 \) | \( 0.40 \) |
| 1 | 10.88 | \( [6.4,3.2] \) | \( 2.56 \) | \( 0.08 \) |
| 2 | 6.58 | \( [5.1,0.64] \) | \( 2.05 \) | \( 0.016 \) |
| 3 | 4.21 | \( [4.1,0.13] \) | \( 1.64 \) | \( 0.003 \) |
Observation:
The loss decreases:
\[ 32 \rightarrow 10.88 \rightarrow 6.58 \rightarrow 4.21 \]However, \( w_2 \) moves much more aggressively than \( w_1 \) because its gradient is steeper. This shows why adaptive methods and momentum can help.
Learning Rate Effects ☆ #
The learning rate controls step size.
| Learning Rate | Behaviour |
|---|---|
| Too large | oscillation, unstable updates, possible divergence |
| Too small | very slow progress, may take too long to train |
| Just right | smooth and fast convergence |
flowchart LR
A[Too Large] --> B[Oscillation or Divergence]
C[Too Small] --> D[Very Slow Training]
E[Good Value] --> F[Smooth Descent]
style A fill:#FFCDD2
style B fill:#FFCDD2
style C fill:#FFF9C4
style D fill:#FFF9C4
style E fill:#C8E6C9
style F fill:#C8E6C9A high learning rate can make the loss jump around. A very low learning rate may look stable, but training may be too slow to be useful.
Why Adaptive Techniques Are Needed ☆ #
A fixed learning rate applies the same step size to every parameter. This can be inefficient because:
- different parameters may need different step sizes;
- some directions in the loss landscape are steeper than others;
- sparse features may need larger updates;
- dense features may need smaller, more careful updates;
- elongated contours can cause zigzag movement.
Exam Notes ☆ #
Remember these points:
- Gradient descent moves opposite to the gradient.
- The learning rate controls the update size.
- Mini-batch GD is commonly used in deep learning.
- Batch GD is smooth but computationally heavy.
- SGD is noisy but gives many updates.
- Mini-batch GD is a practical compromise.
- A learning rate that is too large can cause divergence.
- A learning rate that is too small can make training very slow.
- Steep directions can cause unbalanced updates.