Gradient Descent

Gradient Descent for Linear Regression #

Gradient descent is an iterative optimisation method used to minimise the regression cost function by repeatedly updating parameters in the direction that reduces error.

  • Iterative method
  • Types: batch / stochastic / mini-batch

Key takeaway: Gradient descent starts with initial parameter values and repeatedly updates them using the gradient until the cost stops decreasing.

flowchart TD
GD["Gradient<br/>Descent"] -->|minimises| CF["Cost<br/>function"]
GD -->|updates| W["Parameters<br/>(weights)"]
GD -->|uses| GR["Gradient<br/>(slope)"]

GD --> H["Hyperparameters"]
H --> LR["Learning<br/>rate"]
H --> BS["Batch<br/>size"]
H --> EP["Epochs"]

style GD fill:#90CAF9,stroke:#1E88E5,color:#000

style CF fill:#CE93D8,stroke:#8E24AA,color:#000
style W fill:#CE93D8,stroke:#8E24AA,color:#000
style GR fill:#CE93D8,stroke:#8E24AA,color:#000
style H fill:#CE93D8,stroke:#8E24AA,color:#000
style LR fill:#CE93D8,stroke:#8E24AA,color:#000
style BS fill:#CE93D8,stroke:#8E24AA,color:#000
style EP fill:#CE93D8,stroke:#8E24AA,color:#000

Types of GD #

flowchart TD
T["Gradient Descent<br/>types"] --> BGD["Batch<br/>GD"]
T --> SGD["Stochastic<br/>GD"]
T --> MGD["Mini-batch<br/>GD"]

BGD --> ALL["All data<br/>per step"]
BGD --> STB["Smooth<br/>updates"]

SGD --> ONE["1 sample<br/>per step"]
SGD --> FAST["Quick<br/>progress"]
SGD --> NOISE["Noisy<br/>updates"]

MGD --> MB["Small batch<br/>per step"]
MGD --> PRACT["Practical<br/>default"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000

style BGD fill:#C8E6C9,stroke:#2E7D32,color:#000
style SGD fill:#C8E6C9,stroke:#2E7D32,color:#000
style MGD fill:#C8E6C9,stroke:#2E7D32,color:#000

style ALL fill:#CE93D8,stroke:#8E24AA,color:#000
style STB fill:#CE93D8,stroke:#8E24AA,color:#000
style ONE fill:#CE93D8,stroke:#8E24AA,color:#000
style FAST fill:#CE93D8,stroke:#8E24AA,color:#000
style NOISE fill:#CE93D8,stroke:#8E24AA,color:#000
style MB fill:#CE93D8,stroke:#8E24AA,color:#000
style PRACT fill:#CE93D8,stroke:#8E24AA,color:#000

Batch #

  • Use only if you have huge compute and a lot of time to train

SGD #

  • go-to solution

  • takes random samples and updates parameters

  • since randon → all updates are noisy

  • does not converge perfectly to sharpest minima

  • settles near to flat minima → better generalisation → helps prevent over fitting

  • if learning rate is too high → can cause under fitting

  • if learning rate is too high → can cause over fitting

Mini-batch #

  • choose a batch size
  • pick mini-batches from training data
  • randomly take (32, 64, …) points in every epoch across iterations

Terminology #

Epochs #

  • hyperparameter
  • one full pass of the entire training dataset through the learning algorithm
  • dictates how many times the model updates its internal parameters by seeing every sample
  • multiple epochs are generally needed for convergence
    • too few cause underfitting
    • too many can cause overfitting

Batch #

  • a subset of the training data processed together during one training step
  • used because the entire dataset is often too large to process at once

Batch size #

  • hyperparameter that defines the number of training examples in one batch
  • determines how much data is processed before the model updates its parameters

Iterations #

  • a single training step where the model processes one batch and updates its parameters (weights and biases)

Each iteration includes:

  • Forward pass: compute predictions and cost
  • Backward pass: compute gradients and update model’s parameters to reduce the loss

Gradients (Partial Derivatives) #

We minimise the cost $J(w,b)$ defined here:

The partial derivatives are:

\[ \frac{\partial J}{\partial w} = \frac{1}{m}\sum_{i=0}^{m-1}\left(f_{w,b}\!\left(x^{(i)}\right)-y^{(i)}\right)x^{(i)} \] \[ \frac{\partial J}{\partial b} = \frac{1}{m}\sum_{i=0}^{m-1}\left(f_{w,b}\!\left(x^{(i)}\right)-y^{(i)}\right) \]

Gradient meaning: It tells you how to change parameters to reduce the cost fastest.


Update Equations #

Starting from initial values, parameters are updated each iteration:

\[ w^{(t+1)} := w^{(t)} - \alpha \frac{\partial J}{\partial w} \] \[ b^{(t+1)} := b^{(t)} - \alpha \frac{\partial J}{\partial b} \]

Where:

  • $\alpha$ is the learning rate
  • $t$ is the iteration number

One iteration #

To do one iteration of gradient descent:

  1. Use current parameters to compute predictions.
  2. Compute the gradients at the current parameters.
  3. Apply the update:
\[ \theta \leftarrow \theta - \alpha \nabla J(\theta) \]

Then stop (if claculating for one iteration only).


Matrix/Vector Form (Least Squares + Gradient Descent) #

For an overdetermined system $Ap=b$, the squared error can be written as:

\[ SSE = (Ap-b)^T(Ap-b) \]

Its gradient:

\[ \nabla_p(SSE) = 2A^TAp - 2A^Tb \]

Gradient descent update:

\[ p^{(t+1)} = p^{(t)} - \alpha\left(2A^TAp^{(t)} - 2A^Tb\right) \]

Choosing the learning rate $\alpha$ #

If $\alpha$ is too large:

  • you can overshoot the minimum
  • the cost may oscillate or diverge

If $\alpha$ is too small:

  • training is very slow
  • you may need many iterations

Practical habit: monitor the cost $J$ across iterations and ensure it decreases steadily.


References #

  • Also see DNN - GD
  • /docs/ai/machine-learning/03-linear-models-regression/
  • /docs/ai/machine-learning/03-ordinary-least-squares/
  • /docs/ai/machine-learning/03-cost-function/

Home | Machine Learning