Gradient Descent

Gradient Descent for Linear Regression #

Revision:
Gradient descent is the step-by-step method for reducing the cost function when a direct closed-form solution is not convenient.


Where Gradient Descent Fits in ML ☆ #

Gradient descent is used when we want the model to learn parameters by repeatedly improving them.

For linear regression, it adjusts the slope and intercept until the prediction error becomes small.

flowchart LR
    A["Initial Parameters"] --> B["Make Predictions"]
    B --> C["Compute Cost"]
    C --> D["Compute Gradient"]
    D --> E["Update Parameters"]
    E --> B

    style A fill:#E1F5FE,stroke:#5b7db1,color:#000
    style B fill:#C8E6C9,stroke:#5f8f6a,color:#000
    style C fill:#FFF9C4,stroke:#b59b3b,color:#000
    style D fill:#EDE7F6,stroke:#8a6fb3,color:#000
    style E fill:#C8E6C9,stroke:#5f8f6a,color:#000

Core Idea ☆ #

The gradient tells us the direction in which the cost increases fastest.

To reduce the cost, we move in the opposite direction.

\[ \text{new parameter} = \text{old parameter} - \text{learning rate} \times \text{gradient} \]

This is why the update rule subtracts the gradient.


Exam Recognition Cues ☆ #

Use gradient descent when the question gives:

  • initial parameter values
  • learning rate \( \alpha \) or \( \eta \)
  • number of iterations
  • wording such as “show the first iteration” or “perform two updates”

If a problem asks for only one iteration, do not try to find the final optimal line.
Use the current parameters, compute predictions, compute gradients, update once, and stop.

Gradient descent is an iterative optimisation method used to minimise the regression cost function by repeatedly updating parameters in the direction that reduces error.

  • Iterative method
  • Types: batch / stochastic / mini-batch

Key takeaway: Gradient descent starts with initial parameter values and repeatedly updates them using the gradient until the cost stops decreasing.

flowchart TD
GD["Gradient<br/>Descent"] -->|minimises| CF["Cost<br/>function"]
GD -->|updates| W["Parameters<br/>(weights)"]
GD -->|uses| GR["Gradient<br/>(slope)"]

GD --> H["Hyperparameters"]
H --> LR["Learning<br/>rate"]
H --> BS["Batch<br/>size"]
H --> EP["Epochs"]

style GD fill:#90CAF9,stroke:#1E88E5,color:#000

style CF fill:#CE93D8,stroke:#8E24AA,color:#000
style W fill:#CE93D8,stroke:#8E24AA,color:#000
style GR fill:#CE93D8,stroke:#8E24AA,color:#000
style H fill:#CE93D8,stroke:#8E24AA,color:#000
style LR fill:#CE93D8,stroke:#8E24AA,color:#000
style BS fill:#CE93D8,stroke:#8E24AA,color:#000
style EP fill:#CE93D8,stroke:#8E24AA,color:#000

Types of GD #

flowchart TD
T["Gradient Descent<br/>types"] --> BGD["Batch<br/>GD"]
T --> SGD["Stochastic<br/>GD"]
T --> MGD["Mini-batch<br/>GD"]

BGD --> ALL["All data<br/>per step"]
BGD --> STB["Smooth<br/>updates"]

SGD --> ONE["1 sample<br/>per step"]
SGD --> FAST["Quick<br/>progress"]
SGD --> NOISE["Noisy<br/>updates"]

MGD --> MB["Small batch<br/>per step"]
MGD --> PRACT["Practical<br/>default"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000

style BGD fill:#C8E6C9,stroke:#2E7D32,color:#000
style SGD fill:#C8E6C9,stroke:#2E7D32,color:#000
style MGD fill:#C8E6C9,stroke:#2E7D32,color:#000

style ALL fill:#CE93D8,stroke:#8E24AA,color:#000
style STB fill:#CE93D8,stroke:#8E24AA,color:#000
style ONE fill:#CE93D8,stroke:#8E24AA,color:#000
style FAST fill:#CE93D8,stroke:#8E24AA,color:#000
style NOISE fill:#CE93D8,stroke:#8E24AA,color:#000
style MB fill:#CE93D8,stroke:#8E24AA,color:#000
style PRACT fill:#CE93D8,stroke:#8E24AA,color:#000

Batch #

  • Use only if you have huge compute and a lot of time to train

SGD #

  • go-to solution

  • takes random samples and updates parameters

  • since randon → all updates are noisy

  • does not converge perfectly to sharpest minima

  • settles near to flat minima → better generalisation → helps prevent over fitting

  • if learning rate is too high → can cause under fitting

  • if learning rate is too high → can cause over fitting

Mini-batch #

  • choose a batch size
  • pick mini-batches from training data
  • randomly take (32, 64, …) points in every epoch across iterations

Terminology #

Epochs #

  • hyperparameter
  • one full pass of the entire training dataset through the learning algorithm
  • dictates how many times the model updates its internal parameters by seeing every sample
  • multiple epochs are generally needed for convergence
    • too few cause underfitting
    • too many can cause overfitting

Batch #

  • a subset of the training data processed together during one training step
  • used because the entire dataset is often too large to process at once

Batch size #

  • hyperparameter that defines the number of training examples in one batch
  • determines how much data is processed before the model updates its parameters

Iterations #

  • a single training step where the model processes one batch and updates its parameters (weights and biases)

Each iteration includes:

  • Forward pass: compute predictions and cost
  • Backward pass: compute gradients and update model’s parameters to reduce the loss

Gradients (Partial Derivatives) #

We minimise the cost $J(w,b)$ defined here:

The partial derivatives are:

\[ \frac{\partial J}{\partial w} = \frac{1}{m}\sum_{i=0}^{m-1}\left(f_{w,b}\!\left(x^{(i)}\right)-y^{(i)}\right)x^{(i)} \] \[ \frac{\partial J}{\partial b} = \frac{1}{m}\sum_{i=0}^{m-1}\left(f_{w,b}\!\left(x^{(i)}\right)-y^{(i)}\right) \]

Gradient meaning: It tells you how to change parameters to reduce the cost fastest.


Update Equations #

Starting from initial values, parameters are updated each iteration:

\[ w^{(t+1)} := w^{(t)} - \alpha \frac{\partial J}{\partial w} \] \[ b^{(t+1)} := b^{(t)} - \alpha \frac{\partial J}{\partial b} \]

Where:

  • $\alpha$ is the learning rate
  • $t$ is the iteration number

One iteration #

To do one iteration of gradient descent:

  1. Use current parameters to compute predictions.
  2. Compute the gradients at the current parameters.
  3. Apply the update:
\[ \theta \leftarrow \theta - \alpha \nabla J(\theta) \]

Then stop (if claculating for one iteration only).


Matrix/Vector Form (Least Squares + Gradient Descent) #

For an overdetermined system $Ap=b$, the squared error can be written as:

\[ SSE = (Ap-b)^T(Ap-b) \]

Its gradient:

\[ \nabla_p(SSE) = 2A^TAp - 2A^Tb \]

Gradient descent update:

\[ p^{(t+1)} = p^{(t)} - \alpha\left(2A^TAp^{(t)} - 2A^Tb\right) \]

Choosing the learning rate $\alpha$ #

If $\alpha$ is too large:

  • you can overshoot the minimum
  • the cost may oscillate or diverge

If $\alpha$ is too small:

  • training is very slow
  • you may need many iterations

Practical habit: monitor the cost $J$ across iterations and ensure it decreases steadily.


Notes #

  • Gradient descent is iterative.
  • It needs a learning rate.
  • It updates parameters using gradients.
  • Batch GD uses all records per update.
  • Stochastic GD uses one record per update.
  • Mini-batch GD uses a small group of records per update.
  • A very large learning rate may overshoot or diverge.
  • A very small learning rate may converge very slowly.

Revision #

The main update pattern is:

\[ \theta \leftarrow \theta - \alpha \nabla J(\theta) \]

Memory line:

Predict → Compute error → Compute gradient → Update parameters → Repeat


Summary #

Gradient descent is used to minimise the cost function when training a model. It starts with initial parameter values and repeatedly adjusts them in the direction that reduces error. For linear regression, it can find the best-fit line without directly using the normal equation.

References #

  • Also see DNN - GD
  • /docs/ai/machine-learning/03-linear-models-regression/
  • /docs/ai/machine-learning/03-ordinary-least-squares/
  • /docs/ai/machine-learning/03-cost-function/

Home | Machine Learning