Gradient Descent for Linear Regression #
Gradient descent is an iterative optimisation method used to minimise the regression cost function by repeatedly updating parameters in the direction that reduces error.
- Iterative method
- Types: batch / stochastic / mini-batch
Key takeaway: Gradient descent starts with initial parameter values and repeatedly updates them using the gradient until the cost stops decreasing.
flowchart TD GD["Gradient<br/>Descent"] -->|minimises| CF["Cost<br/>function"] GD -->|updates| W["Parameters<br/>(weights)"] GD -->|uses| GR["Gradient<br/>(slope)"] GD --> H["Hyperparameters"] H --> LR["Learning<br/>rate"] H --> BS["Batch<br/>size"] H --> EP["Epochs"] style GD fill:#90CAF9,stroke:#1E88E5,color:#000 style CF fill:#CE93D8,stroke:#8E24AA,color:#000 style W fill:#CE93D8,stroke:#8E24AA,color:#000 style GR fill:#CE93D8,stroke:#8E24AA,color:#000 style H fill:#CE93D8,stroke:#8E24AA,color:#000 style LR fill:#CE93D8,stroke:#8E24AA,color:#000 style BS fill:#CE93D8,stroke:#8E24AA,color:#000 style EP fill:#CE93D8,stroke:#8E24AA,color:#000
Types of GD #
flowchart TD T["Gradient Descent<br/>types"] --> BGD["Batch<br/>GD"] T --> SGD["Stochastic<br/>GD"] T --> MGD["Mini-batch<br/>GD"] BGD --> ALL["All data<br/>per step"] BGD --> STB["Smooth<br/>updates"] SGD --> ONE["1 sample<br/>per step"] SGD --> FAST["Quick<br/>progress"] SGD --> NOISE["Noisy<br/>updates"] MGD --> MB["Small batch<br/>per step"] MGD --> PRACT["Practical<br/>default"] style T fill:#90CAF9,stroke:#1E88E5,color:#000 style BGD fill:#C8E6C9,stroke:#2E7D32,color:#000 style SGD fill:#C8E6C9,stroke:#2E7D32,color:#000 style MGD fill:#C8E6C9,stroke:#2E7D32,color:#000 style ALL fill:#CE93D8,stroke:#8E24AA,color:#000 style STB fill:#CE93D8,stroke:#8E24AA,color:#000 style ONE fill:#CE93D8,stroke:#8E24AA,color:#000 style FAST fill:#CE93D8,stroke:#8E24AA,color:#000 style NOISE fill:#CE93D8,stroke:#8E24AA,color:#000 style MB fill:#CE93D8,stroke:#8E24AA,color:#000 style PRACT fill:#CE93D8,stroke:#8E24AA,color:#000
Batch #
- Use only if you have huge compute and a lot of time to train
SGD #
go-to solution
takes random samples and updates parameters
since randon → all updates are noisy
does not converge perfectly to sharpest minima
settles near to flat minima → better generalisation → helps prevent over fitting
if learning rate is too high → can cause under fitting
if learning rate is too high → can cause over fitting
Mini-batch #
- choose a batch size
- pick mini-batches from training data
- randomly take (32, 64, …) points in every epoch across iterations
Terminology #
Epochs #
- hyperparameter
- one full pass of the entire training dataset through the learning algorithm
- dictates how many times the model updates its internal parameters by seeing every sample
- multiple epochs are generally needed for convergence
- too few cause underfitting
- too many can cause overfitting
Batch #
- a subset of the training data processed together during one training step
- used because the entire dataset is often too large to process at once
Batch size #
- hyperparameter that defines the number of training examples in one batch
- determines how much data is processed before the model updates its parameters
Iterations #
- a single training step where the model processes one batch and updates its parameters (weights and biases)
Each iteration includes:
- Forward pass: compute predictions and cost
- Backward pass: compute gradients and update model’s parameters to reduce the loss
Gradients (Partial Derivatives) #
We minimise the cost $J(w,b)$ defined here:
The partial derivatives are:
\[ \frac{\partial J}{\partial w} = \frac{1}{m}\sum_{i=0}^{m-1}\left(f_{w,b}\!\left(x^{(i)}\right)-y^{(i)}\right)x^{(i)} \] \[ \frac{\partial J}{\partial b} = \frac{1}{m}\sum_{i=0}^{m-1}\left(f_{w,b}\!\left(x^{(i)}\right)-y^{(i)}\right) \]Gradient meaning: It tells you how to change parameters to reduce the cost fastest.
Update Equations #
Starting from initial values, parameters are updated each iteration:
\[ w^{(t+1)} := w^{(t)} - \alpha \frac{\partial J}{\partial w} \] \[ b^{(t+1)} := b^{(t)} - \alpha \frac{\partial J}{\partial b} \]Where:
- $\alpha$ is the learning rate
- $t$ is the iteration number
One iteration #
To do one iteration of gradient descent:
- Use current parameters to compute predictions.
- Compute the gradients at the current parameters.
- Apply the update:
Then stop (if claculating for one iteration only).
Matrix/Vector Form (Least Squares + Gradient Descent) #
For an overdetermined system $Ap=b$, the squared error can be written as:
\[ SSE = (Ap-b)^T(Ap-b) \]Its gradient:
\[ \nabla_p(SSE) = 2A^TAp - 2A^Tb \]Gradient descent update:
\[ p^{(t+1)} = p^{(t)} - \alpha\left(2A^TAp^{(t)} - 2A^Tb\right) \]Choosing the learning rate $\alpha$ #
If $\alpha$ is too large:
- you can overshoot the minimum
- the cost may oscillate or diverge
If $\alpha$ is too small:
- training is very slow
- you may need many iterations
Practical habit: monitor the cost $J$ across iterations and ensure it decreases steadily.
References #
- Also see DNN - GD
- /docs/ai/machine-learning/03-linear-models-regression/
- /docs/ai/machine-learning/03-ordinary-least-squares/
- /docs/ai/machine-learning/03-cost-function/