Gradient Descent Algorithm #
Gradient Descent Algorithm (GDA) is
- an optimisation method
- used to train models
- by repeatedly updating parameters (weights and biases) to reduce the loss
In deep learning, the default training approach is almost always mini-batch gradient descent, usually with Adam or SGD + momentum.
Gradient Descent is used in both regression and classification.
It’s not tied to the task type — it’s tied to the fact you have:
- a model with parameters (weights/bias), and
- a loss function you want to minimise.
In regression #
- You predict a number (e.g., house price).
- Common loss: Mean Squared Error (MSE).
- Gradient descent adjusts weights to reduce MSE.
In classification #
- You predict a class (e.g., spam vs not spam).
- Common loss: cross-entropy / log loss.
- Gradient descent adjusts weights to reduce classification loss.
What gradient descent does #
You start with some parameters, measure the loss, and update parameters to reduce that loss.
Update rule (conceptual):
\[ \theta \leftarrow \theta - \eta \nabla_{\theta} J(\theta) \]Where:
- \( \theta \) = parameters (weights/biases)
- \( J(\theta) \) = loss (objective)
- \( \nabla_{\theta} J(\theta) \) = gradient (direction of steepest increase)
- \( \eta \) = learning rate (step size)
Types of Gradient Descent #
There are three common “data usage” types:
- Batch Gradient Descent: uses the entire dataset per update
- Stochastic Gradient Descent (SGD): uses one example per update
- Mini-batch Gradient Descent: uses a small batch per update (the deep learning standard)
flowchart TD A["Gradient Descent<br/>Algorithm (GDA)"] --> B["Batch GD<br/>(full dataset)"] A --> C["Stochastic GD<br/>(one sample)"] A --> D["Mini-batch GD<br/>(small batch)"] D --> E["Often paired with:<br/>Momentum / Adam"] %% Pastel colour scheme style A fill:#E3F2FD,stroke:#1E88E5,stroke-width:1px style B fill:#FFF3E0,stroke:#FB8C00,stroke-width:1px style C fill:#FCE4EC,stroke:#D81B60,stroke-width:1px style D fill:#E8F5E9,stroke:#43A047,stroke-width:1px style E fill:#F3E5F5,stroke:#8E24AA,stroke-width:1px
Batch Gradient Descent #
- Uses the entire dataset to compute one update.
- Stable updates, but can be slow on large datasets.
How it works: one update uses all \( N \) training examples.
Pros:
- stable, smooth loss curve
- good for small datasets
Cons:
- slow when \( N \) is large
- one update can be expensive
Stochastic Gradient Descent (SGD) #
- Uses one training example at a time for each update.
- Very fast per update, but updates are noisy (loss may bounce around).
How it works: one update uses a single training example.
Pros:
- very fast updates
- can “jitter” out of shallow traps
Cons:
- noisy updates (loss may bounce)
- can require more iterations to settle
Mini-batch Gradient Descent #
Mini-batch GD is the most widely used training method in deep learning.
- Uses a small batch each update (common sizes: 32, 64, 128, 256).
- This is the standard approach in deep learning.
How it works: each update uses a small batch of size \( B \) (often 32–512).
Why it is popular:
- runs efficiently on GPUs (matrix operations)
- less noisy than SGD
- much cheaper per update than full-batch
- best practical balance of speed and stability
If you remember one thing: Mini-batch GD is the default for deep learning because it scales well with data and compute.
Mini-batch update (same idea, just different data) #
You compute the gradient using only the mini-batch:
\[ \theta \leftarrow \theta - \eta \nabla_{\theta} J_{\text{batch}}(\theta) \]Popular “upgrades” used with mini-batches #
These are still gradient descent, but with smarter updates.
Momentum (very common with SGD) #
Momentum keeps a running “velocity” so updates build up in consistent directions.
- Adds “inertia” so updates keep moving in the same direction when that helps.
- Reduces zig-zagging and speeds up convergence.
Where \( g_t \) is the mini-batch gradient and \( \beta \) controls smoothing.
Adam (most popular “default” optimiser) #
Adam adapts the learning rate per parameter using running averages of:
- gradients (first moment)
- squared gradients (second moment)
Core idea:
- parameters that consistently see large gradients get normalised
- parameters with small gradients can still move meaningfully
In practice, Adam is often the quickest way to get a model training well with minimal tuning, which is why it is so commonly used as a default.
(Implementation details vary by framework, but conceptually Adam = momentum + adaptive step sizes.)
Which one should you use? #
- Learning / small toy datasets: Batch GD is easy to understand.
- Deep learning / real training: Mini-batch GD is the standard.
- Default optimiser choice: Adam (or SGD + momentum when you want strong generalisation and can tune schedules).
Summary #
- GDA minimises loss by updating parameters using gradients.
- Types: Batch GD, SGD, Mini-batch GD.
- Most popular in deep learning: Mini-batch GD, usually with Adam or SGD + momentum.
- Learning rate and batch size strongly affect speed and stability.
Reference #
DNN lecture notes: optimisation content across regression/classification training slides.
Singh & Raj — Deep Learning (training and optimisation fundamentals).
Zhang, Lipton, Li, Smola — Dive into Deep Learning (optimisation chapters).
Mastering Gradient Descent: A Comprehensive Guide with Real-World Applications