Gradient Descent Algorithm #

Gradient Descent Algorithm (GDA) is

an optimisation method
used to train models
by repeatedly updating parameters (weights and biases) to reduce the loss

In deep learning, the default training approach is almost always mini-batch gradient descent, usually with Adam or SGD + momentum.

Gradient Descent is used in both regression and classification.

It’s not tied to the task type — it’s tied to the fact you have:

a model with parameters (weights/bias), and
a loss function you want to minimise.

In regression #

You predict a number (e.g., house price).
Common loss: Mean Squared Error (MSE).
Gradient descent adjusts weights to reduce MSE.

In classification #

You predict a class (e.g., spam vs not spam).
Common loss: cross-entropy / log loss.
Gradient descent adjusts weights to reduce classification loss.

What gradient descent does #

You start with some parameters, measure the loss, and update parameters to reduce that loss.

Update rule (conceptual):

\[ \theta \leftarrow \theta - \eta \nabla_{\theta} J(\theta) \]

Where:

\( \theta \) = parameters (weights/biases)
\( J(\theta) \) = loss (objective)
\( \nabla_{\theta} J(\theta) \) = gradient (direction of steepest increase)
\( \eta \) = learning rate (step size)

Types of Gradient Descent #

There are three common “data usage” types:

Batch Gradient Descent: uses the entire dataset per update
Stochastic Gradient Descent (SGD): uses one example per update
Mini-batch Gradient Descent: uses a small batch per update (the deep learning standard)

flowchart TD
  A["Gradient Descent<br/>Algorithm (GDA)"] --> B["Batch GD<br/>(full dataset)"]
  A --> C["Stochastic GD<br/>(one sample)"]
  A --> D["Mini-batch GD<br/>(small batch)"]

  D --> E["Often paired with:<br/>Momentum / Adam"]

  %% Pastel colour scheme
  style A fill:#E3F2FD,stroke:#1E88E5,stroke-width:1px
  style B fill:#FFF3E0,stroke:#FB8C00,stroke-width:1px
  style C fill:#FCE4EC,stroke:#D81B60,stroke-width:1px
  style D fill:#E8F5E9,stroke:#43A047,stroke-width:1px
  style E fill:#F3E5F5,stroke:#8E24AA,stroke-width:1px

Batch Gradient Descent #

Uses the entire dataset to compute one update.
Stable updates, but can be slow on large datasets.

How it works: one update uses all \( N \) training examples.

Pros:

stable, smooth loss curve
good for small datasets

Cons:

slow when \( N \) is large
one update can be expensive

Stochastic Gradient Descent (SGD) #

Uses one training example at a time for each update.
Very fast per update, but updates are noisy (loss may bounce around).

How it works: one update uses a single training example.

Pros:

very fast updates
can “jitter” out of shallow traps

Cons:

noisy updates (loss may bounce)
can require more iterations to settle

Mini-batch Gradient Descent #

Mini-batch GD is the most widely used training method in deep learning.

Uses a small batch each update (common sizes: 32, 64, 128, 256).
This is the standard approach in deep learning.

How it works: each update uses a small batch of size \( B \) (often 32–512).

Why it is popular:

runs efficiently on GPUs (matrix operations)
less noisy than SGD
much cheaper per update than full-batch
best practical balance of speed and stability

If you remember one thing: Mini-batch GD is the default for deep learning because it scales well with data and compute.

Mini-batch update (same idea, just different data) #

You compute the gradient using only the mini-batch:

\[ \theta \leftarrow \theta - \eta \nabla_{\theta} J_{\text{batch}}(\theta) \]

Popular “upgrades” used with mini-batches #

These are still gradient descent, but with smarter updates.

Momentum (very common with SGD) #

Momentum keeps a running “velocity” so updates build up in consistent directions.

Adds “inertia” so updates keep moving in the same direction when that helps.
Reduces zig-zagging and speeds up convergence.

\[ v_t = \beta v_{t-1} + (1-\beta) g_t \] \[ \theta_{t+1} = \theta_t - \eta v_t \]

Where \( g_t \) is the mini-batch gradient and \( \beta \) controls smoothing.

Adam (most popular “default” optimiser) #

Adam adapts the learning rate per parameter using running averages of:

gradients (first moment)
squared gradients (second moment)

Core idea:

parameters that consistently see large gradients get normalised
parameters with small gradients can still move meaningfully

In practice, Adam is often the quickest way to get a model training well with minimal tuning, which is why it is so commonly used as a default.

(Implementation details vary by framework, but conceptually Adam = momentum + adaptive step sizes.)

Which one should you use? #

Learning / small toy datasets: Batch GD is easy to understand.
Deep learning / real training: Mini-batch GD is the standard.
Default optimiser choice: Adam (or SGD + momentum when you want strong generalisation and can tune schedules).

Summary #

GDA minimises loss by updating parameters using gradients.
Types: Batch GD, SGD, Mini-batch GD.
Most popular in deep learning: Mini-batch GD, usually with Adam or SGD + momentum.
Learning rate and batch size strongly affect speed and stability.

Reference #

DNN lecture notes: optimisation content across regression/classification training slides.
Singh & Raj — Deep Learning (training and optimisation fundamentals).
Zhang, Lipton, Li, Smola — Dive into Deep Learning (optimisation chapters).
GDA in ML
Different Variants of Gradient Descent
Mastering Gradient Descent: A Comprehensive Guide with Real-World Applications

Home | Deep Learning