Gradient Descent Algorithm

Gradient Descent Algorithm #

Gradient Descent Algorithm (GDA) is

  • an optimisation method
  • used to train models
  • by repeatedly updating parameters (weights and biases) to reduce the loss

In deep learning, the default training approach is almost always mini-batch gradient descent, usually with Adam or SGD + momentum.

Gradient Descent is used in both regression and classification.

It’s not tied to the task type — it’s tied to the fact you have:

  • a model with parameters (weights/bias), and
  • a loss function you want to minimise.

In regression #

  • You predict a number (e.g., house price).
  • Common loss: Mean Squared Error (MSE).
  • Gradient descent adjusts weights to reduce MSE.

In classification #

  • You predict a class (e.g., spam vs not spam).
  • Common loss: cross-entropy / log loss.
  • Gradient descent adjusts weights to reduce classification loss.

What gradient descent does #

You start with some parameters, measure the loss, and update parameters to reduce that loss.

Update rule (conceptual):

\[ \theta \leftarrow \theta - \eta \nabla_{\theta} J(\theta) \]

Where:

  • \( \theta \) = parameters (weights/biases)
  • \( J(\theta) \) = loss (objective)
  • \( \nabla_{\theta} J(\theta) \) = gradient (direction of steepest increase)
  • \( \eta \) = learning rate (step size)

Types of Gradient Descent #

There are three common “data usage” types:

  • Batch Gradient Descent: uses the entire dataset per update
  • Stochastic Gradient Descent (SGD): uses one example per update
  • Mini-batch Gradient Descent: uses a small batch per update (the deep learning standard)
flowchart TD
  A["Gradient Descent<br/>Algorithm (GDA)"] --> B["Batch GD<br/>(full dataset)"]
  A --> C["Stochastic GD<br/>(one sample)"]
  A --> D["Mini-batch GD<br/>(small batch)"]

  D --> E["Often paired with:<br/>Momentum / Adam"]

  %% Pastel colour scheme
  style A fill:#E3F2FD,stroke:#1E88E5,stroke-width:1px
  style B fill:#FFF3E0,stroke:#FB8C00,stroke-width:1px
  style C fill:#FCE4EC,stroke:#D81B60,stroke-width:1px
  style D fill:#E8F5E9,stroke:#43A047,stroke-width:1px
  style E fill:#F3E5F5,stroke:#8E24AA,stroke-width:1px

Batch Gradient Descent #

  • Uses the entire dataset to compute one update.
  • Stable updates, but can be slow on large datasets.

How it works: one update uses all \( N \) training examples.

Pros:

  • stable, smooth loss curve
  • good for small datasets

Cons:

  • slow when \( N \) is large
  • one update can be expensive

Stochastic Gradient Descent (SGD) #

  • Uses one training example at a time for each update.
  • Very fast per update, but updates are noisy (loss may bounce around).

How it works: one update uses a single training example.

Pros:

  • very fast updates
  • can “jitter” out of shallow traps

Cons:

  • noisy updates (loss may bounce)
  • can require more iterations to settle

Mini-batch Gradient Descent #

Mini-batch GD is the most widely used training method in deep learning.

  • Uses a small batch each update (common sizes: 32, 64, 128, 256).
  • This is the standard approach in deep learning.

How it works: each update uses a small batch of size \( B \) (often 32–512).

Why it is popular:

  • runs efficiently on GPUs (matrix operations)
  • less noisy than SGD
  • much cheaper per update than full-batch
  • best practical balance of speed and stability

If you remember one thing: Mini-batch GD is the default for deep learning because it scales well with data and compute.

Mini-batch update (same idea, just different data) #

You compute the gradient using only the mini-batch:

\[ \theta \leftarrow \theta - \eta \nabla_{\theta} J_{\text{batch}}(\theta) \]

These are still gradient descent, but with smarter updates.

Momentum (very common with SGD) #

Momentum keeps a running “velocity” so updates build up in consistent directions.

  • Adds “inertia” so updates keep moving in the same direction when that helps.
  • Reduces zig-zagging and speeds up convergence.
\[ v_t = \beta v_{t-1} + (1-\beta) g_t \] \[ \theta_{t+1} = \theta_t - \eta v_t \]

Where \( g_t \) is the mini-batch gradient and \( \beta \) controls smoothing.

Adam adapts the learning rate per parameter using running averages of:

  • gradients (first moment)
  • squared gradients (second moment)

Core idea:

  • parameters that consistently see large gradients get normalised
  • parameters with small gradients can still move meaningfully

In practice, Adam is often the quickest way to get a model training well with minimal tuning, which is why it is so commonly used as a default.

(Implementation details vary by framework, but conceptually Adam = momentum + adaptive step sizes.)


Which one should you use? #

  • Learning / small toy datasets: Batch GD is easy to understand.
  • Deep learning / real training: Mini-batch GD is the standard.
  • Default optimiser choice: Adam (or SGD + momentum when you want strong generalisation and can tune schedules).

Summary #

  • GDA minimises loss by updating parameters using gradients.
  • Types: Batch GD, SGD, Mini-batch GD.
  • Most popular in deep learning: Mini-batch GD, usually with Adam or SGD + momentum.
  • Learning rate and batch size strongly affect speed and stability.

Reference #


Home | Deep Learning