ML

Gradient Descent

Gradient Descent for Linear Regression #

Gradient descent is an iterative optimisation method used to minimise the regression cost function by repeatedly updating parameters in the direction that reduces error.

  • Iterative method
  • Types: batch / stochastic / mini-batch

Key takeaway: Gradient descent starts with initial parameter values and repeatedly updates them using the gradient until the cost stops decreasing.

flowchart TD
GD["Gradient<br/>Descent"] -->|minimises| CF["Cost<br/>function"]
GD -->|updates| W["Parameters<br/>(weights)"]
GD -->|uses| GR["Gradient<br/>(slope)"]

GD --> H["Hyperparameters"]
H --> LR["Learning<br/>rate"]
H --> BS["Batch<br/>size"]
H --> EP["Epochs"]

style GD fill:#90CAF9,stroke:#1E88E5,color:#000

style CF fill:#CE93D8,stroke:#8E24AA,color:#000
style W fill:#CE93D8,stroke:#8E24AA,color:#000
style GR fill:#CE93D8,stroke:#8E24AA,color:#000
style H fill:#CE93D8,stroke:#8E24AA,color:#000
style LR fill:#CE93D8,stroke:#8E24AA,color:#000
style BS fill:#CE93D8,stroke:#8E24AA,color:#000
style EP fill:#CE93D8,stroke:#8E24AA,color:#000

Types of GD #

flowchart TD
T["Gradient Descent<br/>types"] --> BGD["Batch<br/>GD"]
T --> SGD["Stochastic<br/>GD"]
T --> MGD["Mini-batch<br/>GD"]

BGD --> ALL["All data<br/>per step"]
BGD --> STB["Smooth<br/>updates"]

SGD --> ONE["1 sample<br/>per step"]
SGD --> FAST["Quick<br/>progress"]
SGD --> NOISE["Noisy<br/>updates"]

MGD --> MB["Small batch<br/>per step"]
MGD --> PRACT["Practical<br/>default"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000

style BGD fill:#C8E6C9,stroke:#2E7D32,color:#000
style SGD fill:#C8E6C9,stroke:#2E7D32,color:#000
style MGD fill:#C8E6C9,stroke:#2E7D32,color:#000

style ALL fill:#CE93D8,stroke:#8E24AA,color:#000
style STB fill:#CE93D8,stroke:#8E24AA,color:#000
style ONE fill:#CE93D8,stroke:#8E24AA,color:#000
style FAST fill:#CE93D8,stroke:#8E24AA,color:#000
style NOISE fill:#CE93D8,stroke:#8E24AA,color:#000
style MB fill:#CE93D8,stroke:#8E24AA,color:#000
style PRACT fill:#CE93D8,stroke:#8E24AA,color:#000

Batch #

  • Use only if you have huge compute and a lot of time to train

SGD #

  • go-to solution

Classification(Linear Models)

Linear models for Classification #

  • categorises data by finding a linear boundary (hyperplane) that separates classes
  • calculating a weighted sum of input features plus bias
flowchart TD
T["Linear<br/>classification<br/>models"] --> P["Perceptron"]
T --> LR["Logistic<br/>regression"]
T --> SVM["Linear<br/>SVM"]

P -->|uses| STEP["Step<br/>activation"]
LR -->|uses| SIG["Sigmoid<br/>+ log loss"]
SVM -->|uses| HNG["Hinge<br/>loss"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000

style P fill:#C8E6C9,stroke:#2E7D32,color:#000
style LR fill:#C8E6C9,stroke:#2E7D32,color:#000
style SVM fill:#C8E6C9,stroke:#2E7D32,color:#000

style STEP fill:#CE93D8,stroke:#8E24AA,color:#000
style SIG fill:#CE93D8,stroke:#8E24AA,color:#000
style HNG fill:#CE93D8,stroke:#8E24AA,color:#000

Discriminant Functions #

Decision Theory #

Probabilistic Discriminative Classifiers #


Logistic Regression #

  • Supervised machine learning algorithm
  • Binary classification algorithm
  • requires data to be linearly separable
  • predicts the probability that an input belongs to a specific class
  • uses Sigmoid function to convert inputs into a probability value between 0 and 1

Key takeaway: Logistic regression predicts $P(y=1\mid x)$ using a sigmoid of a linear score $z=w\cdot x+b$, then learns $w,b$ by maximising likelihood (equivalently minimising log-loss).

Foundation Models

Foundation Model #

AI models trained on massive datasets to perform a wide range of tasks with minimal fine-tuning.

  • are large deep learning neural networks

  • are large AI models trained on massive and diverse datasets (text, images, audio, or multiple modalities).

  • Contain millions or billions of parameters.

  • designed to perform a broad range of general tasks

  • designed for general-purpose intelligence, not a single task.

  • acts as base models for building specialised AI applications

LLM - Model

LLM – Large Language Model #

Large Language Models (LLMs) are advanced AI systems designed to process, understand, and generate human-like text.

They learn language by analysing massive amounts of text data, discovering patterns in:

  • grammar

  • meaning

  • context

  • relationships between words and sentences

  • Built on Deep Learning

  • Implemented using Neural Networks

  • Based on Transformers

  • Often combined with tools like:

    • Retrieval (RAG)
    • Agents
    • External APIs
    • Memory systems

What makes an LLM special? #

  • Built using deep neural networks
  • Trained on very large datasets (books, articles, code, web text)
  • Can perform many tasks without task-specific training
  • General-purpose language understanding, not single-task models

Foundation: Transformer Architecture #

LLMs are based on the Transformer Architecture, which allows models to understand context and long-range dependencies in text.

Decision Tree

Decision Tree #

A decision tree classifies an example by asking a sequence of questions about its attributes until it reaches a leaf (final decision).

Key takeaway: A decision tree grows by repeatedly splitting the training data into purer subsets using an impurity measure (Entropy / Gini / Classification Error).


Information Theory #

Decision trees need a way to measure: “How mixed are the class labels at a node?”

Instance-based Learning

Instance-based Learning #

Instance-based learning is a family of methods that do not build one explicit global model during training. Instead, they store training examples and delay most of the work until a new query arrives.

When a new point must be classified or predicted, the algorithm compares it with previously seen examples, finds the most relevant neighbours, and uses them to produce the answer.

Instance-based Learning covers three linked ideas:

Support Vector Machine

Support Vector Machine (SVM) #

A Support Vector Machine (SVM) is a supervised machine learning algorithm used for:

  • Classification (most common)
  • Regression (SVR – Support Vector Regression)

Find the decision boundary that separates classes with the maximum margin.

A Support Vector Machine is a supervised learning algorithm that finds an optimal hyperplane by maximising the margin between classes, using support vectors and kernel functions to handle non-linear data.

Attention Mechanism

Attention Mechanism #

  • Queries, Keys, and Values
  • Attention Pooling by Similarity
  • Attention Pooling via Nadaraya–Watson Regression
  • Attention Scoring Functions
  • Dot Product Attention
  • Convenience Functions
  • Scaled Dot Product Attention
  • Additive Attention
  • Bahdanau Attention Mechanism
  • Multi-Head Attention
  • Self-Attention
  • Positional Encoding
  • Code implementation (webinar)

Reference #

  • Dive into deep learning. Cambridge University Press.. (Ch 10, Ch7

Home | Deep Learning