ML

Ordinary Least Squares

Direct solution method - Ordinary Least Squares and the Line of Best Fit #

Revision:
OLS is the direct method for linear regression. It finds the best-fit line by minimising the sum of squared residuals without iterative updates.


Direct Method vs Iterative Method ☆ #

Linear regression parameters can be found in two main ways.

MethodMain ideaWhen used
Ordinary Least SquaresCompute the best parameters directlySmall or moderate datasets
Gradient DescentStart with parameters and update repeatedlyLarge datasets or many features
flowchart LR
    A["Linear Regression"] --> B["Direct Solution<br/>OLS"]
    A --> C["Iterative Solution<br/>Gradient Descent"]

    B --> B1["Normal Equation"]
    B --> B2["No learning rate"]
    B --> B3["One-shot solution"]

    C --> C1["Learning rate"]
    C --> C2["Repeated updates"]
    C --> C3["Stops after convergence"]

    style A fill:#E1F5FE,stroke:#5b7db1,color:#000
    style B fill:#C8E6C9,stroke:#5f8f6a,color:#000
    style C fill:#FFF9C4,stroke:#b59b3b,color:#000
    style B1 fill:#EDE7F6,stroke:#8a6fb3,color:#000
    style B2 fill:#EDE7F6,stroke:#8a6fb3,color:#000
    style B3 fill:#EDE7F6,stroke:#8a6fb3,color:#000
    style C1 fill:#EDE7F6,stroke:#8a6fb3,color:#000
    style C2 fill:#EDE7F6,stroke:#8a6fb3,color:#000
    style C3 fill:#EDE7F6,stroke:#8a6fb3,color:#000

Why It Is Called “Least Squares” ☆ #

OLS is called least squares because it chooses parameters that make the squared residual errors as small as possible.

Cost Function

Cost Function #

Revision:
A cost function converts model error into a single number. Training means changing the model parameters until this number becomes as small as possible.


Why Cost Function Matters in ML ☆ #

A machine learning model needs a way to decide whether one set of parameters is better than another.

For linear regression, every possible value of the parameters gives a different line. The cost function tells us which line is better by measuring how far the predictions are from the true values.

Gradient Descent

Gradient Descent for Linear Regression #

Revision:
Gradient descent is the step-by-step method for reducing the cost function when a direct closed-form solution is not convenient.


Where Gradient Descent Fits in ML ☆ #

Gradient descent is used when we want the model to learn parameters by repeatedly improving them.

For linear regression, it adjusts the slope and intercept until the prediction error becomes small.

flowchart LR
    A["Initial Parameters"] --> B["Make Predictions"]
    B --> C["Compute Cost"]
    C --> D["Compute Gradient"]
    D --> E["Update Parameters"]
    E --> B

    style A fill:#E1F5FE,stroke:#5b7db1,color:#000
    style B fill:#C8E6C9,stroke:#5f8f6a,color:#000
    style C fill:#FFF9C4,stroke:#b59b3b,color:#000
    style D fill:#EDE7F6,stroke:#8a6fb3,color:#000
    style E fill:#C8E6C9,stroke:#5f8f6a,color:#000

Core Idea ☆ #

The gradient tells us the direction in which the cost increases fastest.

Classification (Linear)

Linear models for Classification #

  • categorises data by finding a linear boundary (hyperplane) that separates classes
  • calculating a weighted sum of input features plus bias
flowchart TD
T["Linear<br/>classification<br/>models"] --> P["Perceptron"]
T --> LR["Logistic<br/>regression"]
T --> SVM["Linear<br/>SVM"]

P -->|uses| STEP["Step<br/>activation"]
LR -->|uses| SIG["Sigmoid<br/>+ log loss"]
SVM -->|uses| HNG["Hinge<br/>loss"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000

style P fill:#C8E6C9,stroke:#2E7D32,color:#000
style LR fill:#C8E6C9,stroke:#2E7D32,color:#000
style SVM fill:#C8E6C9,stroke:#2E7D32,color:#000

style STEP fill:#CE93D8,stroke:#8E24AA,color:#000
style SIG fill:#CE93D8,stroke:#8E24AA,color:#000
style HNG fill:#CE93D8,stroke:#8E24AA,color:#000
  • Discriminant Functions
  • Decision Theory
  • Probabilistic Discriminative Classifiers
  • Logistic Regression

Logistic Regression #

  • Supervised machine learning algorithm
  • Binary classification algorithm
  • requires data to be linearly separable
  • predicts the probability that an input belongs to a specific class
  • uses Sigmoid function to convert inputs into a probability value between 0 and 1

Key takeaway: Logistic regression predicts $P(y=1\mid x)$ using a sigmoid of a linear score $z=w\cdot x+b$, then learns $w,b$ by maximising likelihood (equivalently minimising log-loss).

Hypothesis Testing

Hypothesis Testing #

Hypothesis testing is a statistical decision-making method used to decide whether sample evidence is strong enough to reject an initial assumption about a population.

It connects probability, sampling distributions, confidence intervals, significance levels, and decision rules.

Key takeaway:
Hypothesis testing is not about proving something with certainty.

It is about asking:

If the null hypothesis were true, how surprising would this sample result be?

Foundation Models

Foundation Model #

AI models trained on massive datasets to perform a wide range of tasks with minimal fine-tuning.

  • are large deep learning neural networks

  • are large AI models trained on massive and diverse datasets (text, images, audio, or multiple modalities).

  • Contain millions or billions of parameters.

  • designed to perform a broad range of general tasks

  • designed for general-purpose intelligence, not a single task.

  • acts as base models for building specialised AI applications

LLM - Model

LLM – Large Language Model #

Large Language Models (LLMs) are advanced AI systems designed to process, understand, and generate human-like text.

They learn language by analysing massive amounts of text data, discovering patterns in:

  • grammar

  • meaning

  • context

  • relationships between words and sentences

  • Built on Deep Learning

  • Implemented using Neural Networks

  • Based on Transformers

  • Often combined with tools like:

    • Retrieval (RAG)
    • Agents
    • External APIs
    • Memory systems

What makes an LLM special? #

  • Built using deep neural networks
  • Trained on very large datasets (books, articles, code, web text)
  • Can perform many tasks without task-specific training
  • General-purpose language understanding, not single-task models

Foundation: Transformer Architecture #

LLMs are based on the Transformer Architecture, which allows models to understand context and long-range dependencies in text.

AI Agents

AI Agents #

Also referred to as Agentic AI.

AI agents are intelligent systems that can plan, make decisions, and take actions to achieve goals with minimal human intervention.

  • A common use case is task automation

  • for example booking travel based on a user’s request.

  • AI agents typically build on Generative AI and use Large Language Models (LLMs) as the reasoning core.

  • Agents often interact with tools (APIs, databases, calendars) to complete multi-step workflows.

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) #

Retrieval-Augmented Generation (RAG) is a system design pattern that improves an LLM’s answers by:

  1. Retrieving relevant information from an external knowledge source, and then
  2. Augmenting the LLM prompt with that retrieved context before generating the final response.

RAG helps an LLM look things up first, then answer using evidence.


Why RAG is Useful #

RAG is commonly used when:

  • Your knowledge is in private documents (PDFs, policies, internal wiki)
  • You need up-to-date information (things not in the model’s training data)
  • You want fewer hallucinations by grounding answers in retrieved sources
  • You want traceability (show “where the answer came from”)

RAG does not change the model weights.
It changes what the model sees at inference time by adding retrieved context.

Mathematical Foundation

Mathematical Foundations for Machine Learning #

Machine Learning is built on mathematical principles that allow models to:

  • represent data
  • learn patterns
  • optimise performance
flowchart LR
    DATA[Data]
    MATH[Math Models]
    OPT[Optimisation]
    MODEL[Trained Model]

    DATA --> MATH
    MATH --> OPT
    OPT --> MODEL

ML requires core mathematical tools to understand how ML algorithms work internally. Algebra deals with relationships between variables and quantities, while Calculus focuses on change and optimization.