AI on Arshad Siddiqui

Formula Sheet

Thu, 12 Mar 2026 00:00:00 +0000

Formula Sheet #

This page is a quick reference of definitions + formulas, grouped by the modules.

Notation #

Sample size: $ n $ (sample), $ N $ (population)
Sample mean: $ \bar{x} $ , population mean: $ \mu $
Sample variance: $ s^2 $ , population variance: $ \sigma^2 $
Sample SD: $ s $ , population SD: $ \sigma $
Complement: $ A^c $
Intersection (“and”): $ A\cap B $ , union (“or”): $ A\cup B $
Conditional probability: $ P(A\mid B) $

1. Basic Probability & Statistics #

1.1 Measures of Central Tendency #

Arithmetic mean #

Sample mean (ungrouped):

Supervised Learning

Sat, 03 Jan 2026 10:29:52 +0100

Supervised Learning #

Trained using labelled data.
Each example in the training set includes the correct output.
The algorithm learns to generalise and make predictions on unseen data.
Generally more accurate than unsupervised methods.
Requires human intervention for labelling and setup.
Widely used due to its accuracy and efficiency.
Produces highly accurate results when trained on good-quality labelled data.

Classification #

Output is discrete (e.g. Yes/No, Spam/Not Spam).
Used for categorising data into predefined classes.
Support Vector Machine (SVM) is a common classifier (a linear classifier with margin-based separation).

Differentiation of Univariate Functions

Mon, 01 Jan 0001 00:00:00 +0000

Differentiation of Univariate Functions #

Differentiation measures rate of change.

For a function f(x), the derivative measures the rate of change.

$[ f'(x) = $lim_{h $to 0} $frac{f(x+h)-f(x)}{h} $]

Interpretation:

Slope of tangent
Instantaneous rate of change

Home | Vector Calculus

Artificial Intelligence

Thu, 04 Jul 2024 10:55:52 +0100

My AI Notes #

Learning how machines learn! My working notes as I learn AI.

flowchart LR
 AI[Artificial Intelligence]
 ML[Machine Learning]
 DL[Deep Learning]
 FM[Foundation Models]
 LLM[LLM Models]

 AI --> ML
 ML --> DL
 DL --> FM
 FM --> LLM

 style AI fill:#E1F5FE
 style ML fill:#C8E6C9
 style DL fill:#90CAF9
 style FM fill:#64B5F6
 style LLM fill:#FFCCBC

Mathematical Foundations for Machine Learning
Statistical Methods
Machine Learning
Deep Neural Networks

Machine Learning → The broad field where systems learn patterns from data to make predictions or decisions.
Neural Networks → A subset of machine learning that uses interconnected artificial neurons to model complex relationships.
Deep Learning → A subset of neural networks that uses many hidden layers to learn high-level features from large datasets.
Foundation Models → Large deep learning models trained on massive datasets and reused across many tasks using transfer learning.
LLMs (Large Language Models) → A specialised type of foundation model focused on understanding and generating human language.

flowchart TD
AI["Artificial<br/>Intelligence"]
ML["Machine<br/>Learning"]
NN["Neural<br/>Networks"]
DL["Deep<br/>Learning"]
FM["Foundation<br/>Models"]
LLM["LLM<br/>Models"]

AI --> ML
ML --> NN
NN --> DL
DL --> FM
FM --> LLM

LR["Linear<br/>Regression"]
DT["Decision<br/>Trees"]
ML --> LR
ML --> DT

MLP["MLP"]
CNN["CNN"]
NN --> MLP
NN --> CNN

CNNDL["CNN<br/>(deep)"]
RNN["RNN"]
DL --> CNNDL
DL --> RNN

BERT["BERT"]
CLIP["CLIP"]
FM --> BERT
FM --> CLIP

GPT["GPT"]
LLAMA["LLaMA"]
LLM --> GPT
LLM --> LLAMA

TEXT["Text"]
IMAGE["Images"]
AUDIO["Audio"]
VIDEO["Video"]
LLM --> TEXT
LLM --> IMAGE
LLM --> AUDIO
LLM --> VIDEO

style AI fill:#90CAF9,stroke:#1E88E5,color:#000
style ML fill:#90CAF9,stroke:#1E88E5,color:#000
style NN fill:#90CAF9,stroke:#1E88E5,color:#000

style DL fill:#CE93D8,stroke:#8E24AA,color:#000
style FM fill:#CE93D8,stroke:#8E24AA,color:#000

style LLM fill:#C8E6C9,stroke:#2E7D32,color:#000
style LR fill:#C8E6C9,stroke:#2E7D32,color:#000
style DT fill:#C8E6C9,stroke:#2E7D32,color:#000
style MLP fill:#C8E6C9,stroke:#2E7D32,color:#000
style CNN fill:#C8E6C9,stroke:#2E7D32,color:#000
style CNNDL fill:#C8E6C9,stroke:#2E7D32,color:#000
style RNN fill:#C8E6C9,stroke:#2E7D32,color:#000
style BERT fill:#C8E6C9,stroke:#2E7D32,color:#000
style CLIP fill:#C8E6C9,stroke:#2E7D32,color:#000
style GPT fill:#C8E6C9,stroke:#2E7D32,color:#000
style LLAMA fill:#C8E6C9,stroke:#2E7D32,color:#000
style TEXT fill:#C8E6C9,stroke:#2E7D32,color:#000
style IMAGE fill:#C8E6C9,stroke:#2E7D32,color:#000
style AUDIO fill:#C8E6C9,stroke:#2E7D32,color:#000
style VIDEO fill:#C8E6C9,stroke:#2E7D32,color:#000

Stats Formula Sheet

Wed, 25 Feb 2026 00:00:00 +0000

Stats Formula Sheet #

Keep this page as a quick reference of definitions + formulas.

Notation #

Sample size: $ n $ (sample), $ N $ (population)
Mean: $ \bar{x} $ (sample), $ \mu $ (population)
Variance: $ s^2 $ (sample), $ \sigma^2 $ (population)
Standard deviation: $ s $ (sample), $ \sigma $ (population)

Module 1: Basic Statistics #

Measures of Central Tendency #

Sample mean (ungrouped):

Unsupervised Learning

Sat, 03 Jan 2026 10:29:52 +0100

Unsupervised Learning #

Works on unlabelled raw data.
The algorithm discovers hidden patterns without prior knowledge of outcomes.
Requires no human intervention during training.
Does not make direct predictions — it groups or organises data instead.
Carries a higher risk because there’s no ground truth to verify results.
Common techniques include Clustering, Association, and Dimensionality Reduction.

stateDiagram-v2

 %% ML maths-based colours (same palette as supervised)
 classDef probability fill:#d1fae5,stroke:#065f46,stroke-width:1px
 classDef geometry fill:#ffedd5,stroke:#9a3412,stroke-width:1px
 classDef category font-style:italic,font-weight:bold,fill:#f3f4f6,stroke:#374151

 %% Root
 USL: Unsupervised Learning

 %% Main branches
 USL --> CLU:::category
 CLU: Clustering

 USL --> DR:::category
 DR: Dimensionality Reduction

 %% Clustering algorithms
 CLU --> KM:::geometry
 KM: K-Means

 CLU --> HC:::geometry
 HC: Hierarchical Clustering

 CLU --> DB:::geometry
 DB: DBSCAN

 %% Probabilistic models
 USL --> PM:::category
 PM: Probabilistic Models

 PM --> GMM:::probability
 GMM: Gaussian Mixture Model

 PM --> HMM:::probability
 HMM: Hidden Markov Model

Clustering #

Groups similar data points together based on shared features.
Commonly used for market segmentation, image compression, and anomaly detection.

Common Types of Clustering #

K-Means Clustering – Divides data into K groups based on similarity.
Hierarchical Clustering – Builds a hierarchy (tree) of clusters.
DBSCAN (Density-Based Spatial Clustering) – Groups points close in density; identifies noise/outliers.

Association #

Identifies relationships or correlations between variables in a dataset.
Commonly used in market basket analysis (e.g. “Customers who bought X also bought Y”).

Common Techniques #

Apriori Algorithm – Finds frequent itemsets and generates association rules.
Eclat Algorithm – Similar to Apriori but uses set intersections for faster computation.

Dimensionality Reduction #

Reduces the number of input variables to simplify data.
Helps remove noise and redundancy.
Commonly used in data pre-processing and visualisation.

Common Techniques #

Principal Component Analysis (PCA) – Projects data onto fewer dimensions while keeping most variance.
Linear Discriminant Analysis (LDA) – Focuses on class separation.
t-SNE (t-Distributed Stochastic Neighbour Embedding) – Used for visualising high-dimensional data.
Autoencoders – Neural networks that compress and reconstruct data.

mindmap
 root(Unsupervised Learning)
 Clustering
 K Means
 Hierarchical Clustering
 DBSCAN
 Dimensionality Reduction
 PCA
 t SNE
 Autoencoders
 Probabilistic Models
 Gaussian Mixture Model
 Hidden Markov Model

Home | Machine Learning

Partial Differentiation and Gradients

Mon, 01 Jan 0001 00:00:00 +0000

Partial Differentiation and Gradients #

For f(x1, x2, …, xn):

[ \frac{\partial f}{\partial x_i} ]

Gradient vector:

[ \nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \ \vdots \ \frac{\partial f}{\partial x_n} \end{bmatrix} ]

Gradient points in direction of steepest ascent.

flowchart LR
 Input --> Function
 Function --> Gradient
 Gradient --> Optimisation

Home | Vector Calculus

Linear Independence

Mon, 01 Jan 0001 00:00:00 +0000

Linear Independence #

A set of vectors is linearly independent if none of them can be written as a linear combination of the others.

\[ c_1\mathbf{v}_1 + \cdots + c_k\mathbf{v}_k = \mathbf{0} \;\Rightarrow\; c_1=\cdots=c_k=0 \]

Independence means each vector adds new information.

Why it matters #

Detects redundancy
Connects to rank and basis

If one vector can already be formed using others, it does not add anything new.

Semi-Supervised Learning

Sat, 03 Jan 2026 10:29:52 +0100

Semi-Supervised Learning #

A combination of labelled and unlabelled data.
Useful when labelling large datasets is expensive or time-consuming.
Works well with high-volume datasets (e.g. millions of images).
Only a small fraction of data is labelled (e.g. a few thousand).
The algorithm learns from both labelled examples and structure in unlabelled data.
Ideal for medical imaging where labelled data is limited.
For example, a radiologist can label a small set of medical scans,
and the model uses that to learn from thousands of unlabelled scans.
Helps improve accuracy and generalisation with minimal manual effort.

Home | Machine Learning

Gradients of Vector-Valued and Matrix Functions

Mon, 01 Jan 0001 00:00:00 +0000

Gradients of Vector-Valued and Matrix Functions #

Covers gradients when outputs or parameters are vectors/matrices.

If f: R^n -> R^m, the derivative is the Jacobian.

[ J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \dots & \frac{\partial f_1}{\partial x_n} \ \vdots & \ddots & \vdots \ \frac{\partial f_m}{\partial x_1} & \dots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} ]

For scalar f(x):

[ H = \nabla^2 f ]

Hessian captures curvature.

Reinforcement Learning

Mon, 01 Jan 0001 00:00:00 +0000

Reinforcement Learning (RL) #

RL is learning by trial and error.

Reinforcement Learning (RL) is a type of machine learning where an autonomous agent learns to make decisions by interacting with an environment.

Instead of being told the correct answer, the agent:

takes actions
observes outcomes
receives rewards or penalties
gradually learns a strategy that maximises long-term reward

Reinforcement Learning teaches an agent how to act, not what to predict.

Useful Gradient Identities

Mon, 01 Jan 0001 00:00:00 +0000

Useful Gradient Identities #

[ \nabla (a^T x) = a ] [ \nabla (x^T A x) = (A + A^T)x ]

If A symmetric:

[ \nabla (x^T A x) = 2Ax ]

These are heavily used in optimisation.

Home | Vector Calculus

Inner Products and Dot Product

Mon, 01 Jan 0001 00:00:00 +0000

Inner Products and Dot Product #

An inner product maps two vectors to a single scalar.

It allows us to measure:

similarity
vector length
projections
orthogonality

flowchart TD
T["Inner<br/>products<br/>(types)"] --> DOT["Euclidean<br/>Dot product"]
T --> WIP["Weighted<br/>inner product"]
T --> FN["Function-space<br/>(integral)"]
T --> HERM["Complex<br/>Hermitian"]
T --> MAT["Matrix<br/>inner product<br/>(Frobenius)"]

DOT --> Rn["Vectors in<br/>
<span>
 \( \mathbb{R}^n \)
 </span>

"]
WIP --> SPD["SPD matrix<br/>W"]
FN --> L2["L2 space<br/>functions"]
HERM --> Cn["Vectors in<br/>C^n"]
MAT --> Mnm["Matrices<br/>R^{m×n}"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000

style DOT fill:#C8E6C9,stroke:#2E7D32,color:#000
style WIP fill:#C8E6C9,stroke:#2E7D32,color:#000
style FN fill:#C8E6C9,stroke:#2E7D32,color:#000
style HERM fill:#C8E6C9,stroke:#2E7D32,color:#000
style MAT fill:#C8E6C9,stroke:#2E7D32,color:#000

style Rn fill:#CE93D8,stroke:#8E24AA,color:#000
style SPD fill:#CE93D8,stroke:#8E24AA,color:#000
style L2 fill:#CE93D8,stroke:#8E24AA,color:#000
style Cn fill:#CE93D8,stroke:#8E24AA,color:#000
style Mnm fill:#CE93D8,stroke:#8E24AA,color:#000

Definition #

For vectors
$ \mathbf{a}, \mathbf{b} \in \mathbb{R}^n $

Backpropagation and Automatic Differentiation

Mon, 01 Jan 0001 00:00:00 +0000

Backpropagation and Automatic Differentiation #

Backpropagation applies the chain rule:

efficiently across a computational graph.
repeatedly.

Chain rule:

[ \frac{dL}{dx} = \frac{dL}{dy} \cdot \frac{dy}{dx} ]

flowchart LR
 x --> y
 y --> L

Automatic differentiation computes exact derivatives efficiently using computational graphs.

Home | Vector Calculus

Higher-order derivatives

Mon, 01 Jan 0001 00:00:00 +0000

Higher-order derivatives #

Home | Vector Calculus

Angles and Orthogonality

Mon, 01 Jan 0001 00:00:00 +0000

Angles and Orthogonality #

Once we define an inner product, we can define the angle between two vectors.

Angles allow us to measure how aligned or different two vectors are in space.

Key Idea: Angle measures similarity between vectors. Orthogonality means complete independence (no similarity).

Why It Matters in Machine Learning #

PCA produces orthogonal components
Orthogonal features reduce redundancy
Gradient directions depend on angle

Angle Formula #

For vectors in n-dimensional space:

Taylor’s series

Mon, 01 Jan 0001 00:00:00 +0000

Linearization and multivariate Taylor’s series #

Home | Vector Calculus

Maxima and Minima

Mon, 01 Jan 0001 00:00:00 +0000

Computing maxima and minima for unconstrained optimization #

Home | Vector Calculus

AI Foundation

Mon, 26 Jan 2026 10:55:52 +0100

AI #

A selection of notes that didn’t fit elsewhere or are being worked on!.

Home

AI Stages: ANI, AGI, ASI

Thu, 04 Jul 2024 10:55:52 +0100

AI Development Stages: ANI → AGI → ASI #

Artificial Intelligence is often described in three stages, based on capability and scope:

ANI: Task-specific intelligence (today’s AI)
AGI: Human-level general intelligence (future goal)
ASI: Beyond human intelligence (theoretical)

ANI — Artificial Narrow Intelligence #

also called Weak AI
designed to perform one specific task
Operates within a predefined environment
Cannot generalise beyond its training
Most AI systems today are ANI

examples

Basic Statistics

Mon, 01 Jan 0001 00:00:00 +0000

Basic Statistics #

Statistics: describes data (what you see).
Probability: models uncertainty (what you don’t know yet).

Summarise a dataset using central tendency and variability
Explain core probability ideas using simple examples
Apply the axioms of probability
Distinguish mutually exclusive vs independent events

flowchart TD
 A[Dataset] --> B[Central Tendency]
 A --> C[Variability]
 B --> B1[Mean]
 B --> B2[Median]
 B --> B3[Mode]
 C --> C1[Range]
 C --> C2[Variance]
 C --> C3[Standard Deviation]
 C --> C4[IQR]

Measures of Central Tendency #

Central tendency tells you where the “middle” of the data is. Describes a set of scores with a single number that describes the PERFORMANCE of the group.

Basic Probability

Mon, 01 Jan 0001 00:00:00 +0000

Basic Probability #

Probability models uncertainty: what you don’t know yet, but want to reason about.

Key takeaway: Probability is a number between 0 and 1 that measures how likely an event is. The whole topic is about defining events clearly and applying a few core rules consistently.

Probability quantifies uncertainty: a number between 0 and 1.

0 means: impossible
1 means: certain

Terminology #

Random experiment #

A random experiment is an action whose outcome is not known in advance.

Neural Networks

Mon, 01 Jan 0001 00:00:00 +0000

Neural Networks #

A network of artificial neurons inspired by how neurons function in the human brain.
At its core - a mathematical model designed to process and learn from data.
Neural networks form the foundation of Deep Learning (involves training large and complex networks on vast amounts of data).

flowchart LR
 subgraph subGraph0["Input Layer"]
 I1(("Input 1"))
 I2(("Input 2"))
 I3(("Input 3"))
 end
 subgraph subGraph1["Hidden Layer"]
 H1(("Hidden 1"))
 H2(("Hidden 2"))
 H3(("Hidden 3"))
 end
 subgraph subGraph2["Output Layer"]
 O(("Output"))
 end
 I1 --> H1 & H2 & H3
 I2 --> H1 & H2 & H3
 I3 --> H1 & H2 & H3
 H1 --> O
 H2 --> O
 H3 --> O

 style I1 fill:#C8E6C9
 style I2 fill:#C8E6C9
 style I3 fill:#C8E6C9
 style H1 stroke:#2962FF,fill:#BBDEFB
 style H2 fill:#BBDEFB
 style H3 fill:#BBDEFB
 style O fill:#FFCDD2
 style subGraph0 stroke:none,fill:transparent
 style subGraph1 stroke:none,fill:transparent
 style subGraph2 stroke:none,fill:transparent

Structure of a Neural Network #

A typical neural network has three main layers:

Conditional Probability & Bayes’ Theorem

Thu, 12 Mar 2026 00:00:00 +0000

Conditional Probability & Bayes’ Theorem #

Probability often changes when we learn new information.

Conditional probability and Bayes’ theorem give a structured way to update beliefs using evidence.

Conditional probability updates probabilities after observing an event.

Bayes’ theorem lets you estimate a hidden cause from observed evidence.

Naïve Bayes turns Bayes’ theorem into a practical classifier by assuming conditional independence of features given the class.

flowchart TD

A[Conditional<br/>probability] -->|foundation| B[Bayes<br/>theorem]
D[Independent<br/>events] -->|implies| C[Independence]
C -->|simplifies| A

E[Prior] -->|with likelihood| B
F[Likelihood] -->|updates| H[Posterior]
G[Evidence] -->|normalises| B
B -->|yields| H

I[Naïve<br/>Bayes] -->|uses| B
J[Naïve<br/>assumption] -->|assumes| C
K[Features] -->|given class| J
L[Class] -->|conditions| J
I -->|predicts| M[Classification]
M -->|selects| L

style A fill:#90CAF9,stroke:#1E88E5,color:#000
style B fill:#90CAF9,stroke:#1E88E5,color:#000
style C fill:#90CAF9,stroke:#1E88E5,color:#000

style D fill:#CE93D8,stroke:#8E24AA,color:#000
style E fill:#CE93D8,stroke:#8E24AA,color:#000
style F fill:#CE93D8,stroke:#8E24AA,color:#000
style G fill:#CE93D8,stroke:#8E24AA,color:#000
style J fill:#CE93D8,stroke:#8E24AA,color:#000
style K fill:#CE93D8,stroke:#8E24AA,color:#000
style L fill:#CE93D8,stroke:#8E24AA,color:#000

style H fill:#C8E6C9,stroke:#2E7D32,color:#000
style I fill:#C8E6C9,stroke:#2E7D32,color:#000
style M fill:#C8E6C9,stroke:#2E7D32,color:#000

Quick summary #

Conditional probability: updates probability after an event is known.
Multiplication rule: computes joint probability from conditional parts.
Independence: tested using $ P(A\cap B)=P(A)P(B) $ .
Total probability: breaks a probability into weighted cases.
Bayes’ theorem: reverses conditioning to infer causes from evidence.

What’s next #

Probability Distributions
Move from events to random variables and distributions.

Machine Learning

Tue, 06 Aug 2024 23:29:52 +0100

Machine Learning #

stateDiagram-v2

 %% ===== CLASS DEFINITIONS (Math-based colours) =====
 classDef algebra fill:#cfe8ff,stroke:#1e3a8a,stroke-width:1px
 classDef probability fill:#d1fae5,stroke:#065f46,stroke-width:1px
 classDef geometry fill:#ffedd5,stroke:#9a3412,stroke-width:1px
 classDef logic fill:#ede9fe,stroke:#5b21b6,stroke-width:1px
 classDef category font-style:italic,font-weight:bold,fill:#aaaaaa,stroke:#374151,stroke-width:3px

 %% ===== ROOT =====
 ML: Machine Learning

 %% ===== SUPERVISED =====
 ML --> SL:::category
 SL: Supervised Learning

 SL --> Regression
 Regression --> LR:::algebra
 LR: Linear Regression

 LR --> NN:::algebra
 NN: Neural Network

 NN --> DT:::logic
 DT: Decision Tree

 SL --> Classification
 Classification --> NB:::probability
 NB: Naive Bayes

 NB --> KNN:::geometry
 KNN: k-Nearest Neighbours

 KNN --> SVM:::algebra
 SVM: Support Vector Machine
 
 %% ===== UNSUPERVISED =====
 ML --> USL:::category
 USL: Unsupervised Learning

 USL --> Clustering
 Clustering --> KM:::geometry
 KM: K-Means

 KM --> GMM:::probability
 GMM: Gaussian Mixture Model

 GMM --> HMM:::probability
 HMM: Hidden Markov Model

 %% ===== REINFORCEMENT =====
 ML --> RL:::category
 RL: Reinforcement Learning

 RL --> DM:::logic
 DM: Decision Making

Mathematical Legend

Algebra / Linear Algebra (Blue) #

Used heavily when models rely on:

AI Stack

Mon, 01 Jan 0001 00:00:00 +0000

AI Stack #

The AI Stack describes the layers required to build an end-to-end AI system, from infrastructure at the bottom to user-facing applications at the top.

Different organisations represent the AI stack differently; this is a simplified conceptual view for learning.

Each layer depends on the one below it.

graph TB

 subgraph APP["Applications"]
 A[User Interfaces & Integrations]
 end

 subgraph ORCH["Orchestration"]
 O[Workflows • Agents • Control Logic]
 end

 subgraph DATA["Data"]
 D[Data Sources • Pipelines • Vector DBs]
 end

 subgraph MODEL["Models"]
 M[ML • DL • Foundation Models • LLMs]
 end

 subgraph INFRA["Infrastructure"]
 I[Cloud • On-prem • GPUs • Storage]
 end

 %% Styling
 style APP fill:#FFCCBC
 style ORCH fill:#90CAF9
 style DATA fill:#BBDEFB
 style MODEL fill:#C8E6C9
 style INFRA fill:#E1F5FE

 style A fill:#FFE0B2
 style O fill:#B3E5FC
 style D fill:#E3F2FD
 style M fill:#DCEDC8
 style I fill:#E1F5FE

1. Infrastructure #

The foundation that provides compute and storage.

Artificial Neuron and Perceptron

Mon, 01 Jan 0001 00:00:00 +0000

Artificial Neuron and Perceptron #

knowledge in neural networks is stored in connection weights, and learning means modifying those weights.

Biological Neuron #

A biological neuron is a specialised cell that processes and transmits information through electrical and chemical signals.

Core components:

Dendrites: receive signals from other neurons
Cell body (soma): processes incoming signals
Axon: transmits the output signal
Synapses: connection points between neurons

Biological intuition:

many inputs arrive to one neuron
one neuron can connect out to many neurons
massive parallelism enables fast perception and recognition

Artificial Neuron #

An artificial neuron is a simplified computational model inspired by biological neurons.

ML Workflow

Mon, 01 Jan 0001 00:00:00 +0000

Machine learning Workflow #

Data is the foundation of any machine learning system. Quality of data matters more than model complexity.

Role of Data #

Data determines:

What patterns the model can learn
How well it generalises
Whether bias or noise is introduced

Bad data → bad model (even with perfect algorithms).

Data Preprocessing, wrangling #

Raw data is never ready for training.

Data Issues

Noise
- For objects, noise is an extraneous object
- For attributes, noise refers to modification of original values
- Use Log or Z Transfer to convert to mean
Outliers
- Data objects with characteristics that are considerably different than most of the other data objects in the data set
- Handle: Use IQR method
- Find Lower and Upper Bound and replace Outlier with Lower or Upper Bound
Missing Values
- Eliminate data objects or variables
- Handle: Estimate missing values
  - Mean, Median or Mode
  - Prefer Median if there are missing outliers
- Ignore the missing value during analysis
Duplicate Data
- Major issue when merging data from heterogeneous sources
Inconsistent Codes
- Find all Unique and transfer all inconsistent to

Data Preprocessing techniques

Conditional Probability

Thu, 12 Mar 2026 00:00:00 +0000

Conditional Probability #

Conditional probability updates the probability of an event when new information is available.

It shows up whenever a question says:

“given that…”
“among those who…”
“out of the items that…”
“if it does not fail immediately…”

Key takeaway: Conditional probability is always:

joint probability ÷ probability of the condition.

The condition must not be an impossible event.

Prior vs posterior #

Prior probability: probability with no condition (before new information)

Bayes’ Theorem

Mon, 01 Jan 0001 00:00:00 +0000

Bayes’ Theorem #

2.1 Total probability (needed for Bayes) #

Often we split the world into cases $ E_1,E_2,\dots,E_k $ that:

are mutually exclusive
cover the whole sample space

Then for any event $ A $ :

\[ P(A)=\sum_{i=1}^{k} P(A\mid E_i)\,P(E_i) \]

Tree intuition:

flowchart TD
 S[Start] --> E1[Case E1]
 S --> E2[Case E2]
 S --> E3[Case E3]
 E1 --> A1["A happens"]
 E2 --> A2["A happens"]
 E3 --> A3["A happens"]

2.2 Bayes’ theorem (two-event form) #

Bayes’ Theorem is a mathematical formula used to determine the conditional probability of an event based on prior knowledge and new evidence.

Naïve Bayes

Thu, 12 Mar 2026 00:00:00 +0000

Naïve Bayes #

Naïve Bayes is a probabilistic classifier.

Supervised Learning Problem
Binary Classification - final target variable is considered in two classes
Hypothesis is target which you want to classify
Total Probability (Prior) of Yes and No is already calculated
Post / Posterior is when you start studying data
Based on max probability of hypotheses classify given instance into a class

It predicts a class label by computing:

Probability Distributions

Sun, 22 Feb 2026 00:00:00 +0000

Probability Distributions #

Probability distributions are the bridge between: real-world randomness and mathematical modelling.

A random experiment produces outcomes. A random variable turns those outcomes into numbers. A probability distribution tells you how likely each number (or range of numbers) is.

Key takeaway: A distribution is a complete “story” about uncertainty: what values are possible, how likely they are, and how we summarise them (mean, variance).

flowchart TD
	PD["Probability<br/>distributions"] --> RV["Random<br/>variables"]
	PD["Probability<br/>distributions"] --> DS["Common<br/>distributions"]

	style PD fill:#90CAF9,stroke:#1E88E5,color:#000
	style RV fill:#90CAF9,stroke:#1E88E5,color:#000
	style DS fill:#90CAF9,stroke:#1E88E5,color:#000

AI/ML Connection #

Many ML models are probabilistic: they assume data (or errors) follow a distribution.
Loss functions often come from distribution assumptions: squared loss aligns with Gaussian noise.
Naïve Bayes (from the previous module) becomes practical once you can model: $ P(X\mid Y) $ using suitable distributions.

In practice: choosing a distribution is a modelling decision. It affects: prediction, uncertainty estimates, and what “rare” or “typical” means in your data.

LNN for Regression

Sun, 15 Feb 2026 00:00:00 +0000

Linear Neural Networks for Regression #

A linear neural network for regression is a model that predicts a continuous target by taking a weighted sum of input features and applying the identity activation (so the output can be any real number).

Single neuron for regression (predicting how much / how many)
Data + linear model (single neuron, no hidden layers) + squared loss
Training using batch gradient descent algorithm
Prediction (inference)
Eg: Auto MPG (UCI) style prediction with a single neuron (from-scratch code)

flowchart LR
 D["Data<br/>X, y"] --> M["Linear model<br/>w, b<br/>Single neuron"]
 M --> A["Activation<br/>Identity"]
 A --> L["Loss<br/>MSE (Squared error)"]
 L --> O["Optimiser<br/>Batch Gradient DescentBatch GD / Mini-batch GD"]
 O --> P["Parameters<br/>w, b"]
 P --> I["Inference<br/>Predict ŷ (number) for new x"]

 %% Pastel colour scheme
 style D fill:#E3F2FD,stroke:#1E88E5,stroke-width:1px
 style M fill:#E8F5E9,stroke:#43A047,stroke-width:1px
 style A fill:#FFF3E0,stroke:#FB8C00,stroke-width:1px
 style L fill:#FCE4EC,stroke:#D81B60,stroke-width:1px
 style O fill:#F3E5F5,stroke:#8E24AA,stroke-width:1px
 style P fill:#E0F7FA,stroke:#00838F,stroke-width:1px
 style I fill:#F1F8E9,stroke:#558B2F,stroke-width:1px

Regression #

Regression is a supervised learning task that predicts a continuous-valued output based on input features.

Generative AI

Mon, 15 Dec 2025 10:55:52 +0100

Generative AI #

Generative Artificial Intelligence (GenAI) refers to a class of AI systems that can generate new content such as text, images, audio, video, or code, rather than only making predictions or classifications.

GenAI systems learn patterns and representations from large datasets and use them to produce novel outputs that resemble the data they were trained on.

How Generative AI Differs from Traditional AI #

Traditional AI	Generative AI
Predicts or classifies	Generates new content
Task-specific models	General-purpose models
Fixed outputs	Open-ended outputs
Often rule-based	Data-driven and probabilistic

Core Idea of Generative AI #

Instead of learning “what label to assign”, Generative AI learns “how data is structured” and then creates new data following that structure.

AI Pipeline

Thu, 04 Jul 2024 10:55:52 +0100

AI Pipeline #

The AI pipeline is a continuous process where data is collected, prepared, used to train models, evaluated for performance, and continuously improved after deployment.

Collect Data #
Prepare data #
Train Model #
- Iterate until model is good enough
Deploy Model #
- Get data back
- Maintain & update model

timeline
 title AI Pipeline
 Collect Data : Data Ingestion
 : Data Understanding
 Prepare Data : Cleaning
 : Feature Engineering
 : Sampling
 Train Model : Model Training
 : Validation & Metrics
 Deploy Model : Deployment
 : Monitoring & Retraining

Home | AI Foundation

Regression(Linear Models)

Mon, 01 Jan 0001 00:00:00 +0000

Linear Regression #

Linear Regression is a supervised ML method used to predict a numerical target by fitting a model that is linear in its parameters.

In ML , linear models are a core baseline: they’re fast, often surprisingly strong, and usually easy to interpret.

Key takeaway: Linear Regression learns parameters by minimising a squared-error cost. You can solve it directly (closed form) or iteratively (gradient descent), and you can extend it using basis functions and regularisation.

Random Variables

Sun, 22 Feb 2026 00:00:00 +0000

Random Variables #

A random variable is a way to attach numbers to outcomes of a random experiment.

It lets us move from: “what happened?” to: “what number should we analyse?”

Key takeaway: A random variable is a function from the sample space to real numbers. Once you define the random variable clearly, the rest (pmf/pdf/cdf, mean, variance) becomes systematic.

flowchart TD
PD["Probability<br/>distributions"] --> RV["Random<br/>variables"]

RV --> T["Types"]
T --> RV1["Discrete<br/>RVs"]
T --> RV2["Continuous<br/>RVs"]

RV --> F["PMF / PDF / CDF"]
RV --> S["Mean / Variance<br/>Covariance"]
RV --> J["Joint & Marginal<br/>distributions"]
RV --> X["Transformations"]

style PD fill:#90CAF9,stroke:#1E88E5,color:#000
style RV fill:#90CAF9,stroke:#1E88E5,color:#000

style T fill:#CE93D8,stroke:#8E24AA,color:#000
style F fill:#CE93D8,stroke:#8E24AA,color:#000
style S fill:#CE93D8,stroke:#8E24AA,color:#000
style J fill:#CE93D8,stroke:#8E24AA,color:#000
style X fill:#CE93D8,stroke:#8E24AA,color:#000
style RV1 fill:#CE93D8,stroke:#8E24AA,color:#000
style RV2 fill:#CE93D8,stroke:#8E24AA,color:#000

1) Definition #

Random variable: a rule that assigns a number to each outcome.

Common Probability Distributions

Sun, 22 Feb 2026 00:00:00 +0000

Common Probability Distributions #

Once you can describe a random variable using a pmf or pdf, the next step is to use named distributions that appear repeatedly in real data and in ML models.

Key takeaway: Named distributions give you ready-made probability models for common patterns: binary outcomes, counts, and measurement noise.

flowchart TD
PD["Probability<br/>distributions"] --> DS["Common<br/>distributions"]

DS --> DIS["Discrete"]
DS --> CON["Continuous"]

DIS --> D1["Bernoulli"]
DIS --> D2["Binomial"]
DIS --> D3["Poisson"]

CON --> D4["Normal<br/>(Gaussian)"]
CON --> D5["t / Chi-square / F<br/>(intro)"]

style PD fill:#90CAF9,stroke:#1E88E5,color:#000
style DS fill:#90CAF9,stroke:#1E88E5,color:#000

style DIS fill:#CE93D8,stroke:#8E24AA,color:#000
style CON fill:#CE93D8,stroke:#8E24AA,color:#000

style D1 fill:#C8E6C9,stroke:#2E7D32,color:#000
style D2 fill:#C8E6C9,stroke:#2E7D32,color:#000
style D3 fill:#C8E6C9,stroke:#2E7D32,color:#000
style D4 fill:#C8E6C9,stroke:#2E7D32,color:#000
style D5 fill:#C8E6C9,stroke:#2E7D32,color:#000

1) Bernoulli distribution (binary) #

Use when: one trial has two outcomes (success/failure).

Ordinary Least Squares

Sat, 21 Feb 2026 00:00:00 +0000

Direct solution method - Ordinary Least Squares and the Line of Best Fit #

It is possible to compute the best parameters for linear regression in one shot (closed-form), instead of iteratively improving them step-by-step. fileciteturn34file10turn34file6

For linear regression, the direct method is usually Ordinary Least Squares (OLS).

Ordinary Least Squares (OLS) chooses the “best” line by minimising squared prediction errors.

Key takeaway: OLS defines “best fit” as the line that minimises the total squared residual error across all data points.

Cost Function

Sat, 21 Feb 2026 00:00:00 +0000

Cost Function #

also known as an objective function
how far the predicted values are from the actual ones
measure of the difference between predicted values and actual values
quantifies the error between a model’s predicted values and actual values
measures the model’s error on a group of datapoints
method used to predict values by drawing the best-fit line through the data
used to evaluate the accuracy of a model’s predictions

Gradient Descent Algorithm

Thu, 26 Feb 2026 00:00:00 +0000

Gradient Descent Algorithm #

Gradient Descent Algorithm (GDA) is

an optimisation method
used to train models
by repeatedly updating parameters (weights and biases) to reduce the loss

In deep learning, the default training approach is almost always mini-batch gradient descent, usually with Adam or SGD + momentum.

Gradient Descent is used in both regression and classification.

It’s not tied to the task type — it’s tied to the fact you have:

Gradient Descent

Sat, 21 Feb 2026 00:00:00 +0000

Gradient Descent for Linear Regression #

Gradient descent is an iterative optimisation method used to minimise the regression cost function by repeatedly updating parameters in the direction that reduces error.

Iterative method
Types: batch / stochastic / mini-batch

Key takeaway: Gradient descent starts with initial parameter values and repeatedly updates them using the gradient until the cost stops decreasing.

flowchart TD
GD["Gradient<br/>Descent"] -->|minimises| CF["Cost<br/>function"]
GD -->|updates| W["Parameters<br/>(weights)"]
GD -->|uses| GR["Gradient<br/>(slope)"]

GD --> H["Hyperparameters"]
H --> LR["Learning<br/>rate"]
H --> BS["Batch<br/>size"]
H --> EP["Epochs"]

style GD fill:#90CAF9,stroke:#1E88E5,color:#000

style CF fill:#CE93D8,stroke:#8E24AA,color:#000
style W fill:#CE93D8,stroke:#8E24AA,color:#000
style GR fill:#CE93D8,stroke:#8E24AA,color:#000
style H fill:#CE93D8,stroke:#8E24AA,color:#000
style LR fill:#CE93D8,stroke:#8E24AA,color:#000
style BS fill:#CE93D8,stroke:#8E24AA,color:#000
style EP fill:#CE93D8,stroke:#8E24AA,color:#000

Types of GD #

flowchart TD
T["Gradient Descent<br/>types"] --> BGD["Batch<br/>GD"]
T --> SGD["Stochastic<br/>GD"]
T --> MGD["Mini-batch<br/>GD"]

BGD --> ALL["All data<br/>per step"]
BGD --> STB["Smooth<br/>updates"]

SGD --> ONE["1 sample<br/>per step"]
SGD --> FAST["Quick<br/>progress"]
SGD --> NOISE["Noisy<br/>updates"]

MGD --> MB["Small batch<br/>per step"]
MGD --> PRACT["Practical<br/>default"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000

style BGD fill:#C8E6C9,stroke:#2E7D32,color:#000
style SGD fill:#C8E6C9,stroke:#2E7D32,color:#000
style MGD fill:#C8E6C9,stroke:#2E7D32,color:#000

style ALL fill:#CE93D8,stroke:#8E24AA,color:#000
style STB fill:#CE93D8,stroke:#8E24AA,color:#000
style ONE fill:#CE93D8,stroke:#8E24AA,color:#000
style FAST fill:#CE93D8,stroke:#8E24AA,color:#000
style NOISE fill:#CE93D8,stroke:#8E24AA,color:#000
style MB fill:#CE93D8,stroke:#8E24AA,color:#000
style PRACT fill:#CE93D8,stroke:#8E24AA,color:#000

Batch #

Use only if you have huge compute and a lot of time to train

SGD #

go-to solution

Hypothesis Testing

Thu, 12 Mar 2026 00:00:00 +0000

Hypothesis Testing #

Hypothesis testing is a structured way to decide:

Is what we see in a sample just random variation, or is there evidence of a real effect in the population?

Hypothesis Testing topic sits inside inferential statistics: we use a sample to make a statement about a population.

Sampling (random and stratified)
Sampling distribution and Central Limit Theorem
Estimation (confidence intervals and confidence level)
Testing hypotheses (mean, proportion, ANOVA)
Maximum likelihood (MLE)

Key takeaway: The logic is always the same:

LNN for Classification

Sun, 15 Feb 2026 00:00:00 +0000

Linear NN for Classification #

A Linear Neural Network (LNN) for classification uses no hidden layers.
It learns a linear decision boundary and outputs class probabilities, then converts them into predicted classes.

Neural-network view:

Binary classification → logistic regression (single neuron + sigmoid)

Multi-class classification → softmax regression (K output neurons + softmax)

flowchart LR
 D["Data<br/>X, y"] --> M["Linear model<br/>w, b"]
 M --> A["Activation<br/>Sigmoid / Softmax"]
 A --> L["Loss<br/>Cross-entropy"]
 L --> O["Optimiser<br/>Mini-batch GD / Adam"]
 O --> P["Updated parameters<br/>w, b"]
 P --> I["Inference<br/>Probabilities → class"]

 %% Pastel colour scheme
 style D fill:#E3F2FD,stroke:#1E88E5,stroke-width:1px
 style M fill:#E8F5E9,stroke:#43A047,stroke-width:1px
 style A fill:#FFF3E0,stroke:#FB8C00,stroke-width:1px
 style L fill:#FCE4EC,stroke:#D81B60,stroke-width:1px
 style O fill:#F3E5F5,stroke:#8E24AA,stroke-width:1px
 style P fill:#E0F7FA,stroke:#00838F,stroke-width:1px
 style I fill:#F1F8E9,stroke:#558B2F,stroke-width:1px

Classification #

Classification predicts a discrete class label.
Common settings:

Classification(Linear Models)

Mon, 01 Jan 0001 00:00:00 +0000

Linear models for Classification #

categorises data by finding a linear boundary (hyperplane) that separates classes
calculating a weighted sum of input features plus bias

flowchart TD
T["Linear<br/>classification<br/>models"] --> P["Perceptron"]
T --> LR["Logistic<br/>regression"]
T --> SVM["Linear<br/>SVM"]

P -->|uses| STEP["Step<br/>activation"]
LR -->|uses| SIG["Sigmoid<br/>+ log loss"]
SVM -->|uses| HNG["Hinge<br/>loss"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000

style P fill:#C8E6C9,stroke:#2E7D32,color:#000
style LR fill:#C8E6C9,stroke:#2E7D32,color:#000
style SVM fill:#C8E6C9,stroke:#2E7D32,color:#000

style STEP fill:#CE93D8,stroke:#8E24AA,color:#000
style SIG fill:#CE93D8,stroke:#8E24AA,color:#000
style HNG fill:#CE93D8,stroke:#8E24AA,color:#000

Discriminant Functions #

Decision Theory #

Probabilistic Discriminative Classifiers #

Logistic Regression #

Supervised machine learning algorithm
Binary classification algorithm
requires data to be linearly separable
predicts the probability that an input belongs to a specific class
uses Sigmoid function to convert inputs into a probability value between 0 and 1

Key takeaway: Logistic regression predicts $P(y=1\mid x)$ using a sigmoid of a linear score $z=w\cdot x+b$, then learns $w,b$ by maximising likelihood (equivalently minimising log-loss).

Foundation Models

Sun, 14 Dec 2025 00:00:00 +0000

Foundation Model #

AI models trained on massive datasets to perform a wide range of tasks with minimal fine-tuning.

are large deep learning neural networks
are large AI models trained on massive and diverse datasets (text, images, audio, or multiple modalities).
Contain millions or billions of parameters.
designed to perform a broad range of general tasks
designed for general-purpose intelligence, not a single task.
acts as base models for building specialised AI applications

LLM - Model

Mon, 01 Jan 0001 00:00:00 +0000

LLM – Large Language Model #

Large Language Models (LLMs) are advanced AI systems designed to process, understand, and generate human-like text.

They learn language by analysing massive amounts of text data, discovering patterns in:

grammar
meaning
context
relationships between words and sentences
Built on Deep Learning
Implemented using Neural Networks
Based on Transformers
Often combined with tools like:
- Retrieval (RAG)
- Agents
- External APIs
- Memory systems

What makes an LLM special? #

Built using deep neural networks
Trained on very large datasets (books, articles, code, web text)
Can perform many tasks without task-specific training
General-purpose language understanding, not single-task models

Foundation: Transformer Architecture #

LLMs are based on the Transformer Architecture, which allows models to understand context and long-range dependencies in text.

AI Agents

Mon, 15 Dec 2025 10:55:52 +0100

AI Agents #

Also referred to as Agentic AI.

AI agents are intelligent systems that can plan, make decisions, and take actions to achieve goals with minimal human intervention.

A common use case is task automation
for example booking travel based on a user’s request.
AI agents typically build on Generative AI and use Large Language Models (LLMs) as the reasoning core.
Agents often interact with tools (APIs, databases, calendars) to complete multi-step workflows.

Retrieval-Augmented Generation (RAG)

Mon, 01 Jan 0001 00:00:00 +0000

Retrieval-Augmented Generation (RAG) #

Retrieval-Augmented Generation (RAG) is a system design pattern that improves an LLM’s answers by:

Retrieving relevant information from an external knowledge source, and then
Augmenting the LLM prompt with that retrieved context before generating the final response.

RAG helps an LLM look things up first, then answer using evidence.

Why RAG is Useful #

RAG is commonly used when:

Your knowledge is in private documents (PDFs, policies, internal wiki)
You need up-to-date information (things not in the model’s training data)
You want fewer hallucinations by grounding answers in retrieved sources
You want traceability (show “where the answer came from”)

RAG does not change the model weights.
It changes what the model sees at inference time by adding retrieved context.

Mathematical Foundation

Wed, 18 Mar 2026 00:00:00 +0000

Mathematical Foundations for Machine Learning #

Machine Learning is built on mathematical principles that allow models to:

represent data
learn patterns
optimise performance

flowchart LR
 DATA[Data]
 MATH[Math Models]
 OPT[Optimisation]
 MODEL[Trained Model]

 DATA --> MATH
 MATH --> OPT
 OPT --> MODEL

ML requires core mathematical tools to understand how ML algorithms work internally. Algebra deals with relationships between variables and quantities, while Calculus focuses on change and optimization.

Deep Feedforward Neural Networks (DFNN) for Classification

Thu, 26 Feb 2026 00:00:00 +0000

Deep Feedforward Neural Networks (DFNN) or Multi Layer Perceptrons (MLP) for Classification #

A Deep Feedforward Neural Network (DFNN), also called a Multi-Layer Perceptron (MLP), is a neural network with one or more hidden layers where information flows forward only (no recurrence).
For classification, DFNNs learn non-linear decision boundaries by combining hidden layers with non-linear activation functions.

Core idea:

A single neuron can only learn linear boundaries.

Adding hidden layers + non-linearity allows DFNNs to solve problems like XOR.

MLP as solution for XOR #

A single perceptron fails on XOR because XOR is not linearly separable.

Decision Tree

Mon, 01 Jan 0001 00:00:00 +0000

Decision Tree #

A decision tree classifies an example by asking a sequence of questions about its attributes until it reaches a leaf (final decision).

Key takeaway: A decision tree grows by repeatedly splitting the training data into purer subsets using an impurity measure (Entropy / Gini / Classification Error).

Information Theory #

Decision trees need a way to measure: “How mixed are the class labels at a node?”

Prediction & Forecasting

Mon, 01 Jan 0001 00:00:00 +0000

Prediction & Forecasting #

Correlation #

Regression #

Time Series Analysis #

Introduction, Components of time series data #

MA model – basic and weighted MA model #

Time series models #

AR Model
ARIMA Model
SARIMA,SARIMAX,VAR,VARMAX
Simple exponential smoothing model

Reference #

Prediction & Forecasting

Home | Statistics

Convolutional Neural Networks

Sun, 19 Apr 2026 00:00:00 +0000

Convolutional Neural Networks (CNN) #

Convolutional Neural Networks (CNNs) are specialised neural networks designed for data with spatial structure, especially images. They became the standard model for computer vision because they preserve spatial locality, reuse the same pattern detector across the image, and build representations hierarchically. In practical terms, a CNN starts by learning simple features such as edges and corners, then combines them into textures, shapes, object parts, and finally full semantic categories.

Statistics

Thu, 12 Mar 2026 00:00:00 +0000

Statistics #

Statistical methods help you turn raw data into reliable conclusions, while understanding uncertainty, variability, and confidence.

Statistics provides the language and tools for reasoning about data, uncertainty, and inference.

ML needs understanding data behaviour, drawing conclusions, and validating machine learning models.

Collect Data
Present & Organise Data (in a systematic manner)
Alalyse Data
Infer about the Data
Take Decision from the Data

Statistics Topic	What you learn (plain English)	ML Connection
1. Basic Probability & Statistics	Summarise data; understand spread; basic probability rules	Data understanding (EDA), feature sanity checks, detecting outliers, interpreting “average behaviour”
2. Conditional Probability & Bayes	Update probability using new information; Bayes’ rule	Naïve Bayes, Bayesian thinking, posterior probabilities, probabilistic classification
3. Probability Distributions	Model randomness with distributions; expectation/variance/covariance	Likelihood models, noise assumptions (Gaussian), sampling, probabilistic modelling foundations
4. Hypothesis Testing	Sampling, CLT, confidence intervals, significance tests, ANOVA, MLE	A/B testing, evaluating model improvements, significance vs noise, parameter estimation (MLE)
5. Prediction & Forecasting	Correlation, regression, time series (AR/MA/ARIMA/SARIMA etc.)	Linear regression, forecasting, sequential data modelling, baseline predictive modelling
6. GMM & EM	Mixtures of Gaussians; iterative estimation with EM	Unsupervised learning (soft clustering), density estimation, latent-variable models

flowchart TD
 A["Statistical Methods<br/>AIML ZC418"] --> B["1. Basic Probability and Statistics"]
 A --> C["2. Conditional Probability and Bayes"]
 A --> D["3. Probability Distributions"]
 A --> E["4. Hypothesis Testing"]
 A --> F["5. Prediction and Forecasting"]
 A --> G["6. Gaussian Mixture Model and EM"]

 B --> B1["Central Tendency<br/>Mean - Median - Mode"]
 B --> B2["Variability<br/>Range - Variance - SD - Quartiles"]
 B --> B3["Basic Probability Concepts"]
 B3 --> B31["Axioms of Probability"]
 B3 --> B32["Definition of Probability"]
 B3 --> B33["Mutually Exclusive vs Independent"]

 C --> C1["Conditional Probability"]
 C --> C2["Independence (conditional)"]
 C --> C3["Bayes Theorem"]
 C --> C4["Naive Bayes (intro)"]

 D --> D1["Random Variables<br/>Discrete and Continuous"]
 D --> D2["Expectation - Variance - Covariance"]
 D --> D3["Transformations of RVs"]
 D --> D4["Key Distributions"]
 D4 --> D41["Bernoulli"]
 D4 --> D42["Binomial"]
 D4 --> D43["Poisson"]
 D4 --> D44["Normal (Gaussian)"]
 D4 --> D45["t - Chi-square - F (intro)"]

 E --> E1["Sampling<br/>Random and Stratified"]
 E --> E2["Sampling Distributions<br/>CLT"]
 E --> E3["Estimation<br/>Confidence Intervals"]
 E --> E4["Hypothesis Tests<br/>Means and Proportions"]
 E --> E5["ANOVA<br/>Single and Dual factor"]
 E --> E6["Maximum Likelihood"]

 F --> F1["Correlation"]
 F --> F2["Regression"]
 F --> F3["Time Series Basics<br/>Components"]
 F --> F4["Moving Averages<br/>Simple and Weighted"]
 F --> F5["Time Series Models"]
 F5 --> F51["AR"]
 F5 --> F52["ARMA / ARIMA"]
 F5 --> F53["SARIMA / SARIMAX"]
 F5 --> F54["VAR / VARMAX"]
 F --> F6["Exponential Smoothing"]

 G --> G1["GMM<br/>Mixture of Gaussians"]
 G --> G2["EM Algorithm<br/>E-step - M-step"]

 B -.-> C
 C -.-> D
 D -.-> E
 E -.-> F
 F -.-> G

Data - Types #

flowchart TD
	A[(Data)] --> B["Categorical (Qualitative)"]
 A --> C["Numerical (Quantitative)"]

 B --> B1[Nominal]
 B --> B2[Ordinal]

 C --> C1[Discrete]
 C --> C2[Continuous]

 C2 --> C21[Interval]
 C2 --> C22[Ratio]

 %% Styling
 style A fill:#E1F5FE,stroke:#333
 style B fill:#90CAF9,stroke:#333
 style B1 fill:#90CAF9,stroke:#333
 style B2 fill:#90CAF9,stroke:#333
 style C fill:#FFF9C4,stroke:#333
 style C1 fill:#FFF9C4,stroke:#333
 style C2 fill:#FFF9C4,stroke:#333
 style C21 fill:#FFF9C4,stroke:#333
 style C22 fill:#FFF9C4,stroke:#333

Categorical (Qualitative) #

express a qualitative attribute e.g. hair color, eye color

Gaussian Mixture model & Expectation Maximization

Mon, 01 Jan 0001 00:00:00 +0000

Gaussian Mixture model & Expectation Maximization #

Reference #

Gaussian Mixture model

Expectation Maximization

Home | Statistics

Instance-based Learning

Mon, 01 Jan 0001 00:00:00 +0000

Instance-based Learning #

Instance-based learning is a family of methods that do not build one explicit global model during training. Instead, they store training examples and delay most of the work until a new query arrives.

When a new point must be classified or predicted, the algorithm compares it with previously seen examples, finds the most relevant neighbours, and uses them to produce the answer.

Instance-based Learning covers three linked ideas:

Deep CNN Architectures

Sun, 19 Apr 2026 00:00:00 +0000

Deep CNN Architectures #

Once the basic ideas of convolution, pooling, channels, and classifier heads are understood, the next step is to study how successful CNN architectures are designed in practice. The history of deep CNNs is not just a list of famous models. It is a progression of design ideas: smaller filters, more depth, better optimisation, bottlenecks, multi-scale processing, residual connections, and transfer learning.

Key takeaway:
Deep CNN architectures evolved by solving specific problems one by one: LeNet established the template, AlexNet proved deep learning could dominate large-scale vision, VGG simplified the design, NiN introduced powerful 1 × 1 ideas, GoogLeNet made multi-scale processing efficient, and ResNet solved the optimisation problem of very deep networks.

CNN Pipeline

Wed, 22 Apr 2026 00:00:00 +0000

CNN Pipeline: Preprocessing & Models #

Understand CNN concepts deeply
Build CNN models step-by-step
Apply CNNs in assignments using Keras

Think of CNN as a pipeline: Image → Features → Patterns → Prediction

1. Image Representation #

\[ X \in \mathbb{R}^{H \times W \times C} \]

H = Height
W = Width
C = Channels

2. Convolution Operation #

\[ Z(i,j) = \sum_{m,n} X(i+m, j+n) \cdot K(m,n) \]

Sliding filter extracts features
Produces feature maps

3. Stride & Padding #

\[ Output = \frac{N - F + 2P}{S} + 1 \]

4. Activation (ReLU) #

\[ ReLU(x) = max(0, x) \]

5. Pooling #

Max Pooling → strongest feature
Average Pooling → smooth

6. Global Average Pooling #

\[ y_k = \frac{1}{HW} \sum_{i,j} x_{i,j,k} \]

7. Loss Function #

\[ L = - \sum y \log(\hat{y}) \]

8. CNN Architecture #

graph LR
A[Input Image] --> B[Conv]
B --> C[ReLU]
C --> D[Pooling]
D --> E[Conv Layers]
E --> F[Flatten / GAP]
F --> G[Dense]
G --> H[Output]

9. Training #

Forward pass
Loss computation
Backpropagation
Weight update

10. Keras Implementation #

Model #

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Flatten

model = Sequential()

model.add(Conv2D(32, (3,3), activation='relu', input_shape=(64,64,3)))
model.add(MaxPooling2D((2,2)))

model.add(Conv2D(64, (3,3), activation='relu'))
model.add(MaxPooling2D((2,2)))

model.add(Flatten())

model.add(Dense(128, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

Compile #

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Train #

model.fit(X_train, y_train, epochs=10, batch_size=32)

Predict #

pred = model.predict(X_test)

11. Tips #

Normalize images
Use small filters
Avoid too many dense layers

12. Summary #

CNN = Automatic feature extractor + classifier

Recurrent Neural Networks

Sun, 19 Apr 2026 00:00:00 +0000

Recurrent Neural Networks #

Recurrent Neural Networks (RNNs) are neural networks designed for sequential data, where the order of inputs matters and the model must use information from earlier time steps to interpret later ones. Unlike a feedforward network, an RNN does not process each input in isolation. It carries a hidden state from one time step to the next, so the network can build a running summary of what it has seen so far.

Support Vector Machine

Mon, 01 Jan 0001 00:00:00 +0000

Support Vector Machine (SVM) #

A Support Vector Machine (SVM) is a supervised machine learning algorithm used for:

Classification (most common)
Regression (SVR – Support Vector Regression)

Find the decision boundary that separates classes with the maximum margin.

A Support Vector Machine is a supervised learning algorithm that finds an optimal hyperplane by maximising the margin between classes, using support vectors and kernel functions to handle non-linear data.

Deep Recurrent Neural Networks

Sun, 19 Apr 2026 00:00:00 +0000

Deep Recurrent Neural Networks #

Vanilla RNNs introduce the hidden-state idea, but they struggle on longer and more complex sequences because gradients can vanish across time. Deep recurrent models extend the RNN idea in two important ways:

make the recurrent architecture richer, for example by stacking multiple recurrent layers or using information from both directions,
use gates and memory cells to control what should be remembered, forgotten, updated, and exposed.

This is why practical recurrent modelling usually moves from a simple RNN to stacked RNNs, bidirectional RNNs, GRUs, or LSTMs.

Attention Mechanism

Mon, 01 Jan 0001 00:00:00 +0000

Attention Mechanism #

Queries, Keys, and Values
Attention Pooling by Similarity
Attention Pooling via Nadaraya–Watson Regression
Attention Scoring Functions
Dot Product Attention
Convenience Functions
Scaled Dot Product Attention
Additive Attention
Bahdanau Attention Mechanism
Multi-Head Attention
Self-Attention
Positional Encoding
Code implementation (webinar)

Reference #

Dive into deep learning. Cambridge University Press.. (Ch 10, Ch7

Home | Deep Learning

Bayesian Learning

Mon, 01 Jan 0001 00:00:00 +0000

Bayesian Learning #

MLE Hypothesis #

MAP Hypothesis #

Bayes Rule #

Optimal Bayes Classifier #

Naïve Bayes Classifier #

Probabilistic Generative Classifiers #

Bayesian Linear Regression #

Home | Machine Learning

Transformer

Mon, 15 Dec 2025 10:55:52 +0100

Transformer #

is an architecture of neural networks
based on the multi-head attention mechanism
text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table
takes a text sequence as input and produces another text sequence as output
foundation for modern Large Language Models (LLMs) like ChatGPT and Gemini
Transformer architecture
Model, Positionwise Feed-Forward Networks, Residual Connection and Layer Normalization

Ensemble Learning

Mon, 01 Jan 0001 00:00:00 +0000

Ensemble Learning #

Combining Classifiers #

Bagging #

Random Forest #

Boosting #

ADABoost #

Gradient Boosting #

XGBoost #

Home | Machine Learning

Optimisation of Deep models

Mon, 01 Jan 0001 00:00:00 +0000

Optimisation of Deep models #

Goal of Optimization
Optimization Challenges in Deep Learning
Gradient Descent
Stochastic Gradient Descent
Minibatch Stochastic Gradient Descent
Momentum
Adagrad and Algorithm
RMSProp and Algorithm
Adadelta and Algorithm
Adam and Algorithm
Code Implementation and comparison of algorithms (webinar)

Reference #

Dive into deep learning. Cambridge University Press.. (Ch12)

Home | Deep Learning

Evaluation/Comparison

Mon, 01 Jan 0001 00:00:00 +0000

Machine Learning Model Evaluation/Comparison #

Comparing Machine Learning Models #

Emerging requirements e.g., bias, fairness, interpretability of ML models #

Home | Machine Learning

Regularisation for Deep models

Mon, 01 Jan 0001 00:00:00 +0000

Regularisation for Deep models #

Generalization for regression
Training Error and Generalization Error
Underfitting or Overfitting
Model Selection
Weight Decay and Norms
Generalization in Classification
Environment and Distribution Shift
Generalization in Deep Learning
Dropout
Batch Normalization
Layer Normalization
Code implementation (webinar)

Reference #

Dive into deep learning. Cambridge University Press.. (T1 – Ch 3.6, 3.7, T1 - Ch 4.6, 4.7, T1 - Ch 5.5, 5.6, T1 - Ch 8.5, T1 - Ch 11.7

Home | Deep Learning

Linear Algebra

Wed, 18 Mar 2026 00:00:00 +0000

Linear Algebra #

The study of vectors and matrices is called Linear Algebra.

Linear Algebra provides the mathematical language used to represent data, transformations, and structure in ML.

Why Linear Algebra Matters in ML #

Every machine learning model uses matrices
All data in ML is represented using vectors and matrices
Neural networks are pipelines of matrix operations
Models apply matrix transformations to data
Optimisation relies on linear algebra operations

What to Learn #

Scalars, vectors, and matrices
Vector operations (addition, dot product)
Matrix multiplication (critical)
Identity matrices and transpose
Eigenvalues and eigenvectors (conceptual understanding)

Scalar → a number
Vector → a directed point
Matrix → a space transformer
Linear transformation → structured mapping
Feature → one axis
Feature space → where data lives
Vector space → where vectors live

Home | Mathematical Foundation

Linear Systems

Thu, 29 Jan 2026 00:00:00 +0000

Linear Systems #

How systems of linear equations are represented and solved using matrices.

the study of vectors and rules to manipulate vectors
describe multiple linear equations solved simultaneously
connect algebraic equations with matrix representations

Idea of Closure #

performing a specific operation (like addition or multiplication) on members of a set always produces a result that belongs to the same set
idea of closure is fundamental to defining a Vector space because it ensures that performing arithmetic operations (addition and scalar multiplication) on vectors within a set does not produce a new element outside that set.

Systems of Linear Equations

Mon, 01 Jan 0001 00:00:00 +0000

Systems of Linear Equations #

A system of linear equations can be written compactly as:

\[ A\mathbf{x}=\mathbf{b} \]

This represents:

a linear transformation applied to an unknown vector (\mathbf{x})
producing an output vector (\mathbf{b})

Key components #

Coefficient matrix (A) #

(A) contains the coefficients of the variables.

Calculus

Mon, 01 Jan 0001 00:00:00 +0000

Calculus #

Calculus is:

the mathematical framework for understanding and controlling how quantities change
the mathematics of change and accumulation

It helps answer:

How fast is something changing right now?
What happens when inputs change slightly?
Where is something maximum or minimum?

It answers two big questions:

How fast is something changing right now? → derivatives (differentiation)
How much has accumulated over an interval? → integrals (integration)

flowchart TD
 A[Calculus] --> B[Limits]
 B --> C[Continuity]
 B --> D[Derivatives]
 B --> E[Integrals]
 D --> F[Optimisation: maxima/minima]
 D --> G[ML: gradients & learning]
 E --> H[Accumulation: area/total change]

Differential Calculus (Rates of Change) #

Studies how things change.

Matrices

Mon, 01 Jan 0001 00:00:00 +0000

Matrices #

Matrices are the core data structure of linear algebra and the workhorse of machine learning.
Almost every ML model can be described as a sequence of matrix operations.

Special Matrices

Matrix #

A matrix is a rectangular array of numbers arranged in rows and columns.

\[ A \in \mathbb{R}^{m \times n} \]

An ( m \times n ) matrix has:

Solving Linear Systems

Mon, 01 Jan 0001 00:00:00 +0000

Solving Linear Systems #

Solve using:

Substitution Method
Elimination Method (Multiple & then Subtract)
Cross Multiplication

Linear system can have:

no solution
a unique solution
infinitely many solutions

Positive Definite Matrices #

A square matrix is positive definite if pre-multiplying and post-multiplying it by the same vector always gives a positive number as a result, independently of how we choose the vector.

Positive definite symmetric matrices have the property that all their eigenvalues are positive.

Forward and Backward Substitution

Mon, 01 Jan 0001 00:00:00 +0000

Forward and Backward Substitution #

Forward and backward substitution are efficient algorithms used to solve linear systems when the coefficient matrix is triangular.

They are typically used after:

Gaussian elimination
LU decomposition

1. Forward Substitution (Lower Triangular Systems) #

Used to solve:

\[ L\mathbf{x} = \mathbf{b} \]

where (L) is a lower triangular matrix:

Inverse Matrix

Mon, 01 Jan 0001 00:00:00 +0000

Inverse Matrix #

The inverse of a matrix is a matrix that, when multiplied with the original matrix, produces the identity matrix.

A square matrix (A) is invertible if there exists a matrix (A^{-1}) such that:

\[ AA^{-1} = A^{-1}A = I \]

Here:

Convex Combination

Thu, 29 Jan 2026 00:00:00 +0000

Convex Combination of Two Points #

A convex combination describes how to form a point between two points using weighted averages.

It is a fundamental building block in several advanced fields:

Linear Algebra & Geometry
Optimization Theory
Machine Learning (Specifically in SVMs, clustering, and data interpolation)

Given two points (or vectors) $\mathbf{x}_1, \mathbf{x}_2 \in \mathbb{R}^n$, a convex combination of these points is defined as:

$$\mathbf{x} = \lambda \mathbf{x}_1 + (1 - \lambda)\mathbf{x}_2$$

Where:

Vector Spaces

Mon, 01 Jan 0001 00:00:00 +0000

Vector Spaces #

A vector space is the mathematical “home” where vectors live and where addition and scaling are valid operations.

A vector space is a set closed under vector addition and scalar multiplication.
Machine learning operates in vector spaces.
covers independence, bases, rank, and geometric tools like norms and inner products that are used to measure length, distance, and angles.

A vector space is a set of vectors that follows ten axioms, defined under two operations:

Feature Space

Mon, 01 Jan 0001 00:00:00 +0000

Feature #

A feature is an individual measurable property or characteristic of a data point used as input to a machine learning model.

Each feature corresponds to one dimension.

\[ x_i \in \mathbb{R} \]

A data point with ( d ) features is represented as:

Cauchy–Schwarz

Mon, 01 Jan 0001 00:00:00 +0000

Cauchy–Schwarz Inequality #

The Cauchy–Schwarz Inequality is one of the most important results in linear algebra.

It places a fundamental bound on the inner product of two vectors.

If you see angle, cosine, similarity, or inner product bounds
→ think Cauchy–Schwarz Inequality

Key Idea: The inner product (dot product) can never exceed the product of magnitudes. This ensures all geometric interpretations (angles, cosine) are valid.

Statement of the Inequality #

For any vectors:

Matrix Decompositions

Wed, 18 Mar 2026 00:00:00 +0000

Matrix Decompositions #

Decompositions reveal structure in matrices and power algorithms like PCA.

Matrix decompositions break complex matrices into simpler parts.

From the lecture introduction, matrices are used to describe mappings and transformations of vectors.

That is why decomposition is important: it lets us understand a complicated transformation by rewriting it using simpler building blocks.

In the slides, the topic is introduced as part of three closely connected goals: how to summarise matrices, how matrices can be decomposed, and how the decompositions can be used for matrix approximations.

Characteristic Polynomial

Mon, 01 Jan 0001 00:00:00 +0000

Characteristic Polynomial #

The characteristic polynomial of a square matrix is the key tool used to compute eigenvalues.

It connects:

Determinants
Trace
Eigenvalues
Matrix structure

Definition #

Let
$ A \in \mathbb{R}^{n \times n} $
and $ \lambda \in \mathbb{R} $ .

The characteristic polynomial of (A) is defined as:

\[ p_A(\lambda) = \det(A - \lambda I) \]

It is a polynomial in $ \lambda $ of degree (n).

Determinant and Trace

Mon, 01 Jan 0001 00:00:00 +0000

Determinant and Trace #

Minor #

The minor of an element $ a_{ij} $ is the determinant of the smaller square matrix formed by:

removing row $ i $
removing column $ j $

The minor is denoted $ M_{ij} $ .

Minors are used to compute cofactors, which are used for determinants and inverses (via adjoint/adjugate).

Cofactor #

The cofactor of $ a_{ij} $ , denoted $ C_{ij} $ , is:

Eigenvalues and Eigenvectors

Mon, 01 Jan 0001 00:00:00 +0000

Eigenvalues and Eigenvectors #

Eigenvalues give scaling.
Eigenvectors define invariant directions of transformation.

Eigenvalues and eigenvectors describe directions that remain unchanged under a linear transformation, except for scaling.

From lectures: matrix multiplication represents a transformation of space.
Most vectors change direction and magnitude.
Some special vectors only scale.
These are eigenvectors.

Key Idea: A matrix transformation stretches or compresses vectors. Eigenvectors are directions that remain unchanged. Eigenvalues tell how much scaling happens.

Cholesky Decomposition

Mon, 01 Jan 0001 00:00:00 +0000

Cholesky Decomposition #

Cholesky decomposition is a special matrix factorisation used for symmetric positive definite matrices.

From lecture discussions, this decomposition is powerful because it reduces a matrix into a triangular form, making computations easier and more stable.

Key Idea: Cholesky decomposition expresses a matrix as a product of a lower triangular matrix and its transpose. It is efficient and numerically stable.

Definition #

For a symmetric positive definite matrix:

Eigen Decomposition

Mon, 01 Jan 0001 00:00:00 +0000

Eigen Decomposition #

Eigen decomposition expresses a matrix using its eigenvectors and eigenvalues.

From lecture discussions, this is one of the most important ways to understand the internal structure of a matrix.

Instead of treating the matrix as a black box, eigen decomposition reveals its fundamental directions and scaling behaviour.

Key Idea: Eigen decomposition rewrites a matrix in terms of directions (eigenvectors) and scaling factors (eigenvalues). This makes complex transformations easier to understand and compute.

Diagonalization

Mon, 01 Jan 0001 00:00:00 +0000

Diagonalization #

Diagonalisation expresses a matrix using its eigenvectors and eigenvalues when possible.

From lecture explanation, diagonalisation is one of the most powerful tools because it converts a complicated matrix into a much simpler form.

Instead of working with a full matrix, we work with a diagonal matrix, which is much easier to analyse and compute.

Key Idea: If a matrix has enough independent eigenvectors, it can be rewritten as a diagonal matrix using a change of basis. This simplifies matrix operations significantly.

Singular Value Decomposition (SVD)

Wed, 18 Mar 2026 00:00:00 +0000

Singular Value Decomposition (SVD) #

Singular Value Decomposition (SVD) is one of the most important matrix decomposition techniques in linear algebra and machine learning.

It factorises any matrix into three simpler matrices that reveal its structure.

Key Idea: SVD decomposes a matrix into rotations + scaling. It tells us how data is transformed along orthogonal directions.

Definition #

For any matrix in real space: \[ A \in \mathbb{R}^{m \times n} \]

Matrix Approximation

Mon, 01 Jan 0001 00:00:00 +0000

Matrix Approximation #

Low-rank approximation keeps the most important structure while reducing noise and computation.

Low-Rank Approximation #

Used for:

Dimensionality reduction
Noise removal
Efficient computation

Forms the basis of PCA.

Home | Matrix Decompositions

Vector Calculus

Mon, 01 Jan 0001 00:00:00 +0000

Vector Calculus #

Vector calculus extends differentiation to multivariate and vector-valued functions.

Gradients power learning. This section builds differentiation skills needed for backpropagation.

flowchart TD

 %% Core Node
 PD["Partial Derivatives"]

 %% Supporting Concepts
 DQ["Difference Quotient"]
 JH["Jacobian / Hessian"]
 TS["Taylor Series"]

 %% Application Chapters
 CH6["<br/>Probability"]
 CH7["<br/>Optimization"]
 CH9["<br/>Regression"]
 CH10["<br/>Dimensionality Reduction"]
 CH11["<br/>Density Estimation"]
 CH12["<br/>Classification"]

 %% Relationships
 DQ -->|defines| PD
 PD -->|collected in| JH
 JH -->|used in| TS
 JH -->|used in| CH6
	
 PD -->|used in| CH7
 PD -->|used in| CH9
 PD -->|used in| CH10
 PD -->|used in| CH11
 PD -->|used in| CH12

 %% Styling (Your Soft Academic Palette)
 style PD fill:#90CAF9,stroke:#1E88E5,color:#000

 style DQ fill:#CE93D8,stroke:#8E24AA,color:#000
 style JH fill:#CE93D8,stroke:#8E24AA,color:#000
 style TS fill:#CE93D8,stroke:#8E24AA,color:#000
 style CH6 fill:#CE93D8,stroke:#8E24AA,color:#000
	
 style CH7 fill:#C8E6C9,stroke:#2E7D32,color:#000
 style CH9 fill:#C8E6C9,stroke:#2E7D32,color:#000
 style CH10 fill:#C8E6C9,stroke:#2E7D32,color:#000
 style CH11 fill:#C8E6C9,stroke:#2E7D32,color:#000
 style CH12 fill:#C8E6C9,stroke:#2E7D32,color:#000

Home | Calculus

Continuous Optimisation

Mon, 01 Jan 0001 00:00:00 +0000

Continuous Optimisation #

Optimisation finds parameters that minimise (or maximise) an objective function.

Home | Calculus

Optimisation using Gradient Descent

Mon, 01 Jan 0001 00:00:00 +0000

Optimisation using Gradient Descent #

Gradient descent is an optimisation algorithm used to train ML and neural networks.

Gradient descent updates parameters by moving opposite the gradient.

Trains ML models by minimising errors:

between predicted and actual results
by iteratively adjusting its parameters
moves step‑by‑step in the direction of the steepest decrease in the loss function, it helps ML models learn the best possible weights for better predictions

Types of Gradient Gescent learning algorithms #

Batch gradient descent
Stochastic gradient descent
Mini-batch gradient descent

Home | Linear Algebra

Principal Component Analysis (PCA)

Mon, 01 Jan 0001 00:00:00 +0000

Principal Component Analysis (PCA) #

dimensionality reduction technique
helps us to reduce the number of features in a dataset while keeping the most important information.
changes complex datasets by transforming correlated features into a smaller set of uncorrelated components.
uses linear algebra to transform data into new features called principal components.
finds these by calculating eigenvectors (directions) and eigenvalues (importance) from the covariance matrix.
PCA selects the top components with the highest eigenvalues and projects the data onto them simplify the dataset.

PCA prioritizes the directions where the data varies the most because more variation = more useful information.

PCA Theory

Mon, 01 Jan 0001 00:00:00 +0000

PCA Theory #

Problem setting
Maximum variance perspective
Projection perspective
Eigenvector and low-rank approximations

Home | Dimensionality reduction and PCA

PCA in Practice

Mon, 01 Jan 0001 00:00:00 +0000

PCA in Practice #

Key steps of PCA in practice, including considerations in high dimensions.

Home | Dimensionality reduction and PCA

Latent Variable Perspective

Mon, 01 Jan 0001 00:00:00 +0000

Latent Variable Perspective #

PCA can be interpreted as modelling data using a smaller number of latent variables.

Home | Dimensionality reduction and PCA

Mathematical Preliminaries of SVM

Mon, 01 Jan 0001 00:00:00 +0000

Mathematical Preliminaries of SVM #

Primal and dual perspectives
Geometry of margins

Home | Dimensionality reduction and PCA

Nonlinear SVM and Kernels

Mon, 01 Jan 0001 00:00:00 +0000

Nonlinear SVM and Kernels #

Kernels allow inner products in high-dimensional feature spaces without explicit mapping.

Home | Dimensionality reduction and PCA

AI Learning Resources

Sat, 03 Jan 2026 12:00:00 +0100

AI Learning Resources #

A curated list of high-quality online courses to learn Artificial Intelligence, Machine Learning, and Deep Learning from reputable universities and organisations.

Recommended Books & References #

Deep Neural Networks (DNN) #

Deep Learning. MIT Press.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). (Vol. 1, No. 2).
Introduction to Deep Learning. MIT Press.
Eugene, C. (2019).
Deep Learning with Python. Simon & Schuster.
Chollet, F. (2021).

ML Pipeline

Tue, 21 Apr 2026 00:00:00 +0000

Machine Learning Pipeline: Preprocessing & Models #

This page explains both data preprocessing and model development concepts in a clear, structured way to support understanding.

A complete ML pipeline includes preprocessing, feature engineering, feature selection, and model training.

1. Data Preprocessing Overview #

Raw data is often:

Noisy
Incomplete
Inconsistent

Preprocessing ensures data is suitable for machine learning.

2. Missing Values #

Why they occur

Sensor errors
Data collection issues

Methods