ML on Arshad Siddiqui

Supervised Learning

Sat, 03 Jan 2026 10:29:52 +0100

Supervised Learning #

Trained using labelled data.
Each example in the training set includes the correct output.
The algorithm learns to generalise and make predictions on unseen data.
Generally more accurate than unsupervised methods.
Requires human intervention for labelling and setup.
Widely used due to its accuracy and efficiency.
Produces highly accurate results when trained on good-quality labelled data.

Classification #

Output is discrete (e.g. Yes/No, Spam/Not Spam).
Used for categorising data into predefined classes.
Support Vector Machine (SVM) is a common classifier (a linear classifier with margin-based separation).

Differentiation of Univariate Functions

Mon, 01 Jan 0001 00:00:00 +0000

Differentiation of Univariate Functions #

Differentiation measures rate of change.

For a function f(x), the derivative measures the rate of change.

$[ f'(x) = $lim_{h $to 0} $frac{f(x+h)-f(x)}{h} $]

Interpretation:

Slope of tangent
Instantaneous rate of change

Home | Vector Calculus

Unsupervised Learning

Sat, 03 Jan 2026 10:29:52 +0100

Unsupervised Learning #

Works on unlabelled raw data.
The algorithm discovers hidden patterns without prior knowledge of outcomes.
Requires no human intervention during training.
Does not make direct predictions — it groups or organises data instead.
Carries a higher risk because there’s no ground truth to verify results.
Common techniques include Clustering, Association, and Dimensionality Reduction.

stateDiagram-v2

 %% ML maths-based colours (same palette as supervised)
 classDef probability fill:#d1fae5,stroke:#065f46,stroke-width:1px
 classDef geometry fill:#ffedd5,stroke:#9a3412,stroke-width:1px
 classDef category font-style:italic,font-weight:bold,fill:#f3f4f6,stroke:#374151

 %% Root
 USL: Unsupervised Learning

 %% Main branches
 USL --> CLU:::category
 CLU: Clustering

 USL --> DR:::category
 DR: Dimensionality Reduction

 %% Clustering algorithms
 CLU --> KM:::geometry
 KM: K-Means

 CLU --> HC:::geometry
 HC: Hierarchical Clustering

 CLU --> DB:::geometry
 DB: DBSCAN

 %% Probabilistic models
 USL --> PM:::category
 PM: Probabilistic Models

 PM --> GMM:::probability
 GMM: Gaussian Mixture Model

 PM --> HMM:::probability
 HMM: Hidden Markov Model

Clustering #

Groups similar data points together based on shared features.
Commonly used for market segmentation, image compression, and anomaly detection.

Common Types of Clustering #

K-Means Clustering – Divides data into K groups based on similarity.
Hierarchical Clustering – Builds a hierarchy (tree) of clusters.
DBSCAN (Density-Based Spatial Clustering) – Groups points close in density; identifies noise/outliers.

Association #

Identifies relationships or correlations between variables in a dataset.
Commonly used in market basket analysis (e.g. “Customers who bought X also bought Y”).

Common Techniques #

Apriori Algorithm – Finds frequent itemsets and generates association rules.
Eclat Algorithm – Similar to Apriori but uses set intersections for faster computation.

Dimensionality Reduction #

Reduces the number of input variables to simplify data.
Helps remove noise and redundancy.
Commonly used in data pre-processing and visualisation.

Common Techniques #

Principal Component Analysis (PCA) – Projects data onto fewer dimensions while keeping most variance.
Linear Discriminant Analysis (LDA) – Focuses on class separation.
t-SNE (t-Distributed Stochastic Neighbour Embedding) – Used for visualising high-dimensional data.
Autoencoders – Neural networks that compress and reconstruct data.

mindmap
 root(Unsupervised Learning)
 Clustering
 K Means
 Hierarchical Clustering
 DBSCAN
 Dimensionality Reduction
 PCA
 t SNE
 Autoencoders
 Probabilistic Models
 Gaussian Mixture Model
 Hidden Markov Model

Home | Machine Learning

Partial Differentiation and Gradients

Mon, 01 Jan 0001 00:00:00 +0000

Partial Differentiation and Gradients #

For f(x1, x2, …, xn):

[ \frac{\partial f}{\partial x_i} ]

Gradient vector:

[ \nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \ \vdots \ \frac{\partial f}{\partial x_n} \end{bmatrix} ]

Gradient points in direction of steepest ascent.

flowchart LR
 Input --> Function
 Function --> Gradient
 Gradient --> Optimisation

Home | Vector Calculus

Linear Independence

Mon, 01 Jan 0001 00:00:00 +0000

Linear Independence #

A set of vectors is linearly independent if none of them can be written as a linear combination of the others.

\[ c_1\mathbf{v}_1 + \cdots + c_k\mathbf{v}_k = \mathbf{0} \;\Rightarrow\; c_1=\cdots=c_k=0 \]

Independence means each vector adds new information.

Why it matters #

Detects redundancy
Connects to rank and basis

If one vector can already be formed using others, it does not add anything new.

Semi-Supervised Learning

Sat, 03 Jan 2026 10:29:52 +0100

Semi-Supervised Learning #

A combination of labelled and unlabelled data.
Useful when labelling large datasets is expensive or time-consuming.
Works well with high-volume datasets (e.g. millions of images).
Only a small fraction of data is labelled (e.g. a few thousand).
The algorithm learns from both labelled examples and structure in unlabelled data.
Ideal for medical imaging where labelled data is limited.
For example, a radiologist can label a small set of medical scans,
and the model uses that to learn from thousands of unlabelled scans.
Helps improve accuracy and generalisation with minimal manual effort.

Home | Machine Learning

Gradients of Vector-Valued and Matrix Functions

Mon, 01 Jan 0001 00:00:00 +0000

Gradients of Vector-Valued and Matrix Functions #

Covers gradients when outputs or parameters are vectors/matrices.

If f: R^n -> R^m, the derivative is the Jacobian.

[ J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \dots & \frac{\partial f_1}{\partial x_n} \ \vdots & \ddots & \vdots \ \frac{\partial f_m}{\partial x_1} & \dots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} ]

For scalar f(x):

[ H = \nabla^2 f ]

Hessian captures curvature.

Reinforcement Learning

Mon, 01 Jan 0001 00:00:00 +0000

Reinforcement Learning (RL) #

RL is learning by trial and error.

Reinforcement Learning (RL) is a type of machine learning where an autonomous agent learns to make decisions by interacting with an environment.

Instead of being told the correct answer, the agent:

takes actions
observes outcomes
receives rewards or penalties
gradually learns a strategy that maximises long-term reward

Reinforcement Learning teaches an agent how to act, not what to predict.

Useful Gradient Identities

Mon, 01 Jan 0001 00:00:00 +0000

Useful Gradient Identities #

[ \nabla (a^T x) = a ] [ \nabla (x^T A x) = (A + A^T)x ]

If A symmetric:

[ \nabla (x^T A x) = 2Ax ]

These are heavily used in optimisation.

Home | Vector Calculus

Inner Products and Dot Product

Mon, 01 Jan 0001 00:00:00 +0000

Inner Products and Dot Product #

An inner product maps two vectors to a single scalar.

It allows us to measure:

similarity
vector length
projections
orthogonality

flowchart TD
T["Inner<br/>products<br/>(types)"] --> DOT["Euclidean<br/>Dot product"]
T --> WIP["Weighted<br/>inner product"]
T --> FN["Function-space<br/>(integral)"]
T --> HERM["Complex<br/>Hermitian"]
T --> MAT["Matrix<br/>inner product<br/>(Frobenius)"]

DOT --> Rn["Vectors in<br/>
<span>
 \( \mathbb{R}^n \)
 </span>

"]
WIP --> SPD["SPD matrix<br/>W"]
FN --> L2["L2 space<br/>functions"]
HERM --> Cn["Vectors in<br/>C^n"]
MAT --> Mnm["Matrices<br/>R^{m×n}"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000

style DOT fill:#C8E6C9,stroke:#2E7D32,color:#000
style WIP fill:#C8E6C9,stroke:#2E7D32,color:#000
style FN fill:#C8E6C9,stroke:#2E7D32,color:#000
style HERM fill:#C8E6C9,stroke:#2E7D32,color:#000
style MAT fill:#C8E6C9,stroke:#2E7D32,color:#000

style Rn fill:#CE93D8,stroke:#8E24AA,color:#000
style SPD fill:#CE93D8,stroke:#8E24AA,color:#000
style L2 fill:#CE93D8,stroke:#8E24AA,color:#000
style Cn fill:#CE93D8,stroke:#8E24AA,color:#000
style Mnm fill:#CE93D8,stroke:#8E24AA,color:#000

Definition #

For vectors
$ \mathbf{a}, \mathbf{b} \in \mathbb{R}^n $

Backpropagation and Automatic Differentiation

Mon, 01 Jan 0001 00:00:00 +0000

Backpropagation and Automatic Differentiation #

Backpropagation applies the chain rule:

efficiently across a computational graph.
repeatedly.

Chain rule:

[ \frac{dL}{dx} = \frac{dL}{dy} \cdot \frac{dy}{dx} ]

flowchart LR
 x --> y
 y --> L

Automatic differentiation computes exact derivatives efficiently using computational graphs.

Home | Vector Calculus

Higher-order derivatives

Mon, 01 Jan 0001 00:00:00 +0000

Higher-order derivatives #

Home | Vector Calculus

Angles and Orthogonality

Mon, 01 Jan 0001 00:00:00 +0000

Angles and Orthogonality #

Once we define an inner product, we can define the angle between two vectors.

Angles allow us to measure how aligned or different two vectors are in space.

Key Idea: Angle measures similarity between vectors. Orthogonality means complete independence (no similarity).

Why It Matters in Machine Learning #

PCA produces orthogonal components
Orthogonal features reduce redundancy
Gradient directions depend on angle

Angle Formula #

For vectors in n-dimensional space:

Taylor’s series

Mon, 01 Jan 0001 00:00:00 +0000

Linearization and multivariate Taylor’s series #

Home | Vector Calculus

Maxima and Minima

Mon, 01 Jan 0001 00:00:00 +0000

Computing maxima and minima for unconstrained optimization #

Home | Vector Calculus

AI Stages: ANI, AGI, ASI

Thu, 04 Jul 2024 10:55:52 +0100

AI Development Stages: ANI → AGI → ASI #

Artificial Intelligence is often described in three stages, based on capability and scope:

ANI: Task-specific intelligence (today’s AI)
AGI: Human-level general intelligence (future goal)
ASI: Beyond human intelligence (theoretical)

ANI — Artificial Narrow Intelligence #

also called Weak AI
designed to perform one specific task
Operates within a predefined environment
Cannot generalise beyond its training
Most AI systems today are ANI

examples

Neural Networks

Mon, 01 Jan 0001 00:00:00 +0000

Neural Networks #

A network of artificial neurons inspired by how neurons function in the human brain.
At its core - a mathematical model designed to process and learn from data.
Neural networks form the foundation of Deep Learning (involves training large and complex networks on vast amounts of data).

flowchart LR
 subgraph subGraph0["Input Layer"]
 I1(("Input 1"))
 I2(("Input 2"))
 I3(("Input 3"))
 end
 subgraph subGraph1["Hidden Layer"]
 H1(("Hidden 1"))
 H2(("Hidden 2"))
 H3(("Hidden 3"))
 end
 subgraph subGraph2["Output Layer"]
 O(("Output"))
 end
 I1 --> H1 & H2 & H3
 I2 --> H1 & H2 & H3
 I3 --> H1 & H2 & H3
 H1 --> O
 H2 --> O
 H3 --> O

 style I1 fill:#C8E6C9
 style I2 fill:#C8E6C9
 style I3 fill:#C8E6C9
 style H1 stroke:#2962FF,fill:#BBDEFB
 style H2 fill:#BBDEFB
 style H3 fill:#BBDEFB
 style O fill:#FFCDD2
 style subGraph0 stroke:none,fill:transparent
 style subGraph1 stroke:none,fill:transparent
 style subGraph2 stroke:none,fill:transparent

Structure of a Neural Network #

A typical neural network has three main layers:

Machine Learning

Tue, 06 Aug 2024 23:29:52 +0100

Machine Learning #

stateDiagram-v2

 %% ===== CLASS DEFINITIONS (Math-based colours) =====
 classDef algebra fill:#cfe8ff,stroke:#1e3a8a,stroke-width:1px
 classDef probability fill:#d1fae5,stroke:#065f46,stroke-width:1px
 classDef geometry fill:#ffedd5,stroke:#9a3412,stroke-width:1px
 classDef logic fill:#ede9fe,stroke:#5b21b6,stroke-width:1px
 classDef category font-style:italic,font-weight:bold,fill:#aaaaaa,stroke:#374151,stroke-width:3px

 %% ===== ROOT =====
 ML: Machine Learning

 %% ===== SUPERVISED =====
 ML --> SL:::category
 SL: Supervised Learning

 SL --> Regression
 Regression --> LR:::algebra
 LR: Linear Regression

 LR --> NN:::algebra
 NN: Neural Network

 NN --> DT:::logic
 DT: Decision Tree

 SL --> Classification
 Classification --> NB:::probability
 NB: Naive Bayes

 NB --> KNN:::geometry
 KNN: k-Nearest Neighbours

 KNN --> SVM:::algebra
 SVM: Support Vector Machine
 
 %% ===== UNSUPERVISED =====
 ML --> USL:::category
 USL: Unsupervised Learning

 USL --> Clustering
 Clustering --> KM:::geometry
 KM: K-Means

 KM --> GMM:::probability
 GMM: Gaussian Mixture Model

 GMM --> HMM:::probability
 HMM: Hidden Markov Model

 %% ===== REINFORCEMENT =====
 ML --> RL:::category
 RL: Reinforcement Learning

 RL --> DM:::logic
 DM: Decision Making

Mathematical Legend

Algebra / Linear Algebra (Blue) #

Used heavily when models rely on:

Artificial Neuron and Perceptron

Mon, 01 Jan 0001 00:00:00 +0000

Artificial Neuron and Perceptron #

knowledge in neural networks is stored in connection weights, and learning means modifying those weights.

Biological Neuron #

A biological neuron is a specialised cell that processes and transmits information through electrical and chemical signals.

Core components:

Dendrites: receive signals from other neurons
Cell body (soma): processes incoming signals
Axon: transmits the output signal
Synapses: connection points between neurons

Biological intuition:

many inputs arrive to one neuron
one neuron can connect out to many neurons
massive parallelism enables fast perception and recognition

Artificial Neuron #

An artificial neuron is a simplified computational model inspired by biological neurons.

ML Workflow

Mon, 01 Jan 0001 00:00:00 +0000

Machine learning Workflow #

Data is the foundation of any machine learning system. Quality of data matters more than model complexity.

Role of Data #

Data determines:

What patterns the model can learn
How well it generalises
Whether bias or noise is introduced

Bad data → bad model (even with perfect algorithms).

Data Preprocessing, wrangling #

Raw data is never ready for training.

Data Issues

Noise
- For objects, noise is an extraneous object
- For attributes, noise refers to modification of original values
- Use Log or Z Transfer to convert to mean
Outliers
- Data objects with characteristics that are considerably different than most of the other data objects in the data set
- Handle: Use IQR method
- Find Lower and Upper Bound and replace Outlier with Lower or Upper Bound
Missing Values
- Eliminate data objects or variables
- Handle: Estimate missing values
  - Mean, Median or Mode
  - Prefer Median if there are missing outliers
- Ignore the missing value during analysis
Duplicate Data
- Major issue when merging data from heterogeneous sources
Inconsistent Codes
- Find all Unique and transfer all inconsistent to

Data Preprocessing techniques

Regression(Linear Models)

Mon, 01 Jan 0001 00:00:00 +0000

Linear Regression #

Linear Regression is a supervised ML method used to predict a numerical target by fitting a model that is linear in its parameters.

In ML , linear models are a core baseline: they’re fast, often surprisingly strong, and usually easy to interpret.

Key takeaway: Linear Regression learns parameters by minimising a squared-error cost. You can solve it directly (closed form) or iteratively (gradient descent), and you can extend it using basis functions and regularisation.

Ordinary Least Squares

Sat, 21 Feb 2026 00:00:00 +0000

Direct solution method - Ordinary Least Squares and the Line of Best Fit #

It is possible to compute the best parameters for linear regression in one shot (closed-form), instead of iteratively improving them step-by-step. fileciteturn34file10turn34file6

For linear regression, the direct method is usually Ordinary Least Squares (OLS).

Ordinary Least Squares (OLS) chooses the “best” line by minimising squared prediction errors.

Key takeaway: OLS defines “best fit” as the line that minimises the total squared residual error across all data points.

Cost Function

Sat, 21 Feb 2026 00:00:00 +0000

Cost Function #

also known as an objective function
how far the predicted values are from the actual ones
measure of the difference between predicted values and actual values
quantifies the error between a model’s predicted values and actual values
measures the model’s error on a group of datapoints
method used to predict values by drawing the best-fit line through the data
used to evaluate the accuracy of a model’s predictions

Gradient Descent

Sat, 21 Feb 2026 00:00:00 +0000

Gradient Descent for Linear Regression #

Gradient descent is an iterative optimisation method used to minimise the regression cost function by repeatedly updating parameters in the direction that reduces error.

Iterative method
Types: batch / stochastic / mini-batch

Key takeaway: Gradient descent starts with initial parameter values and repeatedly updates them using the gradient until the cost stops decreasing.

flowchart TD
GD["Gradient<br/>Descent"] -->|minimises| CF["Cost<br/>function"]
GD -->|updates| W["Parameters<br/>(weights)"]
GD -->|uses| GR["Gradient<br/>(slope)"]

GD --> H["Hyperparameters"]
H --> LR["Learning<br/>rate"]
H --> BS["Batch<br/>size"]
H --> EP["Epochs"]

style GD fill:#90CAF9,stroke:#1E88E5,color:#000

style CF fill:#CE93D8,stroke:#8E24AA,color:#000
style W fill:#CE93D8,stroke:#8E24AA,color:#000
style GR fill:#CE93D8,stroke:#8E24AA,color:#000
style H fill:#CE93D8,stroke:#8E24AA,color:#000
style LR fill:#CE93D8,stroke:#8E24AA,color:#000
style BS fill:#CE93D8,stroke:#8E24AA,color:#000
style EP fill:#CE93D8,stroke:#8E24AA,color:#000

Types of GD #

flowchart TD
T["Gradient Descent<br/>types"] --> BGD["Batch<br/>GD"]
T --> SGD["Stochastic<br/>GD"]
T --> MGD["Mini-batch<br/>GD"]

BGD --> ALL["All data<br/>per step"]
BGD --> STB["Smooth<br/>updates"]

SGD --> ONE["1 sample<br/>per step"]
SGD --> FAST["Quick<br/>progress"]
SGD --> NOISE["Noisy<br/>updates"]

MGD --> MB["Small batch<br/>per step"]
MGD --> PRACT["Practical<br/>default"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000

style BGD fill:#C8E6C9,stroke:#2E7D32,color:#000
style SGD fill:#C8E6C9,stroke:#2E7D32,color:#000
style MGD fill:#C8E6C9,stroke:#2E7D32,color:#000

style ALL fill:#CE93D8,stroke:#8E24AA,color:#000
style STB fill:#CE93D8,stroke:#8E24AA,color:#000
style ONE fill:#CE93D8,stroke:#8E24AA,color:#000
style FAST fill:#CE93D8,stroke:#8E24AA,color:#000
style NOISE fill:#CE93D8,stroke:#8E24AA,color:#000
style MB fill:#CE93D8,stroke:#8E24AA,color:#000
style PRACT fill:#CE93D8,stroke:#8E24AA,color:#000

Batch #

Use only if you have huge compute and a lot of time to train

SGD #

go-to solution

Classification(Linear Models)

Mon, 01 Jan 0001 00:00:00 +0000

Linear models for Classification #

categorises data by finding a linear boundary (hyperplane) that separates classes
calculating a weighted sum of input features plus bias

flowchart TD
T["Linear<br/>classification<br/>models"] --> P["Perceptron"]
T --> LR["Logistic<br/>regression"]
T --> SVM["Linear<br/>SVM"]

P -->|uses| STEP["Step<br/>activation"]
LR -->|uses| SIG["Sigmoid<br/>+ log loss"]
SVM -->|uses| HNG["Hinge<br/>loss"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000

style P fill:#C8E6C9,stroke:#2E7D32,color:#000
style LR fill:#C8E6C9,stroke:#2E7D32,color:#000
style SVM fill:#C8E6C9,stroke:#2E7D32,color:#000

style STEP fill:#CE93D8,stroke:#8E24AA,color:#000
style SIG fill:#CE93D8,stroke:#8E24AA,color:#000
style HNG fill:#CE93D8,stroke:#8E24AA,color:#000

Discriminant Functions #

Decision Theory #

Probabilistic Discriminative Classifiers #

Logistic Regression #

Supervised machine learning algorithm
Binary classification algorithm
requires data to be linearly separable
predicts the probability that an input belongs to a specific class
uses Sigmoid function to convert inputs into a probability value between 0 and 1

Key takeaway: Logistic regression predicts $P(y=1\mid x)$ using a sigmoid of a linear score $z=w\cdot x+b$, then learns $w,b$ by maximising likelihood (equivalently minimising log-loss).

Foundation Models

Sun, 14 Dec 2025 00:00:00 +0000

Foundation Model #

AI models trained on massive datasets to perform a wide range of tasks with minimal fine-tuning.

are large deep learning neural networks
are large AI models trained on massive and diverse datasets (text, images, audio, or multiple modalities).
Contain millions or billions of parameters.
designed to perform a broad range of general tasks
designed for general-purpose intelligence, not a single task.
acts as base models for building specialised AI applications

LLM - Model

Mon, 01 Jan 0001 00:00:00 +0000

LLM – Large Language Model #

Large Language Models (LLMs) are advanced AI systems designed to process, understand, and generate human-like text.

They learn language by analysing massive amounts of text data, discovering patterns in:

grammar
meaning
context
relationships between words and sentences
Built on Deep Learning
Implemented using Neural Networks
Based on Transformers
Often combined with tools like:
- Retrieval (RAG)
- Agents
- External APIs
- Memory systems

What makes an LLM special? #

Built using deep neural networks
Trained on very large datasets (books, articles, code, web text)
Can perform many tasks without task-specific training
General-purpose language understanding, not single-task models

Foundation: Transformer Architecture #

LLMs are based on the Transformer Architecture, which allows models to understand context and long-range dependencies in text.

AI Agents

Mon, 15 Dec 2025 10:55:52 +0100

AI Agents #

Also referred to as Agentic AI.

AI agents are intelligent systems that can plan, make decisions, and take actions to achieve goals with minimal human intervention.

A common use case is task automation
for example booking travel based on a user’s request.
AI agents typically build on Generative AI and use Large Language Models (LLMs) as the reasoning core.
Agents often interact with tools (APIs, databases, calendars) to complete multi-step workflows.

Retrieval-Augmented Generation (RAG)

Mon, 01 Jan 0001 00:00:00 +0000

Retrieval-Augmented Generation (RAG) #

Retrieval-Augmented Generation (RAG) is a system design pattern that improves an LLM’s answers by:

Retrieving relevant information from an external knowledge source, and then
Augmenting the LLM prompt with that retrieved context before generating the final response.

RAG helps an LLM look things up first, then answer using evidence.

Why RAG is Useful #

RAG is commonly used when:

Your knowledge is in private documents (PDFs, policies, internal wiki)
You need up-to-date information (things not in the model’s training data)
You want fewer hallucinations by grounding answers in retrieved sources
You want traceability (show “where the answer came from”)

RAG does not change the model weights.
It changes what the model sees at inference time by adding retrieved context.

Mathematical Foundation

Wed, 18 Mar 2026 00:00:00 +0000

Mathematical Foundations for Machine Learning #

Machine Learning is built on mathematical principles that allow models to:

represent data
learn patterns
optimise performance

flowchart LR
 DATA[Data]
 MATH[Math Models]
 OPT[Optimisation]
 MODEL[Trained Model]

 DATA --> MATH
 MATH --> OPT
 OPT --> MODEL

ML requires core mathematical tools to understand how ML algorithms work internally. Algebra deals with relationships between variables and quantities, while Calculus focuses on change and optimization.

Decision Tree

Mon, 01 Jan 0001 00:00:00 +0000

Decision Tree #

A decision tree classifies an example by asking a sequence of questions about its attributes until it reaches a leaf (final decision).

Key takeaway: A decision tree grows by repeatedly splitting the training data into purer subsets using an impurity measure (Entropy / Gini / Classification Error).

Information Theory #

Decision trees need a way to measure: “How mixed are the class labels at a node?”

Statistics

Thu, 12 Mar 2026 00:00:00 +0000

Statistics #

Statistical methods help you turn raw data into reliable conclusions, while understanding uncertainty, variability, and confidence.

Statistics provides the language and tools for reasoning about data, uncertainty, and inference.

ML needs understanding data behaviour, drawing conclusions, and validating machine learning models.

Collect Data
Present & Organise Data (in a systematic manner)
Alalyse Data
Infer about the Data
Take Decision from the Data

Statistics Topic	What you learn (plain English)	ML Connection
1. Basic Probability & Statistics	Summarise data; understand spread; basic probability rules	Data understanding (EDA), feature sanity checks, detecting outliers, interpreting “average behaviour”
2. Conditional Probability & Bayes	Update probability using new information; Bayes’ rule	Naïve Bayes, Bayesian thinking, posterior probabilities, probabilistic classification
3. Probability Distributions	Model randomness with distributions; expectation/variance/covariance	Likelihood models, noise assumptions (Gaussian), sampling, probabilistic modelling foundations
4. Hypothesis Testing	Sampling, CLT, confidence intervals, significance tests, ANOVA, MLE	A/B testing, evaluating model improvements, significance vs noise, parameter estimation (MLE)
5. Prediction & Forecasting	Correlation, regression, time series (AR/MA/ARIMA/SARIMA etc.)	Linear regression, forecasting, sequential data modelling, baseline predictive modelling
6. GMM & EM	Mixtures of Gaussians; iterative estimation with EM	Unsupervised learning (soft clustering), density estimation, latent-variable models

flowchart TD
 A["Statistical Methods<br/>AIML ZC418"] --> B["1. Basic Probability and Statistics"]
 A --> C["2. Conditional Probability and Bayes"]
 A --> D["3. Probability Distributions"]
 A --> E["4. Hypothesis Testing"]
 A --> F["5. Prediction and Forecasting"]
 A --> G["6. Gaussian Mixture Model and EM"]

 B --> B1["Central Tendency<br/>Mean - Median - Mode"]
 B --> B2["Variability<br/>Range - Variance - SD - Quartiles"]
 B --> B3["Basic Probability Concepts"]
 B3 --> B31["Axioms of Probability"]
 B3 --> B32["Definition of Probability"]
 B3 --> B33["Mutually Exclusive vs Independent"]

 C --> C1["Conditional Probability"]
 C --> C2["Independence (conditional)"]
 C --> C3["Bayes Theorem"]
 C --> C4["Naive Bayes (intro)"]

 D --> D1["Random Variables<br/>Discrete and Continuous"]
 D --> D2["Expectation - Variance - Covariance"]
 D --> D3["Transformations of RVs"]
 D --> D4["Key Distributions"]
 D4 --> D41["Bernoulli"]
 D4 --> D42["Binomial"]
 D4 --> D43["Poisson"]
 D4 --> D44["Normal (Gaussian)"]
 D4 --> D45["t - Chi-square - F (intro)"]

 E --> E1["Sampling<br/>Random and Stratified"]
 E --> E2["Sampling Distributions<br/>CLT"]
 E --> E3["Estimation<br/>Confidence Intervals"]
 E --> E4["Hypothesis Tests<br/>Means and Proportions"]
 E --> E5["ANOVA<br/>Single and Dual factor"]
 E --> E6["Maximum Likelihood"]

 F --> F1["Correlation"]
 F --> F2["Regression"]
 F --> F3["Time Series Basics<br/>Components"]
 F --> F4["Moving Averages<br/>Simple and Weighted"]
 F --> F5["Time Series Models"]
 F5 --> F51["AR"]
 F5 --> F52["ARMA / ARIMA"]
 F5 --> F53["SARIMA / SARIMAX"]
 F5 --> F54["VAR / VARMAX"]
 F --> F6["Exponential Smoothing"]

 G --> G1["GMM<br/>Mixture of Gaussians"]
 G --> G2["EM Algorithm<br/>E-step - M-step"]

 B -.-> C
 C -.-> D
 D -.-> E
 E -.-> F
 F -.-> G

Data - Types #

flowchart TD
	A[(Data)] --> B["Categorical (Qualitative)"]
 A --> C["Numerical (Quantitative)"]

 B --> B1[Nominal]
 B --> B2[Ordinal]

 C --> C1[Discrete]
 C --> C2[Continuous]

 C2 --> C21[Interval]
 C2 --> C22[Ratio]

 %% Styling
 style A fill:#E1F5FE,stroke:#333
 style B fill:#90CAF9,stroke:#333
 style B1 fill:#90CAF9,stroke:#333
 style B2 fill:#90CAF9,stroke:#333
 style C fill:#FFF9C4,stroke:#333
 style C1 fill:#FFF9C4,stroke:#333
 style C2 fill:#FFF9C4,stroke:#333
 style C21 fill:#FFF9C4,stroke:#333
 style C22 fill:#FFF9C4,stroke:#333

Categorical (Qualitative) #

express a qualitative attribute e.g. hair color, eye color

Instance-based Learning

Mon, 01 Jan 0001 00:00:00 +0000

Instance-based Learning #

Instance-based learning is a family of methods that do not build one explicit global model during training. Instead, they store training examples and delay most of the work until a new query arrives.

When a new point must be classified or predicted, the algorithm compares it with previously seen examples, finds the most relevant neighbours, and uses them to produce the answer.

Instance-based Learning covers three linked ideas:

Support Vector Machine

Mon, 01 Jan 0001 00:00:00 +0000

Support Vector Machine (SVM) #

A Support Vector Machine (SVM) is a supervised machine learning algorithm used for:

Classification (most common)
Regression (SVR – Support Vector Regression)

Find the decision boundary that separates classes with the maximum margin.

A Support Vector Machine is a supervised learning algorithm that finds an optimal hyperplane by maximising the margin between classes, using support vectors and kernel functions to handle non-linear data.

Attention Mechanism

Mon, 01 Jan 0001 00:00:00 +0000

Attention Mechanism #

Queries, Keys, and Values
Attention Pooling by Similarity
Attention Pooling via Nadaraya–Watson Regression
Attention Scoring Functions
Dot Product Attention
Convenience Functions
Scaled Dot Product Attention
Additive Attention
Bahdanau Attention Mechanism
Multi-Head Attention
Self-Attention
Positional Encoding
Code implementation (webinar)

Reference #

Dive into deep learning. Cambridge University Press.. (Ch 10, Ch7

Home | Deep Learning

Bayesian Learning

Mon, 01 Jan 0001 00:00:00 +0000

Bayesian Learning #

MLE Hypothesis #

MAP Hypothesis #

Bayes Rule #

Optimal Bayes Classifier #

Naïve Bayes Classifier #

Probabilistic Generative Classifiers #

Bayesian Linear Regression #

Home | Machine Learning

Transformer

Mon, 15 Dec 2025 10:55:52 +0100

Transformer #

is an architecture of neural networks
based on the multi-head attention mechanism
text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table
takes a text sequence as input and produces another text sequence as output
foundation for modern Large Language Models (LLMs) like ChatGPT and Gemini
Transformer architecture
Model, Positionwise Feed-Forward Networks, Residual Connection and Layer Normalization

Ensemble Learning

Mon, 01 Jan 0001 00:00:00 +0000

Ensemble Learning #

Combining Classifiers #

Bagging #

Random Forest #

Boosting #

ADABoost #

Gradient Boosting #

XGBoost #

Home | Machine Learning

Optimisation of Deep models

Mon, 01 Jan 0001 00:00:00 +0000

Optimisation of Deep models #

Goal of Optimization
Optimization Challenges in Deep Learning
Gradient Descent
Stochastic Gradient Descent
Minibatch Stochastic Gradient Descent
Momentum
Adagrad and Algorithm
RMSProp and Algorithm
Adadelta and Algorithm
Adam and Algorithm
Code Implementation and comparison of algorithms (webinar)

Reference #

Dive into deep learning. Cambridge University Press.. (Ch12)

Home | Deep Learning

Evaluation/Comparison

Mon, 01 Jan 0001 00:00:00 +0000

Machine Learning Model Evaluation/Comparison #

Comparing Machine Learning Models #

Emerging requirements e.g., bias, fairness, interpretability of ML models #

Home | Machine Learning

Regularisation for Deep models

Mon, 01 Jan 0001 00:00:00 +0000

Regularisation for Deep models #

Generalization for regression
Training Error and Generalization Error
Underfitting or Overfitting
Model Selection
Weight Decay and Norms
Generalization in Classification
Environment and Distribution Shift
Generalization in Deep Learning
Dropout
Batch Normalization
Layer Normalization
Code implementation (webinar)

Reference #

Dive into deep learning. Cambridge University Press.. (T1 – Ch 3.6, 3.7, T1 - Ch 4.6, 4.7, T1 - Ch 5.5, 5.6, T1 - Ch 8.5, T1 - Ch 11.7

Home | Deep Learning

Linear Algebra

Wed, 18 Mar 2026 00:00:00 +0000

Linear Algebra #

The study of vectors and matrices is called Linear Algebra.

Linear Algebra provides the mathematical language used to represent data, transformations, and structure in ML.

Why Linear Algebra Matters in ML #

Every machine learning model uses matrices
All data in ML is represented using vectors and matrices
Neural networks are pipelines of matrix operations
Models apply matrix transformations to data
Optimisation relies on linear algebra operations

What to Learn #

Scalars, vectors, and matrices
Vector operations (addition, dot product)
Matrix multiplication (critical)
Identity matrices and transpose
Eigenvalues and eigenvectors (conceptual understanding)

Scalar → a number
Vector → a directed point
Matrix → a space transformer
Linear transformation → structured mapping
Feature → one axis
Feature space → where data lives
Vector space → where vectors live

Home | Mathematical Foundation

Linear Systems

Thu, 29 Jan 2026 00:00:00 +0000

Linear Systems #

How systems of linear equations are represented and solved using matrices.

the study of vectors and rules to manipulate vectors
describe multiple linear equations solved simultaneously
connect algebraic equations with matrix representations

Idea of Closure #

performing a specific operation (like addition or multiplication) on members of a set always produces a result that belongs to the same set
idea of closure is fundamental to defining a Vector space because it ensures that performing arithmetic operations (addition and scalar multiplication) on vectors within a set does not produce a new element outside that set.

Systems of Linear Equations

Mon, 01 Jan 0001 00:00:00 +0000

Systems of Linear Equations #

A system of linear equations can be written compactly as:

\[ A\mathbf{x}=\mathbf{b} \]

This represents:

a linear transformation applied to an unknown vector (\mathbf{x})
producing an output vector (\mathbf{b})

Key components #

Coefficient matrix (A) #

(A) contains the coefficients of the variables.

Calculus

Mon, 01 Jan 0001 00:00:00 +0000

Calculus #

Calculus is:

the mathematical framework for understanding and controlling how quantities change
the mathematics of change and accumulation

It helps answer:

How fast is something changing right now?
What happens when inputs change slightly?
Where is something maximum or minimum?

It answers two big questions:

How fast is something changing right now? → derivatives (differentiation)
How much has accumulated over an interval? → integrals (integration)

flowchart TD
 A[Calculus] --> B[Limits]
 B --> C[Continuity]
 B --> D[Derivatives]
 B --> E[Integrals]
 D --> F[Optimisation: maxima/minima]
 D --> G[ML: gradients & learning]
 E --> H[Accumulation: area/total change]

Differential Calculus (Rates of Change) #

Studies how things change.

Matrices

Mon, 01 Jan 0001 00:00:00 +0000

Matrices #

Matrices are the core data structure of linear algebra and the workhorse of machine learning.
Almost every ML model can be described as a sequence of matrix operations.

Special Matrices

Matrix #

A matrix is a rectangular array of numbers arranged in rows and columns.

\[ A \in \mathbb{R}^{m \times n} \]

An ( m \times n ) matrix has:

Solving Linear Systems

Mon, 01 Jan 0001 00:00:00 +0000

Solving Linear Systems #

Solve using:

Substitution Method
Elimination Method (Multiple & then Subtract)
Cross Multiplication

Linear system can have:

no solution
a unique solution
infinitely many solutions

Positive Definite Matrices #

A square matrix is positive definite if pre-multiplying and post-multiplying it by the same vector always gives a positive number as a result, independently of how we choose the vector.

Positive definite symmetric matrices have the property that all their eigenvalues are positive.

Forward and Backward Substitution

Mon, 01 Jan 0001 00:00:00 +0000

Forward and Backward Substitution #

Forward and backward substitution are efficient algorithms used to solve linear systems when the coefficient matrix is triangular.

They are typically used after:

Gaussian elimination
LU decomposition

1. Forward Substitution (Lower Triangular Systems) #

Used to solve:

\[ L\mathbf{x} = \mathbf{b} \]

where (L) is a lower triangular matrix:

Inverse Matrix

Mon, 01 Jan 0001 00:00:00 +0000

Inverse Matrix #

The inverse of a matrix is a matrix that, when multiplied with the original matrix, produces the identity matrix.

A square matrix (A) is invertible if there exists a matrix (A^{-1}) such that:

\[ AA^{-1} = A^{-1}A = I \]

Here:

Convex Combination

Thu, 29 Jan 2026 00:00:00 +0000

Convex Combination of Two Points #

A convex combination describes how to form a point between two points using weighted averages.

It is a fundamental building block in several advanced fields:

Linear Algebra & Geometry
Optimization Theory
Machine Learning (Specifically in SVMs, clustering, and data interpolation)

Given two points (or vectors) $\mathbf{x}_1, \mathbf{x}_2 \in \mathbb{R}^n$, a convex combination of these points is defined as:

$$\mathbf{x} = \lambda \mathbf{x}_1 + (1 - \lambda)\mathbf{x}_2$$

Where:

Vector Spaces

Mon, 01 Jan 0001 00:00:00 +0000

Vector Spaces #

A vector space is the mathematical “home” where vectors live and where addition and scaling are valid operations.

A vector space is a set closed under vector addition and scalar multiplication.
Machine learning operates in vector spaces.
covers independence, bases, rank, and geometric tools like norms and inner products that are used to measure length, distance, and angles.

A vector space is a set of vectors that follows ten axioms, defined under two operations:

Feature Space

Mon, 01 Jan 0001 00:00:00 +0000

Feature #

A feature is an individual measurable property or characteristic of a data point used as input to a machine learning model.

Each feature corresponds to one dimension.

\[ x_i \in \mathbb{R} \]

A data point with ( d ) features is represented as:

Cauchy–Schwarz

Mon, 01 Jan 0001 00:00:00 +0000

Cauchy–Schwarz Inequality #

The Cauchy–Schwarz Inequality is one of the most important results in linear algebra.

It places a fundamental bound on the inner product of two vectors.

If you see angle, cosine, similarity, or inner product bounds
→ think Cauchy–Schwarz Inequality

Key Idea: The inner product (dot product) can never exceed the product of magnitudes. This ensures all geometric interpretations (angles, cosine) are valid.

Statement of the Inequality #

For any vectors:

Matrix Decompositions

Wed, 18 Mar 2026 00:00:00 +0000

Matrix Decompositions #

Decompositions reveal structure in matrices and power algorithms like PCA.

Matrix decompositions break complex matrices into simpler parts.

From the lecture introduction, matrices are used to describe mappings and transformations of vectors.

That is why decomposition is important: it lets us understand a complicated transformation by rewriting it using simpler building blocks.

In the slides, the topic is introduced as part of three closely connected goals: how to summarise matrices, how matrices can be decomposed, and how the decompositions can be used for matrix approximations.

Characteristic Polynomial

Mon, 01 Jan 0001 00:00:00 +0000

Characteristic Polynomial #

The characteristic polynomial of a square matrix is the key tool used to compute eigenvalues.

It connects:

Determinants
Trace
Eigenvalues
Matrix structure

Definition #

Let
$ A \in \mathbb{R}^{n \times n} $
and $ \lambda \in \mathbb{R} $ .

The characteristic polynomial of (A) is defined as:

\[ p_A(\lambda) = \det(A - \lambda I) \]

It is a polynomial in $ \lambda $ of degree (n).

Determinant and Trace

Mon, 01 Jan 0001 00:00:00 +0000

Determinant and Trace #

Minor #

The minor of an element $ a_{ij} $ is the determinant of the smaller square matrix formed by:

removing row $ i $
removing column $ j $

The minor is denoted $ M_{ij} $ .

Minors are used to compute cofactors, which are used for determinants and inverses (via adjoint/adjugate).

Cofactor #

The cofactor of $ a_{ij} $ , denoted $ C_{ij} $ , is:

Eigenvalues and Eigenvectors

Mon, 01 Jan 0001 00:00:00 +0000

Eigenvalues and Eigenvectors #

Eigenvalues give scaling.
Eigenvectors define invariant directions of transformation.

Eigenvalues and eigenvectors describe directions that remain unchanged under a linear transformation, except for scaling.

From lectures: matrix multiplication represents a transformation of space.
Most vectors change direction and magnitude.
Some special vectors only scale.
These are eigenvectors.

Key Idea: A matrix transformation stretches or compresses vectors. Eigenvectors are directions that remain unchanged. Eigenvalues tell how much scaling happens.

Cholesky Decomposition

Mon, 01 Jan 0001 00:00:00 +0000

Cholesky Decomposition #

Cholesky decomposition is a special matrix factorisation used for symmetric positive definite matrices.

From lecture discussions, this decomposition is powerful because it reduces a matrix into a triangular form, making computations easier and more stable.

Key Idea: Cholesky decomposition expresses a matrix as a product of a lower triangular matrix and its transpose. It is efficient and numerically stable.

Definition #

For a symmetric positive definite matrix:

Eigen Decomposition

Mon, 01 Jan 0001 00:00:00 +0000

Eigen Decomposition #

Eigen decomposition expresses a matrix using its eigenvectors and eigenvalues.

From lecture discussions, this is one of the most important ways to understand the internal structure of a matrix.

Instead of treating the matrix as a black box, eigen decomposition reveals its fundamental directions and scaling behaviour.

Key Idea: Eigen decomposition rewrites a matrix in terms of directions (eigenvectors) and scaling factors (eigenvalues). This makes complex transformations easier to understand and compute.

Diagonalization

Mon, 01 Jan 0001 00:00:00 +0000

Diagonalization #

Diagonalisation expresses a matrix using its eigenvectors and eigenvalues when possible.

From lecture explanation, diagonalisation is one of the most powerful tools because it converts a complicated matrix into a much simpler form.

Instead of working with a full matrix, we work with a diagonal matrix, which is much easier to analyse and compute.

Key Idea: If a matrix has enough independent eigenvectors, it can be rewritten as a diagonal matrix using a change of basis. This simplifies matrix operations significantly.

Singular Value Decomposition (SVD)

Wed, 18 Mar 2026 00:00:00 +0000

Singular Value Decomposition (SVD) #

Singular Value Decomposition (SVD) is one of the most important matrix decomposition techniques in linear algebra and machine learning.

It factorises any matrix into three simpler matrices that reveal its structure.

Key Idea: SVD decomposes a matrix into rotations + scaling. It tells us how data is transformed along orthogonal directions.

Definition #

For any matrix in real space: \[ A \in \mathbb{R}^{m \times n} \]

Matrix Approximation

Mon, 01 Jan 0001 00:00:00 +0000

Matrix Approximation #

Low-rank approximation keeps the most important structure while reducing noise and computation.

Low-Rank Approximation #

Used for:

Dimensionality reduction
Noise removal
Efficient computation

Forms the basis of PCA.

Home | Matrix Decompositions

Vector Calculus

Mon, 01 Jan 0001 00:00:00 +0000

Vector Calculus #

Vector calculus extends differentiation to multivariate and vector-valued functions.

Gradients power learning. This section builds differentiation skills needed for backpropagation.

flowchart TD

 %% Core Node
 PD["Partial Derivatives"]

 %% Supporting Concepts
 DQ["Difference Quotient"]
 JH["Jacobian / Hessian"]
 TS["Taylor Series"]

 %% Application Chapters
 CH6["<br/>Probability"]
 CH7["<br/>Optimization"]
 CH9["<br/>Regression"]
 CH10["<br/>Dimensionality Reduction"]
 CH11["<br/>Density Estimation"]
 CH12["<br/>Classification"]

 %% Relationships
 DQ -->|defines| PD
 PD -->|collected in| JH
 JH -->|used in| TS
 JH -->|used in| CH6
	
 PD -->|used in| CH7
 PD -->|used in| CH9
 PD -->|used in| CH10
 PD -->|used in| CH11
 PD -->|used in| CH12

 %% Styling (Your Soft Academic Palette)
 style PD fill:#90CAF9,stroke:#1E88E5,color:#000

 style DQ fill:#CE93D8,stroke:#8E24AA,color:#000
 style JH fill:#CE93D8,stroke:#8E24AA,color:#000
 style TS fill:#CE93D8,stroke:#8E24AA,color:#000
 style CH6 fill:#CE93D8,stroke:#8E24AA,color:#000
	
 style CH7 fill:#C8E6C9,stroke:#2E7D32,color:#000
 style CH9 fill:#C8E6C9,stroke:#2E7D32,color:#000
 style CH10 fill:#C8E6C9,stroke:#2E7D32,color:#000
 style CH11 fill:#C8E6C9,stroke:#2E7D32,color:#000
 style CH12 fill:#C8E6C9,stroke:#2E7D32,color:#000

Home | Calculus

Continuous Optimisation

Mon, 01 Jan 0001 00:00:00 +0000

Continuous Optimisation #

Optimisation finds parameters that minimise (or maximise) an objective function.

Home | Calculus

Optimisation using Gradient Descent

Mon, 01 Jan 0001 00:00:00 +0000

Optimisation using Gradient Descent #

Gradient descent is an optimisation algorithm used to train ML and neural networks.

Gradient descent updates parameters by moving opposite the gradient.

Trains ML models by minimising errors:

between predicted and actual results
by iteratively adjusting its parameters
moves step‑by‑step in the direction of the steepest decrease in the loss function, it helps ML models learn the best possible weights for better predictions

Types of Gradient Gescent learning algorithms #

Batch gradient descent
Stochastic gradient descent
Mini-batch gradient descent

Home | Linear Algebra

Principal Component Analysis (PCA)

Mon, 01 Jan 0001 00:00:00 +0000

Principal Component Analysis (PCA) #

dimensionality reduction technique
helps us to reduce the number of features in a dataset while keeping the most important information.
changes complex datasets by transforming correlated features into a smaller set of uncorrelated components.
uses linear algebra to transform data into new features called principal components.
finds these by calculating eigenvectors (directions) and eigenvalues (importance) from the covariance matrix.
PCA selects the top components with the highest eigenvalues and projects the data onto them simplify the dataset.