AI on Arshad Siddiqui

Formula Sheet

Thu, 12 Mar 2026 00:00:00 +0000

Formula Sheet #

This page is a quick reference of definitions + formulas, grouped by the modules.

Notation #

Sample size: $ n $ (sample), $ N $ (population)
Sample mean: $ \bar{x} $ , population mean: $ \mu $
Sample variance: $ s^2 $ , population variance: $ \sigma^2 $
Sample SD: $ s $ , population SD: $ \sigma $
Complement: $ A^c $
Intersection (“and”): $ A\cap B $ , union (“or”): $ A\cup B $
Conditional probability: $ P(A\mid B) $

1. Basic Probability & Statistics #

1.1 Measures of Central Tendency #

Arithmetic mean #

Sample mean (ungrouped):

Supervised Learning

Sat, 03 Jan 2026 10:29:52 +0100

Supervised Learning #

Trained using labelled data.
Each example in the training set includes the correct output.
The algorithm learns to generalise and make predictions on unseen data.
Generally more accurate than unsupervised methods.
Requires human intervention for labelling and setup.
Widely used due to its accuracy and efficiency.
Produces highly accurate results when trained on good-quality labelled data.

Classification #

Output is discrete (e.g. Yes/No, Spam/Not Spam).
Used for categorising data into predefined classes.
Support Vector Machine (SVM) is a common classifier (a linear classifier with margin-based separation).

Artificial Intelligence

Thu, 04 Jul 2024 10:55:52 +0100

My AI Notes #

Learning how machines learn! My working notes as I learn AI.

flowchart LR
 AI[Artificial Intelligence]
 ML[Machine Learning]
 DL[Deep Learning]
 FM[Foundation Models]
 LLM[LLM Models]

 AI --> ML
 ML --> DL
 DL --> FM
 FM --> LLM

 style AI fill:#E1F5FE
 style ML fill:#C8E6C9
 style DL fill:#90CAF9
 style FM fill:#64B5F6
 style LLM fill:#FFCCBC

Mathematical Foundations for Machine Learning
Statistical Methods
Machine Learning
Deep Neural Networks

Machine Learning → The broad field where systems learn patterns from data to make predictions or decisions.
Neural Networks → A subset of machine learning that uses interconnected artificial neurons to model complex relationships.
Deep Learning → A subset of neural networks that uses many hidden layers to learn high-level features from large datasets.
Foundation Models → Large deep learning models trained on massive datasets and reused across many tasks using transfer learning.
LLMs (Large Language Models) → A specialised type of foundation model focused on understanding and generating human language.

flowchart TD
AI["Artificial<br/>Intelligence"]
ML["Machine<br/>Learning"]
NN["Neural<br/>Networks"]
DL["Deep<br/>Learning"]
FM["Foundation<br/>Models"]
LLM["LLM<br/>Models"]

AI --> ML
ML --> NN
NN --> DL
DL --> FM
FM --> LLM

LR["Linear<br/>Regression"]
DT["Decision<br/>Trees"]
ML --> LR
ML --> DT

MLP["MLP"]
CNN["CNN"]
NN --> MLP
NN --> CNN

CNNDL["CNN<br/>(deep)"]
RNN["RNN"]
DL --> CNNDL
DL --> RNN

BERT["BERT"]
CLIP["CLIP"]
FM --> BERT
FM --> CLIP

GPT["GPT"]
LLAMA["LLaMA"]
LLM --> GPT
LLM --> LLAMA

TEXT["Text"]
IMAGE["Images"]
AUDIO["Audio"]
VIDEO["Video"]
LLM --> TEXT
LLM --> IMAGE
LLM --> AUDIO
LLM --> VIDEO

style AI fill:#90CAF9,stroke:#1E88E5,color:#000
style ML fill:#90CAF9,stroke:#1E88E5,color:#000
style NN fill:#90CAF9,stroke:#1E88E5,color:#000

style DL fill:#CE93D8,stroke:#8E24AA,color:#000
style FM fill:#CE93D8,stroke:#8E24AA,color:#000

style LLM fill:#C8E6C9,stroke:#2E7D32,color:#000
style LR fill:#C8E6C9,stroke:#2E7D32,color:#000
style DT fill:#C8E6C9,stroke:#2E7D32,color:#000
style MLP fill:#C8E6C9,stroke:#2E7D32,color:#000
style CNN fill:#C8E6C9,stroke:#2E7D32,color:#000
style CNNDL fill:#C8E6C9,stroke:#2E7D32,color:#000
style RNN fill:#C8E6C9,stroke:#2E7D32,color:#000
style BERT fill:#C8E6C9,stroke:#2E7D32,color:#000
style CLIP fill:#C8E6C9,stroke:#2E7D32,color:#000
style GPT fill:#C8E6C9,stroke:#2E7D32,color:#000
style LLAMA fill:#C8E6C9,stroke:#2E7D32,color:#000
style TEXT fill:#C8E6C9,stroke:#2E7D32,color:#000
style IMAGE fill:#C8E6C9,stroke:#2E7D32,color:#000
style AUDIO fill:#C8E6C9,stroke:#2E7D32,color:#000
style VIDEO fill:#C8E6C9,stroke:#2E7D32,color:#000

Stats Formula Sheet

Wed, 25 Feb 2026 00:00:00 +0000

Stats Formula Sheet #

Keep this page as a quick reference of definitions + formulas.

Notation #

Sample size: $ n $ (sample), $ N $ (population)
Mean: $ \bar{x} $ (sample), $ \mu $ (population)
Variance: $ s^2 $ (sample), $ \sigma^2 $ (population)
Standard deviation: $ s $ (sample), $ \sigma $ (population)

Module 1: Basic Statistics #

Measures of Central Tendency #

Sample mean (ungrouped):

Unsupervised Learning

Sat, 03 Jan 2026 10:29:52 +0100

Unsupervised Learning #

Works on unlabelled raw data.
The algorithm discovers hidden patterns without prior knowledge of outcomes.
Requires no human intervention during training.
Does not make direct predictions — it groups or organises data instead.
Carries a higher risk because there’s no ground truth to verify results.
Common techniques include Clustering, Association, and Dimensionality Reduction.

stateDiagram-v2

 %% ML maths-based colours (same palette as supervised)
 classDef probability fill:#d1fae5,stroke:#065f46,stroke-width:1px
 classDef geometry fill:#ffedd5,stroke:#9a3412,stroke-width:1px
 classDef category font-style:italic,font-weight:bold,fill:#f3f4f6,stroke:#374151

 %% Root
 USL: Unsupervised Learning

 %% Main branches
 USL --> CLU:::category
 CLU: Clustering

 USL --> DR:::category
 DR: Dimensionality Reduction

 %% Clustering algorithms
 CLU --> KM:::geometry
 KM: K-Means

 CLU --> HC:::geometry
 HC: Hierarchical Clustering

 CLU --> DB:::geometry
 DB: DBSCAN

 %% Probabilistic models
 USL --> PM:::category
 PM: Probabilistic Models

 PM --> GMM:::probability
 GMM: Gaussian Mixture Model

 PM --> HMM:::probability
 HMM: Hidden Markov Model

Clustering #

Groups similar data points together based on shared features.
Commonly used for market segmentation, image compression, and anomaly detection.

Common Types of Clustering #

K-Means Clustering – Divides data into K groups based on similarity.
Hierarchical Clustering – Builds a hierarchy (tree) of clusters.
DBSCAN (Density-Based Spatial Clustering) – Groups points close in density; identifies noise/outliers.

Association #

Identifies relationships or correlations between variables in a dataset.
Commonly used in market basket analysis (e.g. “Customers who bought X also bought Y”).

Common Techniques #

Apriori Algorithm – Finds frequent itemsets and generates association rules.
Eclat Algorithm – Similar to Apriori but uses set intersections for faster computation.

Dimensionality Reduction #

Reduces the number of input variables to simplify data.
Helps remove noise and redundancy.
Commonly used in data pre-processing and visualisation.

Common Techniques #

Principal Component Analysis (PCA) – Projects data onto fewer dimensions while keeping most variance.
Linear Discriminant Analysis (LDA) – Focuses on class separation.
t-SNE (t-Distributed Stochastic Neighbour Embedding) – Used for visualising high-dimensional data.
Autoencoders – Neural networks that compress and reconstruct data.

mindmap
 root(Unsupervised Learning)
 Clustering
 K Means
 Hierarchical Clustering
 DBSCAN
 Dimensionality Reduction
 PCA
 t SNE
 Autoencoders
 Probabilistic Models
 Gaussian Mixture Model
 Hidden Markov Model

Home | Machine Learning

Semi-Supervised Learning

Sat, 03 Jan 2026 10:29:52 +0100

Semi-Supervised Learning #

A combination of labelled and unlabelled data.
Useful when labelling large datasets is expensive or time-consuming.
Works well with high-volume datasets (e.g. millions of images).
Only a small fraction of data is labelled (e.g. a few thousand).
The algorithm learns from both labelled examples and structure in unlabelled data.
Ideal for medical imaging where labelled data is limited.
For example, a radiologist can label a small set of medical scans,
and the model uses that to learn from thousands of unlabelled scans.
Helps improve accuracy and generalisation with minimal manual effort.

Home | Machine Learning

Reinforcement Learning

Mon, 01 Jan 0001 00:00:00 +0000

Reinforcement Learning (RL) #

RL is learning by trial and error.

Reinforcement Learning (RL) is a type of machine learning where an autonomous agent learns to make decisions by interacting with an environment.

Instead of being told the correct answer, the agent:

takes actions
observes outcomes
receives rewards or penalties
gradually learns a strategy that maximises long-term reward

Reinforcement Learning teaches an agent how to act, not what to predict.

AI Foundation

Mon, 26 Jan 2026 10:55:52 +0100

AI #

A selection of notes that didn’t fit elsewhere or are being worked on!.

Home

AI Stages: ANI, AGI, ASI

Thu, 04 Jul 2024 10:55:52 +0100

AI Development Stages: ANI → AGI → ASI #

Artificial Intelligence is often described in three stages, based on capability and scope:

ANI: Task-specific intelligence (today’s AI)
AGI: Human-level general intelligence (future goal)
ASI: Beyond human intelligence (theoretical)

ANI — Artificial Narrow Intelligence #

also called Weak AI
designed to perform one specific task
Operates within a predefined environment
Cannot generalise beyond its training
Most AI systems today are ANI

examples

Basic Statistics

Mon, 01 Jan 0001 00:00:00 +0000

Basic Statistics #

Statistics: describes data (what you see).
Probability: models uncertainty (what you don’t know yet).

Summarise a dataset using central tendency and variability
Explain core probability ideas using simple examples
Apply the axioms of probability
Distinguish mutually exclusive vs independent events

flowchart TD
 A[Dataset] --> B[Central Tendency]
 A --> C[Variability]
 B --> B1[Mean]
 B --> B2[Median]
 B --> B3[Mode]
 C --> C1[Range]
 C --> C2[Variance]
 C --> C3[Standard Deviation]
 C --> C4[IQR]

Measures of Central Tendency #

Central tendency tells you where the “middle” of the data is. Describes a set of scores with a single number that describes the PERFORMANCE of the group.

Basic Probability

Mon, 01 Jan 0001 00:00:00 +0000

Basic Probability #

Probability models uncertainty: what you don’t know yet, but want to reason about.

Key takeaway: Probability is a number between 0 and 1 that measures how likely an event is. The whole topic is about defining events clearly and applying a few core rules consistently.

Probability quantifies uncertainty: a number between 0 and 1.

0 means: impossible
1 means: certain

Terminology #

Random experiment #

A random experiment is an action whose outcome is not known in advance.

Neural Networks

Mon, 01 Jan 0001 00:00:00 +0000

Neural Networks #

A network of artificial neurons inspired by how neurons function in the human brain.
At its core - a mathematical model designed to process and learn from data.
Neural networks form the foundation of Deep Learning (involves training large and complex networks on vast amounts of data).

flowchart LR
 subgraph subGraph0["Input Layer"]
 I1(("Input 1"))
 I2(("Input 2"))
 I3(("Input 3"))
 end
 subgraph subGraph1["Hidden Layer"]
 H1(("Hidden 1"))
 H2(("Hidden 2"))
 H3(("Hidden 3"))
 end
 subgraph subGraph2["Output Layer"]
 O(("Output"))
 end
 I1 --> H1 & H2 & H3
 I2 --> H1 & H2 & H3
 I3 --> H1 & H2 & H3
 H1 --> O
 H2 --> O
 H3 --> O

 style I1 fill:#C8E6C9
 style I2 fill:#C8E6C9
 style I3 fill:#C8E6C9
 style H1 stroke:#2962FF,fill:#BBDEFB
 style H2 fill:#BBDEFB
 style H3 fill:#BBDEFB
 style O fill:#FFCDD2
 style subGraph0 stroke:none,fill:transparent
 style subGraph1 stroke:none,fill:transparent
 style subGraph2 stroke:none,fill:transparent

Structure of a Neural Network #

A typical neural network has three main layers:

Conditional Probability & Bayes’ Theorem

Thu, 12 Mar 2026 00:00:00 +0000

Conditional Probability & Bayes’ Theorem #

Probability often changes when we learn new information.

Conditional probability and Bayes’ theorem give a structured way to update beliefs using evidence.

Conditional probability updates probabilities after observing an event.

Bayes’ theorem lets you estimate a hidden cause from observed evidence.

Naïve Bayes turns Bayes’ theorem into a practical classifier by assuming conditional independence of features given the class.

flowchart TD

A[Conditional<br/>probability] -->|foundation| B[Bayes<br/>theorem]
D[Independent<br/>events] -->|implies| C[Independence]
C -->|simplifies| A

E[Prior] -->|with likelihood| B
F[Likelihood] -->|updates| H[Posterior]
G[Evidence] -->|normalises| B
B -->|yields| H

I[Naïve<br/>Bayes] -->|uses| B
J[Naïve<br/>assumption] -->|assumes| C
K[Features] -->|given class| J
L[Class] -->|conditions| J
I -->|predicts| M[Classification]
M -->|selects| L

style A fill:#90CAF9,stroke:#1E88E5,color:#000
style B fill:#90CAF9,stroke:#1E88E5,color:#000
style C fill:#90CAF9,stroke:#1E88E5,color:#000

style D fill:#CE93D8,stroke:#8E24AA,color:#000
style E fill:#CE93D8,stroke:#8E24AA,color:#000
style F fill:#CE93D8,stroke:#8E24AA,color:#000
style G fill:#CE93D8,stroke:#8E24AA,color:#000
style J fill:#CE93D8,stroke:#8E24AA,color:#000
style K fill:#CE93D8,stroke:#8E24AA,color:#000
style L fill:#CE93D8,stroke:#8E24AA,color:#000

style H fill:#C8E6C9,stroke:#2E7D32,color:#000
style I fill:#C8E6C9,stroke:#2E7D32,color:#000
style M fill:#C8E6C9,stroke:#2E7D32,color:#000

Quick summary #

Conditional probability: updates probability after an event is known.
Multiplication rule: computes joint probability from conditional parts.
Independence: tested using $ P(A\cap B)=P(A)P(B) $ .
Total probability: breaks a probability into weighted cases.
Bayes’ theorem: reverses conditioning to infer causes from evidence.

What’s next #

Probability Distributions
Move from events to random variables and distributions.

Machine Learning

Tue, 06 Aug 2024 23:29:52 +0100

Machine Learning #

stateDiagram-v2

 %% ===== CLASS DEFINITIONS (Math-based colours) =====
 classDef algebra fill:#cfe8ff,stroke:#1e3a8a,stroke-width:1px
 classDef probability fill:#d1fae5,stroke:#065f46,stroke-width:1px
 classDef geometry fill:#ffedd5,stroke:#9a3412,stroke-width:1px
 classDef logic fill:#ede9fe,stroke:#5b21b6,stroke-width:1px
 classDef category font-style:italic,font-weight:bold,fill:#aaaaaa,stroke:#374151,stroke-width:3px

 %% ===== ROOT =====
 ML: Machine Learning

 %% ===== SUPERVISED =====
 ML --> SL:::category
 SL: Supervised Learning

 SL --> Regression
 Regression --> LR:::algebra
 LR: Linear Regression

 LR --> NN:::algebra
 NN: Neural Network

 NN --> DT:::logic
 DT: Decision Tree

 SL --> Classification
 Classification --> NB:::probability
 NB: Naive Bayes

 NB --> KNN:::geometry
 KNN: k-Nearest Neighbours

 KNN --> SVM:::algebra
 SVM: Support Vector Machine
 
 %% ===== UNSUPERVISED =====
 ML --> USL:::category
 USL: Unsupervised Learning

 USL --> Clustering
 Clustering --> KM:::geometry
 KM: K-Means

 KM --> GMM:::probability
 GMM: Gaussian Mixture Model

 GMM --> HMM:::probability
 HMM: Hidden Markov Model

 %% ===== REINFORCEMENT =====
 ML --> RL:::category
 RL: Reinforcement Learning

 RL --> DM:::logic
 DM: Decision Making

Mathematical Legend

Algebra / Linear Algebra (Blue) #

Used heavily when models rely on:

AI Stack

Mon, 01 Jan 0001 00:00:00 +0000

AI Stack #

The AI Stack describes the layers required to build an end-to-end AI system, from infrastructure at the bottom to user-facing applications at the top.

Different organisations represent the AI stack differently; this is a simplified conceptual view for learning.

Each layer depends on the one below it.

graph TB

 subgraph APP["Applications"]
 A[User Interfaces & Integrations]
 end

 subgraph ORCH["Orchestration"]
 O[Workflows • Agents • Control Logic]
 end

 subgraph DATA["Data"]
 D[Data Sources • Pipelines • Vector DBs]
 end

 subgraph MODEL["Models"]
 M[ML • DL • Foundation Models • LLMs]
 end

 subgraph INFRA["Infrastructure"]
 I[Cloud • On-prem • GPUs • Storage]
 end

 %% Styling
 style APP fill:#FFCCBC
 style ORCH fill:#90CAF9
 style DATA fill:#BBDEFB
 style MODEL fill:#C8E6C9
 style INFRA fill:#E1F5FE

 style A fill:#FFE0B2
 style O fill:#B3E5FC
 style D fill:#E3F2FD
 style M fill:#DCEDC8
 style I fill:#E1F5FE

1. Infrastructure #

The foundation that provides compute and storage.

Artificial Neuron and Perceptron

Mon, 01 Jan 0001 00:00:00 +0000

Artificial Neuron and Perceptron #

knowledge in neural networks is stored in connection weights, and learning means modifying those weights.

Biological Neuron #

A biological neuron is a specialised cell that processes and transmits information through electrical and chemical signals.

Core components:

Dendrites: receive signals from other neurons
Cell body (soma): processes incoming signals
Axon: transmits the output signal
Synapses: connection points between neurons

Biological intuition:

many inputs arrive to one neuron
one neuron can connect out to many neurons
massive parallelism enables fast perception and recognition

Artificial Neuron #

An artificial neuron is a simplified computational model inspired by biological neurons.

ML Workflow

Mon, 01 Jan 0001 00:00:00 +0000

Machine learning Workflow #

Data is the foundation of any machine learning system. Quality of data matters more than model complexity.

Role of Data #

Data determines:

What patterns the model can learn
How well it generalises
Whether bias or noise is introduced

Bad data → bad model (even with perfect algorithms).

Data Preprocessing, wrangling #

Raw data is never ready for training.

Data Issues

Noise
- For objects, noise is an extraneous object
- For attributes, noise refers to modification of original values
- Use Log or Z Transfer to convert to mean
Outliers
- Data objects with characteristics that are considerably different than most of the other data objects in the data set
- Handle: Use IQR method
- Find Lower and Upper Bound and replace Outlier with Lower or Upper Bound
Missing Values
- Eliminate data objects or variables
- Handle: Estimate missing values
  - Mean, Median or Mode
  - Prefer Median if there are missing outliers
- Ignore the missing value during analysis
Duplicate Data
- Major issue when merging data from heterogeneous sources
Inconsistent Codes
- Find all Unique and transfer all inconsistent to

Data Preprocessing techniques

Conditional Probability

Thu, 12 Mar 2026 00:00:00 +0000

Conditional Probability #

Conditional probability updates the probability of an event when new information is available.

It shows up whenever a question says:

“given that…”
“among those who…”
“out of the items that…”
“if it does not fail immediately…”

Key takeaway: Conditional probability is always:

joint probability ÷ probability of the condition.

The condition must not be an impossible event.

Prior vs posterior #

Prior probability: probability with no condition (before new information)

Bayes’ Theorem

Mon, 01 Jan 0001 00:00:00 +0000

Bayes’ Theorem #

2.1 Total probability (needed for Bayes) #

Often we split the world into cases $ E_1,E_2,\dots,E_k $ that:

are mutually exclusive
cover the whole sample space

Then for any event $ A $ :

\[ P(A)=\sum_{i=1}^{k} P(A\mid E_i)\,P(E_i) \]

Tree intuition:

flowchart TD
 S[Start] --> E1[Case E1]
 S --> E2[Case E2]
 S --> E3[Case E3]
 E1 --> A1["A happens"]
 E2 --> A2["A happens"]
 E3 --> A3["A happens"]

2.2 Bayes’ theorem (two-event form) #

Bayes’ Theorem is a mathematical formula used to determine the conditional probability of an event based on prior knowledge and new evidence.

Naïve Bayes

Thu, 12 Mar 2026 00:00:00 +0000

Naïve Bayes #

Naïve Bayes is a probabilistic classifier.

Supervised Learning Problem
Binary Classification - final target variable is considered in two classes
Hypothesis is target which you want to classify
Total Probability (Prior) of Yes and No is already calculated
Post / Posterior is when you start studying data
Based on max probability of hypotheses classify given instance into a class

It predicts a class label by computing:

Probability Distributions

Sun, 22 Feb 2026 00:00:00 +0000

Probability Distributions #

Probability distributions are the bridge between: real-world randomness and mathematical modelling.

A random experiment produces outcomes. A random variable turns those outcomes into numbers. A probability distribution tells you how likely each number (or range of numbers) is.

Key takeaway: A distribution is a complete “story” about uncertainty: what values are possible, how likely they are, and how we summarise them (mean, variance).

flowchart TD
	PD["Probability<br/>distributions"] --> RV["Random<br/>variables"]
	PD["Probability<br/>distributions"] --> DS["Common<br/>distributions"]

	style PD fill:#90CAF9,stroke:#1E88E5,color:#000
	style RV fill:#90CAF9,stroke:#1E88E5,color:#000
	style DS fill:#90CAF9,stroke:#1E88E5,color:#000

AI/ML Connection #

Many ML models are probabilistic: they assume data (or errors) follow a distribution.
Loss functions often come from distribution assumptions: squared loss aligns with Gaussian noise.
Naïve Bayes (from the previous module) becomes practical once you can model: $ P(X\mid Y) $ using suitable distributions.

In practice: choosing a distribution is a modelling decision. It affects: prediction, uncertainty estimates, and what “rare” or “typical” means in your data.

Generative AI

Mon, 15 Dec 2025 10:55:52 +0100

Generative AI #

Generative Artificial Intelligence (GenAI) refers to a class of AI systems that can generate new content such as text, images, audio, video, or code, rather than only making predictions or classifications.

GenAI systems learn patterns and representations from large datasets and use them to produce novel outputs that resemble the data they were trained on.

How Generative AI Differs from Traditional AI #

Traditional AI	Generative AI
Predicts or classifies	Generates new content
Task-specific models	General-purpose models
Fixed outputs	Open-ended outputs
Often rule-based	Data-driven and probabilistic

Core Idea of Generative AI #

Instead of learning “what label to assign”, Generative AI learns “how data is structured” and then creates new data following that structure.

AI Pipeline

Thu, 04 Jul 2024 10:55:52 +0100

AI Pipeline #

The AI pipeline is a continuous process where data is collected, prepared, used to train models, evaluated for performance, and continuously improved after deployment.

Collect Data #
Prepare data #
Train Model #
- Iterate until model is good enough
Deploy Model #
- Get data back
- Maintain & update model

timeline
 title AI Pipeline
 Collect Data : Data Ingestion
 : Data Understanding
 Prepare Data : Cleaning
 : Feature Engineering
 : Sampling
 Train Model : Model Training
 : Validation & Metrics
 Deploy Model : Deployment
 : Monitoring & Retraining

Home | AI Foundation

Regression(Linear Models)

Mon, 01 Jan 0001 00:00:00 +0000

Linear Regression #

Linear Regression is a supervised ML method used to predict a numerical target by fitting a model that is linear in its parameters.

In ML , linear models are a core baseline: they’re fast, often surprisingly strong, and usually easy to interpret.

Key takeaway: Linear Regression learns parameters by minimising a squared-error cost. You can solve it directly (closed form) or iteratively (gradient descent), and you can extend it using basis functions and regularisation.

Random Variables

Sun, 22 Feb 2026 00:00:00 +0000

Random Variables #

A random variable is a way to attach numbers to outcomes of a random experiment.

It lets us move from: “what happened?” to: “what number should we analyse?”

Key takeaway: A random variable is a function from the sample space to real numbers. Once you define the random variable clearly, the rest (pmf/pdf/cdf, mean, variance) becomes systematic.

flowchart TD
PD["Probability<br/>distributions"] --> RV["Random<br/>variables"]

RV --> T["Types"]
T --> RV1["Discrete<br/>RVs"]
T --> RV2["Continuous<br/>RVs"]

RV --> F["PMF / PDF / CDF"]
RV --> S["Mean / Variance<br/>Covariance"]
RV --> J["Joint & Marginal<br/>distributions"]
RV --> X["Transformations"]

style PD fill:#90CAF9,stroke:#1E88E5,color:#000
style RV fill:#90CAF9,stroke:#1E88E5,color:#000

style T fill:#CE93D8,stroke:#8E24AA,color:#000
style F fill:#CE93D8,stroke:#8E24AA,color:#000
style S fill:#CE93D8,stroke:#8E24AA,color:#000
style J fill:#CE93D8,stroke:#8E24AA,color:#000
style X fill:#CE93D8,stroke:#8E24AA,color:#000
style RV1 fill:#CE93D8,stroke:#8E24AA,color:#000
style RV2 fill:#CE93D8,stroke:#8E24AA,color:#000

1) Definition #

Random variable: a rule that assigns a number to each outcome.

Common Probability Distributions

Sun, 22 Feb 2026 00:00:00 +0000

Common Probability Distributions #

Once you can describe a random variable using a pmf or pdf, the next step is to use named distributions that appear repeatedly in real data and in ML models.

Key takeaway: Named distributions give you ready-made probability models for common patterns: binary outcomes, counts, and measurement noise.

flowchart TD
PD["Probability<br/>distributions"] --> DS["Common<br/>distributions"]

DS --> DIS["Discrete"]
DS --> CON["Continuous"]

DIS --> D1["Bernoulli"]
DIS --> D2["Binomial"]
DIS --> D3["Poisson"]

CON --> D4["Normal<br/>(Gaussian)"]
CON --> D5["t / Chi-square / F<br/>(intro)"]

style PD fill:#90CAF9,stroke:#1E88E5,color:#000
style DS fill:#90CAF9,stroke:#1E88E5,color:#000

style DIS fill:#CE93D8,stroke:#8E24AA,color:#000
style CON fill:#CE93D8,stroke:#8E24AA,color:#000

style D1 fill:#C8E6C9,stroke:#2E7D32,color:#000
style D2 fill:#C8E6C9,stroke:#2E7D32,color:#000
style D3 fill:#C8E6C9,stroke:#2E7D32,color:#000
style D4 fill:#C8E6C9,stroke:#2E7D32,color:#000
style D5 fill:#C8E6C9,stroke:#2E7D32,color:#000

1) Bernoulli distribution (binary) #

Use when: one trial has two outcomes (success/failure).

Ordinary Least Squares

Sat, 21 Feb 2026 00:00:00 +0000

Direct solution method - Ordinary Least Squares and the Line of Best Fit #

It is possible to compute the best parameters for linear regression in one shot (closed-form), instead of iteratively improving them step-by-step. fileciteturn34file10turn34file6

For linear regression, the direct method is usually Ordinary Least Squares (OLS).

Ordinary Least Squares (OLS) chooses the “best” line by minimising squared prediction errors.

Key takeaway: OLS defines “best fit” as the line that minimises the total squared residual error across all data points.

Cost Function

Sat, 21 Feb 2026 00:00:00 +0000

Cost Function #

also known as an objective function
how far the predicted values are from the actual ones
measure of the difference between predicted values and actual values
quantifies the error between a model’s predicted values and actual values
measures the model’s error on a group of datapoints
method used to predict values by drawing the best-fit line through the data
used to evaluate the accuracy of a model’s predictions

Gradient Descent

Sat, 21 Feb 2026 00:00:00 +0000

Gradient Descent for Linear Regression #

Gradient descent is an iterative optimisation method used to minimise the regression cost function by repeatedly updating parameters in the direction that reduces error.

Iterative method
Types: batch / stochastic / mini-batch

Key takeaway: Gradient descent starts with initial parameter values and repeatedly updates them using the gradient until the cost stops decreasing.

flowchart TD
GD["Gradient<br/>Descent"] -->|minimises| CF["Cost<br/>function"]
GD -->|updates| W["Parameters<br/>(weights)"]
GD -->|uses| GR["Gradient<br/>(slope)"]

GD --> H["Hyperparameters"]
H --> LR["Learning<br/>rate"]
H --> BS["Batch<br/>size"]
H --> EP["Epochs"]

style GD fill:#90CAF9,stroke:#1E88E5,color:#000

style CF fill:#CE93D8,stroke:#8E24AA,color:#000
style W fill:#CE93D8,stroke:#8E24AA,color:#000
style GR fill:#CE93D8,stroke:#8E24AA,color:#000
style H fill:#CE93D8,stroke:#8E24AA,color:#000
style LR fill:#CE93D8,stroke:#8E24AA,color:#000
style BS fill:#CE93D8,stroke:#8E24AA,color:#000
style EP fill:#CE93D8,stroke:#8E24AA,color:#000

Types of GD #

flowchart TD
T["Gradient Descent<br/>types"] --> BGD["Batch<br/>GD"]
T --> SGD["Stochastic<br/>GD"]
T --> MGD["Mini-batch<br/>GD"]

BGD --> ALL["All data<br/>per step"]
BGD --> STB["Smooth<br/>updates"]

SGD --> ONE["1 sample<br/>per step"]
SGD --> FAST["Quick<br/>progress"]
SGD --> NOISE["Noisy<br/>updates"]

MGD --> MB["Small batch<br/>per step"]
MGD --> PRACT["Practical<br/>default"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000

style BGD fill:#C8E6C9,stroke:#2E7D32,color:#000
style SGD fill:#C8E6C9,stroke:#2E7D32,color:#000
style MGD fill:#C8E6C9,stroke:#2E7D32,color:#000

style ALL fill:#CE93D8,stroke:#8E24AA,color:#000
style STB fill:#CE93D8,stroke:#8E24AA,color:#000
style ONE fill:#CE93D8,stroke:#8E24AA,color:#000
style FAST fill:#CE93D8,stroke:#8E24AA,color:#000
style NOISE fill:#CE93D8,stroke:#8E24AA,color:#000
style MB fill:#CE93D8,stroke:#8E24AA,color:#000
style PRACT fill:#CE93D8,stroke:#8E24AA,color:#000

Batch #

Use only if you have huge compute and a lot of time to train

SGD #

go-to solution

Deep Learning

Wed, 22 Apr 2026 00:00:00 +0000

Deep Learning #

Subset of ML
focuses on algorithms inspired by the structure and function of the brain called Artificial Neural Networks.
A neural network with multiple hidden layers and multiple nodes in each hidden layer is known as a deep learning system or a deep neural network.
Allows systems to automatically learn hierarchical representations (features) from raw input, such as images, sound, or text.

Operational Steps for Neural Architectures #

Step	Perceptron (Boolean/Logic)	Linear Regression Network	Binary Classification (Logistic)	DFNN / MLP (Classification)
1. Input	Take binary or discrete inputs $ x_1, \dots, x_n $	Take numerical features $ x $	Take numerical features $ x $	Take high-dimensional numerical or categorical features
2. Weighted Sum	Single calculation: $ z = \sum (w_i x_i) + b $	Single calculation: $ \hat{y} = w_0 + w_1 x $	Single calculation: $ z = W x + b $	Multiple stages: $ z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]} $ for each layer $ l $
3. Activation	Step Function: Output 1 if $ z \geq 0 $ , else 0	Identity: The output remains $ z $ (no non-linear change)	Sigmoid: Maps $ z $ to a probability between 0 and 1	ReLU for hidden layers; Softmax/Sigmoid for the output layer
4. Loss / Error	Error = Target − Output	Mean Squared Error (MSE): $ J = \frac{1}{2N} \sum (Y - \hat{y})^2 $	Binary Cross-Entropy (BCE): penalises based on probability distance	BCE or Categorical Cross-Entropy for multiple classes
5. Optimisation	Update weights only on misclassification	Gradient Descent: compute gradients at initialization and update weights	Backpropagation: compute error signals $ \delta $ and gradients $ dW $	Backpropagation: recursive chain rule to update all hidden layer weights
6. Output	Discrete Boolean value (0 or 1)	Continuous numerical value (e.g., house prices)	Single probability score or class label	A vector of probabilities for multiple classes

flowchart LR
 %% Input Layer
 subgraph subGraph0["Input Layer"]
 I1(("Input 1"))
 I2(("Input 2"))
 I3(("Input 3"))
 end

 %% Hidden Layers
 subgraph subGraph1["Hidden Layer 1"]
 H1a(("H1-1"))
 H1b(("H1-2"))
 H1c(("H1-3"))
 end

 subgraph subGraph2["Hidden Layer 2"]
 H2a(("H2-1"))
 H2b(("H2-2"))
 H2c(("H2-3"))
 end

 subgraph subGraph3["Hidden Layer 3"]
 H3a(("H3-1"))
 H3b(("H3-2"))
 H3c(("H3-3"))
 end

 %% Output Layer
 subgraph subGraph4["Output Layer"]
 O(("Output"))
 end

 %% Connections: Input to Hidden Layer 1
 I1 --> H1a & H1b & H1c
 I2 --> H1a & H1b & H1c
 I3 --> H1a & H1b & H1c

 %% Connections: Hidden Layer 1 to Hidden Layer 2
 H1a --> H2a & H2b & H2c
 H1b --> H2a & H2b & H2c
 H1c --> H2a & H2b & H2c

 %% Connections: Hidden Layer 2 to Hidden Layer 3
 H2a --> H3a & H3b & H3c
 H2b --> H3a & H3b & H3c
 H2c --> H3a & H3b & H3c

 %% Connections: Hidden Layer 3 to Output
 H3a --> O
 H3b --> O
 H3c --> O

 %% Styling
 style I1 fill:#C8E6C9
 style I2 fill:#C8E6C9
 style I3 fill:#C8E6C9
 style H1a fill:#BBDEFB
 style H1b fill:#BBDEFB
 style H1c fill:#BBDEFB
 style H2a fill:#90CAF9
 style H2b fill:#90CAF9
 style H2c fill:#90CAF9
 style H3a fill:#64B5F6
 style H3b fill:#64B5F6
 style H3c fill:#64B5F6
 style O fill:#FFCDD2
 style subGraph0 stroke:none,fill:transparent
 style subGraph1 stroke:none,fill:transparent
 style subGraph2 stroke:none,fill:transparent
 style subGraph3 stroke:none,fill:transparent
 style subGraph4 stroke:none,fill:transparent

Types of Neural Networks #

Standard NN - Small and Standard for a smaller and simpler data (e.g. Real Estate
CNN - Convolution - used for Images (e.g. Photo Tagging, Object Detection)
RNN - Recurrent - used for Text (e.g. Speech Recognition, Translation)
Hybrid NN (e.g. Autonoumous Driving)

Components of DL #

Data
Learning Algorithm : How to transform data
Loss Function: Objective function that quantifies how well is model doing? lower the loss function, the better the model. So loss function will try to quantify how well or badly the model is learning or the model is doing.
Optimnisation Algorithm: in order to adjust the loss function, Learning Algorithm will try to optimize our algorithm. searching for the best possible parameters for minimizing the loss function. Popular optimization algorithms for deep learning are based on an approach called gradient descent.
Model

Operational Steps for Neural Architectures #

Step	Perceptron (Boolean/Logic)	Linear Regression Network	Binary Classification (Logistic)	DFNN / MLP (Classification)
1. Input	Binary/discrete inputs $ x_1, \dots, x_n $	Numerical features $ x $	Numerical features $ x $	High-dimensional numerical or categorical features
2. Weighted Sum	$ z = \sum (w_i x_i) + b $	$ \hat{y} = w_0 + w_1 x $	$ z = W x + b $	$ z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]} $
3. Activation	Step: 1 if $ z \geq 0 $ , else 0	Identity: output = $ z $	Sigmoid: maps $ z $ to probability	ReLU (hidden), Softmax/Sigmoid (output)
4. Loss / Error	Error = Target − Output	$ J = \frac{1}{2N} \sum (Y - \hat{y})^2 $	Binary Cross-Entropy (BCE)	BCE or Categorical Cross-Entropy
5. Optimisation	Update on misclassification	Gradient Descent	Backpropagation (single layer)	Backpropagation (multi-layer chain rule)
6. Output	Boolean (0 or 1)	Continuous value	Probability score	Probability vector (multi-class)

Applications #

Computer Vision (e.g., face detection, medical imaging)
Natural Language Processing (e.g., ChatGPT, translation)
Self Driving Cars
Speech Assistants (e.g., Alexa, Siri)

Intution #

Deep Learning is the methodology, DNN is a model.

Hypothesis Testing

Thu, 12 Mar 2026 00:00:00 +0000

Hypothesis Testing #

Hypothesis testing is a structured way to decide:

Is what we see in a sample just random variation, or is there evidence of a real effect in the population?

Hypothesis Testing topic sits inside inferential statistics: we use a sample to make a statement about a population.

Sampling (random and stratified)
Sampling distribution and Central Limit Theorem
Estimation (confidence intervals and confidence level)
Testing hypotheses (mean, proportion, ANOVA)
Maximum likelihood (MLE)

Key takeaway: The logic is always the same:

Classification(Linear Models)

Mon, 01 Jan 0001 00:00:00 +0000

Linear models for Classification #

categorises data by finding a linear boundary (hyperplane) that separates classes
calculating a weighted sum of input features plus bias

flowchart TD
T["Linear<br/>classification<br/>models"] --> P["Perceptron"]
T --> LR["Logistic<br/>regression"]
T --> SVM["Linear<br/>SVM"]

P -->|uses| STEP["Step<br/>activation"]
LR -->|uses| SIG["Sigmoid<br/>+ log loss"]
SVM -->|uses| HNG["Hinge<br/>loss"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000

style P fill:#C8E6C9,stroke:#2E7D32,color:#000
style LR fill:#C8E6C9,stroke:#2E7D32,color:#000
style SVM fill:#C8E6C9,stroke:#2E7D32,color:#000

style STEP fill:#CE93D8,stroke:#8E24AA,color:#000
style SIG fill:#CE93D8,stroke:#8E24AA,color:#000
style HNG fill:#CE93D8,stroke:#8E24AA,color:#000

Discriminant Functions #

Decision Theory #

Probabilistic Discriminative Classifiers #

Logistic Regression #

Supervised machine learning algorithm
Binary classification algorithm
requires data to be linearly separable
predicts the probability that an input belongs to a specific class
uses Sigmoid function to convert inputs into a probability value between 0 and 1

Key takeaway: Logistic regression predicts $P(y=1\mid x)$ using a sigmoid of a linear score $z=w\cdot x+b$, then learns $w,b$ by maximising likelihood (equivalently minimising log-loss).

Foundation Models

Sun, 14 Dec 2025 00:00:00 +0000

Foundation Model #

AI models trained on massive datasets to perform a wide range of tasks with minimal fine-tuning.

are large deep learning neural networks
are large AI models trained on massive and diverse datasets (text, images, audio, or multiple modalities).
Contain millions or billions of parameters.
designed to perform a broad range of general tasks
designed for general-purpose intelligence, not a single task.
acts as base models for building specialised AI applications

LLM - Model

Mon, 01 Jan 0001 00:00:00 +0000

LLM – Large Language Model #

Large Language Models (LLMs) are advanced AI systems designed to process, understand, and generate human-like text.

They learn language by analysing massive amounts of text data, discovering patterns in:

grammar
meaning
context
relationships between words and sentences
Built on Deep Learning
Implemented using Neural Networks
Based on Transformers
Often combined with tools like:
- Retrieval (RAG)
- Agents
- External APIs
- Memory systems

What makes an LLM special? #

Built using deep neural networks
Trained on very large datasets (books, articles, code, web text)
Can perform many tasks without task-specific training
General-purpose language understanding, not single-task models

Foundation: Transformer Architecture #

LLMs are based on the Transformer Architecture, which allows models to understand context and long-range dependencies in text.

AI Agents

Mon, 15 Dec 2025 10:55:52 +0100

AI Agents #

Also referred to as Agentic AI.

AI agents are intelligent systems that can plan, make decisions, and take actions to achieve goals with minimal human intervention.

A common use case is task automation
for example booking travel based on a user’s request.
AI agents typically build on Generative AI and use Large Language Models (LLMs) as the reasoning core.
Agents often interact with tools (APIs, databases, calendars) to complete multi-step workflows.

Retrieval-Augmented Generation (RAG)

Mon, 01 Jan 0001 00:00:00 +0000

Retrieval-Augmented Generation (RAG) #

Retrieval-Augmented Generation (RAG) is a system design pattern that improves an LLM’s answers by:

Retrieving relevant information from an external knowledge source, and then
Augmenting the LLM prompt with that retrieved context before generating the final response.

RAG helps an LLM look things up first, then answer using evidence.

Why RAG is Useful #

RAG is commonly used when:

Your knowledge is in private documents (PDFs, policies, internal wiki)
You need up-to-date information (things not in the model’s training data)
You want fewer hallucinations by grounding answers in retrieved sources
You want traceability (show “where the answer came from”)

RAG does not change the model weights.
It changes what the model sees at inference time by adding retrieved context.

Decision Tree

Mon, 01 Jan 0001 00:00:00 +0000

Decision Tree #

A decision tree classifies an example by asking a sequence of questions about its attributes until it reaches a leaf (final decision).

Key takeaway: A decision tree grows by repeatedly splitting the training data into purer subsets using an impurity measure (Entropy / Gini / Classification Error).

Information Theory #

Decision trees need a way to measure: “How mixed are the class labels at a node?”

Prediction & Forecasting

Mon, 01 Jan 0001 00:00:00 +0000

Prediction & Forecasting #

Correlation #

Regression #

Time Series Analysis #

Introduction, Components of time series data #

MA model – basic and weighted MA model #

Time series models #

AR Model
ARIMA Model
SARIMA,SARIMAX,VAR,VARMAX
Simple exponential smoothing model

Reference #

Prediction & Forecasting

Home | Statistics

Gaussian Mixture model & Expectation Maximization

Mon, 01 Jan 0001 00:00:00 +0000

Gaussian Mixture model & Expectation Maximization #

Reference #

Gaussian Mixture model

Expectation Maximization

Home | Statistics

Instance-based Learning

Mon, 01 Jan 0001 00:00:00 +0000

Instance-based Learning #

Instance-based learning is a family of methods that do not build one explicit global model during training. Instead, they store training examples and delay most of the work until a new query arrives.

When a new point must be classified or predicted, the algorithm compares it with previously seen examples, finds the most relevant neighbours, and uses them to produce the answer.

Instance-based Learning covers three linked ideas:

Support Vector Machine

Mon, 01 Jan 0001 00:00:00 +0000

Support Vector Machine (SVM) #

A Support Vector Machine (SVM) is a supervised machine learning algorithm used for:

Classification (most common)
Regression (SVR – Support Vector Regression)

Find the decision boundary that separates classes with the maximum margin.

A Support Vector Machine is a supervised learning algorithm that finds an optimal hyperplane by maximising the margin between classes, using support vectors and kernel functions to handle non-linear data.

Attention Mechanism

Mon, 01 Jan 0001 00:00:00 +0000

Attention Mechanism #

Queries, Keys, and Values
Attention Pooling by Similarity
Attention Pooling via Nadaraya–Watson Regression
Attention Scoring Functions
Dot Product Attention
Convenience Functions
Scaled Dot Product Attention
Additive Attention
Bahdanau Attention Mechanism
Multi-Head Attention
Self-Attention
Positional Encoding
Code implementation (webinar)

Reference #

Dive into deep learning. Cambridge University Press.. (Ch 10, Ch7

Home | Deep Learning

Bayesian Learning

Mon, 01 Jan 0001 00:00:00 +0000

Bayesian Learning #

MLE Hypothesis #

MAP Hypothesis #

Bayes Rule #

Optimal Bayes Classifier #

Naïve Bayes Classifier #

Probabilistic Generative Classifiers #

Bayesian Linear Regression #

Home | Machine Learning

Transformer

Mon, 15 Dec 2025 10:55:52 +0100

Transformer #

is an architecture of neural networks
based on the multi-head attention mechanism
text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table
takes a text sequence as input and produces another text sequence as output
foundation for modern Large Language Models (LLMs) like ChatGPT and Gemini
Transformer architecture
Model, Positionwise Feed-Forward Networks, Residual Connection and Layer Normalization

Ensemble Learning

Mon, 01 Jan 0001 00:00:00 +0000

Ensemble Learning #

Combining Classifiers #

Bagging #

Random Forest #

Boosting #

ADABoost #

Gradient Boosting #

XGBoost #

Home | Machine Learning

Optimisation of Deep models

Mon, 01 Jan 0001 00:00:00 +0000

Optimisation of Deep models #

Goal of Optimization
Optimization Challenges in Deep Learning
Gradient Descent
Stochastic Gradient Descent
Minibatch Stochastic Gradient Descent
Momentum
Adagrad and Algorithm
RMSProp and Algorithm
Adadelta and Algorithm
Adam and Algorithm
Code Implementation and comparison of algorithms (webinar)

Reference #

Dive into deep learning. Cambridge University Press.. (Ch12)

Home | Deep Learning

Evaluation/Comparison

Mon, 01 Jan 0001 00:00:00 +0000

Machine Learning Model Evaluation/Comparison #

Comparing Machine Learning Models #

Emerging requirements e.g., bias, fairness, interpretability of ML models #

Home | Machine Learning

Regularisation for Deep models

Mon, 01 Jan 0001 00:00:00 +0000

Regularisation for Deep models #

Generalization for regression
Training Error and Generalization Error
Underfitting or Overfitting
Model Selection
Weight Decay and Norms
Generalization in Classification
Environment and Distribution Shift
Generalization in Deep Learning
Dropout
Batch Normalization
Layer Normalization
Code implementation (webinar)

Reference #

Dive into deep learning. Cambridge University Press.. (T1 – Ch 3.6, 3.7, T1 - Ch 4.6, 4.7, T1 - Ch 5.5, 5.6, T1 - Ch 8.5, T1 - Ch 11.7

Home | Deep Learning

AI Learning Resources

Sat, 03 Jan 2026 12:00:00 +0100

AI Learning Resources #

A curated list of high-quality online courses to learn Artificial Intelligence, Machine Learning, and Deep Learning from reputable universities and organisations.

Recommended Books & References #

Deep Neural Networks (DNN) #

Deep Learning. MIT Press.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). (Vol. 1, No. 2).
Introduction to Deep Learning. MIT Press.
Eugene, C. (2019).
Deep Learning with Python. Simon & Schuster.
Chollet, F. (2021).

Step	Perceptron (Boolean/Logic)	Linear Regression Network	Binary Classification (Logistic)	DFNN / MLP (Classification)
1. Input	Take binary or discrete inputs \( x_1, \dots, x_n \)	Take numerical features \( x \)	Take numerical features \( x \)	Take high-dimensional numerical or categorical features
2. Weighted Sum	Single calculation: \( z = \sum (w_i x_i) + b \)	Single calculation: \( \hat{y} = w_0 + w_1 x \)	Single calculation: \( z = W x + b \)	Multiple stages: \( z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]} \) for each layer \( l \)
3. Activation	Step Function: Output 1 if \( z \geq 0 \) , else 0	Identity: The output remains \( z \) (no non-linear change)	Sigmoid: Maps \( z \) to a probability between 0 and 1	ReLU for hidden layers; Softmax/Sigmoid for the output layer
4. Loss / Error	Error = Target − Output	Mean Squared Error (MSE): \( J = \frac{1}{2N} \sum (Y - \hat{y})^2 \)	Binary Cross-Entropy (BCE): penalises based on probability distance	BCE or Categorical Cross-Entropy for multiple classes
5. Optimisation	Update weights only on misclassification	Gradient Descent: compute gradients at initialization and update weights	Backpropagation: compute error signals \( \delta \) and gradients \( dW \)	Backpropagation: recursive chain rule to update all hidden layer weights
6. Output	Discrete Boolean value (0 or 1)	Continuous numerical value (e.g., house prices)	Single probability score or class label	A vector of probabilities for multiple classes

Step	Perceptron (Boolean/Logic)	Linear Regression Network	Binary Classification (Logistic)	DFNN / MLP (Classification)
1. Input	Binary/discrete inputs \( x_1, \dots, x_n \)	Numerical features \( x \)	Numerical features \( x \)	High-dimensional numerical or categorical features
2. Weighted Sum	\( z = \sum (w_i x_i) + b \)	\( \hat{y} = w_0 + w_1 x \)	\( z = W x + b \)	\( z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]} \)
3. Activation	Step: 1 if \( z \geq 0 \) , else 0	Identity: output = \( z \)	Sigmoid: maps \( z \) to probability	ReLU (hidden), Softmax/Sigmoid (output)
4. Loss / Error	Error = Target − Output	\( J = \frac{1}{2N} \sum (Y - \hat{y})^2 \)	Binary Cross-Entropy (BCE)	BCE or Categorical Cross-Entropy
5. Optimisation	Update on misclassification	Gradient Descent	Backpropagation (single layer)	Backpropagation (multi-layer chain rule)
6. Output	Boolean (0 or 1)	Continuous value	Probability score	Probability vector (multi-class)