Probability distributions are the bridge between:
real-world randomness and mathematical modelling.
A random experiment produces outcomes.
A random variable turns those outcomes into numbers.
A probability distribution tells you how likely each number (or range of numbers) is.
Key takeaway:
A distribution is a complete “story” about uncertainty:
what values are possible, how likely they are, and how we summarise them (mean, variance).
Many ML models are probabilistic:
they assume data (or errors) follow a distribution.
Loss functions often come from distribution assumptions:
squared loss aligns with Gaussian noise.
Naïve Bayes (from the previous module) becomes practical once you can model:
\( P(X\mid Y) \)
using suitable distributions.
In practice:
choosing a distribution is a modelling decision.
It affects:
prediction, uncertainty estimates, and what “rare” or “typical” means in your data.
Generative Artificial Intelligence (GenAI) refers to a class of AI systems that can generate new content such as text, images, audio, video, or code, rather than only making predictions or classifications.
GenAI systems learn patterns and representations from large datasets and use them to produce novel outputs that resemble the data they were trained on.
The AI pipeline is a continuous process where data is collected, prepared, used to train models, evaluated for performance, and continuously improved after deployment.
timeline
title AI Pipeline
Collect Data : Data Ingestion
: Data Understanding
Prepare Data : Cleaning
: Feature Engineering
: Sampling
Train Model : Model Training
: Validation & Metrics
Deploy Model : Deployment
: Monitoring & Retraining
Linear Regression is a supervised
ML
method used to predict a numerical target by fitting a model that is linear in its parameters.
In
ML
, linear models are a core baseline:
they’re fast, often surprisingly strong, and usually easy to interpret.
Key takeaway:
Linear Regression learns parameters by minimising a squared-error cost.
You can solve it directly (closed form) or iteratively (gradient descent),
and you can extend it using basis functions and regularisation.
A random variable is a way to attach numbers to outcomes of a random experiment.
It lets us move from:
“what happened?”
to:
“what number should we analyse?”
Key takeaway:
A random variable is a function from the sample space to real numbers.
Once you define the random variable clearly, the rest (pmf/pdf/cdf, mean, variance) becomes systematic.
flowchart TD
PD["Probability<br/>distributions"] --> RV["Random<br/>variables"]
RV --> T["Types"]
T --> RV1["Discrete<br/>RVs"]
T --> RV2["Continuous<br/>RVs"]
RV --> F["PMF / PDF / CDF"]
RV --> S["Mean / Variance<br/>Covariance"]
RV --> J["Joint & Marginal<br/>distributions"]
RV --> X["Transformations"]
style PD fill:#90CAF9,stroke:#1E88E5,color:#000
style RV fill:#90CAF9,stroke:#1E88E5,color:#000
style T fill:#CE93D8,stroke:#8E24AA,color:#000
style F fill:#CE93D8,stroke:#8E24AA,color:#000
style S fill:#CE93D8,stroke:#8E24AA,color:#000
style J fill:#CE93D8,stroke:#8E24AA,color:#000
style X fill:#CE93D8,stroke:#8E24AA,color:#000
style RV1 fill:#CE93D8,stroke:#8E24AA,color:#000
style RV2 fill:#CE93D8,stroke:#8E24AA,color:#000
Once you can describe a random variable using a pmf or pdf, the next step is to use
named distributions that appear repeatedly in real data and in ML models.
Key takeaway:
Named distributions give you ready-made probability models for common patterns:
binary outcomes, counts, and measurement noise.
Direct solution method - Ordinary Least Squares and the Line of Best Fit
#
It is possible to compute the best parameters for linear regression in one shot (closed-form),
instead of iteratively improving them step-by-step. fileciteturn34file10turn34file6
For linear regression, the direct method is usually Ordinary Least Squares (OLS).
Ordinary Least Squares (OLS) chooses the “best” line by minimising squared prediction errors.
Key takeaway:
OLS defines “best fit” as the line that minimises the total squared residual error across all data points.
Gradient descent is an iterative optimisation method used to minimise the regression cost function by repeatedly updating parameters in the direction that reduces error.
Iterative method
Types: batch / stochastic / mini-batch
Key takeaway:
Gradient descent starts with initial parameter values and repeatedly updates them using the gradient until the cost stops decreasing.
flowchart TD
GD["Gradient<br/>Descent"] -->|minimises| CF["Cost<br/>function"]
GD -->|updates| W["Parameters<br/>(weights)"]
GD -->|uses| GR["Gradient<br/>(slope)"]
GD --> H["Hyperparameters"]
H --> LR["Learning<br/>rate"]
H --> BS["Batch<br/>size"]
H --> EP["Epochs"]
style GD fill:#90CAF9,stroke:#1E88E5,color:#000
style CF fill:#CE93D8,stroke:#8E24AA,color:#000
style W fill:#CE93D8,stroke:#8E24AA,color:#000
style GR fill:#CE93D8,stroke:#8E24AA,color:#000
style H fill:#CE93D8,stroke:#8E24AA,color:#000
style LR fill:#CE93D8,stroke:#8E24AA,color:#000
style BS fill:#CE93D8,stroke:#8E24AA,color:#000
style EP fill:#CE93D8,stroke:#8E24AA,color:#000
Loss Function: Objective function that quantifies how well is model doing? lower the loss function, the better the model. So loss function will try to quantify how well or badly the model is learning or the model is doing.
Optimnisation Algorithm: in order to adjust the loss function, Learning Algorithm will try to optimize our algorithm. searching for the best possible parameters for minimizing the loss function. Popular optimization algorithms for deep learning are based on an approach called gradient descent.