AI

Probability Distributions

Probability Distributions #

Probability distributions are the bridge between: real-world randomness and mathematical modelling.

A random experiment produces outcomes. A random variable turns those outcomes into numbers. A probability distribution tells you how likely each number (or range of numbers) is.

Key takeaway: A distribution is a complete “story” about uncertainty: what values are possible, how likely they are, and how we summarise them (mean, variance).


flowchart TD
	PD["Probability<br/>distributions"] --> RV["Random<br/>variables"]
	PD["Probability<br/>distributions"] --> DS["Common<br/>distributions"]

	style PD fill:#90CAF9,stroke:#1E88E5,color:#000
	style RV fill:#90CAF9,stroke:#1E88E5,color:#000
	style DS fill:#90CAF9,stroke:#1E88E5,color:#000

AI/ML Connection #

  • Many ML models are probabilistic: they assume data (or errors) follow a distribution.
  • Loss functions often come from distribution assumptions: squared loss aligns with Gaussian noise.
  • Naïve Bayes (from the previous module) becomes practical once you can model: \( P(X\mid Y) \) using suitable distributions.

In practice: choosing a distribution is a modelling decision. It affects: prediction, uncertainty estimates, and what “rare” or “typical” means in your data.

Generative AI

Generative AI #

Generative Artificial Intelligence (GenAI) refers to a class of AI systems that can generate new content such as text, images, audio, video, or code, rather than only making predictions or classifications.

GenAI systems learn patterns and representations from large datasets and use them to produce novel outputs that resemble the data they were trained on.


How Generative AI Differs from Traditional AI #

Traditional AIGenerative AI
Predicts or classifiesGenerates new content
Task-specific modelsGeneral-purpose models
Fixed outputsOpen-ended outputs
Often rule-basedData-driven and probabilistic

Core Idea of Generative AI #

Instead of learning “what label to assign”, Generative AI learns “how data is structured” and then creates new data following that structure.

AI Pipeline

AI Pipeline #

The AI pipeline is a continuous process where data is collected, prepared, used to train models, evaluated for performance, and continuously improved after deployment.

  1. Collect Data #

  2. Prepare data #

  3. Train Model #

    • Iterate until model is good enough
  4. Deploy Model #

    • Get data back
    • Maintain & update model
timeline
    title AI Pipeline
    Collect Data : Data Ingestion
                 : Data Understanding
    Prepare Data : Cleaning
                 : Feature Engineering
                 : Sampling
    Train Model  : Model Training
                 : Validation & Metrics
    Deploy Model : Deployment
                 : Monitoring & Retraining

Home | AI Foundation

Regression(Linear Models)

Linear Regression #

Linear Regression is a supervised ML method used to predict a numerical target by fitting a model that is linear in its parameters.

In ML , linear models are a core baseline: they’re fast, often surprisingly strong, and usually easy to interpret.

Key takeaway: Linear Regression learns parameters by minimising a squared-error cost. You can solve it directly (closed form) or iteratively (gradient descent), and you can extend it using basis functions and regularisation.

Random Variables

Random Variables #

A random variable is a way to attach numbers to outcomes of a random experiment.

It lets us move from: “what happened?” to: “what number should we analyse?”

Key takeaway: A random variable is a function from the sample space to real numbers. Once you define the random variable clearly, the rest (pmf/pdf/cdf, mean, variance) becomes systematic.


flowchart TD
PD["Probability<br/>distributions"] --> RV["Random<br/>variables"]

RV --> T["Types"]
T --> RV1["Discrete<br/>RVs"]
T --> RV2["Continuous<br/>RVs"]

RV --> F["PMF / PDF / CDF"]
RV --> S["Mean / Variance<br/>Covariance"]
RV --> J["Joint & Marginal<br/>distributions"]
RV --> X["Transformations"]

style PD fill:#90CAF9,stroke:#1E88E5,color:#000
style RV fill:#90CAF9,stroke:#1E88E5,color:#000

style T fill:#CE93D8,stroke:#8E24AA,color:#000
style F fill:#CE93D8,stroke:#8E24AA,color:#000
style S fill:#CE93D8,stroke:#8E24AA,color:#000
style J fill:#CE93D8,stroke:#8E24AA,color:#000
style X fill:#CE93D8,stroke:#8E24AA,color:#000
style RV1 fill:#CE93D8,stroke:#8E24AA,color:#000
style RV2 fill:#CE93D8,stroke:#8E24AA,color:#000

1) Definition #

Random variable: a rule that assigns a number to each outcome.

Common Probability Distributions

Common Probability Distributions #

Once you can describe a random variable using a pmf or pdf, the next step is to use named distributions that appear repeatedly in real data and in ML models.

Key takeaway: Named distributions give you ready-made probability models for common patterns: binary outcomes, counts, and measurement noise.


flowchart TD
PD["Probability<br/>distributions"] --> DS["Common<br/>distributions"]

DS --> DIS["Discrete"]
DS --> CON["Continuous"]

DIS --> D1["Bernoulli"]
DIS --> D2["Binomial"]
DIS --> D3["Poisson"]

CON --> D4["Normal<br/>(Gaussian)"]
CON --> D5["t / Chi-square / F<br/>(intro)"]

style PD fill:#90CAF9,stroke:#1E88E5,color:#000
style DS fill:#90CAF9,stroke:#1E88E5,color:#000

style DIS fill:#CE93D8,stroke:#8E24AA,color:#000
style CON fill:#CE93D8,stroke:#8E24AA,color:#000

style D1 fill:#C8E6C9,stroke:#2E7D32,color:#000
style D2 fill:#C8E6C9,stroke:#2E7D32,color:#000
style D3 fill:#C8E6C9,stroke:#2E7D32,color:#000
style D4 fill:#C8E6C9,stroke:#2E7D32,color:#000
style D5 fill:#C8E6C9,stroke:#2E7D32,color:#000

1) Bernoulli distribution (binary) #

Use when: one trial has two outcomes (success/failure).

Ordinary Least Squares

Direct solution method - Ordinary Least Squares and the Line of Best Fit #

It is possible to compute the best parameters for linear regression in one shot (closed-form), instead of iteratively improving them step-by-step. fileciteturn34file10turn34file6

For linear regression, the direct method is usually Ordinary Least Squares (OLS).

Ordinary Least Squares (OLS) chooses the “best” line by minimising squared prediction errors.

Key takeaway: OLS defines “best fit” as the line that minimises the total squared residual error across all data points.

Cost Function

Cost Function #

  • also known as an objective function

  • how far the predicted values are from the actual ones

  • measure of the difference between predicted values and actual values

  • quantifies the error between a model’s predicted values and actual values

  • measures the model’s error on a group of datapoints

  • method used to predict values by drawing the best-fit line through the data

  • used to evaluate the accuracy of a model’s predictions

Gradient Descent

Gradient Descent for Linear Regression #

Gradient descent is an iterative optimisation method used to minimise the regression cost function by repeatedly updating parameters in the direction that reduces error.

  • Iterative method
  • Types: batch / stochastic / mini-batch

Key takeaway: Gradient descent starts with initial parameter values and repeatedly updates them using the gradient until the cost stops decreasing.

flowchart TD
GD["Gradient<br/>Descent"] -->|minimises| CF["Cost<br/>function"]
GD -->|updates| W["Parameters<br/>(weights)"]
GD -->|uses| GR["Gradient<br/>(slope)"]

GD --> H["Hyperparameters"]
H --> LR["Learning<br/>rate"]
H --> BS["Batch<br/>size"]
H --> EP["Epochs"]

style GD fill:#90CAF9,stroke:#1E88E5,color:#000

style CF fill:#CE93D8,stroke:#8E24AA,color:#000
style W fill:#CE93D8,stroke:#8E24AA,color:#000
style GR fill:#CE93D8,stroke:#8E24AA,color:#000
style H fill:#CE93D8,stroke:#8E24AA,color:#000
style LR fill:#CE93D8,stroke:#8E24AA,color:#000
style BS fill:#CE93D8,stroke:#8E24AA,color:#000
style EP fill:#CE93D8,stroke:#8E24AA,color:#000

Types of GD #

flowchart TD
T["Gradient Descent<br/>types"] --> BGD["Batch<br/>GD"]
T --> SGD["Stochastic<br/>GD"]
T --> MGD["Mini-batch<br/>GD"]

BGD --> ALL["All data<br/>per step"]
BGD --> STB["Smooth<br/>updates"]

SGD --> ONE["1 sample<br/>per step"]
SGD --> FAST["Quick<br/>progress"]
SGD --> NOISE["Noisy<br/>updates"]

MGD --> MB["Small batch<br/>per step"]
MGD --> PRACT["Practical<br/>default"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000

style BGD fill:#C8E6C9,stroke:#2E7D32,color:#000
style SGD fill:#C8E6C9,stroke:#2E7D32,color:#000
style MGD fill:#C8E6C9,stroke:#2E7D32,color:#000

style ALL fill:#CE93D8,stroke:#8E24AA,color:#000
style STB fill:#CE93D8,stroke:#8E24AA,color:#000
style ONE fill:#CE93D8,stroke:#8E24AA,color:#000
style FAST fill:#CE93D8,stroke:#8E24AA,color:#000
style NOISE fill:#CE93D8,stroke:#8E24AA,color:#000
style MB fill:#CE93D8,stroke:#8E24AA,color:#000
style PRACT fill:#CE93D8,stroke:#8E24AA,color:#000

Batch #

  • Use only if you have huge compute and a lot of time to train

SGD #

  • go-to solution

Deep Learning

Deep Learning #

  • Subset of ML
  • focuses on algorithms inspired by the structure and function of the brain called Artificial Neural Networks.
  • A neural network with multiple hidden layers and multiple nodes in each hidden layer is known as a deep learning system or a deep neural network.
  • Allows systems to automatically learn hierarchical representations (features) from raw input, such as images, sound, or text.

Operational Steps for Neural Architectures #

StepPerceptron (Boolean/Logic)Linear Regression NetworkBinary Classification (Logistic)DFNN / MLP (Classification)
1. InputTake binary or discrete inputs \( x_1, \dots, x_n \)Take numerical features \( x \)Take numerical features \( x \)Take high-dimensional numerical or categorical features
2. Weighted SumSingle calculation: \( z = \sum (w_i x_i) + b \)Single calculation: \( \hat{y} = w_0 + w_1 x \)Single calculation: \( z = W x + b \)Multiple stages: \( z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]} \) for each layer \( l \)
3. ActivationStep Function: Output 1 if \( z \geq 0 \) , else 0Identity: The output remains \( z \) (no non-linear change)Sigmoid: Maps \( z \) to a probability between 0 and 1ReLU for hidden layers; Softmax/Sigmoid for the output layer
4. Loss / ErrorError = Target − OutputMean Squared Error (MSE): \( J = \frac{1}{2N} \sum (Y - \hat{y})^2 \)Binary Cross-Entropy (BCE): penalises based on probability distanceBCE or Categorical Cross-Entropy for multiple classes
5. OptimisationUpdate weights only on misclassificationGradient Descent: compute gradients at initialization and update weightsBackpropagation: compute error signals \( \delta \) and gradients \( dW \)Backpropagation: recursive chain rule to update all hidden layer weights
6. OutputDiscrete Boolean value (0 or 1)Continuous numerical value (e.g., house prices)Single probability score or class labelA vector of probabilities for multiple classes


flowchart LR
    %% Input Layer
    subgraph subGraph0["Input Layer"]
        I1(("Input 1"))
        I2(("Input 2"))
        I3(("Input 3"))
    end

    %% Hidden Layers
    subgraph subGraph1["Hidden Layer 1"]
        H1a(("H1-1"))
        H1b(("H1-2"))
        H1c(("H1-3"))
    end

    subgraph subGraph2["Hidden Layer 2"]
        H2a(("H2-1"))
        H2b(("H2-2"))
        H2c(("H2-3"))
    end

    subgraph subGraph3["Hidden Layer 3"]
        H3a(("H3-1"))
        H3b(("H3-2"))
        H3c(("H3-3"))
    end

    %% Output Layer
    subgraph subGraph4["Output Layer"]
        O(("Output"))
    end

    %% Connections: Input to Hidden Layer 1
    I1 --> H1a & H1b & H1c
    I2 --> H1a & H1b & H1c
    I3 --> H1a & H1b & H1c

    %% Connections: Hidden Layer 1 to Hidden Layer 2
    H1a --> H2a & H2b & H2c
    H1b --> H2a & H2b & H2c
    H1c --> H2a & H2b & H2c

    %% Connections: Hidden Layer 2 to Hidden Layer 3
    H2a --> H3a & H3b & H3c
    H2b --> H3a & H3b & H3c
    H2c --> H3a & H3b & H3c

    %% Connections: Hidden Layer 3 to Output
    H3a --> O
    H3b --> O
    H3c --> O

    %% Styling
    style I1 fill:#C8E6C9
    style I2 fill:#C8E6C9
    style I3 fill:#C8E6C9
    style H1a fill:#BBDEFB
    style H1b fill:#BBDEFB
    style H1c fill:#BBDEFB
    style H2a fill:#90CAF9
    style H2b fill:#90CAF9
    style H2c fill:#90CAF9
    style H3a fill:#64B5F6
    style H3b fill:#64B5F6
    style H3c fill:#64B5F6
    style O fill:#FFCDD2
    style subGraph0 stroke:none,fill:transparent
    style subGraph1 stroke:none,fill:transparent
    style subGraph2 stroke:none,fill:transparent
    style subGraph3 stroke:none,fill:transparent
    style subGraph4 stroke:none,fill:transparent

Types of Neural Networks #

  • Standard NN - Small and Standard for a smaller and simpler data (e.g. Real Estate
  • CNN - Convolution - used for Images (e.g. Photo Tagging, Object Detection)
  • RNN - Recurrent - used for Text (e.g. Speech Recognition, Translation)
  • Hybrid NN (e.g. Autonoumous Driving)

Components of DL #

  • Data
  • Learning Algorithm : How to transform data
  • Loss Function: Objective function that quantifies how well is model doing? lower the loss function, the better the model. So loss function will try to quantify how well or badly the model is learning or the model is doing.
  • Optimnisation Algorithm: in order to adjust the loss function, Learning Algorithm will try to optimize our algorithm. searching for the best possible parameters for minimizing the loss function. Popular optimization algorithms for deep learning are based on an approach called gradient descent.
  • Model

Operational Steps for Neural Architectures #

StepPerceptron (Boolean/Logic)Linear Regression NetworkBinary Classification (Logistic)DFNN / MLP (Classification)
1. InputBinary/discrete inputs \( x_1, \dots, x_n \)Numerical features \( x \)Numerical features \( x \)High-dimensional numerical or categorical features
2. Weighted Sum\( z = \sum (w_i x_i) + b \)\( \hat{y} = w_0 + w_1 x \)\( z = W x + b \)\( z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]} \)
3. ActivationStep: 1 if \( z \geq 0 \) , else 0Identity: output = \( z \)Sigmoid: maps \( z \) to probabilityReLU (hidden), Softmax/Sigmoid (output)
4. Loss / ErrorError = Target − Output\( J = \frac{1}{2N} \sum (Y - \hat{y})^2 \)Binary Cross-Entropy (BCE)BCE or Categorical Cross-Entropy
5. OptimisationUpdate on misclassificationGradient DescentBackpropagation (single layer)Backpropagation (multi-layer chain rule)
6. OutputBoolean (0 or 1)Continuous valueProbability scoreProbability vector (multi-class)

Applications #

  • Computer Vision (e.g., face detection, medical imaging)
  • Natural Language Processing (e.g., ChatGPT, translation)
  • Self Driving Cars
  • Speech Assistants (e.g., Alexa, Siri)

Intution #

Deep Learning is the methodology, DNN is a model.