AI

Generative AI

Generative AI #

Generative Artificial Intelligence (GenAI) refers to a class of AI systems that can generate new content such as text, images, audio, video, or code, rather than only making predictions or classifications.

GenAI systems learn patterns and representations from large datasets and use them to produce novel outputs that resemble the data they were trained on.


How Generative AI Differs from Traditional AI #

Traditional AIGenerative AI
Predicts or classifiesGenerates new content
Task-specific modelsGeneral-purpose models
Fixed outputsOpen-ended outputs
Often rule-basedData-driven and probabilistic

Core Idea of Generative AI #

Instead of learning “what label to assign”, Generative AI learns “how data is structured” and then creates new data following that structure.

AI Pipeline

AI Pipeline #

The AI pipeline is a continuous process where data is collected, prepared, used to train models, evaluated for performance, and continuously improved after deployment.

  1. Collect Data #

  2. Prepare data #

  3. Train Model #

    • Iterate until model is good enough
  4. Deploy Model #

    • Get data back
    • Maintain & update model
timeline
    title AI Pipeline
    Collect Data : Data Ingestion
                 : Data Understanding
    Prepare Data : Cleaning
                 : Feature Engineering
                 : Sampling
    Train Model  : Model Training
                 : Validation & Metrics
    Deploy Model : Deployment
                 : Monitoring & Retraining

Home | AI Foundation

Regression (Linear)

Linear Regression #

Linear Regression is a supervised ML method used to predict a numerical target by fitting a model that is linear in its parameters.

In ML , linear models are a core baseline: they’re fast, often surprisingly strong, and usually easy to interpret.

Key takeaway: Linear Regression learns parameters by minimising a squared-error cost. You can solve it directly (closed form) or iteratively (gradient descent), and you can extend it using basis functions and regularisation.

Random Variables

Random Variables #

A random variable is a way to attach numbers to outcomes of a random experiment.

It lets us move from: “what happened?” to: “what number should we analyse?”

Key takeaway: A random variable is a function from the sample space to real numbers. Once you define the random variable clearly, the rest (pmf/pdf/cdf, mean, variance) becomes systematic.


flowchart TD
PD["Probability<br/>distributions"] --> RV["Random<br/>variables"]

RV --> T["Types"]
T --> RV1["Discrete<br/>RVs"]
T --> RV2["Continuous<br/>RVs"]

RV --> F["PMF / PDF / CDF"]
RV --> S["Mean / Variance<br/>Covariance"]
RV --> J["Joint & Marginal<br/>distributions"]
RV --> X["Transformations"]

style PD fill:#90CAF9,stroke:#1E88E5,color:#000
style RV fill:#90CAF9,stroke:#1E88E5,color:#000

style T fill:#CE93D8,stroke:#8E24AA,color:#000
style F fill:#CE93D8,stroke:#8E24AA,color:#000
style S fill:#CE93D8,stroke:#8E24AA,color:#000
style J fill:#CE93D8,stroke:#8E24AA,color:#000
style X fill:#CE93D8,stroke:#8E24AA,color:#000
style RV1 fill:#CE93D8,stroke:#8E24AA,color:#000
style RV2 fill:#CE93D8,stroke:#8E24AA,color:#000

1) Definition #

Random variable: a rule that assigns a number to each outcome.

Common Probability Distributions

Common Probability Distributions #

Once you can describe a random variable using a pmf or pdf, the next step is to use named distributions that appear repeatedly in real data and in ML models.

Key takeaway: Named distributions give you ready-made probability models for common patterns: binary outcomes, counts, and measurement noise.


flowchart TD
PD["Probability<br/>distributions"] --> DS["Common<br/>distributions"]

DS --> DIS["Discrete"]
DS --> CON["Continuous"]

DIS --> D1["Bernoulli"]
DIS --> D2["Binomial"]
DIS --> D3["Poisson"]

CON --> D4["Normal<br/>(Gaussian)"]
CON --> D5["t / Chi-square / F<br/>(intro)"]

style PD fill:#90CAF9,stroke:#1E88E5,color:#000
style DS fill:#90CAF9,stroke:#1E88E5,color:#000

style DIS fill:#CE93D8,stroke:#8E24AA,color:#000
style CON fill:#CE93D8,stroke:#8E24AA,color:#000

style D1 fill:#C8E6C9,stroke:#2E7D32,color:#000
style D2 fill:#C8E6C9,stroke:#2E7D32,color:#000
style D3 fill:#C8E6C9,stroke:#2E7D32,color:#000
style D4 fill:#C8E6C9,stroke:#2E7D32,color:#000
style D5 fill:#C8E6C9,stroke:#2E7D32,color:#000

1) Bernoulli distribution (binary) #

Use when: one trial has two outcomes (success/failure).

Ordinary Least Squares

Direct solution method - Ordinary Least Squares and the Line of Best Fit #

Revision:
OLS is the direct method for linear regression. It finds the best-fit line by minimising the sum of squared residuals without iterative updates.


Direct Method vs Iterative Method ☆ #

Linear regression parameters can be found in two main ways.

MethodMain ideaWhen used
Ordinary Least SquaresCompute the best parameters directlySmall or moderate datasets
Gradient DescentStart with parameters and update repeatedlyLarge datasets or many features
flowchart LR
    A["Linear Regression"] --> B["Direct Solution<br/>OLS"]
    A --> C["Iterative Solution<br/>Gradient Descent"]

    B --> B1["Normal Equation"]
    B --> B2["No learning rate"]
    B --> B3["One-shot solution"]

    C --> C1["Learning rate"]
    C --> C2["Repeated updates"]
    C --> C3["Stops after convergence"]

    style A fill:#E1F5FE,stroke:#5b7db1,color:#000
    style B fill:#C8E6C9,stroke:#5f8f6a,color:#000
    style C fill:#FFF9C4,stroke:#b59b3b,color:#000
    style B1 fill:#EDE7F6,stroke:#8a6fb3,color:#000
    style B2 fill:#EDE7F6,stroke:#8a6fb3,color:#000
    style B3 fill:#EDE7F6,stroke:#8a6fb3,color:#000
    style C1 fill:#EDE7F6,stroke:#8a6fb3,color:#000
    style C2 fill:#EDE7F6,stroke:#8a6fb3,color:#000
    style C3 fill:#EDE7F6,stroke:#8a6fb3,color:#000

Why It Is Called “Least Squares” ☆ #

OLS is called least squares because it chooses parameters that make the squared residual errors as small as possible.

Cost Function

Cost Function #

Revision:
A cost function converts model error into a single number. Training means changing the model parameters until this number becomes as small as possible.


Why Cost Function Matters in ML ☆ #

A machine learning model needs a way to decide whether one set of parameters is better than another.

For linear regression, every possible value of the parameters gives a different line. The cost function tells us which line is better by measuring how far the predictions are from the true values.

Gradient Descent

Gradient Descent for Linear Regression #

Revision:
Gradient descent is the step-by-step method for reducing the cost function when a direct closed-form solution is not convenient.


Where Gradient Descent Fits in ML ☆ #

Gradient descent is used when we want the model to learn parameters by repeatedly improving them.

For linear regression, it adjusts the slope and intercept until the prediction error becomes small.

flowchart LR
    A["Initial Parameters"] --> B["Make Predictions"]
    B --> C["Compute Cost"]
    C --> D["Compute Gradient"]
    D --> E["Update Parameters"]
    E --> B

    style A fill:#E1F5FE,stroke:#5b7db1,color:#000
    style B fill:#C8E6C9,stroke:#5f8f6a,color:#000
    style C fill:#FFF9C4,stroke:#b59b3b,color:#000
    style D fill:#EDE7F6,stroke:#8a6fb3,color:#000
    style E fill:#C8E6C9,stroke:#5f8f6a,color:#000

Core Idea ☆ #

The gradient tells us the direction in which the cost increases fastest.

Deep Learning

Deep Learning #

  • Subset of ML
  • focuses on algorithms inspired by the structure and function of the brain called Artificial Neural Networks.
  • A neural network with multiple hidden layers and multiple nodes in each hidden layer is known as a deep learning system or a deep neural network.
  • Allows systems to automatically learn hierarchical representations (features) from raw input, such as images, sound, or text.

Operational Steps for Neural Architectures #

StepPerceptron (Boolean/Logic)Linear Regression NetworkBinary Classification (Logistic)DFNN / MLP (Classification)
1. InputTake binary or discrete inputs \( x_1, \dots, x_n \)Take numerical features \( x \)Take numerical features \( x \)Take high-dimensional numerical or categorical features
2. Weighted SumSingle calculation: \( z = \sum (w_i x_i) + b \)Single calculation: \( \hat{y} = w_0 + w_1 x \)Single calculation: \( z = W x + b \)Multiple stages: \( z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]} \) for each layer \( l \)
3. ActivationStep Function: Output 1 if \( z \geq 0 \) , else 0Identity: The output remains \( z \) (no non-linear change)Sigmoid: Maps \( z \) to a probability between 0 and 1ReLU for hidden layers; Softmax/Sigmoid for the output layer
4. Loss / ErrorError = Target − OutputMean Squared Error (MSE): \( J = \frac{1}{2N} \sum (Y - \hat{y})^2 \)Binary Cross-Entropy (BCE): penalises based on probability distanceBCE or Categorical Cross-Entropy for multiple classes
5. OptimisationUpdate weights only on misclassificationGradient Descent: compute gradients at initialization and update weightsBackpropagation: compute error signals \( \delta \) and gradients \( dW \)Backpropagation: recursive chain rule to update all hidden layer weights
6. OutputDiscrete Boolean value (0 or 1)Continuous numerical value (e.g., house prices)Single probability score or class labelA vector of probabilities for multiple classes


flowchart LR
    %% Input Layer
    subgraph subGraph0["Input Layer"]
        I1(("Input 1"))
        I2(("Input 2"))
        I3(("Input 3"))
    end

    %% Hidden Layers
    subgraph subGraph1["Hidden Layer 1"]
        H1a(("H1-1"))
        H1b(("H1-2"))
        H1c(("H1-3"))
    end

    subgraph subGraph2["Hidden Layer 2"]
        H2a(("H2-1"))
        H2b(("H2-2"))
        H2c(("H2-3"))
    end

    subgraph subGraph3["Hidden Layer 3"]
        H3a(("H3-1"))
        H3b(("H3-2"))
        H3c(("H3-3"))
    end

    %% Output Layer
    subgraph subGraph4["Output Layer"]
        O(("Output"))
    end

    %% Connections: Input to Hidden Layer 1
    I1 --> H1a & H1b & H1c
    I2 --> H1a & H1b & H1c
    I3 --> H1a & H1b & H1c

    %% Connections: Hidden Layer 1 to Hidden Layer 2
    H1a --> H2a & H2b & H2c
    H1b --> H2a & H2b & H2c
    H1c --> H2a & H2b & H2c

    %% Connections: Hidden Layer 2 to Hidden Layer 3
    H2a --> H3a & H3b & H3c
    H2b --> H3a & H3b & H3c
    H2c --> H3a & H3b & H3c

    %% Connections: Hidden Layer 3 to Output
    H3a --> O
    H3b --> O
    H3c --> O

    %% Styling
    style I1 fill:#C8E6C9
    style I2 fill:#C8E6C9
    style I3 fill:#C8E6C9
    style H1a fill:#BBDEFB
    style H1b fill:#BBDEFB
    style H1c fill:#BBDEFB
    style H2a fill:#90CAF9
    style H2b fill:#90CAF9
    style H2c fill:#90CAF9
    style H3a fill:#64B5F6
    style H3b fill:#64B5F6
    style H3c fill:#64B5F6
    style O fill:#FFCDD2
    style subGraph0 stroke:none,fill:transparent
    style subGraph1 stroke:none,fill:transparent
    style subGraph2 stroke:none,fill:transparent
    style subGraph3 stroke:none,fill:transparent
    style subGraph4 stroke:none,fill:transparent

Types of Neural Networks #

  • Standard NN - Small and Standard for a smaller and simpler data (e.g. Real Estate
  • CNN - Convolution - used for Images (e.g. Photo Tagging, Object Detection)
  • RNN - Recurrent - used for Text (e.g. Speech Recognition, Translation)
  • Hybrid NN (e.g. Autonoumous Driving)

Components of DL #

  • Data
  • Learning Algorithm : How to transform data
  • Loss Function: Objective function that quantifies how well is model doing? lower the loss function, the better the model. So loss function will try to quantify how well or badly the model is learning or the model is doing.
  • Optimnisation Algorithm: in order to adjust the loss function, Learning Algorithm will try to optimize our algorithm. searching for the best possible parameters for minimizing the loss function. Popular optimization algorithms for deep learning are based on an approach called gradient descent.
  • Model

Operational Steps for Neural Architectures #

StepPerceptron (Boolean/Logic)Linear Regression NetworkBinary Classification (Logistic)DFNN / MLP (Classification)
1. InputBinary/discrete inputs \( x_1, \dots, x_n \)Numerical features \( x \)Numerical features \( x \)High-dimensional numerical or categorical features
2. Weighted Sum\( z = \sum (w_i x_i) + b \)\( \hat{y} = w_0 + w_1 x \)\( z = W x + b \)\( z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]} \)
3. ActivationStep: 1 if \( z \geq 0 \) , else 0Identity: output = \( z \)Sigmoid: maps \( z \) to probabilityReLU (hidden), Softmax/Sigmoid (output)
4. Loss / ErrorError = Target − Output\( J = \frac{1}{2N} \sum (Y - \hat{y})^2 \)Binary Cross-Entropy (BCE)BCE or Categorical Cross-Entropy
5. OptimisationUpdate on misclassificationGradient DescentBackpropagation (single layer)Backpropagation (multi-layer chain rule)
6. OutputBoolean (0 or 1)Continuous valueProbability scoreProbability vector (multi-class)

Applications #

  • Computer Vision (e.g., face detection, medical imaging)
  • Natural Language Processing (e.g., ChatGPT, translation)
  • Self Driving Cars
  • Speech Assistants (e.g., Alexa, Siri)

Intution #

Deep Learning is the methodology, DNN is a model.

Classification (Linear)

Linear models for Classification #

  • categorises data by finding a linear boundary (hyperplane) that separates classes
  • calculating a weighted sum of input features plus bias
flowchart TD
T["Linear<br/>classification<br/>models"] --> P["Perceptron"]
T --> LR["Logistic<br/>regression"]
T --> SVM["Linear<br/>SVM"]

P -->|uses| STEP["Step<br/>activation"]
LR -->|uses| SIG["Sigmoid<br/>+ log loss"]
SVM -->|uses| HNG["Hinge<br/>loss"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000

style P fill:#C8E6C9,stroke:#2E7D32,color:#000
style LR fill:#C8E6C9,stroke:#2E7D32,color:#000
style SVM fill:#C8E6C9,stroke:#2E7D32,color:#000

style STEP fill:#CE93D8,stroke:#8E24AA,color:#000
style SIG fill:#CE93D8,stroke:#8E24AA,color:#000
style HNG fill:#CE93D8,stroke:#8E24AA,color:#000
  • Discriminant Functions
  • Decision Theory
  • Probabilistic Discriminative Classifiers
  • Logistic Regression

Logistic Regression #

  • Supervised machine learning algorithm
  • Binary classification algorithm
  • requires data to be linearly separable
  • predicts the probability that an input belongs to a specific class
  • uses Sigmoid function to convert inputs into a probability value between 0 and 1

Key takeaway: Logistic regression predicts $P(y=1\mid x)$ using a sigmoid of a linear score $z=w\cdot x+b$, then learns $w,b$ by maximising likelihood (equivalently minimising log-loss).