Bayesian Learning #

Bayesian Learning is a probabilistic approach to machine learning.

Instead of only asking, “Which output should the model predict?”, Bayesian Learning asks:

Given the data we have observed, how likely is each hypothesis, class, or parameter value?

This makes Bayesian Learning useful when uncertainty matters.

It is especially important in classification, probabilistic modelling, generative models, and situations where we want to combine prior knowledge with observed data.

Key takeaway:
Bayesian Learning updates belief using evidence.
Prior belief plus observed data gives posterior belief.

MLE Hypothesis
MAP Hypothesis
Bayes Rule
Optimal Bayes Classifier
Naive Bayes Classifier
Probabilistic Generative Classifiers
Bayesian Linear Regression

Big Picture #

Bayesian Learning is built around this idea:

Start with a prior belief.
Observe data.
Compute how likely the data is under each hypothesis.
Update the belief.
Choose the most probable class or hypothesis.

flowchart LR
    A["Prior Knowledge"] --> B["Observed Data"]
    B --> C["Likelihood"]
    C --> D["Posterior"]
    D --> E["Prediction / Classification"]

    style A fill:#E1F5FE,stroke:#5b7db1,color:#222
    style B fill:#C8E6C9,stroke:#5f8f6a,color:#222
    style C fill:#FFF9C4,stroke:#b59b3b,color:#222
    style D fill:#EDE7F6,stroke:#8a6fb3,color:#222
    style E fill:#FDE2E4,stroke:#b85c68,color:#222

Bayes Rule ☆ #

Bayes Rule tells us how to update probability when new evidence is observed.

In classification, we often want to find the probability of a class after seeing the input features.

\[ P(C \mid X) = \frac{P(X \mid C)P(C)}{P(X)} \]

Where:

Term	Meaning
\( P(C \mid X) \)	Posterior probability
\( P(X \mid C) \)	Likelihood
\( P(C) \)	Prior probability
\( P(X) \)	Evidence / normalising term

Bayes Rule in Hypothesis Form #

In Bayesian Learning, we may compare different hypotheses.

A hypothesis could be a model, a class, or a parameter setting.

\[ P(h \mid D) = \frac{P(D \mid h)P(h)}{P(D)} \]

Where:

Term	Meaning
\( h \)	Hypothesis
\( D \)	Observed training data
\( P(h) \)	Prior probability of hypothesis
\( P(D \mid h) \)	Likelihood of data under hypothesis
\( P(h \mid D) \)	Posterior probability of hypothesis after seeing data

The lecturer emphasised the interpretation:
prior: belief before seeing data
likelihood: probability of data under a hypothesis
posterior: belief after seeing the evidence

Prior, Likelihood, Evidence and Posterior ☆ #

Prior #

The prior is what we believe before seeing the current data.

Example:

Before seeing symptoms, we may know that a disease is rare.

\[ P(C) \]

Likelihood #

The likelihood tells us how likely the evidence is if a class or hypothesis is true.

Example:

If the patient has flu, how likely are fever and cough?

\[ P(X \mid C) \]

Evidence #

The evidence is the overall probability of observing the input.

In classification, it acts as a normalising denominator.

\[ P(X) \]

Posterior #

The posterior is the updated probability after seeing evidence.

\[ P(C \mid X) \]

MLE Hypothesis ☆ #

MLE stands for Maximum Likelihood Estimation.

MLE chooses the hypothesis or parameter value that makes the observed data most likely.

It focuses only on the likelihood term.

\[ h_{MLE} = \arg\max_h P(D \mid h) \]

If the model has parameters \( \theta \) , we write:

\[ \theta_{MLE} = \arg\max_{\theta} P(D \mid \theta) \]

MLE Using Log-Likelihood ☆ #

Products of many probabilities can become very small.

So we usually maximise the log-likelihood instead.

If the data points are independent:

\[ P(D \mid \theta) = \prod_{i=1}^{n} P(x_i \mid \theta) \]

Taking log:

\[ \log P(D \mid \theta) = \sum_{i=1}^{n} \log P(x_i \mid \theta) \]

So:

\[ \theta_{MLE} = \arg\max_{\theta} \sum_{i=1}^{n} \log P(x_i \mid \theta) \]

MLE is data-driven.
It does not include prior belief. It only asks which parameter makes the observed data most probable.

MAP Hypothesis ☆ #

MAP stands for Maximum A Posteriori.

MAP chooses the hypothesis with the highest posterior probability after observing the data.

\[ h_{MAP} = \arg\max_h P(h \mid D) \]

Using Bayes Rule:

\[ h_{MAP} = \arg\max_h \frac{P(D \mid h)P(h)}{P(D)} \]

Since \( P(D) \) is the same for all hypotheses, we can ignore it when comparing hypotheses.

\[ h_{MAP} = \arg\max_h P(D \mid h)P(h) \]

For parameters:

\[ \theta_{MAP} = \arg\max_{\theta} P(D \mid \theta)P(\theta) \]

MAP Using Log Form #

\[ \theta_{MAP} = \arg\max_{\theta} \left[ \log P(D \mid \theta) + \log P(\theta) \right] \]

This shows the difference clearly:

MLE uses only the data likelihood.
MAP uses likelihood plus prior.

MAP can be interpreted as MLE with a prior-based preference.
This is why MAP is often connected to regularisation.

MLE vs MAP #

Aspect	MLE	MAP
Full form	Maximum Likelihood Estimation	Maximum A Posteriori
Uses data likelihood	Yes	Yes
Uses prior belief	No	Yes
Main expression	\( P(D \mid h) \)	\( P(D \mid h)P(h) \)
Useful when	We trust data strongly	We want to include prior knowledge
Risk	May overfit with limited data	Depends on quality of prior

MAP Classification Rule ☆ #

In classification, the MAP rule chooses the class with the highest posterior probability.

\[ \hat{C} = \arg\max_C P(C \mid X) \]

Using Bayes Rule:

\[ \hat{C} = \arg\max_C P(X \mid C)P(C) \]

The denominator \( P(X) \) is ignored because it is common for all classes.

Probabilistic Generative Classifiers ☆ #

A probabilistic generative classifier models how data is generated for each class.

It learns:

the prior probability of each class
the likelihood of features given each class

Then it applies Bayes Rule to predict the class.

\[ P(C \mid X) \propto P(X \mid C)P(C) \]

flowchart LR
    A["Training Data"] --> B["Estimate Class Prior P(C)"]
    A --> C["Estimate Likelihood P(X | C)"]
    B --> D["Apply Bayes Rule"]
    C --> D
    D --> E["Choose Class with Highest Posterior"]

    style A fill:#E1F5FE,stroke:#5b7db1,color:#222
    style B fill:#C8E6C9,stroke:#5f8f6a,color:#222
    style C fill:#FFF9C4,stroke:#b59b3b,color:#222
    style D fill:#EDE7F6,stroke:#8a6fb3,color:#222
    style E fill:#FDE2E4,stroke:#b85c68,color:#222

Naive Bayes Classifier ☆ #

Naive Bayes is a probabilistic classifier based on Bayes Rule.

It is called naive because it assumes that features are conditionally independent given the class.

This assumption is often not perfectly true in real data, but the method can still work surprisingly well.

Naive Bayes Formula #

For features \( X = (x_1, x_2, \ldots, x_n) \) :

\[ P(C \mid x_1, x_2, \ldots, x_n) = \frac{P(C)P(x_1, x_2, \ldots, x_n \mid C)}{P(x_1, x_2, \ldots, x_n)} \]

Using the naive independence assumption:

\[ P(x_1, x_2, \ldots, x_n \mid C) = \prod_{k=1}^{n} P(x_k \mid C) \]

So the classification rule becomes:

\[ \hat{C} = \arg\max_C P(C) \prod_{k=1}^{n} P(x_k \mid C) \]

Why Naive Bayes Is Useful #

Naive Bayes is commonly used for:

spam detection
sentiment classification
document classification
medical diagnosis support
text categorisation

It is fast because it only needs to estimate simple probabilities.

It is especially useful when the number of features is large, such as in text data.

Types of Naive Bayes #

Type	Feature Type	Example Use
Gaussian Naive Bayes	Continuous numeric features	Medical measurements, sensor data
Multinomial Naive Bayes	Count features	Word counts in documents
Bernoulli Naive Bayes	Binary features	Word present / absent

Laplace Smoothing ☆ #

If a feature value never appears with a class in the training data, its probability becomes zero.

That would make the whole product zero.

Laplace smoothing avoids this problem.

\[ P(x_k \mid C) = \frac{\text{count}(x_k, C) + 1}{\text{count}(C) + V} \]

Where \( V \) is the number of possible feature values.

Without smoothing, one unseen feature value can force a class probability to become zero.

Optimal Bayes Classifier #

The Bayes optimal classifier chooses the class with the highest posterior probability.

It gives the best possible classification decision if the true probability distributions are known.

\[ \hat{C}_{Bayes} = \arg\max_C P(C \mid X) \]

In practice, the true distributions are unknown, so methods such as Naive Bayes estimate them from data.

Bayesian Linear Regression #

In ordinary linear regression, we estimate a single best set of weights.

In Bayesian Linear Regression, the weights are treated as random variables.

We place a prior over the weights and update it using observed data.

\[ y = w^T x + \epsilon \]

A common assumption is Gaussian noise:

\[ \epsilon \sim \mathcal{N}(0, \sigma^2) \]

The likelihood becomes:

\[ P(y \mid X, w) = \mathcal{N}(y \mid Xw, \sigma^2 I) \]

If we use a Gaussian prior over weights:

\[ P(w) = \mathcal{N}(w \mid 0, \alpha^{-1}I) \]

Then Bayes Rule gives the posterior over weights:

\[ P(w \mid X,y) \propto P(y \mid X,w)P(w) \]

Bayesian Linear Regression Intuition #

Ordinary regression says:

These are the best weights.

Bayesian regression says:

These weights are more likely, but there is uncertainty.

This is useful because it can give uncertainty around predictions.

Worked Mini Example: MAP Classification #

Suppose we want to classify an email as spam or not spam.

Let the evidence be:

contains the word “offer”
contains the word “free”

We compare:

\[ P(Spam \mid X) \propto P(X \mid Spam)P(Spam) \] \[ P(NotSpam \mid X) \propto P(X \mid NotSpam)P(NotSpam) \]

The predicted class is whichever value is larger.

Notes ☆ #

You should be comfortable with:

Identifying prior, likelihood, evidence and posterior from Bayes Rule.
Explaining the difference between MLE and MAP.
Writing the MAP classification rule.
Applying Naive Bayes using class priors and conditional probabilities.
Explaining why Naive Bayes assumes conditional independence.
Explaining why Laplace smoothing is needed.
Comparing discriminative and generative classifiers.

Common Mistakes #

Mistake	Correction
Confusing likelihood and posterior	Likelihood is \( P(X \mid C) \) , posterior is \( P(C \mid X) \)
Forgetting the prior in MAP	MAP uses both likelihood and prior
Treating MLE and MAP as the same	MLE ignores the prior; MAP includes it
Forgetting Naive Bayes independence assumption	Naive Bayes multiplies individual feature likelihoods
Not using smoothing	Zero probabilities can destroy the whole product

Revision #

Concept	Core Idea	Formula
Bayes Rule	Update belief using evidence	\( P(C \mid X) = \frac{P(X \mid C)P(C)}{P(X)} \)
MLE	Choose hypothesis that makes data most likely	\( \arg\max_h P(D \mid h) \)
MAP	Choose hypothesis with highest posterior	\( \arg\max_h P(D \mid h)P(h) \)
Naive Bayes	Apply Bayes Rule with independent features	\( \arg\max_C P(C)\prod_k P(x_k \mid C) \)
Bayes Optimal	Choose class with highest true posterior	\( \arg\max_C P(C \mid X) \)

Home | Machine Learning