Bayesian Learning

Bayesian Learning #

Bayesian Learning is a probabilistic approach to machine learning.

Instead of only asking, “Which output should the model predict?”, Bayesian Learning asks:

Given the data we have observed, how likely is each hypothesis, class, or parameter value?

This makes Bayesian Learning useful when uncertainty matters.

It is especially important in classification, probabilistic modelling, generative models, and situations where we want to combine prior knowledge with observed data.

Key takeaway:
Bayesian Learning updates belief using evidence.

Prior belief plus observed data gives posterior belief.

  • MLE Hypothesis
  • MAP Hypothesis
  • Bayes Rule
  • Optimal Bayes Classifier
  • Naive Bayes Classifier
  • Probabilistic Generative Classifiers
  • Bayesian Linear Regression

Big Picture #

Bayesian Learning is built around this idea:

  1. Start with a prior belief.
  2. Observe data.
  3. Compute how likely the data is under each hypothesis.
  4. Update the belief.
  5. Choose the most probable class or hypothesis.
flowchart LR
    A["Prior Knowledge"] --> B["Observed Data"]
    B --> C["Likelihood"]
    C --> D["Posterior"]
    D --> E["Prediction / Classification"]

    style A fill:#E1F5FE,stroke:#5b7db1,color:#222
    style B fill:#C8E6C9,stroke:#5f8f6a,color:#222
    style C fill:#FFF9C4,stroke:#b59b3b,color:#222
    style D fill:#EDE7F6,stroke:#8a6fb3,color:#222
    style E fill:#FDE2E4,stroke:#b85c68,color:#222

Bayes Rule ☆ #

Bayes Rule tells us how to update probability when new evidence is observed.

In classification, we often want to find the probability of a class after seeing the input features.

\[ P(C \mid X) = \frac{P(X \mid C)P(C)}{P(X)} \]

Where:

TermMeaning
\( P(C \mid X) \)Posterior probability
\( P(X \mid C) \)Likelihood
\( P(C) \)Prior probability
\( P(X) \)Evidence / normalising term

Bayes Rule in Hypothesis Form #

In Bayesian Learning, we may compare different hypotheses.

A hypothesis could be a model, a class, or a parameter setting.

\[ P(h \mid D) = \frac{P(D \mid h)P(h)}{P(D)} \]

Where:

TermMeaning
\( h \)Hypothesis
\( D \)Observed training data
\( P(h) \)Prior probability of hypothesis
\( P(D \mid h) \)Likelihood of data under hypothesis
\( P(h \mid D) \)Posterior probability of hypothesis after seeing data

The lecturer emphasised the interpretation:

  • prior: belief before seeing data
  • likelihood: probability of data under a hypothesis
  • posterior: belief after seeing the evidence

Prior, Likelihood, Evidence and Posterior ☆ #

Prior #

The prior is what we believe before seeing the current data.

Example:

Before seeing symptoms, we may know that a disease is rare.

\[ P(C) \]

Likelihood #

The likelihood tells us how likely the evidence is if a class or hypothesis is true.

Example:

If the patient has flu, how likely are fever and cough?

\[ P(X \mid C) \]

Evidence #

The evidence is the overall probability of observing the input.

In classification, it acts as a normalising denominator.

\[ P(X) \]

Posterior #

The posterior is the updated probability after seeing evidence.

\[ P(C \mid X) \]

MLE Hypothesis ☆ #

MLE stands for Maximum Likelihood Estimation.

MLE chooses the hypothesis or parameter value that makes the observed data most likely.

It focuses only on the likelihood term.

\[ h_{MLE} = \arg\max_h P(D \mid h) \]

If the model has parameters \( \theta \) , we write:

\[ \theta_{MLE} = \arg\max_{\theta} P(D \mid \theta) \]

MLE Using Log-Likelihood ☆ #

Products of many probabilities can become very small.

So we usually maximise the log-likelihood instead.

If the data points are independent:

\[ P(D \mid \theta) = \prod_{i=1}^{n} P(x_i \mid \theta) \]

Taking log:

\[ \log P(D \mid \theta) = \sum_{i=1}^{n} \log P(x_i \mid \theta) \]

So:

\[ \theta_{MLE} = \arg\max_{\theta} \sum_{i=1}^{n} \log P(x_i \mid \theta) \]

MLE is data-driven.

It does not include prior belief. It only asks which parameter makes the observed data most probable.


MAP Hypothesis ☆ #

MAP stands for Maximum A Posteriori.

MAP chooses the hypothesis with the highest posterior probability after observing the data.

\[ h_{MAP} = \arg\max_h P(h \mid D) \]

Using Bayes Rule:

\[ h_{MAP} = \arg\max_h \frac{P(D \mid h)P(h)}{P(D)} \]

Since \( P(D) \) is the same for all hypotheses, we can ignore it when comparing hypotheses.

\[ h_{MAP} = \arg\max_h P(D \mid h)P(h) \]

For parameters:

\[ \theta_{MAP} = \arg\max_{\theta} P(D \mid \theta)P(\theta) \]

MAP Using Log Form #

\[ \theta_{MAP} = \arg\max_{\theta} \left[ \log P(D \mid \theta) + \log P(\theta) \right] \]

This shows the difference clearly:

  • MLE uses only the data likelihood.
  • MAP uses likelihood plus prior.

MAP can be interpreted as MLE with a prior-based preference.

This is why MAP is often connected to regularisation.


MLE vs MAP #

AspectMLEMAP
Full formMaximum Likelihood EstimationMaximum A Posteriori
Uses data likelihoodYesYes
Uses prior beliefNoYes
Main expression\( P(D \mid h) \)\( P(D \mid h)P(h) \)
Useful whenWe trust data stronglyWe want to include prior knowledge
RiskMay overfit with limited dataDepends on quality of prior

MAP Classification Rule ☆ #

In classification, the MAP rule chooses the class with the highest posterior probability.

\[ \hat{C} = \arg\max_C P(C \mid X) \]

Using Bayes Rule:

\[ \hat{C} = \arg\max_C P(X \mid C)P(C) \]

The denominator \( P(X) \) is ignored because it is common for all classes.


Probabilistic Generative Classifiers ☆ #

A probabilistic generative classifier models how data is generated for each class.

It learns:

  • the prior probability of each class
  • the likelihood of features given each class

Then it applies Bayes Rule to predict the class.

\[ P(C \mid X) \propto P(X \mid C)P(C) \]
flowchart LR
    A["Training Data"] --> B["Estimate Class Prior P(C)"]
    A --> C["Estimate Likelihood P(X | C)"]
    B --> D["Apply Bayes Rule"]
    C --> D
    D --> E["Choose Class with Highest Posterior"]

    style A fill:#E1F5FE,stroke:#5b7db1,color:#222
    style B fill:#C8E6C9,stroke:#5f8f6a,color:#222
    style C fill:#FFF9C4,stroke:#b59b3b,color:#222
    style D fill:#EDE7F6,stroke:#8a6fb3,color:#222
    style E fill:#FDE2E4,stroke:#b85c68,color:#222

Naive Bayes Classifier ☆ #

Naive Bayes is a probabilistic classifier based on Bayes Rule.

It is called naive because it assumes that features are conditionally independent given the class.

This assumption is often not perfectly true in real data, but the method can still work surprisingly well.


Naive Bayes Formula #

For features \( X = (x_1, x_2, \ldots, x_n) \) :

\[ P(C \mid x_1, x_2, \ldots, x_n) = \frac{P(C)P(x_1, x_2, \ldots, x_n \mid C)}{P(x_1, x_2, \ldots, x_n)} \]

Using the naive independence assumption:

\[ P(x_1, x_2, \ldots, x_n \mid C) = \prod_{k=1}^{n} P(x_k \mid C) \]

So the classification rule becomes:

\[ \hat{C} = \arg\max_C P(C) \prod_{k=1}^{n} P(x_k \mid C) \]

Why Naive Bayes Is Useful #

Naive Bayes is commonly used for:

  • spam detection
  • sentiment classification
  • document classification
  • medical diagnosis support
  • text categorisation

It is fast because it only needs to estimate simple probabilities.

It is especially useful when the number of features is large, such as in text data.


Types of Naive Bayes #

TypeFeature TypeExample Use
Gaussian Naive BayesContinuous numeric featuresMedical measurements, sensor data
Multinomial Naive BayesCount featuresWord counts in documents
Bernoulli Naive BayesBinary featuresWord present / absent

Laplace Smoothing ☆ #

If a feature value never appears with a class in the training data, its probability becomes zero.

That would make the whole product zero.

Laplace smoothing avoids this problem.

\[ P(x_k \mid C) = \frac{\text{count}(x_k, C) + 1}{\text{count}(C) + V} \]

Where \( V \) is the number of possible feature values.

Without smoothing, one unseen feature value can force a class probability to become zero.


Optimal Bayes Classifier #

The Bayes optimal classifier chooses the class with the highest posterior probability.

It gives the best possible classification decision if the true probability distributions are known.

\[ \hat{C}_{Bayes} = \arg\max_C P(C \mid X) \]

In practice, the true distributions are unknown, so methods such as Naive Bayes estimate them from data.


Bayesian Linear Regression #

In ordinary linear regression, we estimate a single best set of weights.

In Bayesian Linear Regression, the weights are treated as random variables.

We place a prior over the weights and update it using observed data.

\[ y = w^T x + \epsilon \]

A common assumption is Gaussian noise:

\[ \epsilon \sim \mathcal{N}(0, \sigma^2) \]

The likelihood becomes:

\[ P(y \mid X, w) = \mathcal{N}(y \mid Xw, \sigma^2 I) \]

If we use a Gaussian prior over weights:

\[ P(w) = \mathcal{N}(w \mid 0, \alpha^{-1}I) \]

Then Bayes Rule gives the posterior over weights:

\[ P(w \mid X,y) \propto P(y \mid X,w)P(w) \]

Bayesian Linear Regression Intuition #

Ordinary regression says:

These are the best weights.

Bayesian regression says:

These weights are more likely, but there is uncertainty.

This is useful because it can give uncertainty around predictions.


Worked Mini Example: MAP Classification #

Suppose we want to classify an email as spam or not spam.

Let the evidence be:

  • contains the word “offer”
  • contains the word “free”

We compare:

\[ P(Spam \mid X) \propto P(X \mid Spam)P(Spam) \] \[ P(NotSpam \mid X) \propto P(X \mid NotSpam)P(NotSpam) \]

The predicted class is whichever value is larger.


Notes ☆ #

You should be comfortable with:

  1. Identifying prior, likelihood, evidence and posterior from Bayes Rule.
  2. Explaining the difference between MLE and MAP.
  3. Writing the MAP classification rule.
  4. Applying Naive Bayes using class priors and conditional probabilities.
  5. Explaining why Naive Bayes assumes conditional independence.
  6. Explaining why Laplace smoothing is needed.
  7. Comparing discriminative and generative classifiers.

Common Mistakes #

MistakeCorrection
Confusing likelihood and posteriorLikelihood is \( P(X \mid C) \) , posterior is \( P(C \mid X) \)
Forgetting the prior in MAPMAP uses both likelihood and prior
Treating MLE and MAP as the sameMLE ignores the prior; MAP includes it
Forgetting Naive Bayes independence assumptionNaive Bayes multiplies individual feature likelihoods
Not using smoothingZero probabilities can destroy the whole product

Revision #

ConceptCore IdeaFormula
Bayes RuleUpdate belief using evidence\( P(C \mid X) = \frac{P(X \mid C)P(C)}{P(X)} \)
MLEChoose hypothesis that makes data most likely\( \arg\max_h P(D \mid h) \)
MAPChoose hypothesis with highest posterior\( \arg\max_h P(D \mid h)P(h) \)
Naive BayesApply Bayes Rule with independent features\( \arg\max_C P(C)\prod_k P(x_k \mid C) \)
Bayes OptimalChoose class with highest true posterior\( \arg\max_C P(C \mid X) \)

Home | Machine Learning