Bayesian Learning #
Bayesian Learning is a probabilistic approach to machine learning.
Instead of only asking, “Which output should the model predict?”, Bayesian Learning asks:
Given the data we have observed, how likely is each hypothesis, class, or parameter value?
This makes Bayesian Learning useful when uncertainty matters.
It is especially important in classification, probabilistic modelling, generative models, and situations where we want to combine prior knowledge with observed data.
Key takeaway:
Bayesian Learning updates belief using evidence.Prior belief plus observed data gives posterior belief.
- MLE Hypothesis
- MAP Hypothesis
- Bayes Rule
- Optimal Bayes Classifier
- Naive Bayes Classifier
- Probabilistic Generative Classifiers
- Bayesian Linear Regression
Big Picture #
Bayesian Learning is built around this idea:
- Start with a prior belief.
- Observe data.
- Compute how likely the data is under each hypothesis.
- Update the belief.
- Choose the most probable class or hypothesis.
flowchart LR
A["Prior Knowledge"] --> B["Observed Data"]
B --> C["Likelihood"]
C --> D["Posterior"]
D --> E["Prediction / Classification"]
style A fill:#E1F5FE,stroke:#5b7db1,color:#222
style B fill:#C8E6C9,stroke:#5f8f6a,color:#222
style C fill:#FFF9C4,stroke:#b59b3b,color:#222
style D fill:#EDE7F6,stroke:#8a6fb3,color:#222
style E fill:#FDE2E4,stroke:#b85c68,color:#222
Bayes Rule ☆ #
Bayes Rule tells us how to update probability when new evidence is observed.
In classification, we often want to find the probability of a class after seeing the input features.
\[ P(C \mid X) = \frac{P(X \mid C)P(C)}{P(X)} \]Where:
| Term | Meaning |
|---|---|
| \( P(C \mid X) \) | Posterior probability |
| \( P(X \mid C) \) | Likelihood |
| \( P(C) \) | Prior probability |
| \( P(X) \) | Evidence / normalising term |
Bayes Rule in Hypothesis Form #
In Bayesian Learning, we may compare different hypotheses.
A hypothesis could be a model, a class, or a parameter setting.
\[ P(h \mid D) = \frac{P(D \mid h)P(h)}{P(D)} \]Where:
| Term | Meaning |
|---|---|
| \( h \) | Hypothesis |
| \( D \) | Observed training data |
| \( P(h) \) | Prior probability of hypothesis |
| \( P(D \mid h) \) | Likelihood of data under hypothesis |
| \( P(h \mid D) \) | Posterior probability of hypothesis after seeing data |
The lecturer emphasised the interpretation:
- prior: belief before seeing data
- likelihood: probability of data under a hypothesis
- posterior: belief after seeing the evidence
Prior, Likelihood, Evidence and Posterior ☆ #
Prior #
The prior is what we believe before seeing the current data.
Example:
Before seeing symptoms, we may know that a disease is rare.
\[ P(C) \]Likelihood #
The likelihood tells us how likely the evidence is if a class or hypothesis is true.
Example:
If the patient has flu, how likely are fever and cough?
\[ P(X \mid C) \]Evidence #
The evidence is the overall probability of observing the input.
In classification, it acts as a normalising denominator.
\[ P(X) \]Posterior #
The posterior is the updated probability after seeing evidence.
\[ P(C \mid X) \]MLE Hypothesis ☆ #
MLE stands for Maximum Likelihood Estimation.
MLE chooses the hypothesis or parameter value that makes the observed data most likely.
It focuses only on the likelihood term.
\[ h_{MLE} = \arg\max_h P(D \mid h) \]If the model has parameters \( \theta \) , we write:
\[ \theta_{MLE} = \arg\max_{\theta} P(D \mid \theta) \]MLE Using Log-Likelihood ☆ #
Products of many probabilities can become very small.
So we usually maximise the log-likelihood instead.
If the data points are independent:
\[ P(D \mid \theta) = \prod_{i=1}^{n} P(x_i \mid \theta) \]Taking log:
\[ \log P(D \mid \theta) = \sum_{i=1}^{n} \log P(x_i \mid \theta) \]So:
\[ \theta_{MLE} = \arg\max_{\theta} \sum_{i=1}^{n} \log P(x_i \mid \theta) \]MLE is data-driven.
It does not include prior belief. It only asks which parameter makes the observed data most probable.
MAP Hypothesis ☆ #
MAP stands for Maximum A Posteriori.
MAP chooses the hypothesis with the highest posterior probability after observing the data.
\[ h_{MAP} = \arg\max_h P(h \mid D) \]Using Bayes Rule:
\[ h_{MAP} = \arg\max_h \frac{P(D \mid h)P(h)}{P(D)} \]Since \( P(D) \) is the same for all hypotheses, we can ignore it when comparing hypotheses.
\[ h_{MAP} = \arg\max_h P(D \mid h)P(h) \]For parameters:
\[ \theta_{MAP} = \arg\max_{\theta} P(D \mid \theta)P(\theta) \]MAP Using Log Form #
\[ \theta_{MAP} = \arg\max_{\theta} \left[ \log P(D \mid \theta) + \log P(\theta) \right] \]This shows the difference clearly:
- MLE uses only the data likelihood.
- MAP uses likelihood plus prior.
MAP can be interpreted as MLE with a prior-based preference.
This is why MAP is often connected to regularisation.
MLE vs MAP #
| Aspect | MLE | MAP |
|---|---|---|
| Full form | Maximum Likelihood Estimation | Maximum A Posteriori |
| Uses data likelihood | Yes | Yes |
| Uses prior belief | No | Yes |
| Main expression | \( P(D \mid h) \) | \( P(D \mid h)P(h) \) |
| Useful when | We trust data strongly | We want to include prior knowledge |
| Risk | May overfit with limited data | Depends on quality of prior |
MAP Classification Rule ☆ #
In classification, the MAP rule chooses the class with the highest posterior probability.
\[ \hat{C} = \arg\max_C P(C \mid X) \]Using Bayes Rule:
\[ \hat{C} = \arg\max_C P(X \mid C)P(C) \]The denominator \( P(X) \) is ignored because it is common for all classes.
Probabilistic Generative Classifiers ☆ #
A probabilistic generative classifier models how data is generated for each class.
It learns:
- the prior probability of each class
- the likelihood of features given each class
Then it applies Bayes Rule to predict the class.
\[ P(C \mid X) \propto P(X \mid C)P(C) \]
flowchart LR
A["Training Data"] --> B["Estimate Class Prior P(C)"]
A --> C["Estimate Likelihood P(X | C)"]
B --> D["Apply Bayes Rule"]
C --> D
D --> E["Choose Class with Highest Posterior"]
style A fill:#E1F5FE,stroke:#5b7db1,color:#222
style B fill:#C8E6C9,stroke:#5f8f6a,color:#222
style C fill:#FFF9C4,stroke:#b59b3b,color:#222
style D fill:#EDE7F6,stroke:#8a6fb3,color:#222
style E fill:#FDE2E4,stroke:#b85c68,color:#222
Naive Bayes Classifier ☆ #
Naive Bayes is a probabilistic classifier based on Bayes Rule.
It is called naive because it assumes that features are conditionally independent given the class.
This assumption is often not perfectly true in real data, but the method can still work surprisingly well.
Naive Bayes Formula #
For features \( X = (x_1, x_2, \ldots, x_n) \) :
\[ P(C \mid x_1, x_2, \ldots, x_n) = \frac{P(C)P(x_1, x_2, \ldots, x_n \mid C)}{P(x_1, x_2, \ldots, x_n)} \]Using the naive independence assumption:
\[ P(x_1, x_2, \ldots, x_n \mid C) = \prod_{k=1}^{n} P(x_k \mid C) \]So the classification rule becomes:
\[ \hat{C} = \arg\max_C P(C) \prod_{k=1}^{n} P(x_k \mid C) \]Why Naive Bayes Is Useful #
Naive Bayes is commonly used for:
- spam detection
- sentiment classification
- document classification
- medical diagnosis support
- text categorisation
It is fast because it only needs to estimate simple probabilities.
It is especially useful when the number of features is large, such as in text data.
Types of Naive Bayes #
| Type | Feature Type | Example Use |
|---|---|---|
| Gaussian Naive Bayes | Continuous numeric features | Medical measurements, sensor data |
| Multinomial Naive Bayes | Count features | Word counts in documents |
| Bernoulli Naive Bayes | Binary features | Word present / absent |
Laplace Smoothing ☆ #
If a feature value never appears with a class in the training data, its probability becomes zero.
That would make the whole product zero.
Laplace smoothing avoids this problem.
\[ P(x_k \mid C) = \frac{\text{count}(x_k, C) + 1}{\text{count}(C) + V} \]Where \( V \) is the number of possible feature values.
Without smoothing, one unseen feature value can force a class probability to become zero.
Optimal Bayes Classifier #
The Bayes optimal classifier chooses the class with the highest posterior probability.
It gives the best possible classification decision if the true probability distributions are known.
\[ \hat{C}_{Bayes} = \arg\max_C P(C \mid X) \]In practice, the true distributions are unknown, so methods such as Naive Bayes estimate them from data.
Bayesian Linear Regression #
In ordinary linear regression, we estimate a single best set of weights.
In Bayesian Linear Regression, the weights are treated as random variables.
We place a prior over the weights and update it using observed data.
\[ y = w^T x + \epsilon \]A common assumption is Gaussian noise:
\[ \epsilon \sim \mathcal{N}(0, \sigma^2) \]The likelihood becomes:
\[ P(y \mid X, w) = \mathcal{N}(y \mid Xw, \sigma^2 I) \]If we use a Gaussian prior over weights:
\[ P(w) = \mathcal{N}(w \mid 0, \alpha^{-1}I) \]Then Bayes Rule gives the posterior over weights:
\[ P(w \mid X,y) \propto P(y \mid X,w)P(w) \]Bayesian Linear Regression Intuition #
Ordinary regression says:
These are the best weights.
Bayesian regression says:
These weights are more likely, but there is uncertainty.
This is useful because it can give uncertainty around predictions.
Worked Mini Example: MAP Classification #
Suppose we want to classify an email as spam or not spam.
Let the evidence be:
- contains the word “offer”
- contains the word “free”
We compare:
\[ P(Spam \mid X) \propto P(X \mid Spam)P(Spam) \] \[ P(NotSpam \mid X) \propto P(X \mid NotSpam)P(NotSpam) \]The predicted class is whichever value is larger.
Notes ☆ #
You should be comfortable with:
- Identifying prior, likelihood, evidence and posterior from Bayes Rule.
- Explaining the difference between MLE and MAP.
- Writing the MAP classification rule.
- Applying Naive Bayes using class priors and conditional probabilities.
- Explaining why Naive Bayes assumes conditional independence.
- Explaining why Laplace smoothing is needed.
- Comparing discriminative and generative classifiers.
Common Mistakes #
| Mistake | Correction |
|---|---|
| Confusing likelihood and posterior | Likelihood is \( P(X \mid C) \) , posterior is \( P(C \mid X) \) |
| Forgetting the prior in MAP | MAP uses both likelihood and prior |
| Treating MLE and MAP as the same | MLE ignores the prior; MAP includes it |
| Forgetting Naive Bayes independence assumption | Naive Bayes multiplies individual feature likelihoods |
| Not using smoothing | Zero probabilities can destroy the whole product |
Revision #
| Concept | Core Idea | Formula |
|---|---|---|
| Bayes Rule | Update belief using evidence | \( P(C \mid X) = \frac{P(X \mid C)P(C)}{P(X)} \) |
| MLE | Choose hypothesis that makes data most likely | \( \arg\max_h P(D \mid h) \) |
| MAP | Choose hypothesis with highest posterior | \( \arg\max_h P(D \mid h)P(h) \) |
| Naive Bayes | Apply Bayes Rule with independent features | \( \arg\max_C P(C)\prod_k P(x_k \mid C) \) |
| Bayes Optimal | Choose class with highest true posterior | \( \arg\max_C P(C \mid X) \) |