Naïve Bayes #

3.1 From Bayes to learning (posterior over classes) #

In classification, we want: class probability given features.

For a class \( Y \) and observed features \( X \) :

\[ P(Y\mid X)=\frac{P(X\mid Y)\,P(Y)}{P(X)} \]

This is a generative viewpoint: model \( P(X\mid Y) \) and \( P(Y) \) , then infer \( P(Y\mid X) \) .

3.2 Conditional independence (the “naïve” assumption) #

Naïve Bayes assumes: features are conditionally independent given the class.

Meaning: once you know the class, learning one feature does not change the probability of another feature.

Why it helps: Without the independence assumption, modelling the full joint probability of many features becomes expensive. With it, we can multiply simple 1-feature probabilities.

3.3 Naïve Bayes formula (multiple features) #

For features \( X_1,\dots,X_n \) and class \( Y \) :

\[ P(Y\mid X_1,\dots,X_n)\propto P(Y)\prod_{i=1}^{n} P(X_i\mid Y) \]

We use “proportional to” because \( P(X_1,\dots,X_n) \) is the same for all classes when comparing classes.

Decision rule (MAP classification):

\[ \hat{y}=\arg\max_{y} \; P(y)\prod_{i=1}^{n} P(x_i\mid y) \]

3.4 How to build a Naïve Bayes classifier from data #

Steps (the same workflow used in the session examples):

Collect raw data
Convert to frequency tables
Convert counts to probabilities
Plug into Bayes / Naïve Bayes and compare class scores

3.5 Worked example A (Play Tennis with one feature: Weather) #

Goal: If today is Sunny, predict Play = Yes or No.

From the frequency table:

\( P(Yes)=9/14 \) , \( P(No)=5/14 \)
Weather counts: Sunny has 3 Yes and 2 No, so: \( P(Sunny\mid Yes)=3/9 \) , \( P(Sunny\mid No)=2/5 \)

Compute scores:

\[ ext{Score(Yes)}=P(Yes)\,P(Sunny\mid Yes)=\frac{9}{14}\cdot\frac{3}{9} \] \[ ext{Score(No)}=P(No)\,P(Sunny\mid No)=\frac{5}{14}\cdot\frac{2}{5} \]

Compare the two scores: the larger score gives the predicted class.

You do not need to compute the denominator P(Sunny) when comparing classes. The denominator is the same for both classes.

3.6 Worked example B (Play Tennis with multiple features) #

Today: Outlook = Sunny, Temp = Hot, Humidity = Normal, Windy = False

Use:

Prior: \( P(Yes)=9/14 \) , \( P(No)=5/14 \)
Likelihood terms from the tables, e.g. \( P(Outlook=Sunny\mid Yes)=3/9 \) , \( P(Temp=Hot\mid Yes)=2/9 \) , \( P(Humidity=Normal\mid Yes)=6/9 \) , \( P(Windy=False\mid Yes)=6/9 \) , and similarly for No.

Score each class:

\[ ext{Score(Yes)}=P(Yes)\,P(Sunny\mid Yes)\,P(Hot\mid Yes)\,P(Normal\mid Yes)\,P(False\mid Yes) \] \[ ext{Score(No)}=P(No)\,P(Sunny\mid No)\,P(Hot\mid No)\,P(Normal\mid No)\,P(False\mid No) \]

Pick the larger score.

Zero-frequency problem: If a feature value never appears with a class in your training data, the probability can become zero and the whole product becomes zero. In practice, you use smoothing (e.g., Laplace smoothing) to avoid this.

3.7 Naïve Bayes pipeline (Mermaid) #

flowchart TD
  A[Training data] --> B[Count frequencies per class]
  B --> C[Convert to probabilities]
  C --> D[Compute class scores for new X]
  D --> E{Pick max score}
  E --> F[Predicted class]

Types of Naïve Bayes Models #

Naïve Bayes has a few common variants, mainly depending on the type of features you have.

1) Gaussian Naïve Bayes (for continuous features) #

Gaussian Naïve Bayes is used when your features are continuous numbers (e.g., height, temperature, sensor readings).

It assumes that for each class, the feature values follow a Normal (Gaussian) distribution — the familiar bell-shaped curve that is symmetric around the mean.

Typical examples:

medical measurements
sensor data
continuous-valued features in small/medium datasets

2) Multinomial Naïve Bayes (for counts / frequencies) #

Multinomial Naïve Bayes is used when features represent counts, most commonly word frequencies in a document.

It works well when “how many times a word appears” matters.

Typical examples:

document classification
sentiment analysis using word counts
topic classification using term frequency features

3) Bernoulli Naïve Bayes (for binary features) #

Bernoulli Naïve Bayes is used when features are binary (0/1), meaning they represent presence or absence.

For text, it treats each word feature as:

1 if the word appears in the document
0 if the word does not appear

This is useful when “word present vs not present” matters more than how many times it appears.

Typical examples:

document classification with binary bag-of-words
spam detection based on keyword presence

flowchart TD
  A["Naïve Bayes: choose model by feature type"] --> B["Continuous values"]
  A --> C["Counts / frequencies"]
  A --> D["Binary (0/1)"]

  B --> B1["Gaussian NB<br/>Normal distribution per class"]
  C --> C1["Multinomial NB<br/>term counts / word frequencies"]
  D --> D1["Bernoulli NB<br/>presence vs absence"]

Advantages #

Easy to implement and computationally efficient.
Works well when the number of features is large (common in text data).
Can perform well even with limited training data.
Handles categorical features effectively.
For numerical features, Gaussian NB provides a simple and often useful assumption (Normal distribution).

Disadvantages #

Assumes features are independent given the class, which is often not strictly true in real data.
Can be affected by irrelevant or noisy features.
Can assign zero probability to unseen events (unless smoothing is used), which may reduce generalisation.

Applications #

Spam email filtering: classifies emails as spam or not spam.
Text classification: sentiment analysis, document categorisation, topic classification.
Medical diagnosis: estimates likelihood of disease given symptoms or test results.
Credit scoring: supports decisions about loan approval based on applicant features.
Weather prediction: classifies weather conditions from observed factors.

Mini-check (self-test) #

If \( P(A)=0.3 \) and \( P(B\mid A)=0.8 \) , find \( P(A\cap B) \) .
If \( P(B\mid A)=P(B) \) , what does that tell you about A and B?
In Bayes’ theorem, what usually has the biggest impact when an event is rare?

Answers:

Multiply base probability by the conditional probability.
The events are independent.
The prior probability (base rate).

Practice prompts #

Write \( P(A\cap B\cap C\cap D) \) using the multiplication rule.
If \( P(A)=0.3 \) and \( P(B\mid A)=0.8 \) , find \( P(A\cap B) \) .
Create a 3-branch total probability tree from your work (e.g., device type, customer segment, failure mode), and compute an overall probability.

Quick answer for 2): \( P(A\cap B)=0.3\times 0.8=0.24 \)

What’s next #

Probability Distributions
Move from events to random variables and distributions.

Reference #

Naive Bayes

Home | Conditional Probability & Bayes’ Theorem