Naïve Bayes

Naïve Bayes #

Naïve Bayes is a probabilistic classifier.

  • Supervised Learning Problem
  • Binary Classification - final target variable is considered in two classes
  • Hypothesis is target which you want to classify
  • Total Probability (Prior) of Yes and No is already calculated
  • Post / Posterior is when you start studying data
  • Based on max probability of hypotheses classify given instance into a class

It predicts a class label by computing:

  • a prior for each class
  • conditional probabilities of features given the class
  • a score for each class (multiply probabilities)
  • the class with the maximum score is chosen

Key takeaway: Naïve Bayes = Bayes + conditional independence.

It works by comparing:

\( P(\text{class}) \times \prod P(\text{feature}\mid \text{class}) \)

1) The classification setup #

We have:

  • feature vector: \( X=(X_1,X_2,\dots,X_n) \)
  • class label: \( Y \in \{c_1,c_2,\dots,c_K\} \)

Goal: given a new instance \( x \) , predict \( \hat{y} \) .

Examples:

  • Play = Yes/No given weather
  • Spam / Not spam given words
  • Pass / Fail given attributes

2) Bayes theorem for a class #

For a class \( Y=c_k \) and observed features \( X=x \) :

\[ P(Y=c_k\mid X=x)=\frac{P(X=x\mid Y=c_k)\,P(Y=c_k)}{P(X=x)} \]

Meaning:

  • \( P(Y=c_k) \) : prior probability of class \( c_k \)
  • \( P(X=x\mid Y=c_k) \) : likelihood of observing \( x \) if the class were \( c_k \)
  • \( P(Y=c_k\mid X=x) \) : posterior (updated probability after seeing the features)

3) MAP decision rule (choose the best class) #

Since \( P(X=x) \) is the same for all classes, we can drop it when comparing classes.

\[ \hat{y}=\arg\max_{c_k}\; P(Y=c_k\mid X=x) =\arg\max_{c_k}\; P(X=x\mid Y=c_k)\,P(Y=c_k) \]

This is a comparison rule: we only need relative scores.


4) The “naïve” assumption (conditional independence) #

Naïve Bayes assumes: features are conditionally independent given the class.

\[ P(X_1,\dots,X_n\mid Y)=\prod_{i=1}^{n} P(X_i\mid Y) \]

Plain meaning: once the class is fixed, learning one feature does not change the probability of another feature.


5) The Naïve Bayes scoring formula #

Using conditional independence:

\[ P(Y=c_k\mid X=x)\propto P(Y=c_k)\prod_{i=1}^{n} P(X_i=x_i\mid Y=c_k) \]

So, for each class:

  1. compute the prior \( P(Y=c_k) \)
  2. compute each conditional probability \( P(X_i=x_i\mid Y=c_k) \)
  3. multiply to get a score
  4. choose the maximum score

6) Pipeline #

%%{init: {'theme':'base','themeVariables': {
  'fontFamily':'Inter, ui-sans-serif, system-ui',
  'primaryColor':'#EAF7F1',
  'primaryTextColor':'#1F2937',
  'primaryBorderColor':'#A7E3C8',
  'lineColor':'#94A3B8',
  'tertiaryColor':'#F8FAFC'
}}}%%
flowchart TD
  A["Training data<br/>features + labels"] --> B["Count frequencies per class"]
  B --> C["Convert counts to probabilities<br/>priors + conditional probs"]
  C --> D["Score a new instance x<br/>P(class) × product P(feature|class)"]
  D --> E{"Pick the max score"}
  E --> F["Predicted class"]

7) Worked example A: Play Tennis (multiple features) #

Feature vector:

\( X=(\text{Outlook},\text{Temp},\text{Humidity},\text{Windy}) \)

Example instance:

  • Outlook = Sunny
  • Temp = Hot
  • Humidity = Normal
  • Windy = False

Compute two class scores (Yes/No):

\[ \text{Score(Yes)}= P(\text{Yes})\, P(\text{Sunny}\mid \text{Yes})\, P(\text{Hot}\mid \text{Yes})\, P(\text{Normal}\mid \text{Yes})\, P(\text{False}\mid \text{Yes}) \] \[ \text{Score(No)}= P(\text{No})\, P(\text{Sunny}\mid \text{No})\, P(\text{Hot}\mid \text{No})\, P(\text{Normal}\mid \text{No})\, P(\text{False}\mid \text{No}) \]

Decision:

  • if Score(Yes) > Score(No) → predict Yes
  • otherwise → predict No

Tip: You do not need to divide by ( P(X) )

to choose the class, because it is common to both classes.


8) Worked example B: Binary attributes (Pass/Fail style) #

Suppose the class is: \( Y\in\{\text{Pass},\text{Fail}\} \)

Attributes:

  • Confident = Yes/No
  • Sick = Yes/No

For a new instance: Confident = Yes, Sick = No

Compute the score for each class:

\[ \text{Score(Pass)}= P(\text{Pass})\, P(\text{Confident=Yes}\mid \text{Pass})\, P(\text{Sick=No}\mid \text{Pass}) \] \[ \text{Score(Fail)}= P(\text{Fail})\, P(\text{Confident=Yes}\mid \text{Fail})\, P(\text{Sick=No}\mid \text{Fail}) \]

Pick the larger score.


9) Practical issue: zero-frequency problem #

If any conditional probability becomes 0, the entire product becomes 0.

Zero-frequency problem: A single zero can kill the entire score. This is why smoothing is used.


10) Laplace smoothing #

Laplace smoothing adds 1 (or a constant \( k \) ) to every count.

\[ P(w\mid c)=\frac{\operatorname{count}(w,c)+k}{\operatorname{count}(c)+k|V|} \]

Where:

  • \( |V| \) is vocabulary size
  • \( k=1 \) is common

11) Worked example C: text classification with smoothing #

Classes:

  • Sports
  • Not sports

Sentence: “A very close game”

Words are features: \( (\text{a},\text{very},\text{close},\text{game}) \)

Score each class:

\[ \text{Score(Sports)}= P(\text{Sports})\prod_{w\in\text{sentence}} P(w\mid \text{Sports}) \] \[ \text{Score(Not sports)}= P(\text{Not sports})\prod_{w\in\text{sentence}} P(w\mid \text{Not sports}) \]

12) Variants (choose by feature type) #

%%{init: {'theme':'base','themeVariables': {
  'fontFamily':'Inter, ui-sans-serif, system-ui',
  'primaryColor':'#FFF3E6',
  'primaryTextColor':'#1F2937',
  'primaryBorderColor':'#FFD6A5',
  'lineColor':'#94A3B8',
  'tertiaryColor':'#F8FAFC'
}}}%%
flowchart TD
  A["Naïve Bayes: choose by feature type"] --> B["Continuous values"]
  A --> C["Counts / frequencies"]
  A --> D["Binary (0/1)"]

  B --> B1["Gaussian NB<br/>Normal distribution per class"]
  C --> C1["Multinomial NB<br/>term counts / word frequencies"]
  D --> D1["Bernoulli NB<br/>presence vs absence"]

Gaussian Naïve Bayes #

Continuous features. Assumes a Normal distribution per class.

Multinomial Naïve Bayes #

Count features. Used for word counts / term frequencies.

Bernoulli Naïve Bayes #

Binary features (0/1). Used for presence vs absence.


13) Advantages, disadvantages, applications #

Advantages:

  • easy to implement and computationally efficient
  • works well with many features (especially text)
  • performs well even with limited training data

Disadvantages:

  • conditional independence assumption may not be true
  • can be influenced by irrelevant attributes
  • without smoothing, unseen events can cause zero probabilities

Applications:

  • spam email filtering
  • sentiment analysis and document categorisation
  • medical diagnosis support
  • credit scoring

Mini-check (self-test) #

  1. What does “naïve” mean here?
  2. Why can we skip dividing by \( P(X) \) when classifying?
  3. What problem does Laplace smoothing fix?

Answers:

  1. Features are assumed conditionally independent given the class.
\( P(X) \)

is common to all classes, so it cancels when comparing.
3) Zero-frequency problem (a probability becomes 0).


What’s next #

Probability Distributions
Move from events to random variables and distributions.


Reference #


Home | Conditional Probability & Bayes’ Theorem