Gaussian Mixture Model & Expectation Maximization #

A Gaussian Mixture Model represents data as a weighted combination of multiple Gaussian distributions.

It is commonly used for soft clustering and density estimation.

Key takeaway:
K-means gives hard cluster membership.
GMM gives probabilities of belonging to each cluster.

Gaussian Mixture Model
soft clustering
mixing coefficients
latent variables
likelihood and log-likelihood
Expectation-Maximization algorithm
E-step and M-step
responsibilities
convergence

Motivation ☆ #

Many real datasets are not described well by one Gaussian distribution.

For example, customer spending may contain:

low spenders
medium spenders
high spenders

A single Gaussian may blur these groups.

A mixture of Gaussians can model them better.

Hard Clustering vs Soft Clustering ☆ #

Method	Membership Type	Meaning
K-means	Hard clustering	Each point belongs to exactly one cluster
GMM	Soft clustering	Each point has probability of belonging to each cluster

Example:

A customer may be:

0.80 likely to be a low spender
0.20 likely to be a high spender

This is more flexible than forcing one immediate label.

GMM Intuition #

flowchart LR
    A[Observed Data] --> B[Hidden Gaussian Component 1]
    A --> C[Hidden Gaussian Component 2]
    A --> D[Hidden Gaussian Component K]
    B --> E[Soft Cluster Probabilities]
    C --> E
    D --> E
    E --> F[Updated Parameters]
    F --> B
    F --> C
    F --> D

    style A fill:#E1F5FE
    style B fill:#C8E6C9
    style C fill:#FFF9C4
    style D fill:#EDE7F6
    style E fill:#E1F5FE
    style F fill:#C8E6C9

Mathematical Representation ☆ #

For a dataset with \( K \) Gaussian components, the GMM density is:

\[ p(x) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x \mid \mu_k, \Sigma_k) \]

Where:

Symbol	Meaning
\( K \)	number of Gaussian components
\( \pi_k \)	mixing coefficient of component \( k \)
\( \mu_k \)	mean of component \( k \)
\( \Sigma_k \)	covariance of component \( k \)
\( \mathcal{N}(x \mid \mu_k, \Sigma_k) \)	Gaussian density

The mixing coefficients must satisfy:

\[ \sum_{k=1}^{K} \pi_k = 1 \] \[ 0 \leq \pi_k \leq 1 \]

Gaussian Density ☆ #

For multivariate data:

\[ \mathcal{N}(x \mid \mu, \Sigma) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right) \]

For one-dimensional data:

\[ \mathcal{N}(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \]

Components of GMM ☆ #

Mean #

The mean controls the centre of each Gaussian component.

Covariance #

The covariance controls the spread, shape and orientation of each component.

Mixing Coefficient #

The mixing coefficient controls how much each component contributes to the overall mixture.

Number of Components #

The number of components decides how many Gaussian clusters are used.

Too few components may underfit.
Too many components may overfit.

Latent Variables ☆ #

In GMM, we assume each data point was generated by one hidden Gaussian component.

The component label is not directly observed.

This hidden label is called a latent variable.

A common representation is:

\[ z_{nk} = 1 \quad \text{if data point } n \text{ belongs to component } k \]

But in soft clustering, we estimate probabilities rather than fixed labels.

Responsibility ☆ #

Responsibility is the posterior probability that component \( k \) generated data point \( x_n \) .

\[ \gamma(z_{nk}) = P(z_{nk}=1 \mid x_n) \]

Using Bayes theorem:

\[ \gamma(z_{nk}) = \frac{\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)}{\sum_{j=1}^{K}\pi_j \mathcal{N}(x_n \mid \mu_j, \Sigma_j)} \]

Responsibility answers:
How responsible is component k for generating this point?

Likelihood and Log-Likelihood ☆ #

For independent observations:

\[ p(X \mid \pi, \mu, \Sigma) = \prod_{n=1}^{N}\sum_{k=1}^{K}\pi_k\mathcal{N}(x_n \mid \mu_k, \Sigma_k) \]

The log-likelihood is:

\[ \ln p(X \mid \pi, \mu, \Sigma) = \sum_{n=1}^{N}\ln\left(\sum_{k=1}^{K}\pi_k\mathcal{N}(x_n \mid \mu_k, \Sigma_k)\right) \]

For a single Gaussian, maximum likelihood has a simple closed-form solution.
For a mixture of Gaussians, the summation inside the logarithm makes direct optimisation harder.
This is why EM is used.

Expectation-Maximization Algorithm ☆ #

EM is an iterative method for estimating GMM parameters when cluster labels are hidden.

It alternates between:

Step	Name	What Happens
E-step	Expectation	Estimate responsibilities using current parameters
M-step	Maximization	Update parameters using those responsibilities

flowchart TD
    A[Initialise means covariances and mixing coefficients] --> B[E-step: compute responsibilities]
    B --> C[M-step: update parameters]
    C --> D[Compute log-likelihood]
    D --> E{Converged?}
    E -- No --> B
    E -- Yes --> F[Final GMM]

    style A fill:#E1F5FE
    style B fill:#FFF9C4
    style C fill:#C8E6C9
    style D fill:#EDE7F6
    style E fill:#FFF9C4
    style F fill:#C8E6C9

E-Step Formula ☆ #

Use current parameters to calculate responsibilities:

\[ \gamma(z_{nk}) = \frac{\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)}{\sum_{j=1}^{K}\pi_j \mathcal{N}(x_n \mid \mu_j, \Sigma_j)} \]

This gives a soft assignment of every data point to every component.

M-Step Formulas ☆ #

First compute the effective number of points assigned to component \( k \) :

\[ N_k = \sum_{n=1}^{N}\gamma(z_{nk}) \]

Update the mean:

\[ \mu_k^{new} = \frac{1}{N_k}\sum_{n=1}^{N}\gamma(z_{nk})x_n \]

Update the covariance:

\[ \Sigma_k^{new} = \frac{1}{N_k}\sum_{n=1}^{N}\gamma(z_{nk})(x_n - \mu_k^{new})(x_n - \mu_k^{new})^T \]

Update the mixing coefficient:

\[ \pi_k^{new} = \frac{N_k}{N} \]

EM Convergence ☆ #

Each EM cycle is designed not to decrease the likelihood.

In practice, stop when:

change in log-likelihood is very small
parameter updates become very small
maximum number of iterations is reached

\[ |\ell^{new} - \ell^{old}| < \epsilon \]

One Iteration Intuition ☆ #

Suppose customer spending values are:

\[ X = \{2,3,4,10,11,12\} \]

The data appears to contain two groups:

low spenders around 3
high spenders around 11

A GMM with two components starts with initial guesses for two Gaussians.

Then:

E-step estimates how likely each point belongs to each group.
M-step updates the two means, variances and mixing coefficients.
The process repeats until the model stabilises.

GMM vs K-means ☆ #

Aspect	K-means	GMM
Cluster assignment	Hard	Soft
Cluster shape	Mostly spherical	Can be elliptical
Output	Cluster labels	Cluster probabilities
Model type	Distance-based	Probabilistic
Uses covariance	No	Yes
Handles overlap	Poorly	Better

Use GMM when clusters may overlap and probabilistic membership is useful.

Problems and Practical Issues ☆ #

Local Optimum #

EM may converge to a local optimum.

Use multiple initialisations.

Singular Covariance #

A component may collapse around a single point, causing covariance problems.

Regularisation can help.

Choosing Number of Components #

Common methods:

AIC
BIC
validation likelihood
domain knowledge

Exam Workflow ☆ #

Question Clue	What to Write
“Mixture of Gaussians”	Write GMM density formula
“Soft clustering”	Mention responsibilities
“Hidden labels”	Mention latent variables
“How parameters are estimated”	Explain EM
“E-step”	Compute posterior probabilities
“M-step”	Update means, covariance and mixing coefficients
“Convergence”	Log-likelihood stops changing significantly

Why It Matters in AI and ML #

GMM and EM are used in:

clustering
density estimation
anomaly detection
speech modelling
image segmentation
customer segmentation
missing-data problems

GMM is a probabilistic clustering model.
EM is the optimisation method that helps estimate its hidden structure.

Revision Checklist ☆ #

Can I write the GMM density formula?
Can I explain mixing coefficients?
Can I explain why GMM is soft clustering?
Can I define responsibility?
Can I write the E-step formula?
Can I write the M-step update for mean?
Can I explain why EM is iterative?
Can I compare GMM with K-means?

Home | Statistics