Gaussian Mixture Model & Expectation Maximization

Gaussian Mixture Model & Expectation Maximization #

A Gaussian Mixture Model represents data as a weighted combination of multiple Gaussian distributions.

It is commonly used for soft clustering and density estimation.

Key takeaway:
K-means gives hard cluster membership.

GMM gives probabilities of belonging to each cluster.

  • Gaussian Mixture Model
  • soft clustering
  • mixing coefficients
  • latent variables
  • likelihood and log-likelihood
  • Expectation-Maximization algorithm
  • E-step and M-step
  • responsibilities
  • convergence

Motivation ☆ #

Many real datasets are not described well by one Gaussian distribution.

For example, customer spending may contain:

  • low spenders
  • medium spenders
  • high spenders

A single Gaussian may blur these groups.

A mixture of Gaussians can model them better.


Hard Clustering vs Soft Clustering ☆ #

MethodMembership TypeMeaning
K-meansHard clusteringEach point belongs to exactly one cluster
GMMSoft clusteringEach point has probability of belonging to each cluster

Example:

A customer may be:

  • 0.80 likely to be a low spender
  • 0.20 likely to be a high spender

This is more flexible than forcing one immediate label.


GMM Intuition #

flowchart LR
    A[Observed Data] --> B[Hidden Gaussian Component 1]
    A --> C[Hidden Gaussian Component 2]
    A --> D[Hidden Gaussian Component K]
    B --> E[Soft Cluster Probabilities]
    C --> E
    D --> E
    E --> F[Updated Parameters]
    F --> B
    F --> C
    F --> D

    style A fill:#E1F5FE
    style B fill:#C8E6C9
    style C fill:#FFF9C4
    style D fill:#EDE7F6
    style E fill:#E1F5FE
    style F fill:#C8E6C9

Mathematical Representation ☆ #

For a dataset with \( K \) Gaussian components, the GMM density is:

\[ p(x) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x \mid \mu_k, \Sigma_k) \]

Where:

SymbolMeaning
\( K \)number of Gaussian components
\( \pi_k \)mixing coefficient of component \( k \)
\( \mu_k \)mean of component \( k \)
\( \Sigma_k \)covariance of component \( k \)
\( \mathcal{N}(x \mid \mu_k, \Sigma_k) \)Gaussian density

The mixing coefficients must satisfy:

\[ \sum_{k=1}^{K} \pi_k = 1 \] \[ 0 \leq \pi_k \leq 1 \]

Gaussian Density ☆ #

For multivariate data:

\[ \mathcal{N}(x \mid \mu, \Sigma) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right) \]

For one-dimensional data:

\[ \mathcal{N}(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \]

Components of GMM ☆ #

Mean #

The mean controls the centre of each Gaussian component.

Covariance #

The covariance controls the spread, shape and orientation of each component.

Mixing Coefficient #

The mixing coefficient controls how much each component contributes to the overall mixture.

Number of Components #

The number of components decides how many Gaussian clusters are used.

Too few components may underfit.

Too many components may overfit.


Latent Variables ☆ #

In GMM, we assume each data point was generated by one hidden Gaussian component.

The component label is not directly observed.

This hidden label is called a latent variable.

A common representation is:

\[ z_{nk} = 1 \quad \text{if data point } n \text{ belongs to component } k \]

But in soft clustering, we estimate probabilities rather than fixed labels.


Responsibility ☆ #

Responsibility is the posterior probability that component \( k \) generated data point \( x_n \) .

\[ \gamma(z_{nk}) = P(z_{nk}=1 \mid x_n) \]

Using Bayes theorem:

\[ \gamma(z_{nk}) = \frac{\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)}{\sum_{j=1}^{K}\pi_j \mathcal{N}(x_n \mid \mu_j, \Sigma_j)} \]

Responsibility answers:

How responsible is component k for generating this point?


Likelihood and Log-Likelihood ☆ #

For independent observations:

\[ p(X \mid \pi, \mu, \Sigma) = \prod_{n=1}^{N}\sum_{k=1}^{K}\pi_k\mathcal{N}(x_n \mid \mu_k, \Sigma_k) \]

The log-likelihood is:

\[ \ln p(X \mid \pi, \mu, \Sigma) = \sum_{n=1}^{N}\ln\left(\sum_{k=1}^{K}\pi_k\mathcal{N}(x_n \mid \mu_k, \Sigma_k)\right) \]

For a single Gaussian, maximum likelihood has a simple closed-form solution.

For a mixture of Gaussians, the summation inside the logarithm makes direct optimisation harder.

This is why EM is used.


Expectation-Maximization Algorithm ☆ #

EM is an iterative method for estimating GMM parameters when cluster labels are hidden.

It alternates between:

StepNameWhat Happens
E-stepExpectationEstimate responsibilities using current parameters
M-stepMaximizationUpdate parameters using those responsibilities
flowchart TD
    A[Initialise means covariances and mixing coefficients] --> B[E-step: compute responsibilities]
    B --> C[M-step: update parameters]
    C --> D[Compute log-likelihood]
    D --> E{Converged?}
    E -- No --> B
    E -- Yes --> F[Final GMM]

    style A fill:#E1F5FE
    style B fill:#FFF9C4
    style C fill:#C8E6C9
    style D fill:#EDE7F6
    style E fill:#FFF9C4
    style F fill:#C8E6C9

E-Step Formula ☆ #

Use current parameters to calculate responsibilities:

\[ \gamma(z_{nk}) = \frac{\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)}{\sum_{j=1}^{K}\pi_j \mathcal{N}(x_n \mid \mu_j, \Sigma_j)} \]

This gives a soft assignment of every data point to every component.


M-Step Formulas ☆ #

First compute the effective number of points assigned to component \( k \) :

\[ N_k = \sum_{n=1}^{N}\gamma(z_{nk}) \]

Update the mean:

\[ \mu_k^{new} = \frac{1}{N_k}\sum_{n=1}^{N}\gamma(z_{nk})x_n \]

Update the covariance:

\[ \Sigma_k^{new} = \frac{1}{N_k}\sum_{n=1}^{N}\gamma(z_{nk})(x_n - \mu_k^{new})(x_n - \mu_k^{new})^T \]

Update the mixing coefficient:

\[ \pi_k^{new} = \frac{N_k}{N} \]

EM Convergence ☆ #

Each EM cycle is designed not to decrease the likelihood.

In practice, stop when:

  • change in log-likelihood is very small
  • parameter updates become very small
  • maximum number of iterations is reached
\[ |\ell^{new} - \ell^{old}| < \epsilon \]

One Iteration Intuition ☆ #

Suppose customer spending values are:

\[ X = \{2,3,4,10,11,12\} \]

The data appears to contain two groups:

  • low spenders around 3
  • high spenders around 11

A GMM with two components starts with initial guesses for two Gaussians.

Then:

  1. E-step estimates how likely each point belongs to each group.
  2. M-step updates the two means, variances and mixing coefficients.
  3. The process repeats until the model stabilises.

GMM vs K-means ☆ #

AspectK-meansGMM
Cluster assignmentHardSoft
Cluster shapeMostly sphericalCan be elliptical
OutputCluster labelsCluster probabilities
Model typeDistance-basedProbabilistic
Uses covarianceNoYes
Handles overlapPoorlyBetter

Use GMM when clusters may overlap and probabilistic membership is useful.


Problems and Practical Issues ☆ #

Local Optimum #

EM may converge to a local optimum.

Use multiple initialisations.

Singular Covariance #

A component may collapse around a single point, causing covariance problems.

Regularisation can help.

Choosing Number of Components #

Common methods:

  • AIC
  • BIC
  • validation likelihood
  • domain knowledge

Exam Workflow ☆ #

Question ClueWhat to Write
“Mixture of Gaussians”Write GMM density formula
“Soft clustering”Mention responsibilities
“Hidden labels”Mention latent variables
“How parameters are estimated”Explain EM
“E-step”Compute posterior probabilities
“M-step”Update means, covariance and mixing coefficients
“Convergence”Log-likelihood stops changing significantly

Why It Matters in AI and ML #

GMM and EM are used in:

  • clustering
  • density estimation
  • anomaly detection
  • speech modelling
  • image segmentation
  • customer segmentation
  • missing-data problems

GMM is a probabilistic clustering model.

EM is the optimisation method that helps estimate its hidden structure.


Revision Checklist ☆ #

  • Can I write the GMM density formula?
  • Can I explain mixing coefficients?
  • Can I explain why GMM is soft clustering?
  • Can I define responsibility?
  • Can I write the E-step formula?
  • Can I write the M-step update for mean?
  • Can I explain why EM is iterative?
  • Can I compare GMM with K-means?

Home | Statistics