Gaussian Mixture Model & Expectation Maximization #
A Gaussian Mixture Model represents data as a weighted combination of multiple Gaussian distributions.
It is commonly used for soft clustering and density estimation.
Key takeaway:
K-means gives hard cluster membership.GMM gives probabilities of belonging to each cluster.
- Gaussian Mixture Model
- soft clustering
- mixing coefficients
- latent variables
- likelihood and log-likelihood
- Expectation-Maximization algorithm
- E-step and M-step
- responsibilities
- convergence
Motivation ☆ #
Many real datasets are not described well by one Gaussian distribution.
For example, customer spending may contain:
- low spenders
- medium spenders
- high spenders
A single Gaussian may blur these groups.
A mixture of Gaussians can model them better.
Hard Clustering vs Soft Clustering ☆ #
| Method | Membership Type | Meaning |
|---|---|---|
| K-means | Hard clustering | Each point belongs to exactly one cluster |
| GMM | Soft clustering | Each point has probability of belonging to each cluster |
Example:
A customer may be:
- 0.80 likely to be a low spender
- 0.20 likely to be a high spender
This is more flexible than forcing one immediate label.
GMM Intuition #
flowchart LR
A[Observed Data] --> B[Hidden Gaussian Component 1]
A --> C[Hidden Gaussian Component 2]
A --> D[Hidden Gaussian Component K]
B --> E[Soft Cluster Probabilities]
C --> E
D --> E
E --> F[Updated Parameters]
F --> B
F --> C
F --> D
style A fill:#E1F5FE
style B fill:#C8E6C9
style C fill:#FFF9C4
style D fill:#EDE7F6
style E fill:#E1F5FE
style F fill:#C8E6C9
Mathematical Representation ☆ #
For a dataset with \( K \) Gaussian components, the GMM density is:
\[ p(x) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x \mid \mu_k, \Sigma_k) \]Where:
| Symbol | Meaning |
|---|---|
| \( K \) | number of Gaussian components |
| \( \pi_k \) | mixing coefficient of component \( k \) |
| \( \mu_k \) | mean of component \( k \) |
| \( \Sigma_k \) | covariance of component \( k \) |
| \( \mathcal{N}(x \mid \mu_k, \Sigma_k) \) | Gaussian density |
The mixing coefficients must satisfy:
\[ \sum_{k=1}^{K} \pi_k = 1 \] \[ 0 \leq \pi_k \leq 1 \]Gaussian Density ☆ #
For multivariate data:
\[ \mathcal{N}(x \mid \mu, \Sigma) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right) \]For one-dimensional data:
\[ \mathcal{N}(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \]Components of GMM ☆ #
Mean #
The mean controls the centre of each Gaussian component.
Covariance #
The covariance controls the spread, shape and orientation of each component.
Mixing Coefficient #
The mixing coefficient controls how much each component contributes to the overall mixture.
Number of Components #
The number of components decides how many Gaussian clusters are used.
Too few components may underfit.
Too many components may overfit.
Latent Variables ☆ #
In GMM, we assume each data point was generated by one hidden Gaussian component.
The component label is not directly observed.
This hidden label is called a latent variable.
A common representation is:
\[ z_{nk} = 1 \quad \text{if data point } n \text{ belongs to component } k \]But in soft clustering, we estimate probabilities rather than fixed labels.
Responsibility ☆ #
Responsibility is the posterior probability that component \( k \) generated data point \( x_n \) .
\[ \gamma(z_{nk}) = P(z_{nk}=1 \mid x_n) \]Using Bayes theorem:
\[ \gamma(z_{nk}) = \frac{\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)}{\sum_{j=1}^{K}\pi_j \mathcal{N}(x_n \mid \mu_j, \Sigma_j)} \]Responsibility answers:
How responsible is component k for generating this point?
Likelihood and Log-Likelihood ☆ #
For independent observations:
\[ p(X \mid \pi, \mu, \Sigma) = \prod_{n=1}^{N}\sum_{k=1}^{K}\pi_k\mathcal{N}(x_n \mid \mu_k, \Sigma_k) \]The log-likelihood is:
\[ \ln p(X \mid \pi, \mu, \Sigma) = \sum_{n=1}^{N}\ln\left(\sum_{k=1}^{K}\pi_k\mathcal{N}(x_n \mid \mu_k, \Sigma_k)\right) \]For a single Gaussian, maximum likelihood has a simple closed-form solution.
For a mixture of Gaussians, the summation inside the logarithm makes direct optimisation harder.
This is why EM is used.
Expectation-Maximization Algorithm ☆ #
EM is an iterative method for estimating GMM parameters when cluster labels are hidden.
It alternates between:
| Step | Name | What Happens |
|---|---|---|
| E-step | Expectation | Estimate responsibilities using current parameters |
| M-step | Maximization | Update parameters using those responsibilities |
flowchart TD
A[Initialise means covariances and mixing coefficients] --> B[E-step: compute responsibilities]
B --> C[M-step: update parameters]
C --> D[Compute log-likelihood]
D --> E{Converged?}
E -- No --> B
E -- Yes --> F[Final GMM]
style A fill:#E1F5FE
style B fill:#FFF9C4
style C fill:#C8E6C9
style D fill:#EDE7F6
style E fill:#FFF9C4
style F fill:#C8E6C9
E-Step Formula ☆ #
Use current parameters to calculate responsibilities:
\[ \gamma(z_{nk}) = \frac{\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)}{\sum_{j=1}^{K}\pi_j \mathcal{N}(x_n \mid \mu_j, \Sigma_j)} \]This gives a soft assignment of every data point to every component.
M-Step Formulas ☆ #
First compute the effective number of points assigned to component \( k \) :
\[ N_k = \sum_{n=1}^{N}\gamma(z_{nk}) \]Update the mean:
\[ \mu_k^{new} = \frac{1}{N_k}\sum_{n=1}^{N}\gamma(z_{nk})x_n \]Update the covariance:
\[ \Sigma_k^{new} = \frac{1}{N_k}\sum_{n=1}^{N}\gamma(z_{nk})(x_n - \mu_k^{new})(x_n - \mu_k^{new})^T \]Update the mixing coefficient:
\[ \pi_k^{new} = \frac{N_k}{N} \]EM Convergence ☆ #
Each EM cycle is designed not to decrease the likelihood.
In practice, stop when:
- change in log-likelihood is very small
- parameter updates become very small
- maximum number of iterations is reached
One Iteration Intuition ☆ #
Suppose customer spending values are:
\[ X = \{2,3,4,10,11,12\} \]The data appears to contain two groups:
- low spenders around 3
- high spenders around 11
A GMM with two components starts with initial guesses for two Gaussians.
Then:
- E-step estimates how likely each point belongs to each group.
- M-step updates the two means, variances and mixing coefficients.
- The process repeats until the model stabilises.
GMM vs K-means ☆ #
| Aspect | K-means | GMM |
|---|---|---|
| Cluster assignment | Hard | Soft |
| Cluster shape | Mostly spherical | Can be elliptical |
| Output | Cluster labels | Cluster probabilities |
| Model type | Distance-based | Probabilistic |
| Uses covariance | No | Yes |
| Handles overlap | Poorly | Better |
Use GMM when clusters may overlap and probabilistic membership is useful.
Problems and Practical Issues ☆ #
Local Optimum #
EM may converge to a local optimum.
Use multiple initialisations.
Singular Covariance #
A component may collapse around a single point, causing covariance problems.
Regularisation can help.
Choosing Number of Components #
Common methods:
- AIC
- BIC
- validation likelihood
- domain knowledge
Exam Workflow ☆ #
| Question Clue | What to Write |
|---|---|
| “Mixture of Gaussians” | Write GMM density formula |
| “Soft clustering” | Mention responsibilities |
| “Hidden labels” | Mention latent variables |
| “How parameters are estimated” | Explain EM |
| “E-step” | Compute posterior probabilities |
| “M-step” | Update means, covariance and mixing coefficients |
| “Convergence” | Log-likelihood stops changing significantly |
Why It Matters in AI and ML #
GMM and EM are used in:
- clustering
- density estimation
- anomaly detection
- speech modelling
- image segmentation
- customer segmentation
- missing-data problems
GMM is a probabilistic clustering model.
EM is the optimisation method that helps estimate its hidden structure.
Revision Checklist ☆ #
- Can I write the GMM density formula?
- Can I explain mixing coefficients?
- Can I explain why GMM is soft clustering?
- Can I define responsibility?
- Can I write the E-step formula?
- Can I write the M-step update for mean?
- Can I explain why EM is iterative?
- Can I compare GMM with K-means?