Unsupervised Learning #

Unsupervised Learning is used when we have input data but no target labels.

The model is not told the correct answer. Instead, it tries to discover hidden structure in the data.

K-means Clustering and variants
Review of EM algorithm
GMM based Soft Clustering
Applications

Supervised vs Unsupervised Learning #

Aspect	Supervised Learning	Unsupervised Learning
Data contains target label?	Yes	No
Learns from	Input-output pairs	Input features only
Main goal	Predict output	Discover structure
Example task	Classification, regression	Clustering
Example algorithm	Logistic regression, decision tree	K-means, GMM

Works on unlabelled raw data.
The algorithm discovers hidden patterns without prior knowledge of outcomes.
Requires no human intervention during training.
Does not make direct predictions — it groups or organises data instead.
Carries a higher risk because there’s no ground truth to verify results.
Common techniques include Clustering, Association, and Dimensionality Reduction.

The most common example is clustering, where similar records are grouped together.

Key takeaway:
Unsupervised Learning finds patterns in data without labelled outputs.
For this module, the main focus is clustering: K-means, mixture models, EM, and GMM-based soft clustering.

K-means clustering and variants
Review of EM algorithm
GMM-based soft clustering
Applications

stateDiagram-v2

  %% ML maths-based colours (same palette as supervised)
  classDef probability fill:#d1fae5,stroke:#065f46,stroke-width:1px
  classDef geometry fill:#ffedd5,stroke:#9a3412,stroke-width:1px
  classDef category font-style:italic,font-weight:bold,fill:#f3f4f6,stroke:#374151

  %% Root
  USL: Unsupervised Learning

  %% Main branches
  USL --> CLU:::category
  CLU: Clustering

  USL --> DR:::category
  DR: Dimensionality Reduction

  %% Clustering algorithms
  CLU --> KM:::geometry
  KM: K-Means

  CLU --> HC:::geometry
  HC: Hierarchical Clustering

  CLU --> DB:::geometry
  DB: DBSCAN

  %% Probabilistic models
  USL --> PM:::category
  PM: Probabilistic Models

  PM --> GMM:::probability
  GMM: Gaussian Mixture Model

  PM --> HMM:::probability
  HMM: Hidden Markov Model

Clustering #

Groups similar data points together based on shared features.
Commonly used for market segmentation, image compression, and anomaly detection.

Association #

Identifies relationships or correlations between variables in a dataset.
Commonly used in market basket analysis (e.g. “Customers who bought X also bought Y”).

Common Techniques #

Apriori Algorithm – Finds frequent itemsets and generates association rules.
Eclat Algorithm – Similar to Apriori but uses set intersections for faster computation.

Dimensionality Reduction #

Reduces the number of input variables to simplify data.
Helps remove noise and redundancy.
Commonly used in data pre-processing and visualisation.

Common Techniques #

Principal Component Analysis (PCA) – Projects data onto fewer dimensions while keeping most variance.
Linear Discriminant Analysis (LDA) – Focuses on class separation.
t-SNE (t-Distributed Stochastic Neighbour Embedding) – Used for visualising high-dimensional data.
Autoencoders – Neural networks that compress and reconstruct data.

mindmap
  root(Unsupervised Learning)
    Clustering
      K Means
      Hierarchical Clustering
      DBSCAN
    Dimensionality Reduction
      PCA
      t SNE
      Autoencoders
    Probabilistic Models
      Gaussian Mixture Model
      Hidden Markov Model

Clustering ☆ #

Clustering means grouping data points so that:

points inside the same cluster are similar
points in different clusters are dissimilar

For example:

grouping customers by buying behaviour
grouping documents by topic
grouping images by visual similarity
grouping patients by medical measurements

flowchart LR
    A["Unlabelled Data"] --> B["Similarity / Distance Measure"]
    B --> C["Clustering Algorithm"]
    C --> D["Groups / Clusters"]
    D --> E["Interpret Hidden Structure"]

    style A fill:#E1F5FE,stroke:#5b7db1,color:#222
    style B fill:#C8E6C9,stroke:#5f8f6a,color:#222
    style C fill:#FFF9C4,stroke:#b59b3b,color:#222
    style D fill:#EDE7F6,stroke:#8a6fb3,color:#222
    style E fill:#FDE2E4,stroke:#b85c68,color:#222

Common Types of Clustering #

K-Means Clustering – Divides data into K groups based on similarity.
Hierarchical Clustering – Builds a hierarchy (tree) of clusters.
DBSCAN (Density-Based Spatial Clustering) – Groups points close in density; identifies noise/outliers.

K-means Clustering ☆ #

K-means is a clustering algorithm that divides data into \( K \) clusters.

Each cluster has a centre called a centroid.

The algorithm repeatedly:

assigns each point to the nearest centroid
updates each centroid using the mean of the assigned points
repeats until the centroids stop changing significantly

K-means Objective Function ☆ #

The goal of K-means is to minimise the distance between points and their assigned cluster centroids.

\[ J = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2 \]

Where:

Symbol	Meaning
\( K \)	Number of clusters
\( C_k \)	Cluster \( k \)
\( x_i \)	Data point
\( \mu_k \)	Centroid of cluster \( k \)
\( J \)	Within-cluster squared error

K-means tries to make points close to their own centroid.
Lower objective value means tighter clusters.

K-means Algorithm Steps ☆ #

Step 1: Choose K #

Decide the number of clusters.

Example: \( K = 3 \) means the algorithm should form three groups.

Step 2: Initialise Centroids #

Choose initial centroids randomly or using a better method such as K-means++.

The lecturer emphasised that the initial centroid can be random.

Step 3: Assignment Step #

Assign each data point to the nearest centroid.

\[ c_i = \arg\min_k \|x_i - \mu_k\|^2 \]

This means each point goes to the cluster whose centroid is closest.

Step 4: Update Step #

Recalculate each centroid as the mean of all points assigned to that cluster.

\[ \mu_k = \frac{1}{|C_k|} \sum_{x_i \in C_k} x_i \]

Step 5: Repeat Until Convergence #

Repeat assignment and update until:

centroids stop changing
cluster assignments stop changing
maximum number of iterations is reached

K-means Workflow #

flowchart TD
    A["Choose K"] --> B["Initialise Centroids"]
    B --> C["Assign Points to Nearest Centroid"]
    C --> D["Update Centroids"]
    D --> E{"Centroids Stable?"}
    E -- "No" --> C
    E -- "Yes" --> F["Final Clusters"]

    style A fill:#E1F5FE,stroke:#5b7db1,color:#222
    style B fill:#C8E6C9,stroke:#5f8f6a,color:#222
    style C fill:#FFF9C4,stroke:#b59b3b,color:#222
    style D fill:#EDE7F6,stroke:#8a6fb3,color:#222
    style E fill:#FDE2E4,stroke:#b85c68,color:#222
    style F fill:#C8E6C9,stroke:#5f8f6a,color:#222

Choosing the Value of K ☆ #

K-means requires the number of clusters \( K \) in advance.

Choosing \( K \) is important.

If \( K \) is too small, different groups may be forced together.

If \( K \) is too large, natural groups may be split unnecessarily.

Elbow Method #

The elbow method plots clustering error against different values of \( K \) .

Usually, as \( K \) increases, the error decreases.

The “elbow” point is where improvement starts becoming smaller.

That point is often chosen as a reasonable \( K \) .

\[ WCSS = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2 \]

WCSS means within-cluster sum of squares.

K-means Variants #

Variant	Idea
Standard K-means	Randomly initialise centroids and iterate
K-means++	Choose better initial centroids to improve stability
Mini-batch K-means	Use small batches for faster clustering on large datasets
Kernel K-means	Use kernel idea for more complex cluster shapes

Strengths of K-means #

Simple to understand
Fast for many practical datasets
Works well when clusters are compact and roughly spherical
Easy to implement
Useful as a baseline clustering method

Limitations of K-means ☆ #

Limitation	Explanation
Needs K in advance	User must choose number of clusters
Sensitive to initial centroids	Different initial points may give different clusters
Sensitive to scale	Feature scaling is important
Assumes compact clusters	Does not work well for irregular cluster shapes
Hard assignment	Each point belongs to exactly one cluster
Sensitive to outliers	Outliers can pull centroids away

Before applying K-means, scale numeric features.
Distance-based clustering can be dominated by features with larger ranges.

Hard Clustering vs Soft Clustering ☆ #

K-means performs hard clustering.

Each point is assigned to exactly one cluster.

Gaussian Mixture Models perform soft clustering.

Each point receives a probability of belonging to each cluster.

Aspect	Hard Clustering	Soft Clustering
Assignment	One cluster only	Probability for each cluster
Example	K-means	GMM
Output	Cluster label	Responsibility / probability
Useful when	Groups are clearly separated	Groups overlap

Mixture Models #

A mixture model assumes the data comes from a mixture of several probability distributions.

For clustering, each distribution represents a cluster.

The data point is not simply assigned to one cluster immediately.

Instead, the model estimates how likely it is that the point came from each cluster.

Gaussian Mixture Model ☆ #

A Gaussian Mixture Model assumes each cluster is represented by a Gaussian distribution.

The overall data distribution is a weighted combination of Gaussians.

\[ p(x) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x \mid \mu_k, \Sigma_k) \]

Where:

Symbol	Meaning
\( K \)	Number of Gaussian components
\( \pi_k \)	Mixing coefficient for component \( k \)
\( \mu_k \)	Mean of component \( k \)
\( \Sigma_k \)	Covariance of component \( k \)
\( \mathcal{N}(x \mid \mu_k, \Sigma_k) \)	Gaussian density

The mixing coefficients must sum to one.

\[ \sum_{k=1}^{K} \pi_k = 1 \]

GMM Responsibility ☆ #

The probability that data point \( x_i \) belongs to component \( k \) is called its responsibility.

\[ \gamma_{ik} = P(z_i = k \mid x_i) \]

Using Bayes Rule:

\[ \gamma_{ik} = \frac{\pi_k \mathcal{N}(x_i \mid \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(x_i \mid \mu_j, \Sigma_j)} \]

This is the soft-clustering equivalent of assigning a point to a cluster.

K-means vs GMM #

Aspect	K-means	GMM
Type	Hard clustering	Soft clustering
Cluster representation	Centroid	Gaussian distribution
Assignment	Nearest centroid	Probability of each component
Cluster shape	Roughly spherical	Can model ellipses via covariance
Output	Cluster label	Cluster probabilities
Method	Distance minimisation	Probability maximisation

EM Algorithm ☆ #

EM stands for Expectation-Maximisation.

It is used when there are hidden or latent variables.

In GMM clustering, the hidden variable is the unknown cluster membership.

We observe data points, but we do not know which Gaussian component generated each point.

EM Algorithm Idea #

EM alternates between two steps:

E-step: estimate the hidden cluster responsibilities.
M-step: update model parameters using those responsibilities.

flowchart LR
    A["Initial Parameters"] --> B["E-step: Estimate Responsibilities"]
    B --> C["M-step: Update Parameters"]
    C --> D{"Converged?"}
    D -- "No" --> B
    D -- "Yes" --> E["Final GMM"]

    style A fill:#E1F5FE,stroke:#5b7db1,color:#222
    style B fill:#C8E6C9,stroke:#5f8f6a,color:#222
    style C fill:#FFF9C4,stroke:#b59b3b,color:#222
    style D fill:#EDE7F6,stroke:#8a6fb3,color:#222
    style E fill:#FDE2E4,stroke:#b85c68,color:#222

E-step #

The E-step computes responsibilities using the current parameters.

\[ \gamma_{ik} = \frac{\pi_k \mathcal{N}(x_i \mid \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(x_i \mid \mu_j, \Sigma_j)} \]

Interpretation:

Given the current Gaussians, how much responsibility should component \( k \) take for point \( x_i \) ?

M-step #

The M-step updates the GMM parameters using the responsibilities from the E-step.

First compute the effective number of points assigned to cluster \( k \) :

\[ N_k = \sum_{i=1}^{n} \gamma_{ik} \]

Update the mean:

\[ \mu_k = \frac{1}{N_k} \sum_{i=1}^{n} \gamma_{ik} x_i \]

Update the covariance:

\[ \Sigma_k = \frac{1}{N_k} \sum_{i=1}^{n} \gamma_{ik}(x_i - \mu_k)(x_i - \mu_k)^T \]

Update the mixing coefficient:

\[ \pi_k = \frac{N_k}{n} \]

EM Objective #

EM tries to maximise the likelihood of the observed data.

For GMM:

\[ \log p(X) = \sum_{i=1}^{n} \log \left( \sum_{k=1}^{K} \pi_k \mathcal{N}(x_i \mid \mu_k, \Sigma_k) \right) \]

Each EM iteration should not decrease this likelihood.

K-means and EM Connection #

K-means can be viewed as a simpler hard-assignment version of mixture modelling.

Step	K-means	GMM with EM
Assignment	Assign point to nearest centroid	Compute probability for each Gaussian
Update	Recompute centroids	Recompute means, covariances, mixing weights
Cluster membership	Hard	Soft
Objective	Minimise squared distance	Maximise likelihood

Applications of Unsupervised Learning #

Unsupervised Learning is used in:

customer segmentation
anomaly detection
image segmentation
document/topic grouping
market basket analysis
biological data analysis
recommender systems
exploratory data analysis

Practical ML Workflow #

Prepare data.
Scale numerical features.
Choose clustering method.
Choose number of clusters or components.
Fit the model.
Analyse clusters.
Validate using domain knowledge and metrics.

Evaluation of Clustering #

Since there are no labels, clustering evaluation is harder than supervised learning.

Common approaches include:

Method	Meaning
Inertia / WCSS	Lower within-cluster distance is better
Silhouette score	Measures separation and compactness
Visual inspection	Useful for low-dimensional data
Domain interpretation	Clusters should make practical sense
Stability check	Similar clusters should appear across runs

Silhouette Score #

For a data point:

\( a \) = average distance to points in its own cluster
\( b \) = average distance to points in the nearest different cluster

\[ s = \frac{b-a}{\max(a,b)} \]

A higher value usually indicates better clustering.

Notes ☆ #

You should be comfortable with:

Explaining why unsupervised learning does not use target labels.
Describing the K-means algorithm step by step.
Computing or explaining centroid updates.
Explaining why K-means is hard clustering.
Explaining why GMM is soft clustering.
Writing the GMM probability density formula.
Explaining the E-step and M-step of EM.
Comparing K-means and GMM.
Identifying suitable applications of clustering.

Common Mistakes #

Mistake	Correction
Thinking clustering uses labels	Clustering works without target labels
Forgetting to scale data	Distance-based methods need scaling
Choosing K blindly	Use elbow method, domain knowledge, or validation
Treating GMM like K-means	GMM gives probabilities, not just labels
Confusing E-step and M-step	E-step estimates responsibilities; M-step updates parameters
Ignoring covariance in GMM	Covariance controls cluster shape

Revision #

Concept	Main Idea	Key Formula
K-means	Hard clustering using centroids	\( \sum_k \sum_{x_i \in C_k} \\|x_i - \mu_k\\|^2 \)
Centroid update	Mean of assigned points	\( \mu_k = \frac{1}{\|C_k\|}\sum_{x_i \in C_k}x_i \)
GMM	Soft clustering using Gaussians	\( p(x)=\sum_k \pi_k \mathcal{N}(x\mid\mu_k,\Sigma_k) \)
Responsibility	Probability of component for a point	\( \gamma_{ik}=P(z_i=k\mid x_i) \)
EM	Alternates E-step and M-step	Estimate responsibilities, then update parameters

Home | Machine Learning