Unsupervised Learning #
Unsupervised Learning is used when we have input data but no target labels.
The model is not told the correct answer. Instead, it tries to discover hidden structure in the data.
- K-means Clustering and variants
- Review of EM algorithm
- GMM based Soft Clustering
- Applications
Supervised vs Unsupervised Learning #
| Aspect | Supervised Learning | Unsupervised Learning |
|---|---|---|
| Data contains target label? | Yes | No |
| Learns from | Input-output pairs | Input features only |
| Main goal | Predict output | Discover structure |
| Example task | Classification, regression | Clustering |
| Example algorithm | Logistic regression, decision tree | K-means, GMM |
- Works on unlabelled raw data.
- The algorithm discovers hidden patterns without prior knowledge of outcomes.
- Requires no human intervention during training.
- Does not make direct predictions — it groups or organises data instead.
- Carries a higher risk because there’s no ground truth to verify results.
- Common techniques include Clustering, Association, and Dimensionality Reduction.
The most common example is clustering, where similar records are grouped together.
Key takeaway:
Unsupervised Learning finds patterns in data without labelled outputs.For this module, the main focus is clustering: K-means, mixture models, EM, and GMM-based soft clustering.
- K-means clustering and variants
- Review of EM algorithm
- GMM-based soft clustering
- Applications
stateDiagram-v2 %% ML maths-based colours (same palette as supervised) classDef probability fill:#d1fae5,stroke:#065f46,stroke-width:1px classDef geometry fill:#ffedd5,stroke:#9a3412,stroke-width:1px classDef category font-style:italic,font-weight:bold,fill:#f3f4f6,stroke:#374151 %% Root USL: Unsupervised Learning %% Main branches USL --> CLU:::category CLU: Clustering USL --> DR:::category DR: Dimensionality Reduction %% Clustering algorithms CLU --> KM:::geometry KM: K-Means CLU --> HC:::geometry HC: Hierarchical Clustering CLU --> DB:::geometry DB: DBSCAN %% Probabilistic models USL --> PM:::category PM: Probabilistic Models PM --> GMM:::probability GMM: Gaussian Mixture Model PM --> HMM:::probability HMM: Hidden Markov Model
Clustering #
- Groups similar data points together based on shared features.
- Commonly used for market segmentation, image compression, and anomaly detection.
Association #
- Identifies relationships or correlations between variables in a dataset.
- Commonly used in market basket analysis (e.g. “Customers who bought X also bought Y”).
Common Techniques #
- Apriori Algorithm – Finds frequent itemsets and generates association rules.
- Eclat Algorithm – Similar to Apriori but uses set intersections for faster computation.
Dimensionality Reduction #
- Reduces the number of input variables to simplify data.
- Helps remove noise and redundancy.
- Commonly used in data pre-processing and visualisation.
Common Techniques #
- Principal Component Analysis (PCA) – Projects data onto fewer dimensions while keeping most variance.
- Linear Discriminant Analysis (LDA) – Focuses on class separation.
- t-SNE (t-Distributed Stochastic Neighbour Embedding) – Used for visualising high-dimensional data.
- Autoencoders – Neural networks that compress and reconstruct data.
mindmap
root(Unsupervised Learning)
Clustering
K Means
Hierarchical Clustering
DBSCAN
Dimensionality Reduction
PCA
t SNE
Autoencoders
Probabilistic Models
Gaussian Mixture Model
Hidden Markov Model
Clustering ☆ #
Clustering means grouping data points so that:
- points inside the same cluster are similar
- points in different clusters are dissimilar
For example:
- grouping customers by buying behaviour
- grouping documents by topic
- grouping images by visual similarity
- grouping patients by medical measurements
flowchart LR
A["Unlabelled Data"] --> B["Similarity / Distance Measure"]
B --> C["Clustering Algorithm"]
C --> D["Groups / Clusters"]
D --> E["Interpret Hidden Structure"]
style A fill:#E1F5FE,stroke:#5b7db1,color:#222
style B fill:#C8E6C9,stroke:#5f8f6a,color:#222
style C fill:#FFF9C4,stroke:#b59b3b,color:#222
style D fill:#EDE7F6,stroke:#8a6fb3,color:#222
style E fill:#FDE2E4,stroke:#b85c68,color:#222
Common Types of Clustering #
- K-Means Clustering – Divides data into K groups based on similarity.
- Hierarchical Clustering – Builds a hierarchy (tree) of clusters.
- DBSCAN (Density-Based Spatial Clustering) – Groups points close in density; identifies noise/outliers.
K-means Clustering ☆ #
K-means is a clustering algorithm that divides data into \( K \) clusters.
Each cluster has a centre called a centroid.
The algorithm repeatedly:
- assigns each point to the nearest centroid
- updates each centroid using the mean of the assigned points
- repeats until the centroids stop changing significantly
K-means Objective Function ☆ #
The goal of K-means is to minimise the distance between points and their assigned cluster centroids.
\[ J = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2 \]Where:
| Symbol | Meaning |
|---|---|
| \( K \) | Number of clusters |
| \( C_k \) | Cluster \( k \) |
| \( x_i \) | Data point |
| \( \mu_k \) | Centroid of cluster \( k \) |
| \( J \) | Within-cluster squared error |
K-means tries to make points close to their own centroid.
Lower objective value means tighter clusters.
K-means Algorithm Steps ☆ #
Step 1: Choose K #
Decide the number of clusters.
Example: \( K = 3 \) means the algorithm should form three groups.
Step 2: Initialise Centroids #
Choose initial centroids randomly or using a better method such as K-means++.
The lecturer emphasised that the initial centroid can be random.
Step 3: Assignment Step #
Assign each data point to the nearest centroid.
\[ c_i = \arg\min_k \|x_i - \mu_k\|^2 \]This means each point goes to the cluster whose centroid is closest.
Step 4: Update Step #
Recalculate each centroid as the mean of all points assigned to that cluster.
\[ \mu_k = \frac{1}{|C_k|} \sum_{x_i \in C_k} x_i \]Step 5: Repeat Until Convergence #
Repeat assignment and update until:
- centroids stop changing
- cluster assignments stop changing
- maximum number of iterations is reached
K-means Workflow #
flowchart TD
A["Choose K"] --> B["Initialise Centroids"]
B --> C["Assign Points to Nearest Centroid"]
C --> D["Update Centroids"]
D --> E{"Centroids Stable?"}
E -- "No" --> C
E -- "Yes" --> F["Final Clusters"]
style A fill:#E1F5FE,stroke:#5b7db1,color:#222
style B fill:#C8E6C9,stroke:#5f8f6a,color:#222
style C fill:#FFF9C4,stroke:#b59b3b,color:#222
style D fill:#EDE7F6,stroke:#8a6fb3,color:#222
style E fill:#FDE2E4,stroke:#b85c68,color:#222
style F fill:#C8E6C9,stroke:#5f8f6a,color:#222
Choosing the Value of K ☆ #
K-means requires the number of clusters \( K \) in advance.
Choosing \( K \) is important.
If \( K \) is too small, different groups may be forced together.
If \( K \) is too large, natural groups may be split unnecessarily.
Elbow Method #
The elbow method plots clustering error against different values of \( K \) .
Usually, as \( K \) increases, the error decreases.
The “elbow” point is where improvement starts becoming smaller.
That point is often chosen as a reasonable \( K \) .
\[ WCSS = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2 \]WCSS means within-cluster sum of squares.
K-means Variants #
| Variant | Idea |
|---|---|
| Standard K-means | Randomly initialise centroids and iterate |
| K-means++ | Choose better initial centroids to improve stability |
| Mini-batch K-means | Use small batches for faster clustering on large datasets |
| Kernel K-means | Use kernel idea for more complex cluster shapes |
Strengths of K-means #
- Simple to understand
- Fast for many practical datasets
- Works well when clusters are compact and roughly spherical
- Easy to implement
- Useful as a baseline clustering method
Limitations of K-means ☆ #
| Limitation | Explanation |
|---|---|
| Needs K in advance | User must choose number of clusters |
| Sensitive to initial centroids | Different initial points may give different clusters |
| Sensitive to scale | Feature scaling is important |
| Assumes compact clusters | Does not work well for irregular cluster shapes |
| Hard assignment | Each point belongs to exactly one cluster |
| Sensitive to outliers | Outliers can pull centroids away |
Before applying K-means, scale numeric features.
Distance-based clustering can be dominated by features with larger ranges.
Hard Clustering vs Soft Clustering ☆ #
K-means performs hard clustering.
Each point is assigned to exactly one cluster.
Gaussian Mixture Models perform soft clustering.
Each point receives a probability of belonging to each cluster.
| Aspect | Hard Clustering | Soft Clustering |
|---|---|---|
| Assignment | One cluster only | Probability for each cluster |
| Example | K-means | GMM |
| Output | Cluster label | Responsibility / probability |
| Useful when | Groups are clearly separated | Groups overlap |
Mixture Models #
A mixture model assumes the data comes from a mixture of several probability distributions.
For clustering, each distribution represents a cluster.
The data point is not simply assigned to one cluster immediately.
Instead, the model estimates how likely it is that the point came from each cluster.
Gaussian Mixture Model ☆ #
A Gaussian Mixture Model assumes each cluster is represented by a Gaussian distribution.
The overall data distribution is a weighted combination of Gaussians.
\[ p(x) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x \mid \mu_k, \Sigma_k) \]Where:
| Symbol | Meaning |
|---|---|
| \( K \) | Number of Gaussian components |
| \( \pi_k \) | Mixing coefficient for component \( k \) |
| \( \mu_k \) | Mean of component \( k \) |
| \( \Sigma_k \) | Covariance of component \( k \) |
| \( \mathcal{N}(x \mid \mu_k, \Sigma_k) \) | Gaussian density |
The mixing coefficients must sum to one.
\[ \sum_{k=1}^{K} \pi_k = 1 \]GMM Responsibility ☆ #
The probability that data point \( x_i \) belongs to component \( k \) is called its responsibility.
\[ \gamma_{ik} = P(z_i = k \mid x_i) \]Using Bayes Rule:
\[ \gamma_{ik} = \frac{\pi_k \mathcal{N}(x_i \mid \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(x_i \mid \mu_j, \Sigma_j)} \]This is the soft-clustering equivalent of assigning a point to a cluster.
K-means vs GMM #
| Aspect | K-means | GMM |
|---|---|---|
| Type | Hard clustering | Soft clustering |
| Cluster representation | Centroid | Gaussian distribution |
| Assignment | Nearest centroid | Probability of each component |
| Cluster shape | Roughly spherical | Can model ellipses via covariance |
| Output | Cluster label | Cluster probabilities |
| Method | Distance minimisation | Probability maximisation |
EM Algorithm ☆ #
EM stands for Expectation-Maximisation.
It is used when there are hidden or latent variables.
In GMM clustering, the hidden variable is the unknown cluster membership.
We observe data points, but we do not know which Gaussian component generated each point.
EM Algorithm Idea #
EM alternates between two steps:
- E-step: estimate the hidden cluster responsibilities.
- M-step: update model parameters using those responsibilities.
flowchart LR
A["Initial Parameters"] --> B["E-step: Estimate Responsibilities"]
B --> C["M-step: Update Parameters"]
C --> D{"Converged?"}
D -- "No" --> B
D -- "Yes" --> E["Final GMM"]
style A fill:#E1F5FE,stroke:#5b7db1,color:#222
style B fill:#C8E6C9,stroke:#5f8f6a,color:#222
style C fill:#FFF9C4,stroke:#b59b3b,color:#222
style D fill:#EDE7F6,stroke:#8a6fb3,color:#222
style E fill:#FDE2E4,stroke:#b85c68,color:#222
E-step #
The E-step computes responsibilities using the current parameters.
\[ \gamma_{ik} = \frac{\pi_k \mathcal{N}(x_i \mid \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(x_i \mid \mu_j, \Sigma_j)} \]Interpretation:
Given the current Gaussians, how much responsibility should component \( k \) take for point \( x_i \) ?
M-step #
The M-step updates the GMM parameters using the responsibilities from the E-step.
First compute the effective number of points assigned to cluster \( k \) :
\[ N_k = \sum_{i=1}^{n} \gamma_{ik} \]Update the mean:
\[ \mu_k = \frac{1}{N_k} \sum_{i=1}^{n} \gamma_{ik} x_i \]Update the covariance:
\[ \Sigma_k = \frac{1}{N_k} \sum_{i=1}^{n} \gamma_{ik}(x_i - \mu_k)(x_i - \mu_k)^T \]Update the mixing coefficient:
\[ \pi_k = \frac{N_k}{n} \]EM Objective #
EM tries to maximise the likelihood of the observed data.
For GMM:
\[ \log p(X) = \sum_{i=1}^{n} \log \left( \sum_{k=1}^{K} \pi_k \mathcal{N}(x_i \mid \mu_k, \Sigma_k) \right) \]Each EM iteration should not decrease this likelihood.
K-means and EM Connection #
K-means can be viewed as a simpler hard-assignment version of mixture modelling.
| Step | K-means | GMM with EM |
|---|---|---|
| Assignment | Assign point to nearest centroid | Compute probability for each Gaussian |
| Update | Recompute centroids | Recompute means, covariances, mixing weights |
| Cluster membership | Hard | Soft |
| Objective | Minimise squared distance | Maximise likelihood |
Applications of Unsupervised Learning #
Unsupervised Learning is used in:
- customer segmentation
- anomaly detection
- image segmentation
- document/topic grouping
- market basket analysis
- biological data analysis
- recommender systems
- exploratory data analysis
Practical ML Workflow #
- Prepare data.
- Scale numerical features.
- Choose clustering method.
- Choose number of clusters or components.
- Fit the model.
- Analyse clusters.
- Validate using domain knowledge and metrics.
Evaluation of Clustering #
Since there are no labels, clustering evaluation is harder than supervised learning.
Common approaches include:
| Method | Meaning |
|---|---|
| Inertia / WCSS | Lower within-cluster distance is better |
| Silhouette score | Measures separation and compactness |
| Visual inspection | Useful for low-dimensional data |
| Domain interpretation | Clusters should make practical sense |
| Stability check | Similar clusters should appear across runs |
Silhouette Score #
For a data point:
- \( a \) = average distance to points in its own cluster
- \( b \) = average distance to points in the nearest different cluster
A higher value usually indicates better clustering.
Notes ☆ #
You should be comfortable with:
- Explaining why unsupervised learning does not use target labels.
- Describing the K-means algorithm step by step.
- Computing or explaining centroid updates.
- Explaining why K-means is hard clustering.
- Explaining why GMM is soft clustering.
- Writing the GMM probability density formula.
- Explaining the E-step and M-step of EM.
- Comparing K-means and GMM.
- Identifying suitable applications of clustering.
Common Mistakes #
| Mistake | Correction |
|---|---|
| Thinking clustering uses labels | Clustering works without target labels |
| Forgetting to scale data | Distance-based methods need scaling |
| Choosing K blindly | Use elbow method, domain knowledge, or validation |
| Treating GMM like K-means | GMM gives probabilities, not just labels |
| Confusing E-step and M-step | E-step estimates responsibilities; M-step updates parameters |
| Ignoring covariance in GMM | Covariance controls cluster shape |
Revision #
| Concept | Main Idea | Key Formula |
|---|---|---|
| K-means | Hard clustering using centroids | \( \sum_k \sum_{x_i \in C_k} \|x_i - \mu_k\|^2 \) |
| Centroid update | Mean of assigned points | \( \mu_k = \frac{1}{|C_k|}\sum_{x_i \in C_k}x_i \) |
| GMM | Soft clustering using Gaussians | \( p(x)=\sum_k \pi_k \mathcal{N}(x\mid\mu_k,\Sigma_k) \) |
| Responsibility | Probability of component for a point | \( \gamma_{ik}=P(z_i=k\mid x_i) \) |
| EM | Alternates E-step and M-step | Estimate responsibilities, then update parameters |