Dimensionality reduction and PCA #
PCA and SVM connect linear algebra, geometry, and optimisation.
Dimensionality reduction means representing high-dimensional data using fewer dimensions while trying to preserve the important structure of the data.
Principal Components Analysis, or PCA, is a linear dimensionality reduction method. It finds directions in the data along which the variance is maximum, and projects the data onto those directions.
Key takeaway: PCA chooses the eigenvectors of the covariance matrix corresponding to the largest eigenvalues. These eigenvectors form the principal subspace. The largest eigenvalues represent the directions that preserve the most variance.
PCA at a Glance #
flowchart TD
A[High-dimensional Data] --> B[Centre / Standardise]
B --> C[Covariance Matrix]
C --> D[Eigenvalues and Eigenvectors]
D --> E[Choose Top M Components]
E --> F[Low-dimensional Representation]
E --> G[Reconstruction]
style A fill:#E1F5FE,stroke:#78909C,stroke-width:1px,color:#263238
style B fill:#C8E6C9,stroke:#78909C,stroke-width:1px,color:#263238
style C fill:#FFF9C4,stroke:#78909C,stroke-width:1px,color:#263238
style D fill:#EDE7F6,stroke:#78909C,stroke-width:1px,color:#263238
style E fill:#E1F5FE,stroke:#78909C,stroke-width:1px,color:#263238
style F fill:#C8E6C9,stroke:#78909C,stroke-width:1px,color:#263238
style G fill:#FFF9C4,stroke:#78909C,stroke-width:1px,color:#263238
Why Dimensionality Reduction? #
High-dimensional data can be difficult to:
- visualise
- interpret
- store efficiently
- process computationally
- use in machine learning without noise or redundancy
Often, high-dimensional data is overcomplete. This means some dimensions are redundant or correlated with others.
For example, if two features are highly correlated, most of the useful information may lie close to a lower-dimensional line or plane.
Principal Component Analysis (PCA) #
- dimensionality reduction technique
- helps us to reduce the number of features in a dataset while keeping the most important information.
- changes complex datasets by transforming correlated features into a smaller set of uncorrelated components.
- uses linear algebra to transform data into new features called principal components.
- finds these by calculating eigenvectors (directions) and eigenvalues (importance) from the covariance matrix.
- PCA selects the top components with the highest eigenvalues and projects the data onto them simplify the dataset.
PCA prioritizes the directions where the data varies the most because more variation = more useful information.
- Standardize the Data
- Calculate Covariance Matrix
- Find the Principal Components
- Pick the Top Directions & Transform Data
PCA Problem Setting #
Assume we have centred data:
\[ x_1, x_2, \ldots, x_N \in \mathbb{R}^D \]The data has mean zero:
\[ \mathbb{E}[x] = 0 \]The covariance matrix is:
\[ S = \frac{1}{N}\sum_{n=1}^{N}x_nx_n^T \]If the data matrix is:
\[ X = [x_1, x_2, \ldots, x_N] \in \mathbb{R}^{D \times N} \]then:
\[ S = \frac{1}{N}XX^T \]Low-Dimensional Representation #
PCA looks for a lower-dimensional code:
\[ z_n \in \mathbb{R}^M \]where:
\[ M < D \]Let:
\[ B = [b_1, b_2, \ldots, b_M] \in \mathbb{R}^{D \times M} \]where the columns of \( B \) are orthonormal basis vectors.
The compressed representation is:
\[ z = B^Tx \]The reconstructed data is:
\[ \tilde{x} = Bz \]Substituting \( z = B^Tx \) :
\[ \tilde{x} = BB^Tx \]Bottleneck View of PCA #
PCA can be understood as a bottleneck:
Original data x → compressed code z → reconstructed data x̃
R^D R^M R^D
The low-dimensional code \( z \) controls how much information flows from the original data to the reconstructed data.
If \( M \) is small, compression is strong. If \( M \) is large, more information is preserved.
Centring the Data #
Before PCA, we usually subtract the mean.
If the mean of the data is \( \mu \) , then the centred data is:
\[ x_c = x - \mu \]This is important because PCA is based on variance around the centre of the data.
The slides emphasise that for centred data:
\[ \mathbb{E}[x] = 0 \]and:
\[ S = \frac{1}{N}\sum_{n=1}^{N}x_nx_n^T \]Standardisation #
In practice, PCA is often applied after standardisation.
For each dimension \( d \) :
\[ x_*^{(d)} = \frac{x_*^{(d)} - \mu_d}{\sigma_d} \]where:
- \( \mu_d \) is the training-data mean for dimension \( d \)
- \( \sigma_d \) is the training-data standard deviation for dimension \( d \)
Standardisation makes the data unit-free and gives each feature comparable scale.
For exam numericals, do not use the test point’s own mean and standard deviation. Use the training data mean and standard deviation.
Maximum Variance Perspective #
The key PCA idea is:
choose the direction where the projected data has maximum variance.
For the first principal component, let \( b_1 \) be a unit vector.
The projection of \( x_n \) onto \( b_1 \) is:
\[ z_{1n} = b_1^Tx_n \]The variance of the projected data is:
\[ V_1 = \frac{1}{N}\sum_{n=1}^{N}z_{1n}^2 \]Substitute \( z_{1n} = b_1^Tx_n \) :
\[ V_1 = \frac{1}{N}\sum_{n=1}^{N}(b_1^Tx_n)^2 \]This becomes:
\[ V_1 = b_1^TSb_1 \]So PCA solves:
\[ \max_{b_1} b_1^TSb_1 \]subject to:
\[ \|b_1\| = 1 \]The constraint is needed because otherwise we could increase the magnitude of \( b_1 \) and artificially increase the variance.
Lagrangian Derivation of the First Principal Component #
Set up the constrained optimisation problem:
\[ \max b_1^TSb_1 \]subject to:
\[ b_1^Tb_1 = 1 \]The Lagrangian is:
\[ \mathcal{L}(b_1,\lambda)=b_1^TSb_1+\lambda(1-b_1^Tb_1) \]Now set the derivatives to zero:
\[ \frac{\partial \mathcal{L}}{\partial \lambda}=1-b_1^Tb_1=0 \]and:
\[ \frac{\partial \mathcal{L}}{\partial b_1}=0 \]This gives:
\[ Sb_1=\lambda b_1 \]So \( b_1 \) is an eigenvector of the covariance matrix \( S \) , and \( \lambda \) is the corresponding eigenvalue.
Why the Largest Eigenvalue? #
The variance in direction \( b_1 \) is:
\[ V_1 = b_1^TSb_1 \]Using:
\[ Sb_1=\lambda b_1 \]we get:
\[ V_1 = b_1^T\lambda b_1 \]Since \( b_1^Tb_1=1 \) :
\[ V_1 = \lambda \]Therefore, to maximise variance, choose the eigenvector with the largest eigenvalue.
This is the first principal component.
More Than One Principal Component #
For an \( M \) -dimensional PCA subspace, choose:
\[ B = [b_1,b_2,\ldots,b_M] \]where \( b_1,\ldots,b_M \) are the eigenvectors of \( S \) corresponding to the \( M \) largest eigenvalues.
The projected coordinates are:
\[ z = B^Tx \]The reconstruction is:
\[ \tilde{x}=Bz=BB^Tx \]Variance Explained #
If the eigenvalues are:
\[ \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_D \]then the variance retained by the first \( M \) components is:
\[ \frac{\lambda_1+\lambda_2+\cdots+\lambda_M}{\lambda_1+\lambda_2+\cdots+\lambda_D} \]This is often called the explained variance ratio.
Two Views of PCA #
flowchart LR
A[PCA] --> B[Maximum Variance View]
A --> C[Reconstruction View]
B --> D[Choose directions with largest variance]
C --> E[Minimise reconstruction error]
style A fill:#E1F5FE,stroke:#78909C,stroke-width:1px,color:#263238
style B fill:#C8E6C9,stroke:#78909C,stroke-width:1px,color:#263238
style C fill:#FFF9C4,stroke:#78909C,stroke-width:1px,color:#263238
style D fill:#EDE7F6,stroke:#78909C,stroke-width:1px,color:#263238
style E fill:#E1F5FE,stroke:#78909C,stroke-width:1px,color:#263238
Projection Perspective #
PCA can also be derived as a reconstruction problem.
Instead of asking:
Which directions maximise variance?
we ask:
Which lower-dimensional subspace minimises reconstruction error?
The reconstruction error is:
\[ \sum_{n=1}^{N}\|x_n-\tilde{x}_n\|^2 \]PCA gives the subspace that minimises this average reconstruction error.
So PCA has two equivalent views:
| Perspective | Goal |
|---|---|
| Maximum variance | Preserve maximum spread/information |
| Projection/reconstruction | Minimise reconstruction error |
PCA and SVD #
From the slides:
\[ S = \frac{1}{N}XX^T \]Assume the singular value decomposition of \( X \) is:
\[ X = U\Sigma V^T \]Then:
\[ S = \frac{1}{N}XX^T = \frac{1}{N}U\Sigma \Sigma^T U^T \]The columns of \( U \) are the eigenvectors of \( S \) .
The eigenvalues of \( S \) are related to the singular values of \( X \) :
\[ \lambda_d = \frac{\sigma_d^2}{N} \]This connects PCA with SVD.
PCA as Low-Rank Approximation #
PCA can also be understood using the Eckart–Young theorem.
The best rank- \( M \) approximation to \( X \) is obtained by truncating the SVD:
\[ \tilde{X}_M = U_M\Sigma_MV_M^T \]where:
- \( U_M \) contains the top \( M \) left singular vectors
- \( \Sigma_M \) contains the top \( M \) singular values
- \( V_M \) contains the corresponding right singular vectors
This gives the best low-rank approximation to \( X \) .
PCA in High Dimensions #
In high dimensions, the covariance matrix is:
\[ S \in \mathbb{R}^{D \times D} \]Computing eigenvectors of a \( D \times D \) matrix can be expensive.
This is especially difficult when:
\[ D \gg N \]A useful trick is to use:
\[ X^TX \in \mathbb{R}^{N \times N} \]instead of:
\[ XX^T \in \mathbb{R}^{D \times D} \]The non-zero eigenvalues of \( XX^T \) and \( X^TX \) are the same.
If \( c_m \) is an eigenvector of \( X^TX \) , then:
\[ Xc_m \]can be used to recover the corresponding eigenvector of \( XX^T \) .
Key Steps of PCA in Practice #
The lecture gives the practical PCA pipeline as:
- Mean subtraction
- Standardisation
- Eigendecomposition of the covariance matrix
- Projection
Step 1: Mean Subtraction #
\[ x_c = x - \mu \]Step 2: Standardisation #
\[ x_s^{(d)}=\frac{x^{(d)}-\mu_d}{\sigma_d} \]Step 3: Covariance Matrix #
\[ S = \frac{1}{N}XX^T \]Step 4: Eigenvectors and Eigenvalues #
Solve:
\[ Sb_i=\lambda_i b_i \]Choose the eigenvectors with the largest eigenvalues.
Step 5: Projection #
\[ z = B^Tx \]Step 6: Reconstruction #
\[ \tilde{x}=Bz \]Exam Method for PCA Numericals #
Use this template.
Given Data Matrix #
Write the data matrix \( X \) .
Find the Mean #
\[ \mu = \frac{1}{N}\sum_{n=1}^{N}x_n \]Centre the Data #
\[ X_c = X - \mu \]Compute Covariance #
\[ S = \frac{1}{N}X_cX_c^T \]Find Eigenvalues and Eigenvectors #
Solve:
\[ \det(S-\lambda I)=0 \]Then find eigenvectors.
Choose Principal Components #
Choose eigenvectors corresponding to largest eigenvalues.
Project Data #
\[ z = B^TX_c \]Reconstruct If Asked #
\[ \tilde{X}=Bz+\mu \]Common Exam Mistakes #
| Mistake | Correction |
|---|---|
| Forgetting to centre the data | Always subtract the mean first |
| Choosing smallest eigenvalues | PCA chooses largest eigenvalues |
| Confusing \( B^Tx \) and \( Bx \) | \( z=B^Tx \) , reconstruction is \( \tilde{x}=Bz \) |
| Forgetting to add mean back after reconstruction | Use \( \tilde{x}=Bz+\mu \) if returning to original space |
| Using test-data statistics | Standardise test points using training mean and standard deviation |
Quick Memory Line #
Centre → Covariance → Eigenvectors → Largest Eigenvalues → Project → Reconstruct
Source Focus #
This page is based mainly on:
- Lecture 12: PCA problem setting, maximum variance perspective, Lagrangian derivation, first principal component, projection perspective
- Lecture 13: SVD connection, low-rank approximation, high-dimensional PCA, practical PCA steps
- Course handout: Module 7 and sessions 12–13