Nonlinear SVM #

A linear SVM works well when the data can be separated by a straight line or hyperplane. When the data is not linearly separable in the original input space, nonlinear SVM maps the data to a higher-dimensional feature space where a linear separator may exist.

Key takeaway: Nonlinear SVM uses the kernel trick. Instead of explicitly mapping
\( x \)
to ( \phi(x) )
, we compute inner products in the feature space using a kernel:
\[ K(x_i,x_j)=\phi(x_i)^T\phi(x_j) \]

Nonlinear SVM Idea #

flowchart TD
    A[Nonlinear Data] --> B[Feature Map or Kernel]
    B --> C[Higher-dimensional Feature Space]
    C --> D[Linear Separation]
    D --> E[Classifier in Original Space]
    style A fill:#E1F5FE,stroke:#78909C,stroke-width:1px,color:#263238
    style B fill:#C8E6C9,stroke:#78909C,stroke-width:1px,color:#263238
    style C fill:#FFF9C4,stroke:#78909C,stroke-width:1px,color:#263238
    style D fill:#EDE7F6,stroke:#78909C,stroke-width:1px,color:#263238
    style E fill:#E1F5FE,stroke:#78909C,stroke-width:1px,color:#263238

Why Nonlinear SVM? #

Some datasets cannot be separated by a straight line in the original feature space.

Soft-margin SVM helps when the data is almost separable with some noise.

But if the pattern is fundamentally nonlinear, soft margin may not be enough.

Example:

Original 1D/2D space: not linearly separable
Higher-dimensional feature space: linearly separable

The idea is:

\[ x \mapsto \phi(x) \]

where \( \phi(x) \) maps the input into a higher-dimensional feature space.

Feature Space View #

A nonlinear SVM applies a transformation:

\[ \phi:\mathbb{R}^d \rightarrow \mathbb{R}^D \]

where often:

\[ D>d \]

Then a linear SVM is trained in the transformed feature space.

The separating hyperplane becomes:

\[ w^T\phi(x)+b=0 \]

Prediction becomes:

\[ \hat{y}= \operatorname{sign}(w^T\phi(x)+b) \]

Problem with Explicit Feature Maps #

Explicitly computing \( \phi(x) \) can be expensive.

For example, a polynomial feature map can greatly increase the number of dimensions.

This may lead to:

high computational cost
high memory use
difficult feature construction
risk of overfitting

The kernel trick avoids explicit computation of \( \phi(x) \) .

Kernel Trick #

flowchart LR
    A[Dot product x_i dot x_j] --> B[Replace with K x_i x_j]
    B --> C[No explicit phi x needed]
    C --> D[Efficient nonlinear SVM]
    style A fill:#E1F5FE,stroke:#78909C,stroke-width:1px,color:#263238
    style B fill:#C8E6C9,stroke:#78909C,stroke-width:1px,color:#263238
    style C fill:#FFF9C4,stroke:#78909C,stroke-width:1px,color:#263238
    style D fill:#EDE7F6,stroke:#78909C,stroke-width:1px,color:#263238

The linear SVM dual depends on inner products:

\[ x_i^Tx_j \]

In feature space, these become:

\[ \phi(x_i)^T\phi(x_j) \]

A kernel function computes this directly:

\[ K(x_i,x_j)=\phi(x_i)^T\phi(x_j) \]

So we replace:

\[ x_i^Tx_j \]

with:

\[ K(x_i,x_j) \]

This is called the kernel trick.

Kernelised SVM Dual #

The hard-margin dual for linear SVM is:

\[ \max_{\alpha} \sum_i\alpha_i - \frac{1}{2} \sum_i\sum_j \alpha_i\alpha_jy_iy_jx_i^Tx_j \]

For nonlinear SVM, replace \( x_i^Tx_j \) with \( K(x_i,x_j) \) :

\[ \max_{\alpha} \sum_i\alpha_i - \frac{1}{2} \sum_i\sum_j \alpha_i\alpha_jy_iy_jK(x_i,x_j) \]

subject to:

\[ \alpha_i\ge0 \]

and:

\[ \sum_i\alpha_i y_i=0 \]

Kernelised Classifier #

The classifier becomes:

\[ \hat{y} = \operatorname{sign} \left( \sum_i\alpha_i y_iK(x_i,x)+b \right) \]

Only support vectors have non-zero \( \alpha_i \) , so practically:

\[ \hat{y} = \operatorname{sign} \left( \sum_{i\in SV}\alpha_i y_iK(x_i,x)+b \right) \]

where \( SV \) is the set of support vectors.

Kernel Matrix #

For training points \( x_1,\ldots,x_n \) , the kernel matrix is:

\[ K_{ij}=K(x_i,x_j) \]

This matrix stores all pairwise kernel values.

The lecture steps for nonlinear SVM are:

Select a kernel function
Compute pairwise kernel values between labelled examples
Use the kernel matrix to solve for support vectors and \( \alpha \) weights
Classify a new point using kernel values with the support vectors

Properties of Kernel Functions #

A kernel function:

\[ K(x,y) \]

takes two inputs and returns a real value.

It is symmetric:

\[ K(x,y)=K(y,x) \]

It corresponds to an inner product in some feature space:

\[ K(x_i,x_j)=\phi(x_i)^T\phi(x_j) \]

According to Mercer’s theorem, every positive-semidefinite symmetric function is a valid kernel.

Common Kernels #

Kernels allow inner products in high-dimensional feature spaces without explicit mapping.

Linear Kernel #

\[ K(x_i,x_j)=x_i^Tx_j \]

This gives ordinary linear SVM.

Polynomial Kernel #

\[ K(x_i,x_j)=(1+x_i^Tx_j)^p \]

where \( p \) is the polynomial degree.

This kernel can model polynomial decision boundaries.

Sigmoid Kernel #

\[ K(x_i,x_j)=\tanh(\beta_0x_i^Tx_j+\beta_1) \]

This resembles the activation function used in neural networks.

Exponential Distance Kernel #

The slides mention kernels of the form:

\[ k(x,x')=\exp(-d(x,x')) \]

where \( d(x,x') \) is a distance function.

A common related form is the Gaussian or RBF kernel:

\[ K(x_i,x_j)= \exp \left( -\frac{\|x_i-x_j\|^2}{2\sigma^2} \right) \]

Nonlinear SVM Workflow #

flowchart LR
    A[Choose Kernel] --> B[Compute Kernel Matrix]
    B --> C[Solve Dual]
    C --> D[Find Support Vectors]
    D --> E[Classify New Point]
    style A fill:#E1F5FE,stroke:#78909C,stroke-width:1px,color:#263238
    style B fill:#C8E6C9,stroke:#78909C,stroke-width:1px,color:#263238
    style C fill:#FFF9C4,stroke:#78909C,stroke-width:1px,color:#263238
    style D fill:#EDE7F6,stroke:#78909C,stroke-width:1px,color:#263238
    style E fill:#E1F5FE,stroke:#78909C,stroke-width:1px,color:#263238

XOR Example #

The XOR pattern is a classic nonlinear classification problem.

A typical XOR dataset is:

Point	\( x_1 \)	\( x_2 \)	Label
\( x_1 \)	-1	-1	-1
\( x_2 \)	-1	+1	+1
\( x_3 \)	+1	-1	+1
\( x_4 \)	+1	+1	-1

This cannot be separated by a single straight line in the original 2D space.

However, using a nonlinear transformation or kernel, it can become separable.

Feature Map for XOR Intuition #

A useful feature for XOR is the product:

\[ z=x_1x_2 \]

For the XOR-style labels above:

\( x_1 \)	\( x_2 \)	\( x_1x_2 \)	Label
-1	-1	+1	-1
-1	+1	-1	+1
+1	-1	-1	+1
+1	+1	+1	-1

Now the data can be separated based on the transformed feature \( x_1x_2 \) .

This shows why nonlinear feature spaces help.

Linear vs Nonlinear SVM #

Aspect	Linear SVM	Nonlinear SVM
Decision boundary	Straight hyperplane	Curved in original space
Uses kernel?	Usually no, or linear kernel	Yes
Good for	Linearly separable data	Nonlinear patterns
Main formula	\( x_i^Tx_j \)	\( K(x_i,x_j) \)
Example	simple two-class separation	XOR, circular patterns

Soft Margin with Nonlinear SVM #

Nonlinear SVM can also use soft margins.

The soft-margin kernelised dual is:

\[ \max_{\alpha} \sum_i\alpha_i - \frac{1}{2} \sum_i\sum_j \alpha_i\alpha_jy_iy_jK(x_i,x_j) \]

subject to:

\[ 0\le\alpha_i\le C \]

and:

\[ \sum_i\alpha_i y_i=0 \]

The upper bound \( C \) comes from soft-margin slack variables.

How to Classify a New Example #

Suppose \( x_* \) is a new point.

Step 1: Compute Kernel Values #

For each support vector \( x_i \) :

\[ K(x_i,x_*) \]

Step 2: Compute Decision Score #

\[ s= \sum_{i\in SV}\alpha_i y_iK(x_i,x_*)+b \]

Step 3: Predict Class #

\[ \hat{y}=\operatorname{sign}(s) \]

If \( s>0 \) , predict \( +1 \) . If \( s<0 \) , predict \( -1 \) .

Exam Template: Kernel SVM Classifier #

If the question gives \( \alpha_i \) , \( y_i \) , support vectors, kernel values and \( b \) , use:

\[ f(x)= \sum_{i\in SV}\alpha_i y_iK(x_i,x)+b \]

Then:

\[ \hat{y}=\operatorname{sign}(f(x)) \]

Do not calculate \( w \) explicitly unless the question asks for it.

Exam Template: Kernel Matrix #

If asked to compute a kernel matrix:

Step 1: List Points #

\[ x_1,x_2,\ldots,x_n \]

Step 2: Choose Kernel #

For example, polynomial:

\[ K(x_i,x_j)=(1+x_i^Tx_j)^p \]

Step 3: Fill Matrix #

\[ K= \begin{bmatrix} K(x_1,x_1) & K(x_1,x_2) & \cdots \\ K(x_2,x_1) & K(x_2,x_2) & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix} \]

The matrix is symmetric.

Common Exam Mistakes #

Mistake	Correction
Explicitly computing \( \phi(x) \) when not needed	Use the kernel directly
Forgetting support vectors only	Sum over support vectors with \( \alpha_i>0 \)
Replacing \( x_i^Tx_j \) incorrectly	Replace every dot product with \( K(x_i,x_j) \)
Confusing linear and polynomial kernels	Linear: \( x_i^Tx_j \) , polynomial: \( (1+x_i^Tx_j)^p \)
Forgetting \( b \) in prediction	Always add \( b \) before taking sign

Quick Memory Line #

Nonlinear data → feature map φ → kernel K → dual uses K → classify by weighted support-vector kernels

Source Focus #

This page is based mainly on:

Lecture 16: nonlinear SVM, kernel trick, feature spaces, XOR example and nonlinear SVM steps
Lecture 14: kernel function definition, Mercer theorem and kernel examples
Lecture 15: soft-margin SVM and hinge loss background
Course handout: session 16 and Module 7 nonlinear SVM kernels

Home | Dimensionality reduction and PCA