Nonlinear SVM

Nonlinear SVM #

A linear SVM works well when the data can be separated by a straight line or hyperplane. When the data is not linearly separable in the original input space, nonlinear SVM maps the data to a higher-dimensional feature space where a linear separator may exist.

Key takeaway: Nonlinear SVM uses the kernel trick. Instead of explicitly mapping (x) to (\phi(x)), we compute inner products in the feature space using a kernel: [ K(x_i,x_j)=\phi(x_i)^T\phi(x_j) ]


Nonlinear SVM Idea #

flowchart TD
    A[Nonlinear Data] --> B[Feature Map or Kernel]
    B --> C[Higher-dimensional Feature Space]
    C --> D[Linear Separation]
    D --> E[Classifier in Original Space]

Why Nonlinear SVM? #

Some datasets cannot be separated by a straight line in the original feature space.

Soft-margin SVM helps when the data is almost separable with some noise.

But if the pattern is fundamentally nonlinear, soft margin may not be enough.

Example:

Original 1D/2D space: not linearly separable
Higher-dimensional feature space: linearly separable

The idea is:

[ x \mapsto \phi(x) ]

where (\phi(x)) maps the input into a higher-dimensional feature space.


Feature Space View #

A nonlinear SVM applies a transformation:

[ \phi:\mathbb{R}^d \rightarrow \mathbb{R}^D ]

where often:

[ D>d ]

Then a linear SVM is trained in the transformed feature space.

The separating hyperplane becomes:

[ w^T\phi(x)+b=0 ]

Prediction becomes:

[ \hat{y}= \operatorname{sign}(w^T\phi(x)+b) ]

Problem with Explicit Feature Maps #

Explicitly computing (\phi(x)) can be expensive.

For example, a polynomial feature map can greatly increase the number of dimensions.

This may lead to:

  • high computational cost
  • high memory use
  • difficult feature construction
  • risk of overfitting

The kernel trick avoids explicit computation of (\phi(x)).


Kernel Trick #

flowchart LR
    A[Dot product x_i dot x_j] --> B[Replace with K x_i x_j]
    B --> C[No explicit phi x needed]
    C --> D[Efficient nonlinear SVM]

The linear SVM dual depends on inner products:

[ x_i^Tx_j ]

In feature space, these become:

[ \phi(x_i)^T\phi(x_j) ]

A kernel function computes this directly:

[ K(x_i,x_j)=\phi(x_i)^T\phi(x_j) ]

So we replace:

[ x_i^Tx_j ]

with:

[ K(x_i,x_j) ]

This is called the kernel trick.


Kernelised SVM Dual #

The hard-margin dual for linear SVM is:

[ \max_{\alpha} \sum_i\alpha_i

#

\frac{1}{2} \sum_i\sum_j \alpha_i\alpha_jy_iy_jx_i^Tx_j ]

For nonlinear SVM, replace (x_i^Tx_j) with (K(x_i,x_j)):

[ \max_{\alpha} \sum_i\alpha_i

#

\frac{1}{2} \sum_i\sum_j \alpha_i\alpha_jy_iy_jK(x_i,x_j) ]

subject to:

[ \alpha_i\ge0 ]

and:

[ \sum_i\alpha_i y_i=0 ]

Kernelised Classifier #

The classifier becomes:

[ \hat{y}

#

\operatorname{sign} \left( \sum_i\alpha_i y_iK(x_i,x)+b \right) ]

Only support vectors have non-zero (\alpha_i), so practically:

[ \hat{y}

#

\operatorname{sign} \left( \sum_{i\in SV}\alpha_i y_iK(x_i,x)+b \right) ]

where (SV) is the set of support vectors.


Kernel Matrix #

For training points (x_1,\ldots,x_n), the kernel matrix is:

[ K_{ij}=K(x_i,x_j) ]

This matrix stores all pairwise kernel values.

The lecture steps for nonlinear SVM are:

  1. Select a kernel function
  2. Compute pairwise kernel values between labelled examples
  3. Use the kernel matrix to solve for support vectors and (\alpha) weights
  4. Classify a new point using kernel values with the support vectors

Properties of Kernel Functions #

A kernel function:

[ K(x,y) ]

takes two inputs and returns a real value.

It is symmetric:

[ K(x,y)=K(y,x) ]

It corresponds to an inner product in some feature space:

[ K(x_i,x_j)=\phi(x_i)^T\phi(x_j) ]

According to Mercer’s theorem, every positive-semidefinite symmetric function is a valid kernel.


Common Kernels #

Kernels allow inner products in high-dimensional feature spaces without explicit mapping.

Linear Kernel #

[ K(x_i,x_j)=x_i^Tx_j ]

This gives ordinary linear SVM.


Polynomial Kernel #

[ K(x_i,x_j)=(1+x_i^Tx_j)^p ]

where (p) is the polynomial degree.

This kernel can model polynomial decision boundaries.


Sigmoid Kernel #

[ K(x_i,x_j)=\tanh(\beta_0x_i^Tx_j+\beta_1) ]

This resembles the activation function used in neural networks.


Exponential Distance Kernel #

The slides mention kernels of the form:

[ k(x,x’)=\exp(-d(x,x’)) ]

where (d(x,x’)) is a distance function.

A common related form is the Gaussian or RBF kernel:

[ K(x_i,x_j)= \exp \left( -\frac{|x_i-x_j|^2}{2\sigma^2} \right) ]

Nonlinear SVM Workflow #

flowchart LR
    A[Choose Kernel] --> B[Compute Kernel Matrix]
    B --> C[Solve Dual]
    C --> D[Find Support Vectors]
    D --> E[Classify New Point]

XOR Example #

The XOR pattern is a classic nonlinear classification problem.

A typical XOR dataset is:

Point(x_1)(x_2)Label
(x_1)-1-1-1
(x_2)-1+1+1
(x_3)+1-1+1
(x_4)+1+1-1

This cannot be separated by a single straight line in the original 2D space.

However, using a nonlinear transformation or kernel, it can become separable.


Feature Map for XOR Intuition #

A useful feature for XOR is the product:

[ z=x_1x_2 ]

For the XOR-style labels above:

(x_1)(x_2)(x_1x_2)Label
-1-1+1-1
-1+1-1+1
+1-1-1+1
+1+1+1-1

Now the data can be separated based on the transformed feature (x_1x_2).

This shows why nonlinear feature spaces help.


Linear vs Nonlinear SVM #

AspectLinear SVMNonlinear SVM
Decision boundaryStraight hyperplaneCurved in original space
Uses kernel?Usually no, or linear kernelYes
Good forLinearly separable dataNonlinear patterns
Main formula(x_i^Tx_j)(K(x_i,x_j))
Examplesimple two-class separationXOR, circular patterns

Soft Margin with Nonlinear SVM #

Nonlinear SVM can also use soft margins.

The soft-margin kernelised dual is:

[ \max_{\alpha} \sum_i\alpha_i

#

\frac{1}{2} \sum_i\sum_j \alpha_i\alpha_jy_iy_jK(x_i,x_j) ]

subject to:

[ 0\le\alpha_i\le C ]

and:

[ \sum_i\alpha_i y_i=0 ]

The upper bound (C) comes from soft-margin slack variables.


How to Classify a New Example #

Suppose (x_*) is a new point.

Step 1: Compute Kernel Values #

For each support vector (x_i):

[ K(x_i,x_*) ]

Step 2: Compute Decision Score #

[ s= \sum_{i\in SV}\alpha_i y_iK(x_i,x_*)+b ]

Step 3: Predict Class #

[ \hat{y}=\operatorname{sign}(s) ]

If (s>0), predict (+1). If (s<0), predict (-1).


Exam Template: Kernel SVM Classifier #

If the question gives (\alpha_i), (y_i), support vectors, kernel values and (b), use:

[ f(x)= \sum_{i\in SV}\alpha_i y_iK(x_i,x)+b ]

Then:

[ \hat{y}=\operatorname{sign}(f(x)) ]

Do not calculate (w) explicitly unless the question asks for it.


Exam Template: Kernel Matrix #

If asked to compute a kernel matrix:

Step 1: List Points #

[ x_1,x_2,\ldots,x_n ]

Step 2: Choose Kernel #

For example, polynomial:

[ K(x_i,x_j)=(1+x_i^Tx_j)^p ]

Step 3: Fill Matrix #

[ K= \begin{bmatrix} K(x_1,x_1) & K(x_1,x_2) & \cdots \ K(x_2,x_1) & K(x_2,x_2) & \cdots \ \vdots & \vdots & \ddots \end{bmatrix} ]

The matrix is symmetric.


Common Exam Mistakes #

MistakeCorrection
Explicitly computing (\phi(x)) when not neededUse the kernel directly
Forgetting support vectors onlySum over support vectors with (\alpha_i>0)
Replacing (x_i^Tx_j) incorrectlyReplace every dot product with (K(x_i,x_j))
Confusing linear and polynomial kernelsLinear: (x_i^Tx_j), polynomial: ((1+x_i^Tx_j)^p)
Forgetting (b) in predictionAlways add (b) before taking sign

Quick Memory Line #

Nonlinear data → feature map φ → kernel K → dual uses K → classify by weighted support-vector kernels

Source Focus #

This page is based mainly on:

  • Lecture 16: nonlinear SVM, kernel trick, feature spaces, XOR example and nonlinear SVM steps
  • Lecture 14: kernel function definition, Mercer theorem and kernel examples
  • Lecture 15: soft-margin SVM and hinge loss background
  • Course handout: session 16 and Module 7 nonlinear SVM kernels

Home | Dimensionality reduction and PCA