Mathematical Preliminaries for SVM #

Support Vector Machines use optimisation, geometry and kernels. Before deriving SVM, we need constrained optimisation, Lagrange multipliers, primal and dual problems, KKT conditions, hyperplanes and kernel functions.

Key takeaway: SVM is built on constrained optimisation. The hard-margin SVM primal problem is a quadratic optimisation problem with linear inequality constraints. The dual problem uses Lagrange multipliers and leads naturally to support vectors and kernels.

Primal and dual perspectives
Geometry of margins

SVM Preliminaries Map #

flowchart TD
    A[SVM Preliminaries] --> B[Constrained Optimisation]
    A --> C[Primal and Dual]
    A --> D[KKT Conditions]
    A --> E[Hyperplanes]
    A --> F[Kernel Functions]
    style A fill:#E1F5FE,stroke:#78909C,stroke-width:1px,color:#263238
    style B fill:#C8E6C9,stroke:#78909C,stroke-width:1px,color:#263238
    style C fill:#FFF9C4,stroke:#78909C,stroke-width:1px,color:#263238
    style D fill:#EDE7F6,stroke:#78909C,stroke-width:1px,color:#263238
    style E fill:#E1F5FE,stroke:#78909C,stroke-width:1px,color:#263238
    style F fill:#C8E6C9,stroke:#78909C,stroke-width:1px,color:#263238

Why These Preliminaries Matter #

The SVM problem asks for a separating hyperplane with maximum margin.

This becomes an optimisation problem:

minimise a quadratic function
subject to inequality constraints
solve using Lagrange multipliers
identify active constraints using KKT conditions
use kernels when data is not linearly separable

General Constrained Optimisation Problem #

The standard form is:

\[ \min_x f(x) \]

subject to:

\[ g_i(x) \le 0,\quad i=1,2,\ldots,m \]

and:

\[ h_j(x)=0,\quad j=1,2,\ldots,p \]

Here:

\( f(x) \) is the objective function
\( g_i(x) \) are inequality constraints
\( h_j(x) \) are equality constraints

Lagrangian #

The Lagrangian combines the objective and constraints into one expression.

\[ \mathcal{L}(x,\lambda,\nu) = f(x)+\sum_{i=1}^{m}\lambda_i g_i(x)+\sum_{j=1}^{p}\nu_j h_j(x) \]

where:

\( \lambda_i \) are Lagrange multipliers for inequality constraints
\( \nu_j \) are Lagrange multipliers for equality constraints

For inequality constraints:

\[ \lambda_i \ge 0 \]

For equality constraints, \( \nu_j \) is unrestricted.

Inequality Constraints and Sign Convention #

If the constraint is written as:

\[ g_i(x)\le 0 \]

then the multiplier satisfies:

\[ \lambda_i \ge 0 \]

This is the convention used in convex optimisation.

In SVM derivations, constraints may be written in different equivalent forms. Always check the sign before writing the Lagrangian.

Primal Problem #

The primal problem is the original constrained optimisation problem.

\[ \min_x f(x) \]

subject to:

\[ g_i(x)\le0 \]

The optimisation is performed over the original variables \( x \) . These are called primal variables.

Dual Problem #

The dual function is:

\[ D(\lambda)=\min_x \mathcal{L}(x,\lambda) \]

The dual problem is:

\[ \max_{\lambda} D(\lambda) \]

subject to:

\[ \lambda \ge 0 \]

The variables \( \lambda \) are called dual variables.

Weak Duality #

Weak duality says:

\[ \text{dual optimum} \le \text{primal optimum} \]

For minimisation problems, the dual gives a lower bound on the primal optimum.

In the lecture slides, this appears through the minimax inequality:

\[ \max_y \min_x \phi(x,y) \le \min_x \max_y \phi(x,y) \]

Weak duality always holds under broad conditions.

Strong Duality #

Strong duality says:

\[ \text{dual optimum} = \text{primal optimum} \]

This means we can solve the dual problem and get the same optimal value as the primal problem.

This is useful in SVM because:

the dual depends on inner products between data points
inner products can be replaced by kernels
the solution depends only on support vectors

Slater’s Condition #

Slater’s condition gives a common case where strong duality holds.

For a convex optimisation problem, Slater’s condition holds if:

the objective function \( f \) is convex
the inequality constraints \( g_i \) are convex
the equality constraints \( h_j \) are affine or linear
there exists a strictly feasible point \( \bar{x} \) such that:

\[ g_i(\bar{x}) < 0 \]

and:

\[ h_j(\bar{x}) = 0 \]

If Slater’s condition holds, then strong duality holds.

KKT Conditions #

The Karush-Kuhn-Tucker conditions are the main optimality conditions for constrained optimisation.

For:

\[ \min f(x) \]

subject to:

\[ g_i(x)\le0,\quad h_j(x)=0 \]

the KKT conditions are:

1. Primal Feasibility #

\[ g_i(x^*)\le0 \]

and:

\[ h_j(x^*)=0 \]

2. Dual Feasibility #

\[ \lambda_i^*\ge0 \]

3. Complementary Slackness #

\[ \lambda_i^*g_i(x^*)=0 \]

4. Stationarity #

\[ \nabla f(x^*)+\sum_{i=1}^{m}\lambda_i^*\nabla g_i(x^*)+\sum_{j=1}^{p}\nu_j^*\nabla h_j(x^*)=0 \]

Complementary Slackness Intuition #

Complementary slackness is very important for SVM.

\[ \lambda_i g_i(x)=0 \]

This means:

Case	Meaning
\( \lambda_i=0 \)	constraint is not active
\( g_i(x)=0 \)	constraint is active
\( \lambda_i>0 \)	point lies exactly on the active boundary

In SVM, points with non-zero Lagrange multipliers become support vectors.

Quadratic Programming #

SVM optimisation is a quadratic programming problem.

A standard quadratic programme is:

\[ \min_{x\in\mathbb{R}^d} \frac{1}{2}x^TQx+c^Tx \]

subject to:

\[ Ax\le b \]

The SVM primal has this structure because it minimises:

\[ \frac{1}{2}\|w\|^2 \]

subject to linear constraints involving training points.

Hyperplane #

A hyperplane is a generalised line or plane.

In two dimensions, it is a line. In three dimensions, it is a plane. In \( d \) dimensions, it is a \( (d-1) \) -dimensional hyperplane.

The equation is:

\[ w^Tx+b=0 \]

where:

\( w \) is the normal vector
\( b \) is the bias or intercept
\( x \) is the input vector

Why \( w \) Is Orthogonal to the Hyperplane #

Let \( x_a \) and \( x_b \) lie on the hyperplane.

Then:

\[ w^Tx_a+b=0 \]

and:

\[ w^Tx_b+b=0 \]

Subtract:

\[ w^T(x_a-x_b)=0 \]

So \( w \) is orthogonal to any vector lying inside the hyperplane.

This is why \( w \) determines the orientation of the separating hyperplane.

Distance from a Point to a Hyperplane #

For the hyperplane:

\[ w^Tx+b=0 \]

the distance of a point \( x_0 \) from the hyperplane is:

\[ D=\frac{|w^Tx_0+b|}{\|w\|} \]

This formula is used to derive the SVM margin.

Linear Classifier #

A linear classifier predicts using:

\[ f(x)=w^Tx+b \]

The predicted class is:

\[ \hat{y}=\operatorname{sign}(w^Tx+b) \]

If:

\[ w^Tx+b>0 \]

predict positive class.

If:

\[ w^Tx+b<0 \]

predict negative class.

Kernel Function #

A kernel is a function:

\[ K(x,y) \]

that takes two inputs and returns a real number.

The lecture defines kernels as continuous functions that are symmetric:

\[ K(x,y)=K(y,x) \]

A kernel corresponds to an inner product in some feature space:

\[ K(x_i,x_j)=\phi(x_i)^T\phi(x_j) \]

where \( \phi(x) \) maps the original data to a higher-dimensional feature space.

Why Kernels Matter for SVM #

Linear SVM relies on dot products:

\[ x_i^Tx_j \]

If data is mapped into a higher-dimensional space:

\[ x \mapsto \phi(x) \]

then dot products become:

\[ \phi(x_i)^T\phi(x_j) \]

The kernel trick replaces this with:

\[ K(x_i,x_j) \]

So we do not need to explicitly compute \( \phi(x) \) .

Mercer’s Theorem #

The slides state:

Every positive-semidefinite symmetric function is a kernel function.

This means that if a function is symmetric and positive semidefinite, it can be used as a valid kernel.

Common Kernel Examples #

Linear Kernel #

\[ K(x_i,x_j)=x_i^Tx_j \]

Polynomial Kernel #

\[ K(x_i,x_j)=(1+x_i^Tx_j)^p \]

Sigmoid Kernel #

\[ K(x_i,x_j)=\tanh(\beta_0x_i^Tx_j+\beta_1) \]

RBF-like Distance Kernel #

The slides also mention kernels built from distances:

\[ k(x,x')=\exp(-d(x,x')) \]

where \( d(x,x') \) is a distance function.

Exam Method: KKT Setup #

When asked to write KKT conditions, use this template.

Given:

\[ \min f(x) \]

subject to:

\[ g(x)\le0 \]

write:

Lagrangian #

\[ \mathcal{L}(x,\lambda)=f(x)+\lambda g(x) \]

KKT Conditions #

\[ \nabla_x\mathcal{L}(x,\lambda)=0 \] \[ g(x)\le0 \] \[ \lambda\ge0 \] \[ \lambda g(x)=0 \]

This often gives marks even if the full solution is difficult.

Quick Memory Line #

Constrained optimisation → Lagrangian → Dual → KKT → Hyperplane → Kernel

Source Focus #

This page is based mainly on:

Lecture 11: constrained optimisation, Lagrange multipliers, primal and dual problems, weak duality, convex optimisation, quadratic programming
Lecture 14: KKT conditions, hyperplanes, kernel functions and linear classifiers
Course handout: session 14 and Module 7 SVM preliminaries

Home | Dimensionality reduction and PCA 15-primal