Mathematical Preliminaries for SVM

Mathematical Preliminaries for SVM #

Support Vector Machines use optimisation, geometry and kernels. Before deriving SVM, we need constrained optimisation, Lagrange multipliers, primal and dual problems, KKT conditions, hyperplanes and kernel functions.

Key takeaway: SVM is built on constrained optimisation. The hard-margin SVM primal problem is a quadratic optimisation problem with linear inequality constraints. The dual problem uses Lagrange multipliers and leads naturally to support vectors and kernels.

  • Primal and dual perspectives
  • Geometry of margins

SVM Preliminaries Map #

flowchart TD
    A[SVM Preliminaries] --> B[Constrained Optimisation]
    A --> C[Primal and Dual]
    A --> D[KKT Conditions]
    A --> E[Hyperplanes]
    A --> F[Kernel Functions]

Why These Preliminaries Matter #

The SVM problem asks for a separating hyperplane with maximum margin.

This becomes an optimisation problem:

  • minimise a quadratic function
  • subject to inequality constraints
  • solve using Lagrange multipliers
  • identify active constraints using KKT conditions
  • use kernels when data is not linearly separable

General Constrained Optimisation Problem #

The standard form is:

[ \min_x f(x) ]

subject to:

[ g_i(x) \le 0,\quad i=1,2,\ldots,m ]

and:

[ h_j(x)=0,\quad j=1,2,\ldots,p ]

Here:

  • (f(x)) is the objective function
  • (g_i(x)) are inequality constraints
  • (h_j(x)) are equality constraints

Lagrangian #

The Lagrangian combines the objective and constraints into one expression.

[ \mathcal{L}(x,\lambda,\nu)

#

f(x)+\sum_{i=1}^{m}\lambda_i g_i(x)+\sum_{j=1}^{p}\nu_j h_j(x) ]

where:

  • (\lambda_i) are Lagrange multipliers for inequality constraints
  • (\nu_j) are Lagrange multipliers for equality constraints

For inequality constraints:

[ \lambda_i \ge 0 ]

For equality constraints, (\nu_j) is unrestricted.


Inequality Constraints and Sign Convention #

If the constraint is written as:

[ g_i(x)\le 0 ]

then the multiplier satisfies:

[ \lambda_i \ge 0 ]

This is the convention used in convex optimisation.

In SVM derivations, constraints may be written in different equivalent forms. Always check the sign before writing the Lagrangian.


Primal Problem #

The primal problem is the original constrained optimisation problem.

[ \min_x f(x) ]

subject to:

[ g_i(x)\le0 ]

The optimisation is performed over the original variables (x). These are called primal variables.


Dual Problem #

The dual function is:

[ D(\lambda)=\min_x \mathcal{L}(x,\lambda) ]

The dual problem is:

[ \max_{\lambda} D(\lambda) ]

subject to:

[ \lambda \ge 0 ]

The variables (\lambda) are called dual variables.


Weak Duality #

Weak duality says:

[ \text{dual optimum} \le \text{primal optimum} ]

For minimisation problems, the dual gives a lower bound on the primal optimum.

In the lecture slides, this appears through the minimax inequality:

[ \max_y \min_x \phi(x,y) \le \min_x \max_y \phi(x,y) ]

Weak duality always holds under broad conditions.


Strong Duality #

Strong duality says:

[ \text{dual optimum} = \text{primal optimum} ]

This means we can solve the dual problem and get the same optimal value as the primal problem.

This is useful in SVM because:

  • the dual depends on inner products between data points
  • inner products can be replaced by kernels
  • the solution depends only on support vectors

Slater’s Condition #

Slater’s condition gives a common case where strong duality holds.

For a convex optimisation problem, Slater’s condition holds if:

  1. the objective function (f) is convex
  2. the inequality constraints (g_i) are convex
  3. the equality constraints (h_j) are affine or linear
  4. there exists a strictly feasible point (\bar{x}) such that:
[ g_i(\bar{x}) < 0 ]

and:

[ h_j(\bar{x}) = 0 ]

If Slater’s condition holds, then strong duality holds.


KKT Conditions #

The Karush-Kuhn-Tucker conditions are the main optimality conditions for constrained optimisation.

For:

[ \min f(x) ]

subject to:

[ g_i(x)\le0,\quad h_j(x)=0 ]

the KKT conditions are:

1. Primal Feasibility #

[ g_i(x^*)\le0 ]

and:

[ h_j(x^*)=0 ]

2. Dual Feasibility #

[ \lambda_i^*\ge0 ]

3. Complementary Slackness #

[ \lambda_i^g_i(x^)=0 ]

4. Stationarity #

[ \nabla f(x^)+\sum_{i=1}^{m}\lambda_i^\nabla g_i(x^)+\sum_{j=1}^{p}\nu_j^\nabla h_j(x^*)=0 ]

Complementary Slackness Intuition #

Complementary slackness is very important for SVM.

[ \lambda_i g_i(x)=0 ]

This means:

CaseMeaning
(\lambda_i=0)constraint is not active
(g_i(x)=0)constraint is active
(\lambda_i>0)point lies exactly on the active boundary

In SVM, points with non-zero Lagrange multipliers become support vectors.


Quadratic Programming #

SVM optimisation is a quadratic programming problem.

A standard quadratic programme is:

[ \min_{x\in\mathbb{R}^d} \frac{1}{2}x^TQx+c^Tx ]

subject to:

[ Ax\le b ]

The SVM primal has this structure because it minimises:

[ \frac{1}{2}|w|^2 ]

subject to linear constraints involving training points.


Hyperplane #

A hyperplane is a generalised line or plane.

In two dimensions, it is a line. In three dimensions, it is a plane. In (d) dimensions, it is a ((d-1))-dimensional hyperplane.

The equation is:

[ w^Tx+b=0 ]

where:

  • (w) is the normal vector
  • (b) is the bias or intercept
  • (x) is the input vector

Why (w) Is Orthogonal to the Hyperplane #

Let (x_a) and (x_b) lie on the hyperplane.

Then:

[ w^Tx_a+b=0 ]

and:

[ w^Tx_b+b=0 ]

Subtract:

[ w^T(x_a-x_b)=0 ]

So (w) is orthogonal to any vector lying inside the hyperplane.

This is why (w) determines the orientation of the separating hyperplane.


Distance from a Point to a Hyperplane #

For the hyperplane:

[ w^Tx+b=0 ]

the distance of a point (x_0) from the hyperplane is:

[ D=\frac{|w^Tx_0+b|}{|w|} ]

This formula is used to derive the SVM margin.


Linear Classifier #

A linear classifier predicts using:

[ f(x)=w^Tx+b ]

The predicted class is:

[ \hat{y}=\operatorname{sign}(w^Tx+b) ]

If:

[ w^Tx+b>0 ]

predict positive class.

If:

[ w^Tx+b<0 ]

predict negative class.


Kernel Function #

A kernel is a function:

[ K(x,y) ]

that takes two inputs and returns a real number.

The lecture defines kernels as continuous functions that are symmetric:

[ K(x,y)=K(y,x) ]

A kernel corresponds to an inner product in some feature space:

[ K(x_i,x_j)=\phi(x_i)^T\phi(x_j) ]

where (\phi(x)) maps the original data to a higher-dimensional feature space.


Why Kernels Matter for SVM #

Linear SVM relies on dot products:

[ x_i^Tx_j ]

If data is mapped into a higher-dimensional space:

[ x \mapsto \phi(x) ]

then dot products become:

[ \phi(x_i)^T\phi(x_j) ]

The kernel trick replaces this with:

[ K(x_i,x_j) ]

So we do not need to explicitly compute (\phi(x)).


Mercer’s Theorem #

The slides state:

Every positive-semidefinite symmetric function is a kernel function.

This means that if a function is symmetric and positive semidefinite, it can be used as a valid kernel.


Common Kernel Examples #

Linear Kernel #

[ K(x_i,x_j)=x_i^Tx_j ]

Polynomial Kernel #

[ K(x_i,x_j)=(1+x_i^Tx_j)^p ]

Sigmoid Kernel #

[ K(x_i,x_j)=\tanh(\beta_0x_i^Tx_j+\beta_1) ]

RBF-like Distance Kernel #

The slides also mention kernels built from distances:

[ k(x,x’)=\exp(-d(x,x’)) ]

where (d(x,x’)) is a distance function.


Exam Method: KKT Setup #

When asked to write KKT conditions, use this template.

Given:

[ \min f(x) ]

subject to:

[ g(x)\le0 ]

write:

Lagrangian #

[ \mathcal{L}(x,\lambda)=f(x)+\lambda g(x) ]

KKT Conditions #

[ \nabla_x\mathcal{L}(x,\lambda)=0 ] [ g(x)\le0 ] [ \lambda\ge0 ] [ \lambda g(x)=0 ]

This often gives marks even if the full solution is difficult.


Quick Memory Line #

Constrained optimisation → Lagrangian → Dual → KKT → Hyperplane → Kernel

Source Focus #

This page is based mainly on:

  • Lecture 11: constrained optimisation, Lagrange multipliers, primal and dual problems, weak duality, convex optimisation, quadratic programming
  • Lecture 14: KKT conditions, hyperplanes, kernel functions and linear classifiers
  • Course handout: session 14 and Module 7 SVM preliminaries

Home | Dimensionality reduction and PCA 15-primal