Mathematical Preliminaries for SVM #
Support Vector Machines use optimisation, geometry and kernels. Before deriving SVM, we need constrained optimisation, Lagrange multipliers, primal and dual problems, KKT conditions, hyperplanes and kernel functions.
Key takeaway: SVM is built on constrained optimisation. The hard-margin SVM primal problem is a quadratic optimisation problem with linear inequality constraints. The dual problem uses Lagrange multipliers and leads naturally to support vectors and kernels.
- Primal and dual perspectives
- Geometry of margins
SVM Preliminaries Map #
flowchart TD
A[SVM Preliminaries] --> B[Constrained Optimisation]
A --> C[Primal and Dual]
A --> D[KKT Conditions]
A --> E[Hyperplanes]
A --> F[Kernel Functions]
Why These Preliminaries Matter #
The SVM problem asks for a separating hyperplane with maximum margin.
This becomes an optimisation problem:
- minimise a quadratic function
- subject to inequality constraints
- solve using Lagrange multipliers
- identify active constraints using KKT conditions
- use kernels when data is not linearly separable
General Constrained Optimisation Problem #
The standard form is:
[ \min_x f(x) ]subject to:
[ g_i(x) \le 0,\quad i=1,2,\ldots,m ]and:
[ h_j(x)=0,\quad j=1,2,\ldots,p ]Here:
- (f(x)) is the objective function
- (g_i(x)) are inequality constraints
- (h_j(x)) are equality constraints
Lagrangian #
The Lagrangian combines the objective and constraints into one expression.
[ \mathcal{L}(x,\lambda,\nu)
f(x)+\sum_{i=1}^{m}\lambda_i g_i(x)+\sum_{j=1}^{p}\nu_j h_j(x) ]
where:
- (\lambda_i) are Lagrange multipliers for inequality constraints
- (\nu_j) are Lagrange multipliers for equality constraints
For inequality constraints:
[ \lambda_i \ge 0 ]For equality constraints, (\nu_j) is unrestricted.
Inequality Constraints and Sign Convention #
If the constraint is written as:
[ g_i(x)\le 0 ]then the multiplier satisfies:
[ \lambda_i \ge 0 ]This is the convention used in convex optimisation.
In SVM derivations, constraints may be written in different equivalent forms. Always check the sign before writing the Lagrangian.
Primal Problem #
The primal problem is the original constrained optimisation problem.
[ \min_x f(x) ]subject to:
[ g_i(x)\le0 ]The optimisation is performed over the original variables (x). These are called primal variables.
Dual Problem #
The dual function is:
[ D(\lambda)=\min_x \mathcal{L}(x,\lambda) ]The dual problem is:
[ \max_{\lambda} D(\lambda) ]subject to:
[ \lambda \ge 0 ]The variables (\lambda) are called dual variables.
Weak Duality #
Weak duality says:
[ \text{dual optimum} \le \text{primal optimum} ]For minimisation problems, the dual gives a lower bound on the primal optimum.
In the lecture slides, this appears through the minimax inequality:
[ \max_y \min_x \phi(x,y) \le \min_x \max_y \phi(x,y) ]Weak duality always holds under broad conditions.
Strong Duality #
Strong duality says:
[ \text{dual optimum} = \text{primal optimum} ]This means we can solve the dual problem and get the same optimal value as the primal problem.
This is useful in SVM because:
- the dual depends on inner products between data points
- inner products can be replaced by kernels
- the solution depends only on support vectors
Slater’s Condition #
Slater’s condition gives a common case where strong duality holds.
For a convex optimisation problem, Slater’s condition holds if:
- the objective function (f) is convex
- the inequality constraints (g_i) are convex
- the equality constraints (h_j) are affine or linear
- there exists a strictly feasible point (\bar{x}) such that:
and:
[ h_j(\bar{x}) = 0 ]If Slater’s condition holds, then strong duality holds.
KKT Conditions #
The Karush-Kuhn-Tucker conditions are the main optimality conditions for constrained optimisation.
For:
[ \min f(x) ]subject to:
[ g_i(x)\le0,\quad h_j(x)=0 ]the KKT conditions are:
1. Primal Feasibility #
[ g_i(x^*)\le0 ]and:
[ h_j(x^*)=0 ]2. Dual Feasibility #
[ \lambda_i^*\ge0 ]3. Complementary Slackness #
[ \lambda_i^g_i(x^)=0 ]4. Stationarity #
[ \nabla f(x^)+\sum_{i=1}^{m}\lambda_i^\nabla g_i(x^)+\sum_{j=1}^{p}\nu_j^\nabla h_j(x^*)=0 ]Complementary Slackness Intuition #
Complementary slackness is very important for SVM.
[ \lambda_i g_i(x)=0 ]This means:
| Case | Meaning |
|---|---|
| (\lambda_i=0) | constraint is not active |
| (g_i(x)=0) | constraint is active |
| (\lambda_i>0) | point lies exactly on the active boundary |
In SVM, points with non-zero Lagrange multipliers become support vectors.
Quadratic Programming #
SVM optimisation is a quadratic programming problem.
A standard quadratic programme is:
[ \min_{x\in\mathbb{R}^d} \frac{1}{2}x^TQx+c^Tx ]subject to:
[ Ax\le b ]The SVM primal has this structure because it minimises:
[ \frac{1}{2}|w|^2 ]subject to linear constraints involving training points.
Hyperplane #
A hyperplane is a generalised line or plane.
In two dimensions, it is a line. In three dimensions, it is a plane. In (d) dimensions, it is a ((d-1))-dimensional hyperplane.
The equation is:
[ w^Tx+b=0 ]where:
- (w) is the normal vector
- (b) is the bias or intercept
- (x) is the input vector
Why (w) Is Orthogonal to the Hyperplane #
Let (x_a) and (x_b) lie on the hyperplane.
Then:
[ w^Tx_a+b=0 ]and:
[ w^Tx_b+b=0 ]Subtract:
[ w^T(x_a-x_b)=0 ]So (w) is orthogonal to any vector lying inside the hyperplane.
This is why (w) determines the orientation of the separating hyperplane.
Distance from a Point to a Hyperplane #
For the hyperplane:
[ w^Tx+b=0 ]the distance of a point (x_0) from the hyperplane is:
[ D=\frac{|w^Tx_0+b|}{|w|} ]This formula is used to derive the SVM margin.
Linear Classifier #
A linear classifier predicts using:
[ f(x)=w^Tx+b ]The predicted class is:
[ \hat{y}=\operatorname{sign}(w^Tx+b) ]If:
[ w^Tx+b>0 ]predict positive class.
If:
[ w^Tx+b<0 ]predict negative class.
Kernel Function #
A kernel is a function:
[ K(x,y) ]that takes two inputs and returns a real number.
The lecture defines kernels as continuous functions that are symmetric:
[ K(x,y)=K(y,x) ]A kernel corresponds to an inner product in some feature space:
[ K(x_i,x_j)=\phi(x_i)^T\phi(x_j) ]where (\phi(x)) maps the original data to a higher-dimensional feature space.
Why Kernels Matter for SVM #
Linear SVM relies on dot products:
[ x_i^Tx_j ]If data is mapped into a higher-dimensional space:
[ x \mapsto \phi(x) ]then dot products become:
[ \phi(x_i)^T\phi(x_j) ]The kernel trick replaces this with:
[ K(x_i,x_j) ]So we do not need to explicitly compute (\phi(x)).
Mercer’s Theorem #
The slides state:
Every positive-semidefinite symmetric function is a kernel function.
This means that if a function is symmetric and positive semidefinite, it can be used as a valid kernel.
Common Kernel Examples #
Linear Kernel #
[ K(x_i,x_j)=x_i^Tx_j ]Polynomial Kernel #
[ K(x_i,x_j)=(1+x_i^Tx_j)^p ]Sigmoid Kernel #
[ K(x_i,x_j)=\tanh(\beta_0x_i^Tx_j+\beta_1) ]RBF-like Distance Kernel #
The slides also mention kernels built from distances:
[ k(x,x’)=\exp(-d(x,x’)) ]where (d(x,x’)) is a distance function.
Exam Method: KKT Setup #
When asked to write KKT conditions, use this template.
Given:
[ \min f(x) ]subject to:
[ g(x)\le0 ]write:
Lagrangian #
[ \mathcal{L}(x,\lambda)=f(x)+\lambda g(x) ]KKT Conditions #
[ \nabla_x\mathcal{L}(x,\lambda)=0 ] [ g(x)\le0 ] [ \lambda\ge0 ] [ \lambda g(x)=0 ]This often gives marks even if the full solution is difficult.
Quick Memory Line #
Constrained optimisation → Lagrangian → Dual → KKT → Hyperplane → Kernel
Source Focus #
This page is based mainly on:
- Lecture 11: constrained optimisation, Lagrange multipliers, primal and dual problems, weak duality, convex optimisation, quadratic programming
- Lecture 14: KKT conditions, hyperplanes, kernel functions and linear classifiers
- Course handout: session 14 and Module 7 SVM preliminaries
Home | Dimensionality reduction and PCA 15-primal