Primal and Dual Perspective for Linear SVM #

A linear Support Vector Machine finds a hyperplane that separates two classes with the maximum possible margin.

The primal view gives the direct geometric optimisation problem. The dual view rewrites the problem using Lagrange multipliers and reveals why only support vectors matter.

Key takeaway: Linear SVM maximises the margin by minimising
\( \frac{1}{2}\|w\|^2 \)
subject to correct-classification constraints. The dual solution expresses ( w )
as a weighted sum of support vectors:
\[ w=\sum_i\alpha_i y_i x_i \]

Linear SVM Flow #

flowchart LR
    A[Training Data] --> B[Primal Problem]
    B --> C[Lagrangian]
    C --> D[Dual Problem]
    D --> E[Support Vectors]
    E --> F[Classifier]
    style A fill:#E1F5FE,stroke:#78909C,stroke-width:1px,color:#263238
    style B fill:#C8E6C9,stroke:#78909C,stroke-width:1px,color:#263238
    style C fill:#FFF9C4,stroke:#78909C,stroke-width:1px,color:#263238
    style D fill:#EDE7F6,stroke:#78909C,stroke-width:1px,color:#263238
    style E fill:#E1F5FE,stroke:#78909C,stroke-width:1px,color:#263238
    style F fill:#C8E6C9,stroke:#78909C,stroke-width:1px,color:#263238

Binary Classification Setup #

Training data:

\[ (x_1,y_1),(x_2,y_2),\ldots,(x_n,y_n) \]

where:

\[ x_i\in\mathbb{R}^d \]

and labels are:

\[ y_i\in\{-1,+1\} \]

A linear classifier uses:

\[ f(x)=w^Tx+b \]

Prediction:

\[ \hat{y}=\operatorname{sign}(w^Tx+b) \]

Separating Hyperplane #

The decision boundary is:

\[ w^Tx+b=0 \]

The two margin boundaries are:

\[ w^Tx+b=+1 \]

and:

\[ w^Tx+b=-1 \]

The margin width is:

\[ \frac{2}{\|w\|} \]

To maximise the margin, we minimise \( \|w\| \) , equivalently:

\[ \frac{1}{2}\|w\|^2 \]

Hard-Margin SVM Primal Problem #

For linearly separable data, the hard-margin SVM primal problem is:

\[ \min_{w,b}\frac{1}{2}\|w\|^2 \]

subject to:

\[ y_i(w^Tx_i+b)\ge1,\quad i=1,\ldots,n \]

This single constraint covers both classes:

For \( y_i=+1 \) :

\[ w^Tx_i+b\ge1 \]

For \( y_i=-1 \) :

\[ w^Tx_i+b\le-1 \]

Why This Is a Quadratic Optimisation Problem #

The objective is quadratic:

\[ \frac{1}{2}\|w\|^2 \]

The constraints are linear in \( w \) and \( b \) :

\[ y_i(w^Tx_i+b)\ge1 \]

Therefore, linear SVM is a quadratic programming problem with linear inequality constraints.

SVM Lagrangian #

Rewrite the constraints as:

\[ 1-y_i(w^Tx_i+b)\le0 \]

Introduce Lagrange multipliers:

\[ \alpha_i\ge0 \]

The Lagrangian is:

\[ \mathcal{L}(w,b,\alpha) = \frac{1}{2}\|w\|^2 -\sum_{i=1}^{n}\alpha_i[y_i(w^Tx_i+b)-1] \]

Equivalent expanded form:

\[ \mathcal{L}(w,b,\alpha) = \frac{1}{2}\|w\|^2 -\sum_i\alpha_i y_i w^Tx_i -\sum_i\alpha_i y_i b +\sum_i\alpha_i \]

Stationarity with Respect to \( w \) #

Set:

\[ \frac{\partial \mathcal{L}}{\partial w}=0 \]

This gives:

\[ w-\sum_i\alpha_i y_i x_i=0 \]

Therefore:

\[ w=\sum_i\alpha_i y_i x_i \]

This is a crucial SVM result.

The weight vector is a weighted sum of training examples.

Stationarity with Respect to \( b \) #

Set:

\[ \frac{\partial \mathcal{L}}{\partial b}=0 \]

This gives:

\[ \sum_i\alpha_i y_i=0 \]

This becomes one of the dual constraints.

Dual Objective #

Substitute:

\[ w=\sum_i\alpha_i y_i x_i \]

and:

\[ \sum_i\alpha_i y_i=0 \]

into the Lagrangian.

The dual problem becomes:

\[ \max_{\alpha} \sum_i\alpha_i - \frac{1}{2}\sum_i\sum_j\alpha_i\alpha_jy_iy_jx_i^Tx_j \]

subject to:

\[ \alpha_i\ge0 \]

and:

\[ \sum_i\alpha_i y_i=0 \]

Important Observation #

The dual problem depends on the data only through inner products:

\[ x_i^Tx_j \]

This is why kernels can be used later.

In nonlinear SVM, we replace:

\[ x_i^Tx_j \]

with:

\[ K(x_i,x_j) \]

KKT Complementary Slackness for SVM #

The SVM complementary slackness condition is:

\[ \alpha_i[y_i(w^Tx_i+b)-1]=0 \]

This means:

Case	Meaning
\( \alpha_i=0 \)	point is not a support vector
\( \alpha_i>0 \)	point lies on the margin
\( y_i(w^Tx_i+b)=1 \)	constraint is active

Support vectors are the points for which:

\[ y_i(w^Tx_i+b)=1 \]

Support Vectors #

Support vectors are the training points closest to the separating hyperplane.

They lie on:

\[ w^Tx+b=+1 \]

or:

\[ w^Tx+b=-1 \]

They determine the maximum-margin hyperplane.

If a non-support vector is removed, the hyperplane usually does not change. If a support vector is removed, the hyperplane may change.

Computing \( b \) #

For any support vector:

\[ y_i(w^Tx_i+b)=1 \]

So:

\[ w^Tx_i+b=y_i \]

because \( y_i\in\{-1,+1\} \) .

Therefore:

\[ b=y_i-w^Tx_i \]

In practice, if there are multiple support vectors, average the computed \( b \) values.

Classification Function #

Using:

\[ w=\sum_i\alpha_i y_i x_i \]

the classifier becomes:

\[ f(x)=\operatorname{sign}(w^Tx+b) \]

Substitute \( w \) :

\[ f(x)= \operatorname{sign} \left( \sum_i\alpha_i y_i x_i^Tx+b \right) \]

Only support vectors have \( \alpha_i>0 \) , so only support vectors contribute to the classifier.

Exam Template: Linear SVM from Support Vectors #

If a question gives support vectors and labels, do this:

Step 1: Write Support Vector Equations #

For positive support vector:

\[ w^Tx_i+b=1 \]

For negative support vector:

\[ w^Tx_i+b=-1 \]

Step 2: Solve for \( w \) and \( b \) #

If \( w=[w_1,w_2]^T \) , then:

\[ w^Tx_i+b=w_1x_{i1}+w_2x_{i2}+b \]

Use simultaneous equations.

Step 3: Write Decision Boundary #

\[ w^Tx+b=0 \]

Step 4: Classify New Point #

\[ \hat{y}=\operatorname{sign}(w^Tx+b) \]

Hard vs Soft Margin #

flowchart LR
    A[SVM] --> B[Hard Margin]
    A --> C[Soft Margin]
    B --> D[No misclassification allowed]
    C --> E[Slack variables xi_i]
    C --> F[Penalty parameter C]
    style A fill:#E1F5FE,stroke:#78909C,stroke-width:1px,color:#263238
    style B fill:#C8E6C9,stroke:#78909C,stroke-width:1px,color:#263238
    style C fill:#FFF9C4,stroke:#78909C,stroke-width:1px,color:#263238
    style D fill:#EDE7F6,stroke:#78909C,stroke-width:1px,color:#263238
    style E fill:#E1F5FE,stroke:#78909C,stroke-width:1px,color:#263238
    style F fill:#C8E6C9,stroke:#78909C,stroke-width:1px,color:#263238

In formulas, the slack variable is written as ( \xi_i )
. In the diagram above, it is shown as xi_i to keep Mermaid rendering simple.

Soft-Margin SVM #

Hard-margin SVM requires all training points to be correctly classified.

For noisy data, this may be impossible or undesirable.

Soft-margin SVM introduces slack variables:

\[ \xi_i\ge0 \]

The soft-margin primal problem is:

\[ \min_{w,b,\xi} \frac{1}{2}\|w\|^2+C\sum_{i=1}^{n}\xi_i \]

subject to:

\[ y_i(w^Tx_i+b)\ge1-\xi_i \]

and:

\[ \xi_i\ge0 \]

Meaning of \( C \) #

The parameter \( C \) controls the penalty for margin violations.

\( C \) value	Effect
Large \( C \)	stronger penalty for misclassification; smaller margin may be chosen
Small \( C \)	more tolerance for misclassification; larger margin may be chosen

So \( C \) controls the trade-off between:

large margin
training error penalty

Hinge Loss #

The hinge loss is:

\[ L_i=\max(0,1-y_i(w^Tx_i+b)) \]

If:

\[ y_i(w^Tx_i+b)\ge1 \]

then:

\[ L_i=0 \]

The point is correctly classified with enough margin.

If:

\[ y_i(w^Tx_i+b)<1 \]

then the point has positive hinge loss.

Soft-Margin Objective Using Hinge Loss #

The objective can be written as:

\[ J(w,b)= \frac{1}{2}\|w\|^2+ C\sum_{i=1}^{n} \max(0,1-y_i(w^Tx_i+b)) \]

This is the form often used in gradient-based SVM training.

Hinge Loss Numerical Method #

For each example:

Step 1: Compute Score #

\[ s=w^Tx+b \]

Step 2: Multiply by Label #

\[ ys=y(w^Tx+b) \]

Step 3: Apply Hinge Loss #

\[ L=\max(0,1-ys) \]

Quick Example #

Suppose:

\[ y=+1,\quad w^Tx+b=0.5 \]

Then:

\[ L=\max(0,1-(1)(0.5))=0.5 \]

Suppose:

\[ y=-1,\quad w^Tx+b=0.5 \]

Then:

\[ L=\max(0,1-(-1)(0.5))=1.5 \]

Exam Method: Full Linear SVM Derivation #

If asked to derive the dual, write the following sequence.

1. Primal #

\[ \min_{w,b}\frac{1}{2}\|w\|^2 \]

subject to:

\[ y_i(w^Tx_i+b)\ge1 \]

2. Lagrangian #

\[ \mathcal{L} = \frac{1}{2}\|w\|^2 - \sum_i\alpha_i[y_i(w^Tx_i+b)-1] \]

3. Stationarity #

\[ \frac{\partial \mathcal{L}}{\partial w}=0 \Rightarrow w=\sum_i\alpha_i y_i x_i \] \[ \frac{\partial \mathcal{L}}{\partial b}=0 \Rightarrow \sum_i\alpha_i y_i=0 \]

4. Dual #

\[ \max_{\alpha} \sum_i\alpha_i - \frac{1}{2} \sum_i\sum_j \alpha_i\alpha_jy_iy_jx_i^Tx_j \]

subject to:

\[ \alpha_i\ge0,\quad \sum_i\alpha_i y_i=0 \]

5. Classifier #

\[ \hat{y} = \operatorname{sign} \left( \sum_i\alpha_i y_i x_i^Tx+b \right) \]

Common Exam Mistakes #

Mistake	Correction
Maximising \( \\|w\\| \)	Maximise margin \( 2/\\|w\\| \) , so minimise \( \frac{1}{2}\\|w\\|^2 \)
Forgetting \( y_i \) in constraint	Use \( y_i(w^Tx_i+b)\ge1 \)
Forgetting \( \sum_i\alpha_i y_i=0 \)	This comes from derivative with respect to \( b \)
Thinking all points are support vectors	Only points with \( \alpha_i>0 \) are support vectors
Using \( w^Tx+b=0 \) for support vectors	Support vectors lie on \( w^Tx+b=\pm1 \) , not the centre line

Quick Memory Line #

Max margin → minimise ||w||² → constraints → Lagrangian → α → support vectors → classifier

Source Focus #

This page is based mainly on:

Lecture 15: solving the SVM optimisation problem, Lagrangian, support vectors, linear SVM numerical method, hinge loss and soft margin
Lecture 16: maximum margin geometry and hard-margin linear SVM formulation
Lecture 14: KKT and hyperplane preliminaries
Course handout: session 15 and SVM practical lab outcomes

Home | Dimensionality reduction and PCA