Support Vector Machine

Support Vector Machine (SVM) #

Support Vector Machine (SVM) is a supervised machine learning algorithm used for:

  • Classification (most common)
  • Regression (SVR – Support Vector Regression)

It connects many earlier ideas:

  • classification and decision boundaries
  • linear classifiers
  • margins
  • optimisation
  • constrained optimisation
  • kernels for non-linear data

SVM is a discriminative classifier.

That means it does not try to model how each class is generated.

Instead, it tries to find the best separating boundary between classes.

Key takeaway:
SVM tries to find a decision boundary that separates the classes with the largest possible margin.

The training points that decide this margin are called support vectors.

  • Linearly separable data
  • Non-linearly separable data
  • Kernel Trick (Mercer)
  • Applications to both structured and unstructured data

Where SVM Fits in Machine Learning ☆ #

In supervised learning, we normally work with input-output pairs.

For classification, the output is categorical.

For example:

Input featuresTarget
CGPA, entrance scoreadmission: yes / no
tumour size, cell shapebenign / malignant
email wordsspam / not spam

SVM is commonly introduced for binary classification, where the target has two classes.

The two classes are often represented as:

\[ y \in \{-1, +1\} \]

For multiclass classification, SVM can be extended using strategies such as:

  • one-vs-rest
  • one-vs-one
flowchart LR
    A[Training Data] --> B[Find Separating Boundary]
    B --> C[Maximise Margin]
    C --> D[Identify Support Vectors]
    D --> E[Classify New Point]

    style A fill:#E1F5FE
    style B fill:#FFF9C4
    style C fill:#C8E6C9
    style D fill:#EDE7F6
    style E fill:#C8E6C9

The central question in SVM is:

If many boundaries can separate the classes, which one should we choose?

SVM answers:

Choose the boundary with the maximum margin.


Decision Boundary ☆ #

A linear classifier separates classes using a line in 2D, a plane in 3D, or a hyperplane in higher dimensions.

The decision boundary is written as:

\[ w^T x + b = 0 \]

Where:

SymbolMeaning
\( x \)input feature vector
\( w \)weight vector, perpendicular to the boundary
\( b \)bias or intercept
\( w^T x + b \)score used for classification

The class is predicted using the sign of the score:

\[ f(x) = \operatorname{sign}(w^T x + b) \]

If the score is positive, classify the point as +1.

If the score is negative, classify the point as −1.


Why Not Just Any Separating Line? #

For linearly separable data, there may be many possible decision boundaries.

Some boundaries separate the classes but pass very close to the training points.

That is risky because a small change in the input may cause misclassification.

SVM prefers the boundary that is furthest away from the nearest points of both classes.

This gives better generalisation.

flowchart TB
    A[Many possible separating lines] --> B[Choose the one with largest margin]
    B --> C[Nearest points define the boundary]
    C --> D[Nearest points become support vectors]

    style A fill:#E1F5FE
    style B fill:#C8E6C9
    style C fill:#FFF9C4
    style D fill:#EDE7F6

Margin ☆ #

The margin is the distance between the decision boundary and the nearest training points.

SVM uses two margin boundaries:

\[ w^T x + b = +1 \] \[ w^T x + b = -1 \]

The decision boundary lies in the middle:

\[ w^T x + b = 0 \]

The margin width is:

\[ \text{Margin Width} = \frac{2}{\|w\|} \]

Therefore, maximising the margin is equivalent to minimising the size of \( w \) .


Support Vectors ☆ #

Support vectors are the training points closest to the decision boundary.

They lie on or near the margin boundaries.

They are called support vectors because they support or define the final hyperplane.

If a non-support-vector point is moved slightly, the boundary may not change.

If a support vector is moved, the boundary can change significantly.

In exams, remember this line:

Support vectors are the critical training points that determine the maximum-margin hyperplane.


Hard Margin SVM ☆ #

A hard margin SVM assumes the data is perfectly linearly separable.

No data point is allowed inside the margin or on the wrong side of the boundary.

The condition is:

\[ y_i(w^T x_i + b) \ge 1 \]

This means:

  • positive points must be on the positive side
  • negative points must be on the negative side
  • all points must be outside the margin

The optimisation problem is:

\[ \min_{w,b} \frac{1}{2}\|w\|^2 \]

subject to:

\[ y_i(w^T x_i + b) \ge 1 \]

Why minimise \( \frac{1}{2}\|w\|^2 \) ?

Because maximising the margin \( \frac{2}{\|w\|} \) is equivalent to minimising \( \|w\| \) .

The square and the factor \( \frac{1}{2} \) make the mathematics easier during differentiation.


Hard Margin Intuition #

SituationHard Margin Behaviour
Classes are perfectly separableWorks well
No outliersGood choice
Classes overlapFails or gives poor boundary
Data has noiseToo strict

Hard margin is useful for understanding the core mathematics of SVM.

In real data, soft margin SVM is usually more practical.


Lagrange Multiplier View ☆ #

SVM is a constrained optimisation problem.

To solve it, we can use Lagrange multipliers.

The important result is that the weight vector can be written as:

\[ w = \sum_i \alpha_i y_i x_i \]

Where:

SymbolMeaning
\( \alpha_i \)Lagrange multiplier for training point \( i \)
\( y_i \)class label, usually +1 or −1
\( x_i \)training data point

A very important interpretation is:

Only training points with non-zero ( \alpha_i )

become support vectors.

Most other training points have ( \alpha_i = 0 )

, so they do not directly affect the final decision boundary.

Once \( w \) is known, the bias can be calculated using any support vector:

\[ b = y_i - w^T x_i \]

The classifier becomes:

\[ f(x) = \operatorname{sign}\left(\sum_i \alpha_i y_i (x_i^T x) + b\right) \]

The term \( x_i^T x \) is a dot product between a training point and a new test point.

This dot product becomes important when we discuss kernels.


Soft Margin SVM ☆ #

Real-world data is often noisy.

The classes may overlap.

A strict hard margin may not be possible.

Soft margin SVM allows some violations using slack variables.

The soft margin constraint is:

\[ y_i(w^T x_i + b) \ge 1 - \xi_i \]

Where \( \xi_i \) is the slack variable.

It measures how much a point violates the margin.

The soft margin objective is:

\[ \min_{w,b} \frac{1}{2}\|w\|^2 + C\sum_i \xi_i \]

Where \( C \) controls the penalty for margin violations.


Role of C in Soft Margin SVM ☆ #

The hyperparameter \( C \) controls the trade-off between:

  • large margin
  • low training error
Value of \( C \)BehaviourInterpretation
Small \( C \)Allows more violationsWider margin, more tolerance
Large \( C \)Penalises violations stronglyNarrower margin, tries to classify training data correctly

Low ( C )

means the model is more tolerant of misclassification.

High ( C )

means the model tries harder to avoid training errors.

A very large \( C \) can overfit noisy data.

A very small \( C \) can underfit.


Hinge Loss #

Soft margin SVM is closely related to hinge loss.

For one training example, hinge loss is:

\[ \max(0, 1 - y_i(w^T x_i + b)) \]

Interpretation:

ConditionHinge Loss
Correct and outside margin0
Correct but inside marginPositive loss
MisclassifiedLarger positive loss

SVM does not only care whether a point is classified correctly.

It also cares whether the point is classified with enough margin.


Linearly Separable Data ☆ #

Data is linearly separable when a straight line or hyperplane can separate the classes.

For example, in two dimensions, a single line can separate positive and negative points.

A linear SVM is suitable when this is possible.

\[ w^T x + b = 0 \]

The aim is to choose the line with the maximum margin.


Non-Linearly Separable Data ☆ #

Data is non-linearly separable when no straight line can separate the classes properly.

For example, one class may surround another class in a circular pattern.

In the original input space, a linear boundary fails.

The key idea is:

Map the data into a higher-dimensional feature space where it may become linearly separable.

Example transformation:

\[ x \mapsto (x, x^2) \]

In the original space, the boundary may be curved.

In the transformed space, the boundary can become linear.

flowchart LR
    A[Original Space: Not Linearly Separable] --> B[Feature Transformation]
    B --> C[Higher-Dimensional Space]
    C --> D[Linear Separation Possible]

    style A fill:#FFF9C4
    style B fill:#E1F5FE
    style C fill:#EDE7F6
    style D fill:#C8E6C9

Kernel Trick ☆ #

Mapping data explicitly into a high-dimensional space can be computationally expensive.

The kernel trick avoids computing the transformed features directly.

Instead, it computes the dot product in the transformed space using a kernel function.

If \( \phi(x) \) is a transformation into a higher-dimensional space, then:

\[ K(x_i, x_j) = \phi(x_i)^T\phi(x_j) \]

This means SVM can behave as if it is working in a higher-dimensional space without explicitly constructing all the transformed features.

The classifier becomes:

\[ f(x) = \operatorname{sign}\left(\sum_i \alpha_i y_i K(x_i, x) + b\right) \]

Common Kernel Functions ☆ #

KernelFormulaWhen Useful
Linear\( K(x_i,x_j)=x_i^T x_j \)linearly separable or high-dimensional sparse data
Polynomial\( K(x_i,x_j)=(x_i^T x_j+c)^d \)curved boundaries with polynomial structure
RBF / Gaussian\( K(x_i,x_j)=\exp(-\gamma\|x_i-x_j\|^2) \)complex non-linear boundaries
Sigmoid\( K(x_i,x_j)=\tanh(\alpha x_i^T x_j+c) \)neural-network-like similarity behaviour

The RBF kernel is popular because it can model very flexible non-linear decision boundaries.


Mercer Condition #

Not every similarity function is a valid SVM kernel.

A kernel should correspond to an inner product in some feature space.

Mercer’s condition gives the mathematical requirement for a valid kernel.

A practical way to remember it:

A valid kernel must produce a kernel matrix that behaves like a proper inner-product matrix.

In simple terms, the kernel matrix should be symmetric and positive semi-definite.

For a set of training points, the kernel matrix is:

\[ K_{ij} = K(x_i, x_j) \]

A valid kernel should satisfy:

\[ K(x_i, x_j) = K(x_j, x_i) \]

and should be positive semi-definite.


Linear SVM vs Kernel SVM #

AspectLinear SVMKernel SVM
Boundarystraight line / hyperplanecurved boundary in original space
Feature spaceoriginal featuresimplicit transformed features
Speedfasterusually slower
Interpretabilityeasierharder
Best forlinearly separable or high-dimensional sparse datanon-linear class patterns

Structured and Unstructured Data Applications #

SVM can be used with both structured and unstructured data.

Structured Data #

Structured data is usually tabular.

Examples:

  • medical diagnosis from patient attributes
  • loan approval using income, credit score and account history
  • customer churn prediction
  • admission prediction using scores and profile attributes

For structured data, features are already in columns.

SVM can directly use numerical feature vectors after preprocessing and scaling.

Unstructured Data #

Unstructured data includes text, images, audio and documents.

SVM cannot directly process raw text or raw images.

They must first be converted into feature vectors.

Examples:

Data TypeFeature RepresentationSVM Use
Textbag-of-words, TF-IDF, embeddingsspam detection, sentiment classification
ImagesHOG, SIFT, CNN embeddingsobject recognition, face detection
DocumentsTF-IDF, topic vectorsdocument classification
AudioMFCC featuresspeaker or sound classification

SVM often performs well when the number of features is large, especially in text classification.


Why Feature Scaling Matters ☆ #

SVM depends on distances, dot products and margins.

If one feature has a much larger scale than another, it can dominate the decision boundary.

For example:

FeatureRange
Age18 to 80
Income10,000 to 500,000

Income may dominate the classifier unless scaling is applied.

Common scaling methods:

  • standardisation
  • min-max normalisation

For SVM, feature scaling is usually essential.

Always fit the scaler on the training data only, then transform both training and test data.


Worked Mini Example ☆ #

Suppose an SVM has learned:

\[ w = (1, -1), \quad b = -0.5 \]

For a new point:

\[ x = (2, 1) \]

Compute the score:

\[ w^T x + b = (1)(2) + (-1)(1) - 0.5 = 0.5 \]

Since the score is positive:

\[ f(x) = +1 \]

So the new point is classified as the positive class.


SVM Compared with Logistic Regression #

FeatureLogistic RegressionSVM
Main outputprobabilityclass boundary / score
Main ideaestimate probability using sigmoidmaximise margin
Boundaryusually linear unless features are transformedlinear or kernel-based non-linear
Losslog losshinge loss
Outlier sensitivitycan be affectedsoft margin controls effect through \( C \)
Interpretabilityprobability is easier to explainmargin-based explanation

Logistic regression is useful when probability interpretation matters.

SVM is useful when a strong separating boundary is more important.


SVM Compared with Decision Trees #

FeatureDecision TreeSVM
Decision boundaryaxis-aligned splitsmaximum-margin hyperplane
Interpretabilityhigh for small treesmoderate to low
Feature scalingusually not requiredimportant
Non-linear datahandled by tree splitshandled by kernels
Overfitting controlpruning, depth, minimum samples\( C \) , kernel choice, \( \gamma \)

Practical Scikit-Learn Example #

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline

# X = input features, y = class labels
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

model = Pipeline([
    ("scaler", StandardScaler()),
    ("svm", SVC(kernel="rbf", C=1.0, gamma="scale"))
])

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

The pipeline is important because scaling must be learned only from the training data.


Common Exam Questions ☆ #

1. Define SVM #

SVM is a supervised discriminative classifier that finds the maximum-margin decision boundary between classes.

2. What are support vectors? #

Support vectors are the training points closest to the decision boundary.

They determine the final hyperplane and margin.

3. What is the difference between hard margin and soft margin? #

Hard margin does not allow any violation.

Soft margin allows some violations using slack variables and controls the penalty through \( C \) .

4. What is the kernel trick? #

The kernel trick computes dot products in a transformed feature space without explicitly computing the transformation.

5. Why is SVM useful for non-linear data? #

Kernel SVM can map data implicitly into a higher-dimensional feature space where a linear separator may exist.


Common Mistakes #

MistakeCorrection
Thinking SVM only draws any separating lineSVM chooses the maximum-margin separator
Forgetting support vectorsSupport vectors define the margin and boundary
Using hard margin for noisy dataUse soft margin with suitable \( C \)
Forgetting scalingScale features before SVM training
Thinking kernels explicitly create featuresKernel trick avoids explicit transformation
Treating SVM output as probability by defaultSVM gives a decision score unless probability calibration is enabled

Quick Revision Summary #

  • SVM is mainly a supervised classification algorithm.
  • It is a discriminative classifier.
  • It finds the best decision boundary by maximising the margin.
  • Support vectors are the closest points to the boundary.
  • Hard margin assumes perfect separability.
  • Soft margin allows violations using slack variables.
  • \( C \) controls the trade-off between margin width and training error.
  • Kernel trick helps SVM handle non-linear data.
  • Linear, polynomial and RBF are common kernels.
  • Feature scaling is very important for SVM.

References #

  • Course handout: Module 7, Support Vector Machines.
  • Lecture transcript: Support Vector Machine sessions by Prof. Indumathi Prabakeran.
  • Course slides: Instance-based Learning and SVM transition material.
  • C. J. C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition.

Home | Machine Learning