Support Vector Machine (SVM) #
Support Vector Machine (SVM) is a supervised machine learning algorithm used for:
- Classification (most common)
- Regression (SVR – Support Vector Regression)
It connects many earlier ideas:
- classification and decision boundaries
- linear classifiers
- margins
- optimisation
- constrained optimisation
- kernels for non-linear data
SVM is a discriminative classifier.
That means it does not try to model how each class is generated.
Instead, it tries to find the best separating boundary between classes.
Key takeaway:
SVM tries to find a decision boundary that separates the classes with the largest possible margin.The training points that decide this margin are called support vectors.
- Linearly separable data
- Non-linearly separable data
- Kernel Trick (Mercer)
- Applications to both structured and unstructured data
Where SVM Fits in Machine Learning ☆ #
In supervised learning, we normally work with input-output pairs.
For classification, the output is categorical.
For example:
| Input features | Target |
|---|---|
| CGPA, entrance score | admission: yes / no |
| tumour size, cell shape | benign / malignant |
| email words | spam / not spam |
SVM is commonly introduced for binary classification, where the target has two classes.
The two classes are often represented as:
\[ y \in \{-1, +1\} \]For multiclass classification, SVM can be extended using strategies such as:
- one-vs-rest
- one-vs-one
flowchart LR
A[Training Data] --> B[Find Separating Boundary]
B --> C[Maximise Margin]
C --> D[Identify Support Vectors]
D --> E[Classify New Point]
style A fill:#E1F5FE
style B fill:#FFF9C4
style C fill:#C8E6C9
style D fill:#EDE7F6
style E fill:#C8E6C9The central question in SVM is:
If many boundaries can separate the classes, which one should we choose?
SVM answers:
Choose the boundary with the maximum margin.
Decision Boundary ☆ #
A linear classifier separates classes using a line in 2D, a plane in 3D, or a hyperplane in higher dimensions.
The decision boundary is written as:
\[ w^T x + b = 0 \]Where:
| Symbol | Meaning |
|---|---|
| \( x \) | input feature vector |
| \( w \) | weight vector, perpendicular to the boundary |
| \( b \) | bias or intercept |
| \( w^T x + b \) | score used for classification |
The class is predicted using the sign of the score:
\[ f(x) = \operatorname{sign}(w^T x + b) \]If the score is positive, classify the point as +1.
If the score is negative, classify the point as −1.
Why Not Just Any Separating Line? #
For linearly separable data, there may be many possible decision boundaries.
Some boundaries separate the classes but pass very close to the training points.
That is risky because a small change in the input may cause misclassification.
SVM prefers the boundary that is furthest away from the nearest points of both classes.
This gives better generalisation.
flowchart TB
A[Many possible separating lines] --> B[Choose the one with largest margin]
B --> C[Nearest points define the boundary]
C --> D[Nearest points become support vectors]
style A fill:#E1F5FE
style B fill:#C8E6C9
style C fill:#FFF9C4
style D fill:#EDE7F6Margin ☆ #
The margin is the distance between the decision boundary and the nearest training points.
SVM uses two margin boundaries:
\[ w^T x + b = +1 \] \[ w^T x + b = -1 \]The decision boundary lies in the middle:
\[ w^T x + b = 0 \]The margin width is:
\[ \text{Margin Width} = \frac{2}{\|w\|} \]Therefore, maximising the margin is equivalent to minimising the size of \( w \) .
Support Vectors ☆ #
Support vectors are the training points closest to the decision boundary.
They lie on or near the margin boundaries.
They are called support vectors because they support or define the final hyperplane.
If a non-support-vector point is moved slightly, the boundary may not change.
If a support vector is moved, the boundary can change significantly.
In exams, remember this line:
Support vectors are the critical training points that determine the maximum-margin hyperplane.
Hard Margin SVM ☆ #
A hard margin SVM assumes the data is perfectly linearly separable.
No data point is allowed inside the margin or on the wrong side of the boundary.
The condition is:
\[ y_i(w^T x_i + b) \ge 1 \]This means:
- positive points must be on the positive side
- negative points must be on the negative side
- all points must be outside the margin
The optimisation problem is:
\[ \min_{w,b} \frac{1}{2}\|w\|^2 \]subject to:
\[ y_i(w^T x_i + b) \ge 1 \]Why minimise \( \frac{1}{2}\|w\|^2 \) ?
Because maximising the margin \( \frac{2}{\|w\|} \) is equivalent to minimising \( \|w\| \) .
The square and the factor \( \frac{1}{2} \) make the mathematics easier during differentiation.
Hard Margin Intuition #
| Situation | Hard Margin Behaviour |
|---|---|
| Classes are perfectly separable | Works well |
| No outliers | Good choice |
| Classes overlap | Fails or gives poor boundary |
| Data has noise | Too strict |
Hard margin is useful for understanding the core mathematics of SVM.
In real data, soft margin SVM is usually more practical.
Lagrange Multiplier View ☆ #
SVM is a constrained optimisation problem.
To solve it, we can use Lagrange multipliers.
The important result is that the weight vector can be written as:
\[ w = \sum_i \alpha_i y_i x_i \]Where:
| Symbol | Meaning |
|---|---|
| \( \alpha_i \) | Lagrange multiplier for training point \( i \) |
| \( y_i \) | class label, usually +1 or −1 |
| \( x_i \) | training data point |
A very important interpretation is:
Only training points with non-zero ( \alpha_i )
become support vectors.
Most other training points have ( \alpha_i = 0 )
, so they do not directly affect the final decision boundary.
Once \( w \) is known, the bias can be calculated using any support vector:
\[ b = y_i - w^T x_i \]The classifier becomes:
\[ f(x) = \operatorname{sign}\left(\sum_i \alpha_i y_i (x_i^T x) + b\right) \]The term \( x_i^T x \) is a dot product between a training point and a new test point.
This dot product becomes important when we discuss kernels.
Soft Margin SVM ☆ #
Real-world data is often noisy.
The classes may overlap.
A strict hard margin may not be possible.
Soft margin SVM allows some violations using slack variables.
The soft margin constraint is:
\[ y_i(w^T x_i + b) \ge 1 - \xi_i \]Where \( \xi_i \) is the slack variable.
It measures how much a point violates the margin.
The soft margin objective is:
\[ \min_{w,b} \frac{1}{2}\|w\|^2 + C\sum_i \xi_i \]Where \( C \) controls the penalty for margin violations.
Role of C in Soft Margin SVM ☆ #
The hyperparameter \( C \) controls the trade-off between:
- large margin
- low training error
| Value of \( C \) | Behaviour | Interpretation |
|---|---|---|
| Small \( C \) | Allows more violations | Wider margin, more tolerance |
| Large \( C \) | Penalises violations strongly | Narrower margin, tries to classify training data correctly |
Low ( C )
means the model is more tolerant of misclassification.
High ( C )
means the model tries harder to avoid training errors.
A very large \( C \) can overfit noisy data.
A very small \( C \) can underfit.
Hinge Loss #
Soft margin SVM is closely related to hinge loss.
For one training example, hinge loss is:
\[ \max(0, 1 - y_i(w^T x_i + b)) \]Interpretation:
| Condition | Hinge Loss |
|---|---|
| Correct and outside margin | 0 |
| Correct but inside margin | Positive loss |
| Misclassified | Larger positive loss |
SVM does not only care whether a point is classified correctly.
It also cares whether the point is classified with enough margin.
Linearly Separable Data ☆ #
Data is linearly separable when a straight line or hyperplane can separate the classes.
For example, in two dimensions, a single line can separate positive and negative points.
A linear SVM is suitable when this is possible.
\[ w^T x + b = 0 \]The aim is to choose the line with the maximum margin.
Non-Linearly Separable Data ☆ #
Data is non-linearly separable when no straight line can separate the classes properly.
For example, one class may surround another class in a circular pattern.
In the original input space, a linear boundary fails.
The key idea is:
Map the data into a higher-dimensional feature space where it may become linearly separable.
Example transformation:
\[ x \mapsto (x, x^2) \]In the original space, the boundary may be curved.
In the transformed space, the boundary can become linear.
flowchart LR
A[Original Space: Not Linearly Separable] --> B[Feature Transformation]
B --> C[Higher-Dimensional Space]
C --> D[Linear Separation Possible]
style A fill:#FFF9C4
style B fill:#E1F5FE
style C fill:#EDE7F6
style D fill:#C8E6C9Kernel Trick ☆ #
Mapping data explicitly into a high-dimensional space can be computationally expensive.
The kernel trick avoids computing the transformed features directly.
Instead, it computes the dot product in the transformed space using a kernel function.
If \( \phi(x) \) is a transformation into a higher-dimensional space, then:
\[ K(x_i, x_j) = \phi(x_i)^T\phi(x_j) \]This means SVM can behave as if it is working in a higher-dimensional space without explicitly constructing all the transformed features.
The classifier becomes:
\[ f(x) = \operatorname{sign}\left(\sum_i \alpha_i y_i K(x_i, x) + b\right) \]Common Kernel Functions ☆ #
| Kernel | Formula | When Useful |
|---|---|---|
| Linear | \( K(x_i,x_j)=x_i^T x_j \) | linearly separable or high-dimensional sparse data |
| Polynomial | \( K(x_i,x_j)=(x_i^T x_j+c)^d \) | curved boundaries with polynomial structure |
| RBF / Gaussian | \( K(x_i,x_j)=\exp(-\gamma\|x_i-x_j\|^2) \) | complex non-linear boundaries |
| Sigmoid | \( K(x_i,x_j)=\tanh(\alpha x_i^T x_j+c) \) | neural-network-like similarity behaviour |
The RBF kernel is popular because it can model very flexible non-linear decision boundaries.
Mercer Condition #
Not every similarity function is a valid SVM kernel.
A kernel should correspond to an inner product in some feature space.
Mercer’s condition gives the mathematical requirement for a valid kernel.
A practical way to remember it:
A valid kernel must produce a kernel matrix that behaves like a proper inner-product matrix.
In simple terms, the kernel matrix should be symmetric and positive semi-definite.
For a set of training points, the kernel matrix is:
\[ K_{ij} = K(x_i, x_j) \]A valid kernel should satisfy:
\[ K(x_i, x_j) = K(x_j, x_i) \]and should be positive semi-definite.
Linear SVM vs Kernel SVM #
| Aspect | Linear SVM | Kernel SVM |
|---|---|---|
| Boundary | straight line / hyperplane | curved boundary in original space |
| Feature space | original features | implicit transformed features |
| Speed | faster | usually slower |
| Interpretability | easier | harder |
| Best for | linearly separable or high-dimensional sparse data | non-linear class patterns |
Structured and Unstructured Data Applications #
SVM can be used with both structured and unstructured data.
Structured Data #
Structured data is usually tabular.
Examples:
- medical diagnosis from patient attributes
- loan approval using income, credit score and account history
- customer churn prediction
- admission prediction using scores and profile attributes
For structured data, features are already in columns.
SVM can directly use numerical feature vectors after preprocessing and scaling.
Unstructured Data #
Unstructured data includes text, images, audio and documents.
SVM cannot directly process raw text or raw images.
They must first be converted into feature vectors.
Examples:
| Data Type | Feature Representation | SVM Use |
|---|---|---|
| Text | bag-of-words, TF-IDF, embeddings | spam detection, sentiment classification |
| Images | HOG, SIFT, CNN embeddings | object recognition, face detection |
| Documents | TF-IDF, topic vectors | document classification |
| Audio | MFCC features | speaker or sound classification |
SVM often performs well when the number of features is large, especially in text classification.
Why Feature Scaling Matters ☆ #
SVM depends on distances, dot products and margins.
If one feature has a much larger scale than another, it can dominate the decision boundary.
For example:
| Feature | Range |
|---|---|
| Age | 18 to 80 |
| Income | 10,000 to 500,000 |
Income may dominate the classifier unless scaling is applied.
Common scaling methods:
- standardisation
- min-max normalisation
For SVM, feature scaling is usually essential.
Always fit the scaler on the training data only, then transform both training and test data.
Worked Mini Example ☆ #
Suppose an SVM has learned:
\[ w = (1, -1), \quad b = -0.5 \]For a new point:
\[ x = (2, 1) \]Compute the score:
\[ w^T x + b = (1)(2) + (-1)(1) - 0.5 = 0.5 \]Since the score is positive:
\[ f(x) = +1 \]So the new point is classified as the positive class.
SVM Compared with Logistic Regression #
| Feature | Logistic Regression | SVM |
|---|---|---|
| Main output | probability | class boundary / score |
| Main idea | estimate probability using sigmoid | maximise margin |
| Boundary | usually linear unless features are transformed | linear or kernel-based non-linear |
| Loss | log loss | hinge loss |
| Outlier sensitivity | can be affected | soft margin controls effect through \( C \) |
| Interpretability | probability is easier to explain | margin-based explanation |
Logistic regression is useful when probability interpretation matters.
SVM is useful when a strong separating boundary is more important.
SVM Compared with Decision Trees #
| Feature | Decision Tree | SVM |
|---|---|---|
| Decision boundary | axis-aligned splits | maximum-margin hyperplane |
| Interpretability | high for small trees | moderate to low |
| Feature scaling | usually not required | important |
| Non-linear data | handled by tree splits | handled by kernels |
| Overfitting control | pruning, depth, minimum samples | \( C \) , kernel choice, \( \gamma \) |
Practical Scikit-Learn Example #
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
# X = input features, y = class labels
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42,
stratify=y
)
model = Pipeline([
("scaler", StandardScaler()),
("svm", SVC(kernel="rbf", C=1.0, gamma="scale"))
])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
The pipeline is important because scaling must be learned only from the training data.
Common Exam Questions ☆ #
1. Define SVM #
SVM is a supervised discriminative classifier that finds the maximum-margin decision boundary between classes.
2. What are support vectors? #
Support vectors are the training points closest to the decision boundary.
They determine the final hyperplane and margin.
3. What is the difference between hard margin and soft margin? #
Hard margin does not allow any violation.
Soft margin allows some violations using slack variables and controls the penalty through \( C \) .
4. What is the kernel trick? #
The kernel trick computes dot products in a transformed feature space without explicitly computing the transformation.
5. Why is SVM useful for non-linear data? #
Kernel SVM can map data implicitly into a higher-dimensional feature space where a linear separator may exist.
Common Mistakes #
| Mistake | Correction |
|---|---|
| Thinking SVM only draws any separating line | SVM chooses the maximum-margin separator |
| Forgetting support vectors | Support vectors define the margin and boundary |
| Using hard margin for noisy data | Use soft margin with suitable \( C \) |
| Forgetting scaling | Scale features before SVM training |
| Thinking kernels explicitly create features | Kernel trick avoids explicit transformation |
| Treating SVM output as probability by default | SVM gives a decision score unless probability calibration is enabled |
Quick Revision Summary #
- SVM is mainly a supervised classification algorithm.
- It is a discriminative classifier.
- It finds the best decision boundary by maximising the margin.
- Support vectors are the closest points to the boundary.
- Hard margin assumes perfect separability.
- Soft margin allows violations using slack variables.
- \( C \) controls the trade-off between margin width and training error.
- Kernel trick helps SVM handle non-linear data.
- Linear, polynomial and RBF are common kernels.
- Feature scaling is very important for SVM.
References #
- Course handout: Module 7, Support Vector Machines.
- Lecture transcript: Support Vector Machine sessions by Prof. Indumathi Prabakeran.
- Course slides: Instance-based Learning and SVM transition material.
- C. J. C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition.