Linear models for Classification #

categorises data by finding a linear boundary (hyperplane) that separates classes
calculating a weighted sum of input features plus bias

flowchart TD
T["Linear<br/>classification<br/>models"] --> P["Perceptron"]
T --> LR["Logistic<br/>regression"]
T --> SVM["Linear<br/>SVM"]

P -->|uses| STEP["Step<br/>activation"]
LR -->|uses| SIG["Sigmoid<br/>+ log loss"]
SVM -->|uses| HNG["Hinge<br/>loss"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000

style P fill:#C8E6C9,stroke:#2E7D32,color:#000
style LR fill:#C8E6C9,stroke:#2E7D32,color:#000
style SVM fill:#C8E6C9,stroke:#2E7D32,color:#000

style STEP fill:#CE93D8,stroke:#8E24AA,color:#000
style SIG fill:#CE93D8,stroke:#8E24AA,color:#000
style HNG fill:#CE93D8,stroke:#8E24AA,color:#000

Discriminant Functions #

Decision Theory #

Probabilistic Discriminative Classifiers #

Logistic Regression #

Supervised machine learning algorithm
Binary classification algorithm
requires data to be linearly separable
predicts the probability that an input belongs to a specific class
uses Sigmoid function to convert inputs into a probability value between 0 and 1

Key takeaway: Logistic regression predicts $P(y=1\mid x)$ using a sigmoid of a linear score $z=w\cdot x+b$, then learns $w,b$ by maximising likelihood (equivalently minimising log-loss).

Types of Logistic Regression #

Logistic regression can be classified into three main types based on the nature of the dependent variable.

flowchart TD
T["Logistic Regression<br/>Types"] --> BIN["Binomial<br/>(Binary)"]
T --> MULTI["Multinomial"]
T --> ORD["Ordinal"]

BIN --> BIN1["Two classes<br/>0/1, Yes/No"]
MULTI --> MULTI1["3+ classes<br/>no natural order"]
ORD --> ORD1["3+ classes<br/>ordered categories"]

style T fill:#90CAF9,stroke:#1E88E5,color:#000
style BIN fill:#C8E6C9,stroke:#2E7D32,color:#000
style MULTI fill:#C8E6C9,stroke:#2E7D32,color:#000
style ORD fill:#C8E6C9,stroke:#2E7D32,color:#000
style BIN1 fill:#CE93D8,stroke:#8E24AA,color:#000
style MULTI1 fill:#CE93D8,stroke:#8E24AA,color:#000
style ORD1 fill:#CE93D8,stroke:#8E24AA,color:#000

Binomial Logistic Regression #

dependent variable has two possible categories (e.g. Yes/No, Pass/Fail, 0/1)
most common form
used for binary classification problems

Multinomial Logistic Regression #

dependent variable has three or more categories that are not ordered
e.g. “cat”, “dog”, “sheep”
extends binary logistic regression to multiple classes

Ordinal Logistic Regression #

dependent variable has three or more categories with a natural order
e.g. “low”, “medium”, “high”
takes ordering into account when modelling

Multiclass strategies (OvR / OvO) #

Logistic regression is naturally a binary classifier. To handle more than two classes, use strategies like One-vs-Rest or One-vs-One.

One-vs-Rest (OvR) #

Train $K$ binary logistic regression models (one per class).
For class $k$, set labels:
- $y=1$ if the example belongs to class $k$
- $y=0$ otherwise
At prediction time, choose the class with the highest predicted probability.

\[ \hat{k}=\arg\max_{k\in\{1,\dots,K\}} p_k(x) \]

When to use:

Works well when $K$ is moderate.
Simple to implement.

One-vs-One (OvO) #

Train a binary classifier for every pair of classes.
Total classifiers: $K(K-1)/2$.
At prediction time, each classifier votes; the class with most votes wins.

When to use:

Useful when $K$ is small.
Training can be heavier if $K$ is large.

Assumptions of Logistic Regression #

Independent observations: each data point is assumed independent (no dependence between samples)
Dependent variable type: binary for standard logistic regression (for more than two categories, multinomial/softmax is used)
Linearity in the log-odds: predictors affect the log-odds in a linear way
No extreme outliers: outliers can distort coefficient estimates
Large sample size: needs enough data for stable, reliable estimates

Understanding the Sigmoid Function #

The sigmoid function converts a real-valued input to a probability between 0 and 1.

\[ \sigma(z)=\frac{1}{1+e^{-z}} \]

Properties:

$\sigma(z)\to 1$ as $z\to \infty$
$\sigma(z)\to 0$ as $z\to -\infty$
$\sigma(z)$ is always in $(0,1)$

Classification using a threshold (commonly $0.5$):

if $\sigma(z)\ge 0.5$: predict Class 1
if $\sigma(z)<0.5$: predict Class 0

Link function vs sigmoid (common confusion) #

Logistic Regression can be described in two linked ways:

Linear score: $z=w\cdot x+b$
Probability: $p=\sigma(z)$

The logit is the link between probability and score:

\[ \log\left(\frac{p}{1-p}\right)=w\cdot x+b \]

logit: often called the link function in GLM language
sigmoid: the inverse link that maps $z$ to a probability

How Logistic Regression Works #

Input features and labels #

Input features (matrix):

\[ X= \begin{bmatrix} x_{11} & \dots & x_{1m}\\ x_{21} & \dots & x_{2m}\\ \vdots & \ddots & \vdots\\ x_{n1} & \dots & x_{nm} \end{bmatrix} \]

Dependent variable (binary):

\[ y \in \{0,1\} \]

Linear score #

Compute a linear score:

\[ z=\left(\sum_{i=1}^{n}w_i x_i\right)+b \]

Vector form:

\[ z=w\cdot x+b \]

Here:

$w$ are weights / coefficients
$b$ is the bias (intercept)

Probability prediction #

Convert $z$ to a probability:

\[ p(x)=P(y=1\mid x)=\sigma(z) \]

So:

$P(y=1)=p(x)$
$P(y=0)=1-p(x)$

Odds, Log-Odds (Logit), and the Logistic Equation #

Odds of the event:

\[ \frac{p(x)}{1-p(x)} = e^{z} \]

Log-odds (logit):

\[ \log\left(\frac{p(x)}{1-p(x)}\right)=z=w\cdot x+b \]

Solving for $p(x)$ gives the logistic regression probability:

\[ p(x)=\frac{e^{w\cdot x+b}}{1+e^{w\cdot x+b}} = \frac{1}{1+e^{-(w\cdot x+b)}} \]

This is the probability that the input belongs to Class 1.

Likelihood and Log-Likelihood #

The goal is to find $w$ and $b$ that maximise the likelihood of the observed data.

For each data point $i$:

if $y_i=1$: probability is $p(x_i)$
if $y_i=0$: probability is $1-p(x_i)$

Likelihood:

\[ L(b,w)=\prod_{i=1}^{n} p(x_i)^{y_i}\left(1-p(x_i)\right)^{1-y_i} \]

Log-likelihood:

\[ \log L(b,w)=\sum_{i=1}^{n}\left[y_i\log p(x_i) + (1-y_i)\log\left(1-p(x_i)\right)\right] \]

Gradient of the Log-Likelihood #

To find the best $w$ and $b$ we can use gradient ascent on $\log L$.

For weight component $w_j$:

\[ \frac{\partial \log L}{\partial w_j} = \sum_{i=1}^{n}\left(y_i - p(x_i)\right)x_{ij} \]

Evaluation metrics for classification #

To evaluate Logistic Regression (and other classifiers), we often use a confusion matrix and derived metrics.

Confusion Matrix #

For binary classification (positive class = 1):

True Positive (TP): actual 1, predicted 1
False Positive (FP): actual 0, predicted 1
False Negative (FN): actual 1, predicted 0
True Negative (TN): actual 0, predicted 0

\[ \begin{array}{c|cc} & \text{Pred }1 & \text{Pred }0\\ \hline \text{Actual }1 & TP & FN\\ \text{Actual }0 & FP & TN \end{array} \]

Accuracy #

Accuracy measures the overall fraction of correct predictions.

\[ \text{Accuracy}=\frac{TP+TN}{TP+FP+FN+TN} \]

When to use:

When classes are fairly balanced.
When FP and FN have similar cost.

Precision #

Precision tells you: “When the model predicts 1, how often is it correct?”

\[ \text{Precision}=\frac{TP}{TP+FP} \]

When to use:

When false positives are costly.
Example: flagging a legitimate customer as fraud.

Recall (Sensitivity / True Positive Rate) #

Recall tells you: “Out of all actual 1s, how many did we catch?”

\[ \text{Recall}=\frac{TP}{TP+FN} \]

When to use:

When false negatives are costly.
Example: missing a fraud case / missing a disease.

F1 score #

\[ F1=\frac{2\cdot \text{Precision}\cdot \text{Recall}}{\text{Precision}+\text{Recall}} \]

Log loss (Cross-entropy) #

Log loss evaluates probabilities, not just hard class labels. It penalises confident wrong predictions heavily.

Per-example:

\[ \mathrm{Cost}\!\left(p,y\right) = -y\log(p)-(1-y)\log(1-p) \]

Over the full dataset:

\[ J=-\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\log(p^{(i)})+(1-y^{(i)})\log(1-p^{(i)})\right] \]

When to use:

When you care about well-calibrated probabilities.
When comparing probabilistic classifiers (like logistic regression).

ROC curve and AUC #

ROC curve plots:

True Positive Rate (TPR) vs False Positive Rate (FPR) as you vary the classification threshold.

\[ TPR=\frac{TP}{TP+FN} \] \[ FPR=\frac{FP}{FP+TN} \]

AUC: Area under the ROC curve.

When to use:

When you want a threshold-independent performance summary.
Often useful when classes are imbalanced and you want ranking quality.

Decision Tree #

A Decision Tree is a non-linear classification model that splits the feature space into regions using if-then rules.

How it works:

choose a feature and split threshold
split the data into two (or more) groups
repeat until a stopping rule is met

Common split criteria:

Gini impurity
Entropy / Information Gain

When to use:

when relationships are non-linear
when you want interpretable rule-based decisions
when feature interactions matter

Typical risks:

can overfit without pruning / depth limits
can be unstable (small data changes can change the tree)

Evaluation: Use the same metrics as logistic regression: confusion matrix, accuracy, precision, recall, ROC/AUC.

Terminology #

Independent variables: input features / predictors
Dependent variable: target variable (categorical)
Logistic function: sigmoid mapping to probability
Odds: $\frac{p}{1-p}$
Log-odds (logit): $\log\left(\frac{p}{1-p}\right)$
Coefficients: the weights $w$
Intercept: the bias $b$
Maximum Likelihood Estimation (MLE): estimate $w,b$ by maximising likelihood of observed data

References #

Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP)

Home | Machine Learning