Linear NN for Classification #
A Linear Neural Network (LNN) for classification uses no hidden layers.
It learns a linear decision boundary and outputs class probabilities, then converts them into predicted classes.
Neural-network view:
- Binary classification → logistic regression (single neuron + sigmoid)
- Multi-class classification → softmax regression (K output neurons + softmax)
flowchart LR D["Data<br/>X, y"] --> M["Linear model<br/>w, b"] M --> A["Activation<br/>Sigmoid / Softmax"] A --> L["Loss<br/>Cross-entropy"] L --> O["Optimiser<br/>Mini-batch GD / Adam"] O --> P["Updated parameters<br/>w, b"] P --> I["Inference<br/>Probabilities → class"] %% Pastel colour scheme style D fill:#E3F2FD,stroke:#1E88E5,stroke-width:1px style M fill:#E8F5E9,stroke:#43A047,stroke-width:1px style A fill:#FFF3E0,stroke:#FB8C00,stroke-width:1px style L fill:#FCE4EC,stroke:#D81B60,stroke-width:1px style O fill:#F3E5F5,stroke:#8E24AA,stroke-width:1px style P fill:#E0F7FA,stroke:#00838F,stroke-width:1px style I fill:#F1F8E9,stroke:#558B2F,stroke-width:1px
Classification #
Classification predicts a discrete class label.
Common settings:
- Binary classification: \( y \in \{0,1\} \)
- Multi-class classification: \( y \in \{1,2,\dots,K\} \) (exactly one class)
- Multi-label classification: one example can belong to multiple classes (handled differently)
The Four Components #
A complete ML system for linear classification (single neuron) has four components:
- Data: input feature matrix \( X \in \mathbb{R}^{N \times d} \) and binary targets \( y \in \{0,1\}^N \)
- Model: single neuron implementing a linear function followed by a sigmoid activation (maps inputs to probabilities)
- Objective: binary cross-entropy loss (measures prediction error)
- Learning algorithm: Stochastic gradient descent (SGD) to optimise weights
This same template extends directly to multi-class classification and deep neural networks.
Only the model becomes multi-output (softmax) or deeper (MLP), and backpropagation computes gradients efficiently.
Binary Classification: Logistic Regression as a Single Neuron #
flowchart LR %% Inputs (features + bias) B["1 (bias input)"] -->|w0| SUM["Weighted sum<br/>z = w^T x + b"] X1["x1"] -->|w1| SUM X2["x2"] -->|w2| SUM XD["x_d"] -->|w_d| SUM %% Activation + output SUM --> SIG["Sigmoid<br/>ŷ = σ(z)"] SIG --> OUT["Output probability<br/>ŷ = P(y=1|x) ∈ [0,1]"] %% Pastel colour scheme (same as your regression diagram) style B fill:#E3F2FD,stroke:#1E88E5,stroke-width:1px style X1 fill:#E3F2FD,stroke:#1E88E5,stroke-width:1px style X2 fill:#E3F2FD,stroke:#1E88E5,stroke-width:1px style XD fill:#E3F2FD,stroke:#1E88E5,stroke-width:1px style SUM fill:#FFF3E0,stroke:#FB8C00,stroke-width:1px style SIG fill:#E8F5E9,stroke:#43A047,stroke-width:1px style OUT fill:#FCE4EC,stroke:#D81B60,stroke-width:1px
Model (logit → probability) #
\[ z = w^T x + b \] \[ \hat{p} = \sigma(z) = \frac{1}{1 + e^{-z}} \]- \( \hat{p} \) is interpreted as \( P(y=1 \mid x) \) .
- A simple decision rule uses a threshold:
- predict class 1 if \( \hat{p} \ge 0.5 \)
- else class 0
Why not use linear regression for classification? #
Linear regression outputs any real number, which is not a probability and does not naturally map to classes.
Sigmoid fixes this by mapping the logit to a valid probability in
\( (0,1) \)
.
Sigmoid Function (quick properties) #
- monotonic increasing (larger \( z \) → larger probability)
- smooth and differentiable
- gives confident outputs when \( z \) has large magnitude
Derivative (useful in learning):
\[ \sigma'(z) = \sigma(z)\big(1-\sigma(z)\big) \]Binary Loss: Cross-Entropy #
Binary cross-entropy compares predicted probabilities to true labels.
\[ J(w,b) = -\frac{1}{N}\sum_{i=1}^{N}\left[y^{(i)}\log(\hat{p}^{(i)}) + (1-y^{(i)})\log(1-\hat{p}^{(i)})\right] \]Why cross-entropy (intuition):
- probabilistic interpretation (maximum likelihood)
- heavily penalises confident wrong predictions
- well-behaved gradients for sigmoid-based learning
Training: Gradient Descent (usually SGD / Mini-batch) #
Parameter update (generic):
\[ \theta \leftarrow \theta - \eta \nabla_{\theta} J(\theta) \]In practice for deep learning, you usually train with:
- mini-batch gradient descent (standard)
- often with Adam or SGD + momentum
Binary Classification Metrics #
Use metrics that match the cost of mistakes.
- Accuracy: good when classes are balanced
- Precision: important when false positives are costly (e.g., spam flagging)
- Recall: important when false negatives are costly (e.g., disease screening)
- F1 score: balances precision and recall
- Confusion matrix: shows TP/FP/FN/TN counts
Multi-class Classification: Softmax Regression #
When there are
\( K \)
classes, we use K output neurons (one per class).
The model outputs logits, then softmax converts them into a probability distribution.
Model #
For each class \( k \) :
\[ z_k = w_k^T x + b_k \]Softmax:
\[ \hat{p}_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}} \]Properties:
- \( \hat{p}_k \in [0,1] \)
- \( \sum_{k=1}^{K}\hat{p}_k = 1 \)
- prediction: choose the class with the largest probability
Multi-class Loss: Categorical Cross-Entropy #
With one-hot labels \( y_k^{(i)} \) :
\[ J(W,b) = -\frac{1}{N}\sum_{i=1}^{N}\sum_{k=1}^{K} y_k^{(i)}\log(\hat{p}_k^{(i)}) \]Making Predictions (Inference) #
Binary:
- compute \( \hat{p} \)
- output probability, then threshold to get class
Multi-class:
- compute \( \hat{p}_1,\dots,\hat{p}_K \)
- choose \( \arg\max_k \hat{p}_k \)
Implementation tips and debugging #
Implementation tips:
- scale features (helps convergence and stability)
- check dimensions carefully ( \( XW + b \) shape issues are common)
- start with a tiny dataset and overfit on purpose (sanity check)
- monitor loss curve (should generally trend downward)
Debugging checklist:
- does loss decrease for a small batch?
- are labels encoded correctly (0/1 for binary, one-hot for multi-class)?
- are you using the right activation + loss pairing?
- sigmoid + binary cross-entropy
- softmax + categorical cross-entropy
- is learning rate too large (divergence) or too small (no learning)?
Summary #
- LNN classification uses no hidden layers and learns a linear boundary.
- Binary: sigmoid outputs a probability; train with binary cross-entropy.
- Multi-class: softmax outputs a probability distribution; train with categorical cross-entropy.
- Train with mini-batch GD (often Adam / SGD+momentum).
- Evaluate with the right metrics (accuracy vs precision/recall/F1 based on the problem).
Reference #
Course slides: DNN_M4_Linear NN Classification.
Zhang, Lipton, Li, Smola — Dive into Deep Learning (Linear Neural Networks for Classification: softmax + cross-entropy).
Singh & Raj — Deep Learning notes (classification and learning fundamentals).
DNN Module #3 — Linear Neural Networks for Regression. (T1 – Ch 4, T1 - Ch 12)