Deep Feedforward Neural Networks (DFNN) or Multi Layer Perceptrons (MLP) for Classification #

A Deep Feedforward Neural Network (DFNN), also called a Multi-Layer Perceptron (MLP), is a neural network with one or more hidden layers where information flows forward only (no recurrence).
For classification, DFNNs learn non-linear decision boundaries by combining hidden layers with non-linear activation functions.

Core idea:
A single neuron can only learn linear boundaries.
Adding hidden layers + non-linearity allows DFNNs to solve problems like XOR.

MLP as solution for XOR #

A single perceptron fails on XOR because XOR is not linearly separable.

An MLP solves XOR by:

using at least one hidden layer
using a non-linear activation in the hidden layer
combining hidden neurons to create a non-linear decision boundary

XOR is the simplest “proof by example” that you need hidden layers for non-linear separation.

Hidden layers and non-linearity #

Hidden layers apply:

a linear transform (weights + bias)
a non-linear activation (ReLU / sigmoid / tanh)

Without non-linearity, stacking layers collapses to a single linear function.

Forward Propagation (vectorised) #

Assume:

input dimension \( d \)
output classes \( K \)
number of layers \( L \) (including output layer)
batch size \( m \)

Batch matrix:

\( X \in \mathbb{R}^{m \times d} \)
\( A^{[0]} = X \)

For layer \( l \) :

\[ Z^{[l]} = A^{[l-1]} W^{[l]} + \mathbf{1} \, (b^{[l]})^T \] \[ A^{[l]} = g^{[l]}(Z^{[l]}) \]

Where:

\( W^{[l]} \) is the weight matrix
\( b^{[l]} \) is the bias vector
\( g^{[l]} \) is the activation function
\( \mathbf{1} \) is an all-ones column vector (bias broadcast)

Output layer:

Binary: sigmoid
Multi-class: softmax

Forward pass algorithm (batch) #

Input: X, {W[l], b[l]} for l=1..L
A[0] = X
for l = 1..L-1:
    Z[l] = A[l-1] W[l] + b[l]
    A[l] = g(Z[l])                 # ReLU / tanh / sigmoid
Z[L] = A[L-1] W[L] + b[L]
A[L] = output_activation(Z[L])     # sigmoid or softmax
Return A[L] (predictions)

Define loss functions and compute the error #

Binary cross-entropy (sigmoid output) #

\[ J = -\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\log(\hat{p}^{(i)}) + (1-y^{(i)})\log(1-\hat{p}^{(i)})\right] \]

Where \( \hat{p} \) is the predicted probability for class 1.

Multi-class cross-entropy (softmax output) #

\[ J = -\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K} y_k^{(i)} \log(\hat{p}_k^{(i)}) \]

Where:

\( y_k^{(i)} \) is a one-hot label
\( \hat{p}_k^{(i)} \) is predicted probability for class \( k \)

Backward Propagation (vectorised) #

Backprop computes gradients using the chain rule layer by layer from output to input.

Let:

\( dZ^{[l]} = \frac{\partial J}{\partial Z^{[l]}} \)
\( dW^{[l]} = \frac{\partial J}{\partial W^{[l]}} \)
\( db^{[l]} = \frac{\partial J}{\partial b^{[l]}} \)
\( dA^{[l]} = \frac{\partial J}{\partial A^{[l]}} \)

Output layer gradients (softmax + cross-entropy shortcut) #

\[ dZ^{[L]} = A^{[L]} - Y \]

Then:

\[ dW^{[L]} = \frac{1}{m}(A^{[L-1]})^T dZ^{[L]} \] \[ db^{[L]} = \frac{1}{m}\sum_{i=1}^{m} dZ^{[L]}_{(i,:)} \]

Propagate backward:

\[ dA^{[L-1]} = dZ^{[L]} (W^{[L]})^T \]

Hidden layer gradients #

For \( l = L-1, \dots, 1 \) :

\[ dZ^{[l]} = dA^{[l]} \odot g'^{[l]}(Z^{[l]}) \] \[ dW^{[l]} = \frac{1}{m}(A^{[l-1]})^T dZ^{[l]} \] \[ db^{[l]} = \frac{1}{m}\sum_{i=1}^{m} dZ^{[l]}_{(i,:)} \] \[ dA^{[l-1]} = dZ^{[l]} (W^{[l]})^T \]

Where \( \odot \) is element-wise multiplication.

Backprop algorithm (batch) #

Input: cached {A[l], Z[l]} from forward pass, labels Y
Compute dZ[L] = A[L] - Y                      # softmax + CE shortcut
Compute dW[L] = (1/m) A[L-1]^T dZ[L]
Compute db[L] = (1/m) sum_rows(dZ[L])
dA[L-1] = dZ[L] W[L]^T

for l = L-1 down to 1:
    dZ[l] = dA[l] ⊙ g'(Z[l])
    dW[l] = (1/m) A[l-1]^T dZ[l]
    db[l] = (1/m) sum_rows(dZ[l])
    dA[l-1] = dZ[l] W[l]^T
Return {dW[l], db[l]}

Weight updation using gradients (vectorised) #

Using learning rate \( \eta \) :

\[ W^{[l]} \leftarrow W^{[l]} - \eta \, dW^{[l]} \] \[ b^{[l]} \leftarrow b^{[l]} - \eta \, db^{[l]} \]

Impact of depth and width in DFNN #

Depth (more layers) #

Can represent complex functions more efficiently than shallow networks
Encourages hierarchical feature learning (simple → complex)
Can be harder to train without good initialisation/normalisation (vanishing/exploding gradients)

Width (more neurons per layer) #

Increases capacity within a layer
With limited data, higher width can increase overfitting risk

Rule of thumb:
depth helps expressivity and feature hierarchy
width helps capacity inside a layer
both require good optimisation and regularisation

Code implementation from scratch (webinar) #

Below is a minimal “from scratch” XOR MLP (2–2–1) showing:

vectorised forward pass
backprop gradients
gradient descent updates

# XOR with a tiny MLP (2-2-1) from scratch (educational)
# - forward pass
# - backprop
# - gradient descent
#
# No ML libraries used.

import math
import random

def sigmoid(z):
    return 1.0 / (1.0 + math.exp(-z))

def dsigmoid(a):
    # derivative wrt z when a = sigmoid(z)
    return a * (1.0 - a)

def matmul(A, B):
    m, n = len(A), len(A[0])
    n2, p = len(B), len(B[0])
    assert n == n2
    out = [[0.0]*p for _ in range(m)]
    for i in range(m):
        for k in range(n):
            aik = A[i][k]
            for j in range(p):
                out[i][j] += aik * B[k][j]
    return out

def add_bias(M, b):
    return [[M[i][j] + b[j] for j in range(len(b))] for i in range(len(M))]

def transpose(A):
    return [list(row) for row in zip(*A)]

def elemwise(A, B):
    return [[A[i][j]*B[i][j] for j in range(len(A[0]))] for i in range(len(A))]

def sub(A, B):
    return [[A[i][j]-B[i][j] for j in range(len(A[0]))] for i in range(len(A))]

def scale(A, s):
    return [[A[i][j]*s for j in range(len(A[0]))] for i in range(len(A))]

def sum_rows(A):
    p = len(A[0])
    out = [0.0]*p
    for i in range(len(A)):
        for j in range(p):
            out[j] += A[i][j]
    return out

def sigmoid_mat(Z):
    return [[sigmoid(Z[i][j]) for j in range(len(Z[0]))] for i in range(len(Z))]

def dsigmoid_mat(A):
    return [[dsigmoid(A[i][j]) for j in range(len(A[0]))] for i in range(len(A))]

# XOR data
X = [[0.0,0.0],[0.0,1.0],[1.0,0.0],[1.0,1.0]]   # 4 x 2
Y = [[0.0],[1.0],[1.0],[0.0]]                   # 4 x 1

m = len(X)

def randmat(r, c, scale=1.0):
    return [[(random.random()-0.5)*scale for _ in range(c)] for _ in range(r)]

W1 = randmat(2, 2, scale=1.0)   # 2 x 2
b1 = [0.0, 0.0]                 # 2
W2 = randmat(2, 1, scale=1.0)   # 2 x 1
b2 = [0.0]                      # 1

lr = 0.5
epochs = 5000

for _ in range(epochs):
    # forward
    Z1 = add_bias(matmul(X, W1), b1)     # 4 x 2
    A1 = sigmoid_mat(Z1)                # 4 x 2
    Z2 = add_bias(matmul(A1, W2), b2)   # 4 x 1
    A2 = sigmoid_mat(Z2)                # 4 x 1

    # simple MSE gradient for demo: dA2 = (2/m) (A2 - Y)
    dA2 = scale(sub(A2, Y), 2.0/m)

    # backprop output
    dZ2 = elemwise(dA2, dsigmoid_mat(A2))          # 4 x 1
    dW2 = matmul(transpose(A1), dZ2)               # 2 x 1
    db2 = [v/m for v in sum_rows(dZ2)]             # 1

    # backprop hidden
    dA1 = matmul(dZ2, transpose(W2))               # 4 x 2
    dZ1 = elemwise(dA1, dsigmoid_mat(A1))          # 4 x 2
    dW1 = matmul(transpose(X), dZ1)                # 2 x 2
    db1 = [v/m for v in sum_rows(dZ1)]             # 2

    # update
    W2 = sub(W2, scale(dW2, lr))
    b2 = [b2[j] - lr*db2[j] for j in range(len(b2))]
    W1 = sub(W1, scale(dW1, lr))
    b1 = [b1[j] - lr*db1[j] for j in range(len(b1))]

def predict(x):
    z1 = [x[0]*W1[0][j] + x[1]*W1[1][j] + b1[j] for j in range(2)]
    a1 = [sigmoid(z) for z in z1]
    z2 = a1[0]*W2[0][0] + a1[1]*W2[1][0] + b2[0]
    a2 = sigmoid(z2)
    return 1 if a2 >= 0.5 else 0, a2

for x, y in zip(X, Y):
    cls, prob = predict(x)
    print(x, "pred:", cls, "prob:", round(prob, 3), "true:", int(y[0]))

This code is intentionally minimal for learning. In practice, DFNN classification typically uses cross-entropy loss, mini-batches, and more stable initialisation/normalisation.

Summary #

DFNN/MLP adds hidden layers + non-linear activations to learn non-linear decision boundaries.
This makes it a solution to XOR (which a single perceptron cannot solve).
Training uses forward propagation to compute predictions, then backpropagation to compute gradients.
Weights are updated using gradients and a learning rate.
Depth and width both increase capacity, but affect optimisation and overfitting differently.

Reference #

Course slides: DNN_M5_DFNN.
Singh & Raj — Deep Learning notes (Ch. 1–3).
Zhang, Lipton, Li, Smola — Dive into Deep Learning (MLP, forward/backprop).
Dive into deep learning. Cambridge University Press.. (T1 – Ch 5)
Jurafsky, D., & Martin, J. H. (Third Edition) Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. (R4 - Ch 7.3, 7.5)

Home | Deep Learning