DNN Formula and Numerical Sheet #

This page consolidates the most useful Deep Neural Networks formulas and numerical patterns for revision.

It is designed for preparation and should be used together with the topic pages.

Revision strategy:
Do not only memorise formulas.
For each formula, know:
what each symbol means
when to apply it
how to substitute values carefully
what the output shape or answer represents

1. Artificial Neuron #

Weighted Sum ☆ #

\[ z = \sum_{i=1}^{n} w_i x_i + b \]

Vector form:

\[ z = w^T x + b \]

Activation #

\[ \hat{y}=f(z) \]

2. Regression with Single Neuron #

Linear Prediction ☆ #

\[ \hat{y}=w^T x \]

Mean Squared Error #

\[ J(w)=\frac{1}{N}\sum_{i=1}^{N}\frac{1}{2}(w^T x^{(i)}-y^{(i)})^2 \]

Gradient of Squared Error #

\[ \nabla J(w)=\frac{1}{N}X^T(Xw-y) \]

Gradient Descent Update #

\[ w^{(t+1)}=w^{(t)}-\eta \nabla J(w^{(t)}) \]

3. Binary Classification #

Sigmoid ☆ #

\[ \sigma(z)=\frac{1}{1+e^{-z}} \]

Sigmoid Derivative #

\[ \sigma'(z)=\sigma(z)(1-\sigma(z)) \]

Prediction Rule #

\[ \hat{y} = \begin{cases}1, & \sigma(z) \ge 0.5 \\ 0, & \sigma(z) < 0.5\end{cases} \]

Binary Cross-Entropy #

\[ \ell(\hat{y},y)= -y\log(\hat{y})-(1-y)\log(1-\hat{y}) \]

4. Multi-Class Classification #

Softmax ☆ #

\[ \hat{y}_k = \frac{e^{z_k}}{\sum_{j=1}^{K}e^{z_j}} \]

Categorical Cross-Entropy #

\[ \ell(\hat{y},y)= -\sum_{k=1}^{K} y_k \log(\hat{y}_k) \]

Useful Shortcut #

For softmax with cross-entropy:

\[ \frac{\partial \ell}{\partial z_k}=\hat{y}_k-y_k \]

5. DFNN / MLP Shapes #

Layer-wise Computation ☆ #

For layer \( l \) :

\[ z^{(l)} = h^{(l-1)}W^{(l)} + b^{(l)} \] \[ h^{(l)} = \sigma^{(l)}(z^{(l)}) \]

Parameter Count for Dense Layer ☆ #

If input units are \( n_{in} \) and output units are \( n_{out} \) :

\[ \text{Parameters}=n_{in}n_{out}+n_{out} \]

6. CNN Output Size #

Convolution Output Size ☆ #

For input size \( N \) , kernel size \( K \) , padding \( P \) , and stride \( S \) :

\[ O = \left\lfloor \frac{N-K+2P}{S} \right\rfloor + 1 \]

For height and width separately:

\[ H_{out}=\left\lfloor \frac{H-K_h+2P_h}{S_h} \right\rfloor + 1 \] \[ W_{out}=\left\lfloor \frac{W-K_w+2P_w}{S_w} \right\rfloor + 1 \]

Same Padding for Odd Kernel #

For stride 1 and odd kernel size:

\[ P = \frac{K-1}{2} \]

7. CNN Parameter Count #

Convolution Layer Parameters ☆ #

For kernel size \( K_h \times K_w \) , input channels \( C_{in} \) , and output channels \( C_{out} \) :

\[ \text{Parameters}=(K_hK_wC_{in}+1)C_{out} \]

The +1 represents one bias per filter.

1 by 1 Convolution Parameters #

\[ \text{Parameters}=(1 \times 1 \times C_{in}+1)C_{out} \]

8. Pooling Output Size #

Pooling uses the same spatial size formula as convolution.

\[ O = \left\lfloor \frac{N-K+2P}{S} \right\rfloor + 1 \]

Pooling usually has no learnable parameters.

9. Dilated Convolution #

Effective Kernel Size ☆ #

For kernel size \( K \) and dilation \( D \) :

\[ K_{eff}=K+(K-1)(D-1) \]

Then use \( K_{eff} \) in the output-size formula.

10. RNN #

Hidden State Update ☆ #

\[ h_t=\phi(W_{hh}h_{t-1}+W_{xh}x_t+b_h) \]

Output #

\[ o_t=W_{ho}h_t+b_o \]

11. LSTM / GRU Concepts #

For LSTM, remember:

Gate	Purpose
Forget gate	what to remove from memory
Input gate	what new information to store
Candidate memory	proposed new content
Output gate	what to expose as hidden state

For GRU, remember:

Gate	Purpose
Reset gate	how much past to forget while forming candidate
Update gate	how much old hidden state to keep
Candidate hidden state	proposed new hidden representation

12. Attention #

Attention Weighted Sum ☆ #

\[ \text{Attention}(q,K,V)=\sum_{i=1}^{n}\alpha_i v_i \]

Attention Weights #

\[ \alpha_i = \frac{\exp(\text{score}(q,k_i))}{\sum_{j=1}^{n}\exp(\text{score}(q,k_j))} \]

Scaled Dot-Product Attention ☆ #

\[ \text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

13. Transformer #

Positional Encoding ☆ #

\[ PE(pos,2i)=\sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) \] \[ PE(pos,2i+1)=\cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) \]

Transformer Input #

\[ X = \text{Embedding}(tokens) + PE(positions) \]

Position-wise Feed Forward Network #

\[ FFN(x)=\max(0,xW_1+b_1)W_2+b_2 \]

Add and Norm #

\[ \text{Output}=\text{LayerNorm}(x+\text{Sublayer}(x)) \]

14. Optimisers #

Gradient Descent ☆ #

\[ \theta_{t+1}=\theta_t-\eta \nabla_{\theta}\mathcal{L}(\theta_t) \]

Mini-Batch Gradient #

\[ g=\frac{1}{B}\sum_{i \in B}\nabla_{\theta}\ell_i(\theta) \]

Momentum #

\[ v_t=\beta v_{t-1}+g_t \] \[ \theta_{t+1}=\theta_t-\eta v_t \]

Adam #

\[ m_t=\beta_1m_{t-1}+(1-\beta_1)g_t \] \[ v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2 \] \[ \theta_{t+1}=\theta_t-\eta\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} \]

15. Regularisation #

L2 / Weight Decay ☆ #

\[ J_{regularised}(\theta)=J(\theta)+\lambda\|\theta\|_2^2 \]

L1 #

\[ J_{regularised}(\theta)=J(\theta)+\lambda\|\theta\|_1 \]

Batch Normalisation #

\[ \hat{x}_i=\frac{x_i-\mu_B}{\sqrt{\sigma_B^2+\epsilon}} \] \[ y_i=\gamma\hat{x}_i+\beta \]

Xavier Initialisation #

\[ W \sim U\left(-\sqrt{\frac{6}{n_{in}+n_{out}}},\sqrt{\frac{6}{n_{in}+n_{out}}}\right) \]

He Initialisation #

\[ W \sim N\left(0,\frac{2}{n_{in}}\right) \]

16. Numerical Checklist ☆ #

CNN Output Size Questions #

Steps:

Identify input size \( N \)
Identify kernel size \( K \)
Identify padding \( P \)
Identify stride \( S \)
Substitute into output formula
Repeat separately for height and width if needed
Output channels equal number of filters

CNN Parameter Count Questions #

Steps:

Identify kernel height and width
Identify input channels
Identify number of filters
Add one bias per filter
Use \( (K_hK_wC_{in}+1)C_{out} \)

Softmax Questions #

Steps:

Exponentiate each logit
Add all exponentials
Divide each exponential by the total
Choose class with highest probability

Optimiser Numerical Questions #

Steps:

Compute gradient
Multiply by learning rate
Subtract from old parameter
For momentum, update velocity first
For Adam, update first and second moments

17. Quick Comparison Table #

Topic	Must remember
Regression	identity activation, squared loss
Binary classification	sigmoid, binary cross-entropy
Multi-class classification	softmax, categorical cross-entropy
DFNN	hidden layers create non-linearity
CNN	local connectivity, parameter sharing
Pooling	spatial reduction, usually no parameters
RNN	hidden state stores sequence history
LSTM	gates control memory
GRU	simpler gated RNN
Attention	query-key-value weighted retrieval
Transformer	self-attention plus positional encoding
Optimisers	update parameters to minimise loss
Regularisation	reduce overfitting and improve generalisation

18. Final Memory Lines #

Neuron: weighted sum → activation → prediction
Classification: score → sigmoid or softmax → probability → loss
CNN: image → convolution → feature map → activation → pooling → classifier
RNN: current input plus previous hidden state → new hidden state
Attention: query compares with keys → softmax weights → weighted values
Transformer: embedding plus position → attention → feed-forward → output
Optimiser: gradient tells direction, learning rate controls step size
Regularisation: reduce memorisation, improve generalisation

Home | Deep Learning