DNN Formula and Numerical Sheet #
This page consolidates the most useful Deep Neural Networks formulas and numerical patterns for revision.
It is designed for preparation and should be used together with the topic pages.
Revision strategy:
Do not only memorise formulas.For each formula, know:
- what each symbol means
- when to apply it
- how to substitute values carefully
- what the output shape or answer represents
1. Artificial Neuron #
Weighted Sum ☆ #
\[ z = \sum_{i=1}^{n} w_i x_i + b \]Vector form:
\[ z = w^T x + b \]Activation #
\[ \hat{y}=f(z) \]2. Regression with Single Neuron #
Linear Prediction ☆ #
\[ \hat{y}=w^T x \]Mean Squared Error #
\[ J(w)=\frac{1}{N}\sum_{i=1}^{N}\frac{1}{2}(w^T x^{(i)}-y^{(i)})^2 \]Gradient of Squared Error #
\[ \nabla J(w)=\frac{1}{N}X^T(Xw-y) \]Gradient Descent Update #
\[ w^{(t+1)}=w^{(t)}-\eta \nabla J(w^{(t)}) \]3. Binary Classification #
Sigmoid ☆ #
\[ \sigma(z)=\frac{1}{1+e^{-z}} \]Sigmoid Derivative #
\[ \sigma'(z)=\sigma(z)(1-\sigma(z)) \]Prediction Rule #
\[ \hat{y} = \begin{cases}1, & \sigma(z) \ge 0.5 \\ 0, & \sigma(z) < 0.5\end{cases} \]Binary Cross-Entropy #
\[ \ell(\hat{y},y)= -y\log(\hat{y})-(1-y)\log(1-\hat{y}) \]4. Multi-Class Classification #
Softmax ☆ #
\[ \hat{y}_k = \frac{e^{z_k}}{\sum_{j=1}^{K}e^{z_j}} \]Categorical Cross-Entropy #
\[ \ell(\hat{y},y)= -\sum_{k=1}^{K} y_k \log(\hat{y}_k) \]Useful Shortcut #
For softmax with cross-entropy:
\[ \frac{\partial \ell}{\partial z_k}=\hat{y}_k-y_k \]5. DFNN / MLP Shapes #
Layer-wise Computation ☆ #
For layer \( l \) :
\[ z^{(l)} = h^{(l-1)}W^{(l)} + b^{(l)} \] \[ h^{(l)} = \sigma^{(l)}(z^{(l)}) \]Parameter Count for Dense Layer ☆ #
If input units are \( n_{in} \) and output units are \( n_{out} \) :
\[ \text{Parameters}=n_{in}n_{out}+n_{out} \]6. CNN Output Size #
Convolution Output Size ☆ #
For input size \( N \) , kernel size \( K \) , padding \( P \) , and stride \( S \) :
\[ O = \left\lfloor \frac{N-K+2P}{S} \right\rfloor + 1 \]For height and width separately:
\[ H_{out}=\left\lfloor \frac{H-K_h+2P_h}{S_h} \right\rfloor + 1 \] \[ W_{out}=\left\lfloor \frac{W-K_w+2P_w}{S_w} \right\rfloor + 1 \]Same Padding for Odd Kernel #
For stride 1 and odd kernel size:
\[ P = \frac{K-1}{2} \]7. CNN Parameter Count #
Convolution Layer Parameters ☆ #
For kernel size \( K_h \times K_w \) , input channels \( C_{in} \) , and output channels \( C_{out} \) :
\[ \text{Parameters}=(K_hK_wC_{in}+1)C_{out} \]The +1 represents one bias per filter.
1 by 1 Convolution Parameters #
\[ \text{Parameters}=(1 \times 1 \times C_{in}+1)C_{out} \]8. Pooling Output Size #
Pooling uses the same spatial size formula as convolution.
\[ O = \left\lfloor \frac{N-K+2P}{S} \right\rfloor + 1 \]Pooling usually has no learnable parameters.
9. Dilated Convolution #
Effective Kernel Size ☆ #
For kernel size \( K \) and dilation \( D \) :
\[ K_{eff}=K+(K-1)(D-1) \]Then use \( K_{eff} \) in the output-size formula.
10. RNN #
Hidden State Update ☆ #
\[ h_t=\phi(W_{hh}h_{t-1}+W_{xh}x_t+b_h) \]Output #
\[ o_t=W_{ho}h_t+b_o \]11. LSTM / GRU Concepts #
For LSTM, remember:
| Gate | Purpose |
|---|---|
| Forget gate | what to remove from memory |
| Input gate | what new information to store |
| Candidate memory | proposed new content |
| Output gate | what to expose as hidden state |
For GRU, remember:
| Gate | Purpose |
|---|---|
| Reset gate | how much past to forget while forming candidate |
| Update gate | how much old hidden state to keep |
| Candidate hidden state | proposed new hidden representation |
12. Attention #
Attention Weighted Sum ☆ #
\[ \text{Attention}(q,K,V)=\sum_{i=1}^{n}\alpha_i v_i \]Attention Weights #
\[ \alpha_i = \frac{\exp(\text{score}(q,k_i))}{\sum_{j=1}^{n}\exp(\text{score}(q,k_j))} \]Scaled Dot-Product Attention ☆ #
\[ \text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]13. Transformer #
Positional Encoding ☆ #
\[ PE(pos,2i)=\sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) \] \[ PE(pos,2i+1)=\cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) \]Transformer Input #
\[ X = \text{Embedding}(tokens) + PE(positions) \]Position-wise Feed Forward Network #
\[ FFN(x)=\max(0,xW_1+b_1)W_2+b_2 \]Add and Norm #
\[ \text{Output}=\text{LayerNorm}(x+\text{Sublayer}(x)) \]14. Optimisers #
Gradient Descent ☆ #
\[ \theta_{t+1}=\theta_t-\eta \nabla_{\theta}\mathcal{L}(\theta_t) \]Mini-Batch Gradient #
\[ g=\frac{1}{B}\sum_{i \in B}\nabla_{\theta}\ell_i(\theta) \]Momentum #
\[ v_t=\beta v_{t-1}+g_t \] \[ \theta_{t+1}=\theta_t-\eta v_t \]Adam #
\[ m_t=\beta_1m_{t-1}+(1-\beta_1)g_t \] \[ v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2 \] \[ \theta_{t+1}=\theta_t-\eta\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} \]15. Regularisation #
L2 / Weight Decay ☆ #
\[ J_{regularised}(\theta)=J(\theta)+\lambda\|\theta\|_2^2 \]L1 #
\[ J_{regularised}(\theta)=J(\theta)+\lambda\|\theta\|_1 \]Batch Normalisation #
\[ \hat{x}_i=\frac{x_i-\mu_B}{\sqrt{\sigma_B^2+\epsilon}} \] \[ y_i=\gamma\hat{x}_i+\beta \]Xavier Initialisation #
\[ W \sim U\left(-\sqrt{\frac{6}{n_{in}+n_{out}}},\sqrt{\frac{6}{n_{in}+n_{out}}}\right) \]He Initialisation #
\[ W \sim N\left(0,\frac{2}{n_{in}}\right) \]16. Numerical Checklist ☆ #
CNN Output Size Questions #
Steps:
- Identify input size \( N \)
- Identify kernel size \( K \)
- Identify padding \( P \)
- Identify stride \( S \)
- Substitute into output formula
- Repeat separately for height and width if needed
- Output channels equal number of filters
CNN Parameter Count Questions #
Steps:
- Identify kernel height and width
- Identify input channels
- Identify number of filters
- Add one bias per filter
- Use \( (K_hK_wC_{in}+1)C_{out} \)
Softmax Questions #
Steps:
- Exponentiate each logit
- Add all exponentials
- Divide each exponential by the total
- Choose class with highest probability
Optimiser Numerical Questions #
Steps:
- Compute gradient
- Multiply by learning rate
- Subtract from old parameter
- For momentum, update velocity first
- For Adam, update first and second moments
17. Quick Comparison Table #
| Topic | Must remember |
|---|---|
| Regression | identity activation, squared loss |
| Binary classification | sigmoid, binary cross-entropy |
| Multi-class classification | softmax, categorical cross-entropy |
| DFNN | hidden layers create non-linearity |
| CNN | local connectivity, parameter sharing |
| Pooling | spatial reduction, usually no parameters |
| RNN | hidden state stores sequence history |
| LSTM | gates control memory |
| GRU | simpler gated RNN |
| Attention | query-key-value weighted retrieval |
| Transformer | self-attention plus positional encoding |
| Optimisers | update parameters to minimise loss |
| Regularisation | reduce overfitting and improve generalisation |
18. Final Memory Lines #
Neuron: weighted sum → activation → prediction
Classification: score → sigmoid or softmax → probability → loss
CNN: image → convolution → feature map → activation → pooling → classifier
RNN: current input plus previous hidden state → new hidden state
Attention: query compares with keys → softmax weights → weighted values
Transformer: embedding plus position → attention → feed-forward → output
Optimiser: gradient tells direction, learning rate controls step size
Regularisation: reduce memorisation, improve generalisation