DNN Formula and Numerical Sheet

DNN Formula and Numerical Sheet #

This page consolidates the most useful Deep Neural Networks formulas and numerical patterns for revision.

It is designed for preparation and should be used together with the topic pages.

Revision strategy:
Do not only memorise formulas.

For each formula, know:

  1. what each symbol means
  2. when to apply it
  3. how to substitute values carefully
  4. what the output shape or answer represents

1. Artificial Neuron #

Weighted Sum ☆ #

\[ z = \sum_{i=1}^{n} w_i x_i + b \]

Vector form:

\[ z = w^T x + b \]

Activation #

\[ \hat{y}=f(z) \]

2. Regression with Single Neuron #

Linear Prediction ☆ #

\[ \hat{y}=w^T x \]

Mean Squared Error #

\[ J(w)=\frac{1}{N}\sum_{i=1}^{N}\frac{1}{2}(w^T x^{(i)}-y^{(i)})^2 \]

Gradient of Squared Error #

\[ \nabla J(w)=\frac{1}{N}X^T(Xw-y) \]

Gradient Descent Update #

\[ w^{(t+1)}=w^{(t)}-\eta \nabla J(w^{(t)}) \]

3. Binary Classification #

Sigmoid ☆ #

\[ \sigma(z)=\frac{1}{1+e^{-z}} \]

Sigmoid Derivative #

\[ \sigma'(z)=\sigma(z)(1-\sigma(z)) \]

Prediction Rule #

\[ \hat{y} = \begin{cases}1, & \sigma(z) \ge 0.5 \\ 0, & \sigma(z) < 0.5\end{cases} \]

Binary Cross-Entropy #

\[ \ell(\hat{y},y)= -y\log(\hat{y})-(1-y)\log(1-\hat{y}) \]

4. Multi-Class Classification #

Softmax ☆ #

\[ \hat{y}_k = \frac{e^{z_k}}{\sum_{j=1}^{K}e^{z_j}} \]

Categorical Cross-Entropy #

\[ \ell(\hat{y},y)= -\sum_{k=1}^{K} y_k \log(\hat{y}_k) \]

Useful Shortcut #

For softmax with cross-entropy:

\[ \frac{\partial \ell}{\partial z_k}=\hat{y}_k-y_k \]

5. DFNN / MLP Shapes #

Layer-wise Computation ☆ #

For layer \( l \) :

\[ z^{(l)} = h^{(l-1)}W^{(l)} + b^{(l)} \] \[ h^{(l)} = \sigma^{(l)}(z^{(l)}) \]

Parameter Count for Dense Layer ☆ #

If input units are \( n_{in} \) and output units are \( n_{out} \) :

\[ \text{Parameters}=n_{in}n_{out}+n_{out} \]

6. CNN Output Size #

Convolution Output Size ☆ #

For input size \( N \) , kernel size \( K \) , padding \( P \) , and stride \( S \) :

\[ O = \left\lfloor \frac{N-K+2P}{S} \right\rfloor + 1 \]

For height and width separately:

\[ H_{out}=\left\lfloor \frac{H-K_h+2P_h}{S_h} \right\rfloor + 1 \] \[ W_{out}=\left\lfloor \frac{W-K_w+2P_w}{S_w} \right\rfloor + 1 \]

Same Padding for Odd Kernel #

For stride 1 and odd kernel size:

\[ P = \frac{K-1}{2} \]

7. CNN Parameter Count #

Convolution Layer Parameters ☆ #

For kernel size \( K_h \times K_w \) , input channels \( C_{in} \) , and output channels \( C_{out} \) :

\[ \text{Parameters}=(K_hK_wC_{in}+1)C_{out} \]

The +1 represents one bias per filter.

1 by 1 Convolution Parameters #

\[ \text{Parameters}=(1 \times 1 \times C_{in}+1)C_{out} \]

8. Pooling Output Size #

Pooling uses the same spatial size formula as convolution.

\[ O = \left\lfloor \frac{N-K+2P}{S} \right\rfloor + 1 \]

Pooling usually has no learnable parameters.


9. Dilated Convolution #

Effective Kernel Size ☆ #

For kernel size \( K \) and dilation \( D \) :

\[ K_{eff}=K+(K-1)(D-1) \]

Then use \( K_{eff} \) in the output-size formula.


10. RNN #

Hidden State Update ☆ #

\[ h_t=\phi(W_{hh}h_{t-1}+W_{xh}x_t+b_h) \]

Output #

\[ o_t=W_{ho}h_t+b_o \]

11. LSTM / GRU Concepts #

For LSTM, remember:

GatePurpose
Forget gatewhat to remove from memory
Input gatewhat new information to store
Candidate memoryproposed new content
Output gatewhat to expose as hidden state

For GRU, remember:

GatePurpose
Reset gatehow much past to forget while forming candidate
Update gatehow much old hidden state to keep
Candidate hidden stateproposed new hidden representation

12. Attention #

Attention Weighted Sum ☆ #

\[ \text{Attention}(q,K,V)=\sum_{i=1}^{n}\alpha_i v_i \]

Attention Weights #

\[ \alpha_i = \frac{\exp(\text{score}(q,k_i))}{\sum_{j=1}^{n}\exp(\text{score}(q,k_j))} \]

Scaled Dot-Product Attention ☆ #

\[ \text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

13. Transformer #

Positional Encoding ☆ #

\[ PE(pos,2i)=\sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) \] \[ PE(pos,2i+1)=\cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) \]

Transformer Input #

\[ X = \text{Embedding}(tokens) + PE(positions) \]

Position-wise Feed Forward Network #

\[ FFN(x)=\max(0,xW_1+b_1)W_2+b_2 \]

Add and Norm #

\[ \text{Output}=\text{LayerNorm}(x+\text{Sublayer}(x)) \]

14. Optimisers #

Gradient Descent ☆ #

\[ \theta_{t+1}=\theta_t-\eta \nabla_{\theta}\mathcal{L}(\theta_t) \]

Mini-Batch Gradient #

\[ g=\frac{1}{B}\sum_{i \in B}\nabla_{\theta}\ell_i(\theta) \]

Momentum #

\[ v_t=\beta v_{t-1}+g_t \] \[ \theta_{t+1}=\theta_t-\eta v_t \]

Adam #

\[ m_t=\beta_1m_{t-1}+(1-\beta_1)g_t \] \[ v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2 \] \[ \theta_{t+1}=\theta_t-\eta\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} \]

15. Regularisation #

L2 / Weight Decay ☆ #

\[ J_{regularised}(\theta)=J(\theta)+\lambda\|\theta\|_2^2 \]

L1 #

\[ J_{regularised}(\theta)=J(\theta)+\lambda\|\theta\|_1 \]

Batch Normalisation #

\[ \hat{x}_i=\frac{x_i-\mu_B}{\sqrt{\sigma_B^2+\epsilon}} \] \[ y_i=\gamma\hat{x}_i+\beta \]

Xavier Initialisation #

\[ W \sim U\left(-\sqrt{\frac{6}{n_{in}+n_{out}}},\sqrt{\frac{6}{n_{in}+n_{out}}}\right) \]

He Initialisation #

\[ W \sim N\left(0,\frac{2}{n_{in}}\right) \]

16. Numerical Checklist ☆ #

CNN Output Size Questions #

Steps:

  1. Identify input size \( N \)
  2. Identify kernel size \( K \)
  3. Identify padding \( P \)
  4. Identify stride \( S \)
  5. Substitute into output formula
  6. Repeat separately for height and width if needed
  7. Output channels equal number of filters

CNN Parameter Count Questions #

Steps:

  1. Identify kernel height and width
  2. Identify input channels
  3. Identify number of filters
  4. Add one bias per filter
  5. Use \( (K_hK_wC_{in}+1)C_{out} \)

Softmax Questions #

Steps:

  1. Exponentiate each logit
  2. Add all exponentials
  3. Divide each exponential by the total
  4. Choose class with highest probability

Optimiser Numerical Questions #

Steps:

  1. Compute gradient
  2. Multiply by learning rate
  3. Subtract from old parameter
  4. For momentum, update velocity first
  5. For Adam, update first and second moments

17. Quick Comparison Table #

TopicMust remember
Regressionidentity activation, squared loss
Binary classificationsigmoid, binary cross-entropy
Multi-class classificationsoftmax, categorical cross-entropy
DFNNhidden layers create non-linearity
CNNlocal connectivity, parameter sharing
Poolingspatial reduction, usually no parameters
RNNhidden state stores sequence history
LSTMgates control memory
GRUsimpler gated RNN
Attentionquery-key-value weighted retrieval
Transformerself-attention plus positional encoding
Optimisersupdate parameters to minimise loss
Regularisationreduce overfitting and improve generalisation

18. Final Memory Lines #

Neuron: weighted sum → activation → prediction
Classification: score → sigmoid or softmax → probability → loss
CNN: image → convolution → feature map → activation → pooling → classifier
RNN: current input plus previous hidden state → new hidden state
Attention: query compares with keys → softmax weights → weighted values
Transformer: embedding plus position → attention → feed-forward → output
Optimiser: gradient tells direction, learning rate controls step size
Regularisation: reduce memorisation, improve generalisation


Home | Deep Learning