Attention Mechanism

Attention Mechanism #

Attention is a deep learning mechanism that allows a model to focus on the most relevant parts of an input sequence when producing an output.

Instead of compressing the whole input into one fixed vector, attention computes a weighted combination of useful information.

Key takeaway:
Attention answers a simple question:

For the current prediction, which input tokens should the model focus on most?

  • Queries, Keys, and Values
  • Attention Pooling by Similarity
  • Attention Pooling via Nadaraya–Watson Regression
  • Attention Scoring Functions
  • Dot Product Attention
  • Convenience Functions
  • Scaled Dot Product Attention
  • Additive Attention
  • Bahdanau Attention Mechanism
  • Multi-Head Attention
  • Self-Attention
  • Positional Encoding

Why Attention Is Needed ☆ #

Traditional encoder-decoder RNN models compress the full input sequence into one context vector.

This creates a bottleneck, especially for long sequences.

Problems include:

  • early tokens may be forgotten
  • all information must pass through a fixed-size vector
  • long-range dependencies are difficult to preserve
  • decoder cannot dynamically choose which input positions matter

Attention solves this by giving the decoder a different context vector at each step.


Encoder-Decoder Without Attention #

flowchart LR
    A["Input Sequence"] --> B["Encoder"]
    B --> C["Single Fixed Context Vector"]
    C --> D["Decoder"]
    D --> E["Output Sequence"]

    style A fill:#E1F5FE,stroke:#4A90E2
    style B fill:#C8E6C9,stroke:#43A047
    style C fill:#FFF9C4,stroke:#FBC02D
    style D fill:#EDE7F6,stroke:#7E57C2
    style E fill:#E1F5FE,stroke:#4A90E2

A single context vector is like asking the model to summarise a whole paragraph into one small note and then answer every question using only that note.


Encoder-Decoder With Attention #

flowchart LR
    A["Input Tokens"] --> B["Encoder Hidden States"]
    B --> C["Attention Weights"]
    C --> D["Dynamic Context Vector"]
    D --> E["Decoder Step"]
    E --> F["Output Token"]

    E --> C

    style A fill:#E1F5FE,stroke:#4A90E2
    style B fill:#C8E6C9,stroke:#43A047
    style C fill:#FFF9C4,stroke:#FBC02D
    style D fill:#EDE7F6,stroke:#7E57C2
    style E fill:#C8E6C9,stroke:#43A047
    style F fill:#E1F5FE,stroke:#4A90E2

Now the decoder can selectively attend to different input positions at different output steps.


Query, Key, and Value Framework ☆ #

Attention is usually described using three objects:

TermMeaningIntuition
QueryWhat the model is looking forCurrent decoding need
KeyWhat each input position offers for matchingSearch label
ValueThe actual information to retrieveContent

A simple analogy:

  • query = question
  • key = index entry
  • value = answer content

Attention Formula ☆ #

For a query \( q \) , keys \( k_i \) , and values \( v_i \) , attention computes a weighted sum of values.

\[ \text{Attention}(q, K, V) = \sum_{i=1}^{n} \alpha_i v_i \]

The attention weights are usually produced using softmax.

\[ \alpha_i = \frac{\exp(\text{score}(q, k_i))}{\sum_{j=1}^{n} \exp(\text{score}(q, k_j))} \]

Because softmax is used, attention weights behave like a probability distribution.

\[ \sum_{i=1}^{n} \alpha_i = 1 \]

Scaled Dot-Product Attention ☆ #

In transformers, the most important attention formula is scaled dot-product attention.

\[ \text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

Where:

  • \( Q \) is the query matrix
  • \( K \) is the key matrix
  • \( V \) is the value matrix
  • \( d_k \) is the key dimension

The division by \( \sqrt{d_k} \) prevents dot products from becoming too large.


Dot Product Attention #

Dot product attention uses similarity between query and key.

\[ \text{score}(q,k_i)=q^T k_i \]

High dot product means high similarity.

Low dot product means the key is less relevant to the query.


Additive Attention #

Additive attention learns a scoring function using a small neural network.

\[ \text{score}(q,k_i)=v^T \tanh(W_q q + W_k k_i) \]

This was used in early sequence-to-sequence attention models.


Bahdanau Attention #

Bahdanau attention is an additive attention mechanism used in encoder-decoder models.

It allows the decoder to look at all encoder hidden states instead of relying only on the final encoder state.

flowchart TD
    A["Encoder Hidden States"] --> B["Score each state with decoder state"]
    B --> C["Softmax attention weights"]
    C --> D["Weighted context vector"]
    D --> E["Decoder prediction"]

    style A fill:#E1F5FE,stroke:#4A90E2
    style B fill:#C8E6C9,stroke:#43A047
    style C fill:#FFF9C4,stroke:#FBC02D
    style D fill:#EDE7F6,stroke:#7E57C2
    style E fill:#E1F5FE,stroke:#4A90E2

Self-Attention ☆ #

Self-attention is attention applied within the same sequence.

The query, key, and value all come from the same input sequence.

This allows each token to communicate with every other token.

Example:

In the sentence:

The animal did not cross the road because it was tired.

Self-attention helps the model connect it with the correct earlier word.


Multi-Head Attention ☆ #

Multi-head attention runs attention several times in parallel.

Each head can learn a different type of relationship.

For example:

  • one head may focus on nearby words
  • one head may focus on subject-object relationships
  • one head may focus on long-distance dependencies
  • one head may focus on positional structure
\[ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) \] \[ \text{MultiHead}(Q,K,V)=\text{Concat}(\text{head}_1,\ldots,\text{head}_h)W^O \]

Positional Encoding ☆ #

Self-attention does not naturally know word order.

So position information must be added to token embeddings.

\[ PE(pos,2i)=\sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) \] \[ PE(pos,2i+1)=\cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) \]

Attention in Applications #

ApplicationWhat attention focuses on
Translationrelevant source words for each target word
Sentiment analysisemotionally important words
Question answeringfacts related to the question
Time series forecastingimportant previous time steps
Vision transformerimportant image patches

Common Mistakes ☆ #

Do not confuse attention weights with model weights.

Attention weights are dynamic and depend on the current input.

Model weights are learned parameters stored in matrices such as ( W^Q )

, ( W^K )

, and ( W^V )

.

Other mistakes:

  • forgetting softmax in attention
  • forgetting the scaling term \( \sqrt{d_k} \)
  • confusing self-attention with cross-attention
  • saying attention removes the need for embeddings
  • saying positional encoding is optional in transformers

Summary #

ConceptOne-line meaning
Attentionweighted focus on relevant inputs
Querywhat we are looking for
Keywhat each input offers for matching
Valueactual content retrieved
Self-attentiontokens attend to tokens in the same sequence
Cross-attentiondecoder attends to encoder output
Multi-head attentionseveral attention mechanisms in parallel
Positional encodingadds order information to tokens

For exams, remember the flow:

Query → compare with Keys → softmax scores → weighted sum of Values → context vector


Reference #

  • Dive into deep learning. Cambridge University Press.. (Ch 10, Ch7

Home | Deep Learning