LLM Foundations & Fine-Tuning · Step 1 of 8

Attention Mechanism and Transformers: The Math Behind Every LLM

Attention Mechanism and Transformers

Introduction

Before Transformers, sequence models like RNNs and LSTMs processed tokens one at a time, left to right. To understand the relationship between a word at position 1 and a word at position 100, the signal had to travel through 98 intermediate hidden states. Long-range dependencies were hard to learn because gradients vanished or exploded over that path.

The Transformer, introduced in "Attention Is All You Need" (Vaswani et al., 2017), discarded recurrence entirely. Instead, every position attends directly to every other position in a single operation. A word at position 1 and a word at position 100 are equally "close" from the model's perspective. The cost of this global attention is quadratic in sequence length — but for the sequence lengths that matter in practice, and with the parallelism of modern GPUs, this was a decisive trade-off.

GPT, BERT, LLaMA, Claude, and every major LLM are Transformer variants. Understanding the attention mechanism is therefore understanding the core computation of modern AI.


1. What Is Attention?

Attention is a soft lookup mechanism. Imagine a library where each book (value) has a title (key). When you arrive with a query, you do not retrieve one book with a hard index; instead you compute how well your query matches every title, and receive a weighted blend of all books — with the most relevant ones weighted most heavily.

In self-attention, each token in the sequence simultaneously acts as a query (what am I looking for?), a key (what do I contain?), and a value (what do I contribute?). The output for each token is a weighted sum of all token values, where the weights are determined by query–key similarity.


2. Scaled Dot-Product Attention

Given an input sequence of \(n\) tokens, each represented as a \(d_{\text{model}}\)-dimensional vector, we produce three matrices by projecting the input \(X \in \mathbb{R}^{n \times d_{\text{model}}}\) through learned weight matrices:

$$ Q = X W^Q, \quad K = X W^K, \quad V = X W^V $$

where \(W^Q, W^K \in \mathbb{R}^{d_{\text{model}} \times d_k}\) and \(W^V \in \mathbb{R}^{d_{\text{model}} \times d_v}\).

The attention output is:

$$ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V $$

Step by step:

  1. \(QK^\top \in \mathbb{R}^{n \times n}\) — each entry \([i,j]\) is the dot product between the query of token \(i\) and the key of token \(j\). High values mean token \(i\) attends strongly to token \(j\).
  2. Divide by \(\sqrt{d_k}\) — the scaling factor. Without it, dot products in high-dimensional spaces push the softmax into regions of near-zero gradient, making training unstable.
  3. Apply softmax row-wise — converts raw scores into a probability distribution over positions. Each row sums to 1.
  4. Multiply by \(V\) — the output for each token is a weighted sum of value vectors, with weights given by the attention probabilities.

Why \(\sqrt{d_k}\)?

Assume each element of \(Q\) and \(K\) is drawn i.i.d. with mean 0 and variance 1. Then each element of \(QK^\top\) has mean 0 and variance \(d_k\) (sum of \(d_k\) products of unit-variance variables). The standard deviation grows as \(\sqrt{d_k}\). Dividing by \(\sqrt{d_k}\) normalises the variance back to 1, keeping the softmax in a well-behaved gradient regime.


3. A Numerical Example

Let \(d_k = 2\) for simplicity, with 3 tokens. Suppose after projection:

$$ Q = \begin{pmatrix}1 & 0\\ 0 & 1\\ 1 & 1\end{pmatrix}, \quad K = \begin{pmatrix}1 & 0\\ 0 & 1\\ 1 & 1\end{pmatrix}, \quad V = \begin{pmatrix}v_1\\ v_2\\ v_3\end{pmatrix} $$

Raw scores \(QK^\top / \sqrt{2}\):

$$ \frac{QK^\top}{\sqrt{2}} = \frac{1}{\sqrt{2}}\begin{pmatrix}1 & 0 & 1\\ 0 & 1 & 1\\ 1 & 1 & 2\end{pmatrix} \approx \begin{pmatrix}0.71 & 0 & 0.71\\ 0 & 0.71 & 0.71\\ 0.71 & 0.71 & 1.41\end{pmatrix} $$

After row-wise softmax, token 3 (with the highest self-score of 1.41) attends most strongly to itself. Tokens 1 and 2 each split attention evenly between themselves and token 3. The output is a weighted sum of value vectors accordingly.


4. Multi-Head Attention

A single attention head produces one weighted mixture of values — one "perspective" on the sequence. Multi-head attention runs \(h\) independent attention heads in parallel, each with its own projection matrices, and concatenates their outputs:

$$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O $$

where \(\text{head}_i = \text{Attention}(Q W^Q_i, K W^K_i, V W^V_i)\) and \(W^O \in \mathbb{R}^{h d_v \times d_{\text{model}}}\).

Each head learns to attend to different aspects of the input. In a sentence like "The animal didn't cross the street because it was too tired", one head may learn to resolve the coreference of "it" (attending "it" → "animal"), another may capture syntactic structure (attending verbs to their subjects), and another may track semantic roles. This division of labour is one reason Transformers are so expressive.

In practice, \(d_k = d_v = d_{\text{model}} / h\), so the total computation is similar to a single head with full dimensionality, but with \(h\) times the representational diversity.


5. The Transformer Block

A Transformer layer stacks multi-head attention with a position-wise feed-forward network, using residual connections and layer normalization at each step. The diagram below shows Pre-LN (layer norm applied before each sublayer), which is used in modern LLMs such as GPT, LLaMA, and PaLM. The original Vaswani et al. (2017) paper used Post-LN (norm applied after the residual: LayerNorm(x + Sublayer(x))); Pre-LN became the standard because it yields more stable gradients at large scale.

Pre-LN Transformer Block — two sublayers applied in sequence:
  1. Attention sublayer — LayerNorm is applied to the input, passed through Multi-Head Attention, then the original input is added back as a residual: x₁ = x + Attention(LayerNorm(x))
  2. FFN sublayer — LayerNorm is applied to x₁, passed through the Feed-Forward Network, then x₁ is added back as a residual: output = x₁ + FFN(LayerNorm(x₁))

The feed-forward network (FFN) is applied independently to each position. The original Transformer used a ReLU activation:

$$ \text{FFN}(x) = \max(0, x W_1 + b_1) W_2 + b_2 $$

Modern LLMs have largely replaced ReLU with smoother activations: BERT and GPT-2 use GELU; LLaMA, PaLM, and Mistral use SwiGLU (Shazeer, 2020), which introduces a learned gate:

$$ \text{FFN}_{\text{SwiGLU}}(x) = \bigl(\text{SiLU}(xW_g) \otimes xW_u\bigr)\, W_d, \quad \text{SiLU}(z) = z \cdot \sigma(z) $$

The gate projection \(W_g\) and up projection \(W_u\) are applied separately; the element-wise product with the SiLU gate selectively suppresses features before the final projection \(W_d\). The inner dimension is typically 4× the model dimension for ReLU/GELU, and ≈ 8/3× for SwiGLU (to maintain the same parameter count with three weight matrices instead of two). The FFN is where much of the factual knowledge appears to be stored; attention handles relationships, FFN handles recall.


6. Positional Encoding

Self-attention is permutation-invariant: shuffling the input tokens in any order produces the same attention weights (up to reordering). "The cat sat" and "sat the cat" would look identical to the model. Positional encoding fixes this by adding position information to each token embedding before it enters the Transformer.

The original Transformer used fixed sinusoidal encodings:

$$ \text{PE}(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i / d_{\text{model}}}}\right) $$ $$ \text{PE}(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i / d_{\text{model}}}}\right) $$

Each position gets a unique vector of alternating sines and cosines at different frequencies. The model can then learn to use these signals to infer relative positions. Modern LLMs use Rotary Position Embeddings (RoPE), which encode relative position directly into the query–key dot product rather than adding absolute position to the token embedding, enabling better generalization to sequence lengths longer than those seen during training.


7. Causal (Masked) Self-Attention

Encoder models like BERT use bidirectional attention — each token attends to all others. Decoder models like GPT use causal attention: token \(i\) can only attend to positions \(\leq i\). This is implemented by masking the upper triangle of the attention score matrix with \(-\infty\) before the softmax, so those positions receive zero weight after exponentiation:

$$ \text{Mask}[i, j] = \begin{cases} 0 & j \leq i \\ -\infty & j > i \end{cases} $$

Causal masking enables autoregressive generation: the model predicts one token at a time, left to right, and each prediction can only see tokens that have already been generated.


8. Complexity and the KV Cache

The memory and compute cost of attention is \(O(n^2 d)\) in the sequence length \(n\). For a 100-token sequence, the attention matrix has 10,000 entries. For 10,000 tokens, it has 100 million entries. This quadratic scaling is why long-context models are expensive to run and why techniques like Flash Attention (which reorders computation to avoid materializing the full matrix) and sparse attention patterns have become important.

During inference, the key and value matrices for already-generated tokens do not need to be recomputed. The KV cache stores these between generation steps, reducing the compute for each new token from \(O(n^2)\) to \(O(n)\) — though the initial prefill pass over the prompt still costs \(O(n^2)\) — at the cost of memory proportional to \(n \cdot d\) per layer.


9. Key Architectural Variants

Model family Attention type Position encoding Notable representatives
Encoder-only Bidirectional Learned absolute BERT, RoBERTa, DeBERTa
Decoder-only Causal (masked) RoPE / ALiBi GPT-4, LLaMA 3, Claude, Gemini
Encoder-decoder Cross-attention between encoder and decoder Relative / sinusoidal T5, BART, Whisper

Frequently Asked Questions

What is the difference between self-attention and cross-attention?

In self-attention, queries, keys, and values all come from the same sequence — each token attends to every other token in the same input. In cross-attention, queries come from one sequence (e.g. the decoder) and keys and values come from another (e.g. the encoder output). Cross-attention is how the decoder in a seq-to-seq Transformer reads the encoded input when generating output.

Why is attention scaled by the square root of the key dimension?

Without scaling, dot products grow in magnitude as the key dimension d_k increases, pushing the softmax into regions with very small gradients. Dividing by √d_k keeps the dot products in a range where softmax gradients remain healthy during training. The value √d_k is derived from the expected variance of a dot product between two unit-variance vectors of dimension d_k.

What is multi-head attention and why does it help?

Multi-head attention runs several attention operations in parallel, each with its own learned projection matrices. Each head can specialize in a different type of relationship — one head might track syntactic dependencies, another coreference, another positional proximity. The outputs are concatenated and projected back down. Using h heads with d_model/h dimensions each costs the same computation as single-head attention at full dimension.

How does positional encoding inject order into the Transformer?

Attention is permutation-invariant by design — shuffling the input tokens produces the same attention weights. Positional encoding adds a position-dependent signal to each token embedding before attention is applied. The original Transformer used fixed sinusoidal encodings; modern LLMs use learned absolute positions or rotary position embeddings (RoPE), which encode relative position directly in the attention score.


Key Takeaways

  • Scaled dot-product attention computes a weighted sum of values, where weights are softmax-normalized dot products of queries and keys, scaled by \(1/\sqrt{d_k}\) to prevent gradient saturation.
  • Multi-head attention runs \(h\) independent attention operations in parallel, each learning to capture a different relational pattern in the sequence.
  • Positional encoding is injected before the first Transformer layer because self-attention is inherently order-agnostic. Modern models use RoPE to encode relative, not absolute, position.
  • Causal masking restricts each token to attend only to prior positions, enabling left-to-right autoregressive generation in decoder models.
  • Attention complexity is \(O(n^2)\) in sequence length, which is why long-context inference is expensive and drives techniques like Flash Attention and the KV cache.

References

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30.
  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT 2019.
  • Su, J., Lu, Y., Pan, S., Wen, B., & Liu, Y. (2024). RoFormer: Enhanced Transformer with Rotary Position Embedding. Neurocomputing, 568.
  • Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Advances in Neural Information Processing Systems, 35.
Found this useful?