Attention Is All You Need
Transformer (Vaswani et al., 2017) replaces recurrence and convolution with self-attention, making sequence modeling highly parallel, scalable, and better at capturing long-range dependencies.
Core ideas (quick read)
Attention Is All You Need introduces the Transformer: a sequence model built entirely on attention. It avoids the step-by-step recursion of RNNs and the stacked local convolutions of CNNs, enabling efficient parallel training and strong results on tasks like machine translation.
Key contributions at a glance
- Self-Attention: each token can directly attend to any other token in the sequence. This shortens the interaction path and makes long-range information flow easier.
- Multi-Head Attention: learn multiple alignment patterns in parallel (semantic, syntactic, positional, etc.), then concatenate and mix them.
- Positional Encoding: inject order information without recurrence or convolution.
- Stacked encoder/decoder: compose attention, feed-forward networks, residual connections, and LayerNorm into a scalable deep architecture.
Transformer architecture (simplified)
How self-attention is computed (flow)
Multi-Head Attention (split → parallel → concat)
Attention heatmap (example visualization)
Attention (simplified equation)
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V
Why it scales: parallelism and long-range interaction
| Approach | Interaction | Training parallelism | Long-range dependencies |
|---|---|---|---|
| RNN/LSTM | Step-by-step recurrence over time | Weak (inherently sequential) | Long path; signal can decay |
| CNN | Stacked local convolutions expand receptive field | Strong (parallel within a layer) | Needs more layers to cover long context |
| Self-Attention | Global token-to-token interaction + weighted aggregation | Strong (matrix-friendly) | Direct interaction; short path |
Details to watch for while reading
- Masking: decoder self-attention masks future tokens to preserve autoregressive causality.
- Cross-Attention: the decoder attends over encoder outputs to selectively read the source sequence.
- Head split + concat: split the hidden dimension into heads, run attention per head, then concatenate and apply a linear projection.
- Residual + LayerNorm: crucial for stable training when stacking many layers.
Related links: arXiv:1706.03762 · PDF