Kael AIpaper landing (arXiv-style)

Attention Is All You Need

Transformer (Vaswani et al., 2017) replaces recurrence and convolution with self-attention, making sequence modeling highly parallel, scalable, and better at capturing long-range dependencies.

Core ideas (quick read)

Attention Is All You Need introduces the Transformer: a sequence model built entirely on attention. It avoids the step-by-step recursion of RNNs and the stacked local convolutions of CNNs, enabling efficient parallel training and strong results on tasks like machine translation.

Key contributions at a glance

  • Self-Attention: each token can directly attend to any other token in the sequence. This shortens the interaction path and makes long-range information flow easier.
  • Multi-Head Attention: learn multiple alignment patterns in parallel (semantic, syntactic, positional, etc.), then concatenate and mix them.
  • Positional Encoding: inject order information without recurrence or convolution.
  • Stacked encoder/decoder: compose attention, feed-forward networks, residual connections, and LayerNorm into a scalable deep architecture.

Transformer architecture (simplified)

Fig. 1 — Encoder/decoder stack (schematic; not to scale)
Transformer architecture diagramA simplified Transformer showing encoder and decoder stacks, self-attention, cross-attention, and feed-forward blocks.Encoder (N×)Input tokens → Embedding + PositionalSelf-Attention(Q,K,V from same sequence)FFN (Feed-Forward)+ Residual + LayerNormDecoder (N×)Output tokens (shifted right) → Embedding + PositionalMasked Self-Attention (causal)Cross-Attention (to encoder)Linear + Softmax → next tokenencoder output (memory)

How self-attention is computed (flow)

Fig. 2 — Main steps from inputs to attention output
Self-attention flowInput embeddings are projected to queries, keys and values; then attention weights are computed and applied to values.X (input vectors)Linear projectionsWq, Wk, WvQKVScores: QKᵀ / √dₖsoftmaxWeightsOutput = Weights · V (weighted sum)

Multi-Head Attention (split → parallel → concat)

Fig. 3 — Split into heads, run attention in parallel, then concatenate and project
Multi-head attention diagramInput X is linearly projected then split into multiple heads to run attention in parallel, concatenated and projected back.XLinear projectionsWq, Wk, WvSplit into headshead 1head 2head 3AttentionAttentionAttentionConcat+ Wo

Attention heatmap (example visualization)

Attention heatmap exampleA small example attention weight matrix. Darker cells mean higher attention weight from a query token to a key token.Keys →Queries ↓<s>Thecatsatonmat<s>ThecatsatonmatlowhighFig. 4 — Example attention weight matrix (illustrative; not from the paper)Reading tip: rows are query tokens, columns are key tokens; darker cells mean higher weight.
Fig. 4 — Example attention weight matrix (illustrative; not from the paper). Reading tip: rows are query tokens, columns are key tokens; darker cells mean higher weight.

Attention (simplified equation)

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

Why it scales: parallelism and long-range interaction

ApproachInteractionTraining parallelismLong-range dependencies
RNN/LSTMStep-by-step recurrence over timeWeak (inherently sequential)Long path; signal can decay
CNNStacked local convolutions expand receptive fieldStrong (parallel within a layer)Needs more layers to cover long context
Self-AttentionGlobal token-to-token interaction + weighted aggregationStrong (matrix-friendly)Direct interaction; short path

Details to watch for while reading

  • Masking: decoder self-attention masks future tokens to preserve autoregressive causality.
  • Cross-Attention: the decoder attends over encoder outputs to selectively read the source sequence.
  • Head split + concat: split the hidden dimension into heads, run attention per head, then concatenate and apply a linear projection.
  • Residual + LayerNorm: crucial for stable training when stacking many layers.

Related links: arXiv:1706.03762 · PDF

© 2025 Kael AI · Layout inspired by arXiv.org.