An interactive guide to the architecture powering modern language models
Breaking text into subword units the model can process
Mapping tokens to numerical vectors that encode meaning
Similar words occupy nearby regions in embedding space.
The building blocks that make it all work
Each transformer layer repeats this pattern. Modern LLMs stack 32–120+ layers.
Click a token to see its attention pattern. Arc thickness indicates attention weight.
Multiple heads attend to different aspects simultaneously.
Each layer has a feed-forward network. Residual connections carry the original signal forward, preventing information loss in deep networks.
Predicting the next word, one token at a time
Two strategies for scaling model capacity
| Dense | Mixture of Experts | |
|---|---|---|
| Active per token | 100% of parameters | ~12–25% |
| Compute cost | Higher | Lower per token |
| Total parameters | Moderate | Much larger |
| Examples | GPT-4o, Claude, Llama | Mixtral, DeepSeek, Grok |
The staggering resources behind frontier LLMs
Loss decreases as the model trains, with diminishing returns over time.