How Transformer
LLMs Work

An interactive guide to the architecture powering modern language models

Tokenization

Breaking text into subword units the model can process

Example

"unbelievable" un believ able

Interactive Tokenizer

Simplified tokenizer using ~200 common subword units. Real tokenizers (BPE, etc.) use 30k–100k+ tokens learned from training data.

Embeddings

Mapping tokens to numerical vectors that encode meaning

Embedding Vectors

Select a word to reveal its vector representation…

Semantic Clustering

Similar words occupy nearby regions in embedding space.

Transformer Architecture

The building blocks that make it all work

Input Embeddings
Multi-Head Self-Attention
Add & Layer Norm
Feed-Forward Network
Add & Layer Norm
Output Probabilities

Each transformer layer repeats this pattern. Modern LLMs stack 32–120+ layers.

Self-Attention

Click a token to see its attention pattern. Arc thickness indicates attention weight.

Click a token below to visualize attention.

Multi-Head Attention

Multiple heads attend to different aspects simultaneously.

Head 1 · syntax Head 2 · semantics Head 3 · position

Feed-Forward & Residual Connections

Each layer has a feed-forward network. Residual connections carry the original signal forward, preventing information loss in deep networks.

Input
Linear + GELU
Linear
Add & Norm
skip connection

Token Generation

Predicting the next word, one token at a time

The weather today is
0.7

Autoregressive Loop

Input tokens
Transformer
Probabilities
Sample & append

Dense vs Mixture of Experts

Two strategies for scaling model capacity

Dense

 

Mixture of Experts

 
router
Tokens auto-route every 2 seconds
DenseMixture of Experts
Active per token100% of parameters~12–25%
Compute costHigherLower per token
Total parametersModerateMuch larger
ExamplesGPT-4o, Claude, LlamaMixtral, DeepSeek, Grok

Scale & Training

The staggering resources behind frontier LLMs

0
Parameters
0
Training Tokens
0
GPUs

Training Pipeline

Data Collection
Pre-training
Fine-tuning / RLHF
Deployment

Training Loss

Loss decreases as the model trains, with diminishing returns over time.

Loss Training Steps