How Transformer LLMs Work

Tokenization

Breaking text into subword units the model can process

Example

"unbelievable" → un believ able

Interactive Tokenizer

Simplified tokenizer using ~200 common subword units. Real tokenizers (BPE, etc.) use 30k–100k+ tokens learned from training data.

Embeddings

Mapping tokens to numerical vectors that encode meaning

Embedding Vectors

Select a word to reveal its vector representation…

Semantic Clustering

Similar words occupy nearby regions in embedding space.

Transformer Architecture

The building blocks that make it all work

Input Embeddings

Multi-Head Self-Attention

Add & Layer Norm

Feed-Forward Network

Add & Layer Norm

Output Probabilities

Each transformer layer repeats this pattern. Modern LLMs stack 32–120+ layers.

Self-Attention

Click a token to see its attention pattern. Arc thickness indicates attention weight.

Click a token below to visualize attention.

Multi-Head Attention

Multiple heads attend to different aspects simultaneously.

Head 1 · syntax Head 2 · semantics Head 3 · position

Feed-Forward & Residual Connections

Each layer has a feed-forward network. Residual connections carry the original signal forward, preventing information loss in deep networks.

Input

Linear + GELU

Linear

Add & Norm

skip connection

Token Generation

Predicting the next word, one token at a time

The weather today is

Temp 0.7

Autoregressive Loop

Input tokens

Transformer

Probabilities

Sample & append

Dense vs Mixture of Experts

Two strategies for scaling model capacity

Dense

Mixture of Experts

router

Tokens auto-route every 2 seconds

	Dense	Mixture of Experts
Active per token	100% of parameters	~12–25%
Compute cost	Higher	Lower per token
Total parameters	Moderate	Much larger
Examples	GPT-4o, Claude, Llama	Mixtral, DeepSeek, Grok

Scale & Training

The staggering resources behind frontier LLMs

Parameters

Training Tokens

GPUs

Training Pipeline

Data Collection

Pre-training

Fine-tuning / RLHF

Deployment

Training Loss

Loss decreases as the model trains, with diminishing returns over time.

How TransformerLLMs Work