What actually happens
inside a GPT?

GPT stands for Generative Pre-trained Transformer: a model that generates text, is trained ahead of time on lots of examples, and uses a design called a transformer. Every GPT works the same way: it reads everything written so far, predicts the next token (a single letter in our case), appends that guess, and repeats. A whole word is just that one step over and over until the model predicts a stop. In this lab, a real (but tiny) character-level transformer. Its the same core architecture family behind GPTs and it learns to do this on 1,727 actual dinosaur names. Real GPTs use subword tokens (pieces of words) rather than single letters, but the underlying ideas are the same.

Building the name "rex" one prediction at a time. Each row feeds in everything written so far, and the model predicts the next character. START marks the empty beginning; an END prediction finishes the name.

Minimal: just the training loop and the live predictions and attention map. Everything else is hidden. Focused: key concepts only. Switch to Advanced for the full forward pass and deeper notes. Advanced: full lab with the forward pass diagram and deeper notes on every step.

Overview

The Whole Journey

Before the details, here is the entire pipeline. The lab walks these five steps in order, each one turning a pile of raw dinosaur names a little closer into a model that can invent brand-new ones.

Input · 1,727 dinosaur names

1 Patterns Spot the structure hidden in the names: which letters start them, how long they tend to run.

2 Tokens & Embeddings Turn each letter into a vector of numbers the model can actually compute with.

3 Training Nudge the weights over and over so the next-letter guesses keep getting less wrong.

4 Forward Pass Advanced Push one input through every weight matrix in the network, in order.

5 Predictions & Attention Let each letter weigh the ones before it, then score every possible next letter and sample one to spell a new name.

Output · Velthysaurus

Step 1 · Patterns in the data

What the model has to figure out on its own

The model gets no rules. It just sees 1,727 real dinosaur names and tries to predict the next letter.

Watch the bars below. Some letters start names much more often than others, and most names are 8 to 14 letters long. The model has to figure all of this out on its own.

The model gets no rules. No Greek roots, no naming conventions. It simply sees 1,727 real dinosaur names and tries to predict the next letter.

Right away, two patterns emerge:

Some letters (like A or T) start names much more often than others.
Most names are 8 to 14 letters long.

Watch the glowing bars as names cycle. They show where the current name fits.

…

How often each letter starts a name

How long the names are (characters)

Step 2 · Tokens & Embeddings

Computers process numbers, not letters

Computers only understand numbers. So the model first turns every letter into a number. This is called tokenization.

Then it learns a short "flavor profile" (called an embedding) for each letter, and another one for each position in the name. The tables below (wte for letters, wpe for positions) hold these learned profiles. Watch them change as training runs. You are watching the model learn what each letter means.

Computers don't understand letters. Only numbers. So the model first turns every letter into a number. This is called tokenization.

Here's how it works in this lab:

Every letter a-z gets its own number (0 to 25).
One special number called [BOS] marks the beginning and end of each name.

That's it. The model now sees only numbers.

Numbers alone mean nothing. So the model learns a short "flavor profile" for each letter (technically called an embedding, 32 numbers long). These profiles live in a table called wte. Letters that behave similarly in dinosaur names end up with similar profiles, just like similar foods end up with similar taste profiles.

The model also learns a second table called wpe that tells it where each letter sits in the name. Every position (0 through 17) gets its own profile of 32 numbers. The model adds the letter profile to the position profile, so the input becomes a mix of identity and location.

The display below shows the real wte and wpe tables side by side. Amber bars are positive values, red bars are negative. The bright rows are the letters and positions in the current dinosaur. Watch both tables change as training runs. You are literally watching the model learn what each letter means and what each position contributes.

Note: wte (input token embeddings) and lm_head (final output projection) are learned separately in this lab. Many small GPTs tie them (share the same matrix) to halve the parameter count with no loss in quality.

Tokenization. Before any learning happens, every character is swapped for a fixed ID: a–z become 0–25, and one extra marker [BOS] = 26 bookends each name. From here on the model sees only these 27 numbers, never letters.

Embedding. The token ID picks a row of wte (what the letter is); the position picks a row of wpe (where it sits). Adding them yields one vector carrying both identity and location. Amber bars are positive numbers, red are negative. These tables start as random noise and are shaped entirely by training.

Tokens, wte, and wpe (each vector is 32-dim) for the current name (amber = positive, red = negative)

wte · token embedding table (vocab × 32)

Each row = one character (a token). Each cell = one of the 32 learned numbers in that character’s embedding. The brightened rows belong to letters in the current dinosaur.

wpe · position embedding table (block × 32)

Each row = one position in the sequence (0 to 17). Each cell = one of the 32 learned numbers for that position. The brightened rows are the positions the current name occupies.

Step 3 · Training

The model improves by correcting its errors

The model guesses the next letter before it sees it. Then it checks how wrong the guess was. That difference is the loss (error). The model nudges its numbers to make the error smaller. Repeat thousands of times and patterns appear out of random noise.

Training works like this. The model guesses the next letter before it sees it. The difference between the guess and the real letter is called loss (error). The model then slightly adjusts every number (weight) to make that error smaller next time. Do this thousands of times and patterns emerge from random noise.

Press start training and watch the loss curve fall. Hover the curve to see the average loss in each 100-step window. Anything below the dashed line beats pure random guessing.

The training loop. The model guesses, measures how wrong it was (the loss), traces that error backward to find which weights were responsible, and nudges each one slightly to do better next time. No rules are ever written down. The loss curve below is step 2 plotted over thousands of repetitions.

Training controls

Jump to step:

Untrained. Press start to teach it.

Each point is the loss for one name. Noisy because some names are easier. Hover to see 100-step rolling average.

Training Comparison

How training transforms gibberish into names

Capture the model at two different training points. Then generate fresh names from each. Try the temperature buttons to make the names safer or wilder, and watch creativity emerge from random noise.

Capture the model's weights at two different points during training, then sample fresh names from each. Each click samples with a small amount of randomness, so the same model produces different names every time. That randomness is exactly how a GPT shows creativity.

Snapshot A

Source: untrained random weights

Click Generate to produce a name.

Snapshot B

Source: untrained random weights

Click Generate to produce a name.

Temperature

Low plays it safe. High takes wild risks.

Each click generates one new name per snapshot at the chosen temperature. Pick a source for each side: an untrained baseline, a step-jump checkpoint (if loaded), or capture current to snapshot the model right now.

Step 4 · Forward Pass

Every weight matrix the model uses, in order

This diagram shows every step the model takes to predict the next letter. The last letter of the current dinosaur enters at the top. Follow the arrows downward. Each amber box is one set of learned numbers (a weight matrix). The shapes are real: wide boxes really are wide, tall boxes really are tall. Don't worry about understanding every box yet. Just watch how the patterns inside them become clearer as training runs.

Two layers repeat the same RMSNorm → attention → residual → RMSNorm → feed-forward → residual block. Real GPTs add more layers; the block is identical. Note: this lab's RMSNorm is parameter-free (no learnable gain γ) for simplicity; most real implementations include one.

Forward pass · current input shown live

Step 5 · Predictions and Attention

What the trained model actually does with an input

The model produces two things you can actually see:

Next-letter probabilities. Which letter the model thinks should come next. Tallest bar is the top guess.
Attention map. Which earlier letters the model relied on most. Brighter means paid more attention.

Type a name below to watch both update in real time.

The model produces two things you can actually see:

Next-letter probabilities. Which letter the model thinks should come next. Tallest bar is the top guess.
Attention map. Which earlier letters the model relied on most. Brighter means paid more attention.

Type a name below to watch both update in real time. The attention map uses 4 heads, which are 4 separate looks back at the input. Each head can focus on different earlier letters.

Type a name to feed in (drives both panels below)

Using cycled name: …

Lowercase a–z only (up to 16 chars, the model's position limit). Leave blank to keep using the auto-cycled dinosaur from Steps 1 and 2.

Temperature 1.00 low = confident · high = adventurous

Temperature reshapes the probability bars on the left before a letter is picked: divide every score by it, then re-normalize. Below 1.0 sharpens toward the single top guess; above 1.0 flattens the field so unlikely letters get a real chance. It does not touch the attention map. This is the knob that tunes a GPT between safe and creative.

Next-character probabilities (train first)

Attention map (averaged across both layers)

Train the model first, then this map fills in for the name fed in above.

Multi-head attention. The same input is examined by 4 heads at once, each free to focus on different earlier letters (one might track the previous letter, another the first). Their results are combined into a single output. The tabs above the live map below let you inspect each head on its own.

Step 6 · Head Specialization

Head Specialization Analysis

The map above shows what one head does on one name. But each head learns a job, and that job only shows up across many names. Here the four heads are measured over 80 dinosaur names at once, so you can see what each one actually specialized in.

Nobody told the heads to divide the work. They were never assigned roles. Any split you see below is something the model invented on its own to lower its error. Train more and the split usually sharpens.

Run the analysis

Each head is averaged across both layers, matching the map above.

Train the model first, then press Analyze 80 names. The table fills in below.

How each number is measured

Every one of the 80 names is fed through the trained model one letter at a time. At each prediction the four heads each produce an attention distribution over the letters seen so far. The values below are averages of those distributions, with each head pooled across both layers.

Positional (BOS). Average attention landing on position 0, the opening BOS marker, measured only when there is at least one real letter to compete with it. High means the head anchors to the start of the name.
Previous token. Average attention on the letter immediately before the current one. High means the head tracks local letter-to-letter transitions.
Vowel bias. Of all the letter-to-letter attention, the share that lands on a vowel. Compared against the plain vowel rate in the data to decide whether the head leans vowel, consonant, or neither.
Entropy. Average spread of the attention, in bits. Lower is sharper (the head looks at fewer letters). Higher means it spreads attention around.
Copy score. When the next letter has already appeared earlier in the name, how much attention the head puts on that earlier copy. High suggests copy or repetition behavior.

The math here mirrors the model's own forward pass exactly; only the autograd bookkeeping is dropped, since no training happens during analysis.

Reality check

How much code is this, really?

GPTs are not huge programs. The math is small. The forward pass in this lab is roughly 50 lines of JavaScript. But a real GPT scales the same simple idea up by enormous amounts:

It's tempting to say "GPTs are just a few hundred lines of code." That's misleading. The math is compact. The forward pass above is roughly 50 lines of JavaScript, and the autograd engine that backpropagates errors is about 30 more. But a usable lab needs an optimizer, a tokenizer, a training loop, weight inspection, attention visualization, dataset handling, and a UI. This page is around 2,000 lines total, and it still has only ~27,000 parameters. A real GPT scales the same architecture by orders of magnitude:

This lab

~27k params

2 layers, 32-dim embeddings, vocab of 27. Trains in a browser tab. Output: invented dinosaur names.

GPT-2 small

~124M params

12 layers, 768-dim embeddings, vocab of ~50k subword tokens. Same architecture, ~4,600× more weights.

GPT-4 class

~1.8T params (est.)

Hundreds of layers, mixture-of-experts variants, trillions of training tokens. Same loss function. Same forward pass.

The algorithm hasn't changed since 2017. Scale, data quality, and engineering have. Every component on this page exists in the largest models too.

The big picture

Three ideas worth taking with you

Next-token prediction is the only job

The model's single task is to predict the next letter. Every impressive thing a GPT can do (naming dinosaurs, writing essays, coding) grows out of getting better at this one simple task.

Attention is how context becomes selective

A GPT doesn't treat all previous tokens equally. It learns to weight which earlier ones matter for the current prediction. That dynamic, learned relevance is what separates transformers from everything that came before.

Knowledge is just adjusted numbers

The model starts as random noise. It predicts, measures error, traces blame backward through every parameter, nudges each one slightly. Repeat millions of times and structured knowledge appears. No programming, no rules, no hand-crafted logic. Just prediction, error, correction.

What actually happensinside a GPT?