GPT stands for Generative Pre-trained Transformer: a model that generates text, is trained ahead of time on lots of examples, and uses a design called a transformer. Every GPT works the same way: it reads everything written so far, predicts the next token (a single letter in our case), appends that guess, and repeats. A whole word is just that one step over and over until the model predicts a stop. In this lab, a real (but tiny) GPT learns to do this on 1,727 actual dinosaur names.
Minimal: just the training loop and the live predictions and attention map. Everything else is hidden. Focused: key concepts only. Switch to Advanced for the full forward pass and deeper notes. Advanced: full lab with the forward pass diagram and deeper notes on every step.
Before the details, here is the entire pipeline. The lab walks these five steps in order, each one turning a pile of raw dinosaur names a little closer into a model that can invent brand-new ones.
The model gets no rules. It just sees 1,727 real dinosaur names and tries to predict the next letter.
Watch the bars below. Some letters start names much more often than others, and most names are 8 to 14 letters long. The model has to figure all of this out on its own.
The model gets no rules. No Greek roots, no naming conventions. It simply sees 1,727 real dinosaur names and tries to predict the next letter.
Right away, two patterns emerge:
Watch the glowing bars as names cycle. They show where the current name fits.
Computers only understand numbers. So the model first turns every letter into a number. This is called tokenization.
Then it learns a short "flavor profile" (called an embedding) for each letter, and another one for each position in the name. The tables below (wte for letters, wpe for positions) hold these learned profiles. Watch them change as training runs. You are watching the model learn what each letter means.
Computers don't understand letters. Only numbers. So the model first turns every letter into a number. This is called tokenization.
Here's how it works in this lab:
That's it. The model now sees only numbers.
Numbers alone mean nothing. So the model learns a short "flavor profile" for each letter (technically called an embedding, 32 numbers long). These profiles live in a table called wte. Letters that behave similarly in dinosaur names end up with similar profiles, just like similar foods end up with similar taste profiles.
The model also learns a second table called wpe that tells it where each letter sits in the name. Every position (0 through 17) gets its own profile of 32 numbers. The model adds the letter profile to the position profile, so the input becomes a mix of identity and location.
The display below shows the real wte and wpe tables side by side. Amber bars are positive values, red bars are negative. The bright rows are the letters and positions in the current dinosaur. Watch both tables change as training runs. You are literally watching the model learn what each letter means and what each position contributes.
Note: wte (input token embeddings) and lm_head (final output projection) are learned separately in this lab. Many small GPTs tie them (share the same matrix) to halve the parameter count with no loss in quality.
The model guesses the next letter before it sees it. Then it checks how wrong the guess was. That difference is the loss (error). The model nudges its numbers to make the error smaller. Repeat thousands of times and patterns appear out of random noise.
Training works like this. The model guesses the next letter before it sees it. The difference between the guess and the real letter is called loss (error). The model then slightly adjusts every number (weight) to make that error smaller next time. Do this thousands of times and patterns emerge from random noise.
Each point is the loss for one name. Noisy because some names are easier. Hover to see 100-step rolling average.
Capture the model at two different training points. Then generate fresh names from each. Try the temperature buttons to make the names safer or wilder, and watch creativity emerge from random noise.
Capture the model's weights at two different points during training, then sample fresh names from each. Each click samples with a small amount of randomness, so the same model produces different names every time. That randomness is exactly how a GPT shows creativity.
This diagram shows every step the model takes to predict the next letter. The last letter of the current dinosaur enters at the top. Follow the arrows downward. Each amber box is one set of learned numbers (a weight matrix). The shapes are real: wide boxes really are wide, tall boxes really are tall. Don't worry about understanding every box yet. Just watch how the patterns inside them become clearer as training runs.
The model produces two things you can actually see:
Type a name below to watch both update in real time.
The model produces two things you can actually see:
Type a name below to watch both update in real time. The attention map uses 4 heads, which are 4 separate looks back at the input. Each head can focus on different earlier letters.
The map above shows what one head does on one name. But each head learns a job, and that job only shows up across many names. Here the four heads are measured over 80 dinosaur names at once, so you can see what each one actually specialized in.
Nobody told the heads to divide the work. They were never assigned roles. Any split you see below is something the model invented on its own to lower its error. Train more and the split usually sharpens.
Every one of the 80 names is fed through the trained model one letter at a time. At each prediction the four heads each produce an attention distribution over the letters seen so far. The values below are averages of those distributions, with each head pooled across both layers.
The math here mirrors the model's own forward pass exactly; only the autograd bookkeeping is dropped, since no training happens during analysis.
GPTs are not huge programs. The math is small. The forward pass in this lab is roughly 50 lines of JavaScript. But a real GPT scales the same simple idea up by enormous amounts:
It's tempting to say "GPTs are just a few hundred lines of code." That's misleading. The math is compact. The forward pass above is roughly 50 lines of JavaScript, and the autograd engine that backpropagates errors is about 30 more. But a usable lab needs an optimizer, a tokenizer, a training loop, weight inspection, attention visualization, dataset handling, and a UI. This page is around 2,000 lines total, and it still has only ~5,000 parameters. A real GPT scales the same architecture by orders of magnitude:
2 layers, 32-dim embeddings, vocab of 27. Trains in a browser tab. Output: invented dinosaur names.
12 layers, 768-dim embeddings, vocab of ~50k subword tokens. Same architecture, ~25,000× more weights.
Hundreds of layers, mixture-of-experts variants, trillions of training tokens. Same loss function. Same forward pass.
The algorithm hasn't changed since 2017. Scale, data quality, and engineering have. Every component on this page exists in the largest models too.
The model's single task is to predict the next letter. Every impressive thing a GPT can do (naming dinosaurs, writing essays, coding) grows out of getting better at this one simple task.
A GPT doesn't treat all previous tokens equally. It learns to weight which earlier ones matter for the current prediction. That dynamic, learned relevance is what separates transformers from everything that came before.
The model starts as random noise. It predicts, measures error, traces blame backward through every parameter, nudges each one slightly. Repeat millions of times and structured knowledge appears. No programming, no rules, no hand-crafted logic. Just prediction, error, correction.