babyLLM
babyLLM is a hands-on experiment in building a tiny language model from scratch. It walks through vocabulary encoding, tokenization, training data generation, and a simple neural network—showing how even a small model can learn to predict answers like 4+4 without ever seeing that exact problem.
Get the Code
Explore the Python notebook/scripts, tweak hyperparameters, or run the training locally.
Review the codeStep 1 — Vocabulary & Encoding
Define a minimal vocabulary of single-digit integers, operators, a space, and equals. Map each character to a token ID and keep an inverse map for decoding.
Then we invert it and allow the token to be our primary rather than the character
Step 2 — Generate Training Data (Excluding the Target Case)
Programmatically create equations like "1+2=3", avoiding the specific example you want to test (4+4). This tests generalization.
Step 3 — Prepare Sequences (Inputs & Targets)
Tokenize each equation into IDs, then build (context, target) pairs: the context is the fixed-length prefix, and the target is the very next token. For "1+1=2 ", with IDs [3,1,3,12,4,0], a 4-token context [3,1,3,12] ("1+1=") predicts target 4 ("2"). Batch pairs into arrays for contexts and targets for training.
Step 4 — Hyperparameters
- embedding_dim — vector size per token (e.g., 16).
- hidden_dim — neurons in the hidden layer (capacity).
- epochs — passes over the dataset (too many → overfitting).
- lr — learning rate for weight updates (stability vs. speed).
Step 5 — Initialize Weights
Seed randomness for reproducibility.
- Trainable embedding matrix (lookup table for token vectors).
- W1, b1 — from flattened embeddings → hidden layer.
- W2, b2 — from hidden layer → logits over vocab.
Step 6 — Training Loop
For each epoch, embed the context, run a forward pass, compute loss (cross-entropy), backpropagate gradients through output → hidden → embeddings, and update weights.
After training, the model can predict the next token with high probability—even for 4+4, which was deliberately excluded from training.
Notes & Scope
- Currently limited to single-digit operands; double-digit support is intentionally deferred.
- Focus is educational—clarity over scale or performance.
- Great for experimenting with embeddings, context windows, and training dynamics.