softwarerror

babyLLM

babyLLM is a hands-on experiment in building a tiny language model from scratch. It walks through vocabulary encoding, tokenization, training data generation, and a simple neural network—showing how even a small model can learn to predict answers like 4+4 without ever seeing that exact problem.

Get the Code

Explore the Python notebook/scripts, tweak hyperparameters, or run the training locally.

Review the code

Step 1 — Vocabulary & Encoding

Define a minimal vocabulary of single-digit integers, operators, a space, and equals. Map each character to a token ID and keep an inverse map for decoding.

{' ':0,'+':1,'0':2,'1':3,'2':4,'3':5,'4':6,'5':7,'6':8,'7':9,'8':10,'9':11,'=':12}

Then we invert it and allow the token to be our primary rather than the character

{0:' ',1:'+',2:'0',3:'1',4:'2',5:'3',6:'4',7:'5',8:'6',9:'7',10:'8',11:'9',12:'='}

Step 2 — Generate Training Data (Excluding the Target Case)

Programmatically create equations like "1+2=3", avoiding the specific example you want to test (4+4). This tests generalization.

Step 3 — Prepare Sequences (Inputs & Targets)

Tokenize each equation into IDs, then build (context, target) pairs: the context is the fixed-length prefix, and the target is the very next token. For "1+1=2 ", with IDs [3,1,3,12,4,0], a 4-token context [3,1,3,12] ("1+1=") predicts target 4 ("2"). Batch pairs into arrays for contexts and targets for training.

Step 4 — Hyperparameters

  • embedding_dim — vector size per token (e.g., 16).
  • hidden_dim — neurons in the hidden layer (capacity).
  • epochs — passes over the dataset (too many → overfitting).
  • lr — learning rate for weight updates (stability vs. speed).

Step 5 — Initialize Weights

Seed randomness for reproducibility.

  • Trainable embedding matrix (lookup table for token vectors).
  • W1, b1 — from flattened embeddings → hidden layer.
  • W2, b2 — from hidden layer → logits over vocab.

Step 6 — Training Loop

For each epoch, embed the context, run a forward pass, compute loss (cross-entropy), backpropagate gradients through output → hidden → embeddings, and update weights.

After training, the model can predict the next token with high probability—even for 4+4, which was deliberately excluded from training.

Notes & Scope

  • Currently limited to single-digit operands; double-digit support is intentionally deferred.
  • Focus is educational—clarity over scale or performance.
  • Great for experimenting with embeddings, context windows, and training dynamics.