A working primer for practitioners who understand the architecture but want to read the papers.
A large and growing population of engineers, researchers, and technical leaders can describe transformer architecture with genuine clarity (attention mechanisms, embedding spaces, next-token prediction) but lose the thread the moment a paper shifts into mathematical notation. The existing educational landscape does not serve them well. Formal textbooks assume a mathematics degree; informal explainers simplify past the point of usefulness, leaving readers with intuitions that collapse under any real technical weight. Neither approach bridges the gap between conceptual understanding and the ability to read a research paper on its own terms.
This primer attempts that bridge. Each chapter introduces a mathematical tool because it does specific, visible work inside a language model. No symbol appears without prior explanation, and no concept earns its place unless it unlocks something concrete about how these systems function. The sequence is deliberate: every chapter depends only on what came before it. A reader who starts at the beginning and works through the exercises should arrive at the attention mechanism, the mathematical core of the transformer, with every piece of notation fully in hand.
Words as ordered lists of numbers, and why direction carries meaning.
Every large language model begins with the same operation: it converts a word (or, more precisely, a token) into a list of numbers. That list is a vector, and nearly everything that follows in the transformer architecture (attention, prediction, loss computation) is arithmetic performed on vectors. Understanding what vectors are, how they relate to each other, and what it means to operate on them is the single most important mathematical foundation for reading LLM research.
A vector is an ordered list of numbers. That is the entire definition. The vector [3, 7] has two components and lives in two-dimensional space. The vector [0.12, −0.85, 0.33, 0.71] has four components and lives in four-dimensional space. GPT-4's token embeddings are vectors with thousands of components, each one a coordinate in a space far too large to visualize directly but governed by exactly the same rules as two-dimensional arrows on a page.
Two properties of a vector matter immediately. The first is its magnitude, how long the arrow is, calculated as the square root of the sum of its squared components. The second is its direction, where the arrow points. In the context of language models, direction turns out to carry the meaning. Two word embeddings that point in similar directions represent words with similar roles in language, regardless of how long the vectors happen to be.
Why does this representation work? Consider the alternative: a language model could assign each word a single number (word #1, word #2, word #38042). But a single number cannot capture the relationships between words, the fact that “king” and “queen” share something that “king” and “carburetor” do not. A vector with hundreds of dimensions can encode these relationships structurally. Words with similar meanings end up near each other, and the geometric relationships between vectors reflect semantic relationships between concepts.
Figure 1 shows a simplified picture: words plotted as points in two dimensions. In a real model the space has 768 or 4,096 or more dimensions, but the principle holds at any scale. Words used in similar contexts cluster together, and the distance between points reflects semantic distance. The model stores and retrieves meaning through exactly this geometric structure.
Suppose a toy language model represents the word “king” as the vector [3, 4]. What is the magnitude of this embedding?
Apply the formula directly:
This is a Pythagorean triple, so the result is a clean integer, a convenience that rarely occurs with real embeddings, which typically have magnitudes like 11.38 or 0.97 depending on the model’s normalization scheme.
The magnitude formula should look familiar: for a two-dimensional vector it is the Pythagorean theorem. The generalization to n dimensions works identically; add more squared terms under the radical. This is a recurring pattern throughout this primer: the mathematical tools that govern high-dimensional LLM computations are, at their core, straightforward extensions of geometry the reader already knows from two and three dimensions.
Measuring similarity between vectors with element-wise multiplication.
Chapter I established that words become vectors. The immediate follow-up question: how do you compare two vectors? If “king” is [3, 7] and “queen” is [4, 6.5], how similar are they, and how would you quantify that? The dot product is the operation that answers this question, and it is the single most frequently occurring computation in the entire transformer architecture.
The dot product of two vectors is computed by multiplying corresponding elements and summing the results. Given vectors a = [a₁, a₂, …, aₙ] and b = [b₁, b₂, …, bₙ]:
The operation requires that both vectors have the same number of dimensions. You cannot take the dot product of a 3-dimensional vector and a 5-dimensional one. This constraint shows up constantly in transformer engineering as the requirement that “the shapes must match,” and it is the most common source of implementation errors when building or modifying model architectures.
The geometric interpretation is what makes the dot product powerful. The dot product of two vectors equals the product of their magnitudes multiplied by the cosine of the angle between them:
This means the dot product encodes two things at once: how long the vectors are, and how much they point in the same direction. When we want to isolate just the directional component (the similarity of meaning, stripped of magnitude) we divide out the lengths. The result is cosine similarity:
Cosine similarity ranges from −1 (vectors pointing in exactly opposite directions) through 0 (perpendicular, no relationship) to +1 (identical direction). In practice, word embeddings trained on real data rarely produce negative cosine similarities. Most pairs land somewhere between 0 and 0.5, with semantically related words clustering above 0.7.
| Range | Interpretation | Example |
|---|---|---|
| 0.90 – 1.00 | Near-identical | “happy” / “joyful” |
| 0.70 – 0.90 | Strongly related | “king” / “queen” |
| 0.40 – 0.70 | Some relationship | “king” / “castle” |
| 0.00 – 0.40 | Weak or no relation | “king” / “bicycle” |
| < 0.00 | Opposing (rare) | — |
A toy model represents “cat” as [2, 5, 1] and “dog” as [3, 4, 2]. Compute the dot product and cosine similarity.
Dot product:
Magnitudes:
Cosine similarity:
A cosine similarity of 0.95 indicates these vectors point in nearly the same direction, consistent with “cat” and “dog” being semantically close.
Work each problem on paper before revealing the solution. The goal is fluency with the operation, the point at which these computations feel mechanical rather than effortful.
[6, 2, 3].a = [1, 3, 5] and b = [2, 0, 4]. What is their dot product?a = [1, 2] and b = [2, −1], compute their cosine similarity.a = [3, 4] and b = [4, 3]. Compute the cosine similarity.How weight matrices reshape vectors between layers.
A matrix is a rectangular grid of numbers arranged in rows and columns. If a vector is a list, a matrix is a table. Where a vector represents a single point or direction in space, a matrix represents a transformation: a machine that takes a vector as input and produces a different vector as output. Every layer of a neural network is, at its computational core, a matrix multiplication followed by a nonlinear function. Understanding what matrices do geometrically is therefore understanding what neural networks do to the data that passes through them.
A matrix is described by its shape: the number of rows and columns. A matrix with 3 rows and 2 columns is a “3-by-2 matrix,” written (3×2). The weight matrix in a transformer layer that projects a 768-dimensional embedding into a 3072-dimensional intermediate representation is a (3072×768) matrix: 3,072 rows, 768 columns, containing over 2.3 million individual numbers, each one a learnable parameter.
The central operation is matrix-vector multiplication. Given a matrix W with shape (m×n) and a vector v with n components, the product Wv is a new vector with m components. Each element of the output is the dot product of one row of the matrix with the input vector. This is where Chapter II pays off: every row of the matrix is itself a vector, and the output is a list of dot products measuring how much the input aligns with each of those row vectors.
The shape constraint is critical and worth internalizing: the number of columns in the matrix must equal the number of components in the vector. The output vector has as many components as the matrix has rows. This rule (inner dimensions must match, outer dimensions give the result shape) governs all matrix operations and is the source of most shape mismatch errors in deep learning code.
When two matrices are multiplied together, the same rule applies elementwise: a (m×n) matrix times a (n×p) matrix yields a (m×p) matrix. Composing two transformations (first project from 768 to 3072 dimensions, then from 3072 back to 768) is the same as multiplying the two matrices together. This is how transformer layers are stacked: each layer’s output vector becomes the next layer’s input, passing through matrix after matrix, each one reshaping the representation.
A (2×3) weight matrix and a 3-dimensional input vector:
Compute Wv:
The output is the 2-dimensional vector [11, 7]. A 3D input has been projected into 2D space, a dimension reduction, which is exactly what happens when a transformer compresses representations between layers.
Work each problem on paper before revealing the solution. The goal is fluency with the operation, the point at which these computations feel mechanical rather than effortful.
[[2, 1], [0, 3]] by the vector [4, 2].The two inverse functions behind softmax and loss computation.
Two mathematical functions appear so frequently in machine learning that a reader who does not recognize them on sight will struggle with nearly every technical paper: the exponential function and the logarithm. They are inverses of each other (what one does, the other undoes) and together they form the mathematical backbone of both the softmax function (Chapter VI) and the cross-entropy loss (Chapter VII). Building reliable intuition for their behavior now will pay dividends for the rest of this primer.
The exponential function, written eˣ or exp(x), takes any real number and returns a positive number. The constant e ≈ 2.718 is a mathematical constant that arises naturally in growth and decay processes. The critical properties: exp(0) = 1, the function is always positive, it grows explosively for positive inputs, and it decays toward zero for negative inputs. Crucially, exp(a + b) = exp(a) × exp(b), meaning the exponential converts addition into multiplication.
The logarithm, written ln(x) or log(x), is the inverse: if exp(a) = b, then ln(b) = a. It is defined only for positive numbers, maps 1 to 0, and grows slowly, logarithmically, toward infinity. Its essential property mirrors the exponential’s: ln(a × b) = ln(a) + ln(b). The logarithm converts multiplication into addition, which is why it appears everywhere in probability calculations where we need to multiply many small numbers together without numerical underflow.
Why do these functions matter for LLMs specifically? Two reasons dominate. First, the softmax function (Chapter VI) uses exp() to convert arbitrary real-valued scores into positive numbers that can be normalized into a probability distribution. Second, the cross-entropy loss function (Chapter VII) uses ln() to measure how surprised the model is by the correct answer, specifically the negative log of the predicted probability. Every training step of every language model computes both of these functions billions of times.
A language model predicts a sequence of three tokens with individual probabilities 0.8, 0.6, and 0.9. The joint probability of the full sequence is their product:
In log space, the same calculation becomes addition:
And we can verify: exp(−0.839) ≈ 0.432. With only three tokens the product is still manageable, but for sequences of hundreds of tokens, each probability less than 1, the product approaches zero rapidly and risks numerical underflow. This is why LLM training operates on log-probabilities rather than raw probabilities.
Work each problem on paper before revealing the solution. The goal is fluency with the operation, the point at which these computations feel mechanical rather than effortful.
exp(0), exp(1), and exp(−1). Which is largest? Which is smallest?ln(exp(5)) + ln(exp(3))Distributions, conditional probability, and the chain rule for sequences.
A language model is, at its mathematical core, a machine that outputs a probability distribution over the next token given all previous tokens. Every concept in this chapter (probability distributions, conditional probability, the chain rule of probability) exists to make that sentence precise. By the end, the reader should be able to parse the statement “the model computes P(xₜ | x₁, x₂, …, xₜ₋₁)” and understand exactly what each symbol means.
A probability distribution is a list of numbers that satisfy two rules: every number is non-negative (zero or greater), and the numbers sum to exactly 1. A language model with a vocabulary of 50,000 tokens must output 50,000 numbers, one per possible next token, that are all non-negative and sum to 1. The number assigned to each token represents the model’s confidence that this token comes next. The token with the highest probability is the model’s best guess, but the full distribution captures the model’s uncertainty about all alternatives.
Conditional probability is the probability of an event given that some other event has already occurred. The notation P(A | B) reads “the probability of A given B.” For a language model, the relevant conditional probability is P(next token | all previous tokens). The entire context window, every token the model has seen so far, is the condition. The output distribution changes with every new token because the condition changes.
The chain rule of probability (not to be confused with the calculus chain rule in Chapter VIII) decomposes the probability of a sequence into a product of conditional probabilities:
This is exactly how an autoregressive language model generates text. It produces one token at a time, left to right, with each token’s probability conditioned on all tokens generated so far. The probability of the entire generated sequence is the product of all the per-token conditional probabilities, which, per Chapter IV, is computed in practice as a sum of log-probabilities.
Consider the sentence “The cat sat.” A language model assigns:
The probability of the full sentence:
This number is small, as it should be, since any specific four-token sequence is one possibility among an enormous space of alternatives. In log space: ln(P) = ln(0.05) + ln(0.12) + ln(0.08) + ln(0.70) ≈ −3.00 + (−2.12) + (−2.53) + (−0.36) = −8.00.
Work each problem on paper before revealing the solution. The goal is fluency with the operation, the point at which these computations feel mechanical rather than effortful.
Converting logits into a valid probability distribution.
The internal computations of a neural network produce raw numerical scores called logits that can be any real number: positive, negative, or zero. These scores are not probabilities. They do not sum to 1, and they can be negative. The softmax function solves this problem by converting a vector of arbitrary logits into a valid probability distribution, using the exponential function from Chapter IV.
Given a vector of logits z = [z₁, z₂, …, zₙ], softmax computes:
The exponential ensures every output is positive (since exp(x) > 0 for all x). Dividing by the sum of all exponentials ensures the outputs sum to 1. The result is a valid probability distribution. Larger logits produce larger probabilities, and the relative differences between logits are preserved but amplified, because the exponential function is nonlinear and grows more steeply for larger inputs.
A temperature parameter controls how “peaked” or “flat” the output distribution is. The modified softmax divides each logit by the temperature T before applying the exponential: softmax(zᵢ/T). When T = 1, the standard softmax applies. As T approaches 0, the distribution collapses toward a single spike on the highest logit (deterministic output). As T increases, the distribution flattens toward uniform (maximum randomness). This is the “temperature” slider in LLM interfaces.
A practical concern: for large logit values, exp(z) can overflow to infinity. The standard fix, universally applied in implementations, is to subtract the maximum logit from all values before exponentiating. Since softmax(z) = softmax(z − c) for any constant c (the subtraction cancels in the ratio), this produces identical results while keeping the numbers in a manageable range.
Logits: [2.0, 1.0, 0.5]
The first logit (2.0) was only 1.0 larger than the second, but its probability is nearly three times higher because the exponential amplifies differences. Verify: 0.629 + 0.231 + 0.140 = 1.000.
Work each problem on paper before revealing the solution. The goal is fluency with the operation, the point at which these computations feel mechanical rather than effortful.
[0, 0, 0]. What distribution do you get, and why?[10, 0, 0]. What happens when one logit dominates?Quantifying prediction error with surprise and perplexity.
Training a language model requires a way to measure how wrong its predictions are. The model outputs a probability distribution over the next token (Chapter V, via softmax in Chapter VI), and we know which token actually appeared. The cross-entropy loss quantifies the gap between the predicted distribution and reality, using the logarithm from Chapter IV. It is the single number that training seeks to minimize.
The intuition begins with surprise. If a model assigns probability 0.9 to the correct next token, it is not very surprised when that token appears. If it assigns probability 0.01, it is extremely surprised. The natural measure of surprise for an event with probability p is −ln(p). This measure is zero when p = 1 (no surprise at all), infinite when p approaches 0 (maximum surprise), and increases smoothly between these extremes. The negative sign makes the value positive, since ln(p) is negative for p between 0 and 1.
Cross-entropy generalizes this across an entire vocabulary. Given the model’s predicted distribution q and the true distribution p (which for next-token prediction is simply 1 for the correct token and 0 for everything else), the cross-entropy is:
In the language modeling case, since p is 1 for the correct token and 0 elsewhere, this simplifies to −ln(q(correct token)), exactly the surprise formula above. The average cross-entropy across many predictions is the training loss that appears on loss curves. Perplexity, the metric most commonly reported for language models, is simply the exponential of the average cross-entropy: perplexity = exp(H). A perplexity of 20 means the model is, on average, as uncertain as if it were choosing uniformly among 20 equally likely options.
A model predicts the distribution [0.7, 0.2, 0.1] over tokens [“the”, “a”, “an”]. The correct token is “the”.
If instead the correct token were “an” (predicted probability 0.1):
The loss is much higher when the model assigns low probability to the correct answer. Training adjusts the model’s parameters to reduce this loss, which means pushing the model to assign higher probability to tokens that actually appear.
| Predicted P(correct) | Cross-Entropy Loss | Perplexity |
|---|---|---|
| 0.95 | 0.05 | 1.05 |
| 0.50 | 0.69 | 2.00 |
| 0.10 | 2.30 | 10.0 |
| 0.01 | 4.61 | 100 |
| 0.001 | 6.91 | 1,000 |
Work each problem on paper before revealing the solution. The goal is fluency with the operation, the point at which these computations feel mechanical rather than effortful.
Partial derivatives and the gradient vector.
Chapter VII defined the loss function, the number the model wants to minimize. The question now is: how does the model know which direction to adjust its parameters? The answer is the derivative, which measures how a function’s output changes when its input changes by a tiny amount. In one dimension this is the slope of a curve at a point. In the many-dimensional space of a neural network’s parameters, the collection of all partial derivatives is called the gradient, and it points in the direction of steepest increase.
The derivative of a function f(x) at a point x, written df/dx or f′(x), answers the question: if I nudge x by a tiny amount, how much does f(x) change? A positive derivative means the function is increasing; a negative derivative means it is decreasing; a zero derivative means the function is flat, potentially at a minimum or maximum.
When a function has multiple inputs (as a neural network’s loss function does, with millions of parameters) we take partial derivatives, one for each input, while holding all others constant. The notation ∂L/∂w reads “the partial derivative of the loss L with respect to the parameter w.” It answers: if I change just this one weight slightly, how much does the loss change?
The gradient is the vector of all partial derivatives. If the loss function has n parameters, the gradient is an n-dimensional vector:
The gradient points in the direction of steepest ascent, the direction that would increase the loss most. To decrease the loss, the model moves in the opposite direction: the negative gradient. This is the central idea behind all of neural network training, and Chapter IX makes it operational.
Consider the function f(x, y) = x² + 3xy. Compute the partial derivatives.
At the point (x=2, y=1):
The gradient [7, 6] says: increasing x has slightly more impact on the function than increasing y at this particular point. In training terms, the parameter corresponding to x should receive a larger update.
Work each problem on paper before revealing the solution. The goal is fluency with the operation, the point at which these computations feel mechanical rather than effortful.
Iterative parameter updates via SGD, Adam, and backpropagation.
Gradient descent is the algorithm that uses the gradient (Chapter VIII) to iteratively improve the model’s parameters. The idea is straightforward: compute the gradient of the loss with respect to every parameter, then take a small step in the opposite direction, repeating for every batch of training data. Each step reduces the loss slightly, and over millions of steps, the model’s predictions improve from random noise to coherent language.
The update rule is a single equation:
The symbol α is the learning rate, a small positive number (typically between 0.0001 and 0.01) that controls the step size. Too large, and the model overshoots the minimum, bouncing around without converging. Too small, and training takes prohibitively long. The learning rate is perhaps the most important hyperparameter in all of deep learning, and modern training runs often vary it on a schedule: starting higher and gradually decreasing as training progresses.
In practice, computing the gradient over the entire training dataset for each update would be prohibitively expensive. Stochastic gradient descent (SGD) computes the gradient on a small random subset of the data, a mini-batch, and uses that estimate instead. The gradient estimate is noisy (it may not point exactly toward the true minimum), but averaged over many steps the noise cancels out. Mini-batch sizes for LLM training are typically in the thousands to millions of tokens.
Most modern LLMs use a refined variant called Adam (Adaptive Moment Estimation), which maintains running averages of both the gradient and its square for each parameter, effectively giving each parameter its own adaptive learning rate. The mathematical details of Adam are beyond the scope of this primer, but conceptually it is still gradient descent (compute the gradient, step in the opposite direction) with per-parameter step sizes that adapt based on the history of recent gradients.
Backpropagation is the algorithm that computes the gradient efficiently. It works backward through the network, applying the chain rule (Chapter VIII) layer by layer, computing ∂L/∂w for every weight in the network in a single backward pass. The cost of the backward pass is roughly twice the cost of the forward pass, a remarkably efficient ratio that makes training networks with billions of parameters feasible.
A model has a single weight w = 3.0 and the computed gradient is ∂L/∂w = 4.0. The learning rate α = 0.1.
The weight moved from 3.0 to 2.6, a step in the direction that reduces the loss. After many such updates across all parameters, the model converges toward parameters that produce lower loss (better predictions).
Work each problem on paper before revealing the solution. The goal is fluency with the operation, the point at which these computations feel mechanical rather than effortful.
Queries, keys, values, and the scaled dot-product mechanism.
The attention mechanism is the mathematical core of the transformer architecture, and every concept from the preceding nine chapters converges here. Vectors represent tokens (Chapter I). Dot products measure relevance between tokens (Chapter II). Matrices project embeddings into queries, keys, and values (Chapter III). The exponential function and softmax convert raw scores into attention weights (Chapters IV and VI). Cross-entropy provides the loss signal (Chapter VII), and gradient descent adjusts the weight matrices to improve predictions (Chapters VIII and IX). The reader who has worked through this primer now possesses every tool needed to read the attention equation from the original “Attention Is All You Need” paper and parse every symbol.
The mechanism begins with three linear projections. Given an input matrix X (one row per token, each row a vector), the model computes:
Q is the query matrix, what each token is “looking for.” K is the key matrix, what each token “advertises” about itself. V is the value matrix, the actual information each token carries. The weight matrices W_Q, W_K, and W_V are learned parameters, adjusted through gradient descent during training. Each of these matrix multiplications is exactly the operation defined in Chapter III.
Next, the model computes attention scores by taking the dot product of every query with every key. This produces a matrix where entry (i, j) measures how much token i should attend to token j. The scores are scaled by dividing by √d_k (the square root of the key dimension), which prevents the dot products from growing too large in high-dimensional spaces, a scaling factor that keeps the softmax from saturating.
Softmax (Chapter VI) converts each row of the score matrix into a probability distribution, a set of attention weights that sum to 1. Each token now has a distribution over all other tokens indicating how much attention it pays to each.
Finally, these attention weights multiply the value matrix. Each token’s output is a weighted sum of all other tokens’ values, where the weights are determined by the attention scores. The full equation:
Multi-head attention runs this entire process multiple times in parallel, with different learned projection matrices for each “head.” If a model uses 12 attention heads, it learns 12 different sets of (W_Q, W_K, W_V) matrices, computes 12 separate attention outputs, concatenates them, and projects the result back to the model’s embedding dimension through a final weight matrix. Each head can learn to attend to different aspects of the input; one head might track syntactic relationships while another tracks semantic similarity.
Every symbol in the attention equation has now been defined through the preceding chapters. Q, K, and V are matrices produced by the matrix multiplications of Chapter III. QKᵀ is a matrix of dot products from Chapter II. The division by √d_k uses the square root from Chapter I’s magnitude formula. Softmax applies the exponential function from Chapter IV to produce probability distributions from Chapter V. The final multiplication by V is another matrix operation. The loss function (Chapter VII) measures how well the model’s predictions match reality, and gradient descent (Chapter IX) adjusts all the weight matrices to improve the next prediction.
Consider three tokens with 2-dimensional embeddings, and a single attention head with d_k = 2:
Step 1: Compute QKᵀ
Step 2: Scale by √d_k = √2 ≈ 1.414
Step 3: Apply softmax row-wise (each row sums to 1)
Step 4: Multiply by V (each row is a weighted sum of the value vectors)
Token 1 attends equally to tokens 1 and 3 (weights 0.40 each), with less attention to token 2 (weight 0.20). Its output is a blend of those tokens’ values, weighted accordingly.
Work each problem on paper before revealing the solution. The goal is fluency with the operation, the point at which these computations feel mechanical rather than effortful.
| Symbol | Meaning | Introduced |
|---|---|---|
| v, a, b | Vectors | Ch. I |
| ||v|| | Magnitude (length) of vector v | Ch. I |
| a · b | Dot product | Ch. II |
| cos(θ) | Cosine similarity | Ch. II |
| W | Matrix (weight matrix) | Ch. III |
| Wv | Matrix-vector multiplication | Ch. III |
| exp(x), eˣ | Exponential function | Ch. IV |
| ln(x), log(x) | Natural logarithm | Ch. IV |
| P(A | B) | Conditional probability | Ch. V |
| softmax(z) | Softmax function | Ch. VI |
| H(p, q) | Cross-entropy | Ch. VII |
| ∂L/∂w | Partial derivative of loss w.r.t. weight | Ch. VIII |
| ∇L | Gradient vector | Ch. VIII |
| α | Learning rate | Ch. IX |
| Q, K, V | Query, key, value matrices | Ch. X |
| d_k | Dimension per attention head | Ch. X |
Fluency with these ten topics changes what a practitioner can do with the primary literature. Architecture papers become readable rather than skimmable. Ablation studies make sense because the operations being ablated have concrete definitions. Hyperparameter choices (learning rate schedules, attention head counts, embedding dimensions) connect to the mechanics they govern rather than appearing as arbitrary numbers to copy from a reference implementation. The goal is not to become a mathematician but to stop being blocked by the notation.