The Mathematics of Large Language Models

A working primer for practitioners who understand the architecture but want to read the papers.

A large and growing population of engineers, researchers, and technical leaders can describe transformer architecture with genuine clarity (attention mechanisms, embedding spaces, next-token prediction) but lose the thread the moment a paper shifts into mathematical notation. The existing educational landscape does not serve them well. Formal textbooks assume a mathematics degree; informal explainers simplify past the point of usefulness, leaving readers with intuitions that collapse under any real technical weight. Neither approach bridges the gap between conceptual understanding and the ability to read a research paper on its own terms.

This primer attempts that bridge. Each chapter introduces a mathematical tool because it does specific, visible work inside a language model. No symbol appears without prior explanation, and no concept earns its place unless it unlocks something concrete about how these systems function. The sequence is deliberate: every chapter depends only on what came before it. A reader who starts at the beginning and works through the exercises should arrive at the attention mechanism, the mathematical core of the transformer, with every piece of notation fully in hand.

Contents
I
VectorsWords as ordered lists of numbers
II
The Dot Product & SimilarityMeasuring similarity between vectors
III
MatricesWeight matrices and vector transformations
IV
Exponentials & LogarithmsThe inverse functions behind softmax and loss
V
ProbabilityDistributions, conditionals, and the chain rule
VI
SoftmaxConverting logits into probabilities
VII
Cross-EntropyQuantifying prediction error
VIII
Derivatives & GradientsPartial derivatives and the gradient vector
IX
Gradient DescentSGD, Adam, and backpropagation
X
AttentionQueries, keys, values, and scaled dot-product
Chapter I

Vectors

Words as ordered lists of numbers, and why direction carries meaning.

Every large language model begins with the same operation: it converts a word (or, more precisely, a token) into a list of numbers. That list is a vector, and nearly everything that follows in the transformer architecture (attention, prediction, loss computation) is arithmetic performed on vectors. Understanding what vectors are, how they relate to each other, and what it means to operate on them is the single most important mathematical foundation for reading LLM research.

A vector is an ordered list of numbers. That is the entire definition. The vector [3, 7] has two components and lives in two-dimensional space. The vector [0.12, −0.85, 0.33, 0.71] has four components and lives in four-dimensional space. GPT-4's token embeddings are vectors with thousands of components, each one a coordinate in a space far too large to visualize directly but governed by exactly the same rules as two-dimensional arrows on a page.

v = [v₁, v₂, v₃, …, vₙ]   where n = number of dimensions
definition

Two properties of a vector matter immediately. The first is its magnitude, how long the arrow is, calculated as the square root of the sum of its squared components. The second is its direction, where the arrow points. In the context of language models, direction turns out to carry the meaning. Two word embeddings that point in similar directions represent words with similar roles in language, regardless of how long the vectors happen to be.

||v|| = √(v₁² + v₂² + … + vₙ²)
magnitude

Why does this representation work? Consider the alternative: a language model could assign each word a single number (word #1, word #2, word #38042). But a single number cannot capture the relationships between words, the fact that “king” and “queen” share something that “king” and “carburetor” do not. A vector with hundreds of dimensions can encode these relationships structurally. Words with similar meanings end up near each other, and the geometric relationships between vectors reflect semantic relationships between concepts.

Figure 1 · Words as points in two-dimensional space (simplified)
royalty / people vehicles / objects king queen prince throne crown car truck wheel engine road bridge between clusters close far apart

Figure 1 shows a simplified picture: words plotted as points in two dimensions. In a real model the space has 768 or 4,096 or more dimensions, but the principle holds at any scale. Words used in similar contexts cluster together, and the distance between points reflects semantic distance. The model stores and retrieves meaning through exactly this geometric structure.

Example 1Computing the magnitude of an embedding vector

Suppose a toy language model represents the word “king” as the vector [3, 4]. What is the magnitude of this embedding?

Apply the formula directly:

||v|| = √(3² + 4²) = √(9 + 16) = √25 = 5

This is a Pythagorean triple, so the result is a clean integer, a convenience that rarely occurs with real embeddings, which typically have magnitudes like 11.38 or 0.97 depending on the model’s normalization scheme.

The magnitude formula should look familiar: for a two-dimensional vector it is the Pythagorean theorem. The generalization to n dimensions works identically; add more squared terms under the radical. This is a recurring pattern throughout this primer: the mathematical tools that govern high-dimensional LLM computations are, at their core, straightforward extensions of geometry the reader already knows from two and three dimensions.

Chapter II

The Dot Product & Similarity

Measuring similarity between vectors with element-wise multiplication.

Chapter I established that words become vectors. The immediate follow-up question: how do you compare two vectors? If “king” is [3, 7] and “queen” is [4, 6.5], how similar are they, and how would you quantify that? The dot product is the operation that answers this question, and it is the single most frequently occurring computation in the entire transformer architecture.

The dot product of two vectors is computed by multiplying corresponding elements and summing the results. Given vectors a = [a₁, a₂, …, aₙ] and b = [b₁, b₂, …, bₙ]:

a · b = a₁b₁ + a₂b₂ + … + aₙbₙ
dot product

The operation requires that both vectors have the same number of dimensions. You cannot take the dot product of a 3-dimensional vector and a 5-dimensional one. This constraint shows up constantly in transformer engineering as the requirement that “the shapes must match,” and it is the most common source of implementation errors when building or modifying model architectures.

Figure 2 · The dot product as element-wise multiplication and sum
a 2 5 1 element-wise multiply b 3 1 4 6 + 5 + 4 = 15

The geometric interpretation is what makes the dot product powerful. The dot product of two vectors equals the product of their magnitudes multiplied by the cosine of the angle between them:

a · b = ||a|| × ||b|| × cos(θ)
geometric form

This means the dot product encodes two things at once: how long the vectors are, and how much they point in the same direction. When we want to isolate just the directional component (the similarity of meaning, stripped of magnitude) we divide out the lengths. The result is cosine similarity:

cos(θ) = (a · b) / (||a|| × ||b||)
cosine similarity

Cosine similarity ranges from −1 (vectors pointing in exactly opposite directions) through 0 (perpendicular, no relationship) to +1 (identical direction). In practice, word embeddings trained on real data rarely produce negative cosine similarities. Most pairs land somewhere between 0 and 0.5, with semantically related words clustering above 0.7.

Figure 3 · Cosine similarity measures direction, not magnitude
a b c θ small cos θ ≈ 0.99 θ large cos θ ≈ 0.15 Direction matters more than length. Cosine measures the angle, not the magnitude.
Reference · Cosine similarity interpretation
RangeInterpretationExample
0.90 – 1.00Near-identical“happy” / “joyful”
0.70 – 0.90Strongly related“king” / “queen”
0.40 – 0.70Some relationship“king” / “castle”
0.00 – 0.40Weak or no relation“king” / “bicycle”
< 0.00Opposing (rare)
Example 2Dot product and cosine similarity of two word vectors

A toy model represents “cat” as [2, 5, 1] and “dog” as [3, 4, 2]. Compute the dot product and cosine similarity.

Dot product:

(2)(3) + (5)(4) + (1)(2) = 6 + 20 + 2 = 28

Magnitudes:

||cat|| = √(4 + 25 + 1) = √30 ≈ 5.477
||dog|| = √(9 + 16 + 4) = √29 ≈ 5.385

Cosine similarity:

28 / (5.477 × 5.385) ≈ 28 / 29.50 ≈ 0.949

A cosine similarity of 0.95 indicates these vectors point in nearly the same direction, consistent with “cat” and “dog” being semantically close.

Exercises

Work each problem on paper before revealing the solution. The goal is fluency with the operation, the point at which these computations feel mechanical rather than effortful.

1.
Compute the magnitude of the vector [6, 2, 3].
Show solution
||v|| = √(36 + 4 + 9) = √49 = 7. One of the rare cases where the sum of squares produces a perfect square.
2.
Vectors a = [1, 3, 5] and b = [2, 0, 4]. What is their dot product?
Show solution
(1)(2) + (3)(0) + (5)(4) = 2 + 0 + 20 = 22.
3.
Given a = [1, 2] and b = [2, −1], compute their cosine similarity.
Show solution
Dot product: (1)(2) + (2)(−1) = 0. When the dot product is zero, cosine similarity is 0 regardless of magnitudes, meaning the vectors are perpendicular. In an embedding space, perpendicularity means the model treats the two concepts as entirely unrelated.
4.
Vectors a = [3, 4] and b = [4, 3]. Compute the cosine similarity.
Show solution
Dot product: 12 + 12 = 24. Both magnitudes: 5. Cosine: 24/25 = 0.96. Swapping the components preserves most of the directional information.
Chapter III

Matrices

How weight matrices reshape vectors between layers.

A matrix is a rectangular grid of numbers arranged in rows and columns. If a vector is a list, a matrix is a table. Where a vector represents a single point or direction in space, a matrix represents a transformation: a machine that takes a vector as input and produces a different vector as output. Every layer of a neural network is, at its computational core, a matrix multiplication followed by a nonlinear function. Understanding what matrices do geometrically is therefore understanding what neural networks do to the data that passes through them.

A matrix is described by its shape: the number of rows and columns. A matrix with 3 rows and 2 columns is a “3-by-2 matrix,” written (3×2). The weight matrix in a transformer layer that projects a 768-dimensional embedding into a 3072-dimensional intermediate representation is a (3072×768) matrix: 3,072 rows, 768 columns, containing over 2.3 million individual numbers, each one a learnable parameter.

W is a matrix with shape (m × n): m rows, n columns
notation

The central operation is matrix-vector multiplication. Given a matrix W with shape (m×n) and a vector v with n components, the product Wv is a new vector with m components. Each element of the output is the dot product of one row of the matrix with the input vector. This is where Chapter II pays off: every row of the matrix is itself a vector, and the output is a list of dot products measuring how much the input aligns with each of those row vectors.

Figure 4 · Matrix-vector multiplication: a (3×2) matrix transforms a 2D vector into a 3D vector
W 2 0 0 3 1 1 × v 4 2 = Wv 8 6 6 2×4 + 0×2 = 8 0×4 + 3×2 = 6 1×4 + 1×2 = 6 (3×2) (2×1) (3×1) Inner dimensions must match: the 2 in (3×2) and (2×1) agree.

The shape constraint is critical and worth internalizing: the number of columns in the matrix must equal the number of components in the vector. The output vector has as many components as the matrix has rows. This rule (inner dimensions must match, outer dimensions give the result shape) governs all matrix operations and is the source of most shape mismatch errors in deep learning code.

(m × n) matrix × (n × 1) vector = (m × 1) vector
shape rule

When two matrices are multiplied together, the same rule applies elementwise: a (m×n) matrix times a (n×p) matrix yields a (m×p) matrix. Composing two transformations (first project from 768 to 3072 dimensions, then from 3072 back to 768) is the same as multiplying the two matrices together. This is how transformer layers are stacked: each layer’s output vector becomes the next layer’s input, passing through matrix after matrix, each one reshaping the representation.

Example 3Matrix-vector multiplication as projection

A (2×3) weight matrix and a 3-dimensional input vector:

W = [[1, 0, 2],     v = [3, 1, 4]
     [0, 3, 1]]

Compute Wv:

Row 1 · v = (1)(3) + (0)(1) + (2)(4) = 3 + 0 + 8 = 11
Row 2 · v = (0)(3) + (3)(1) + (1)(4) = 0 + 3 + 4 = 7

The output is the 2-dimensional vector [11, 7]. A 3D input has been projected into 2D space, a dimension reduction, which is exactly what happens when a transformer compresses representations between layers.

Exercises

Work each problem on paper before revealing the solution. The goal is fluency with the operation, the point at which these computations feel mechanical rather than effortful.

5.
Multiply the matrix [[2, 1], [0, 3]] by the vector [4, 2].
Show solution
Row 1: (2)(4) + (1)(2) = 8 + 2 = 10. Row 2: (0)(4) + (3)(2) = 0 + 6 = 6. Result: [10, 6].
6.
A matrix has shape (512 × 768). What shape must the input vector have? What shape is the output?
Show solution
The input must be a 768-dimensional vector (matching the number of columns). The output is a 512-dimensional vector (matching the number of rows). This is a dimension reduction from 768 to 512.
7.
If matrix A has shape (3 × 4) and matrix B has shape (4 × 2), what is the shape of the product AB? Can you compute BA?
Show solution
AB has shape (3 × 2) because the inner dimensions (4 and 4) match, and the outer dimensions give the result. BA would require B’s columns (2) to match A’s rows (3), which fails, so BA is undefined. Matrix multiplication is not commutative.
Chapter IV

Exponentials & Logarithms

The two inverse functions behind softmax and loss computation.

Two mathematical functions appear so frequently in machine learning that a reader who does not recognize them on sight will struggle with nearly every technical paper: the exponential function and the logarithm. They are inverses of each other (what one does, the other undoes) and together they form the mathematical backbone of both the softmax function (Chapter VI) and the cross-entropy loss (Chapter VII). Building reliable intuition for their behavior now will pay dividends for the rest of this primer.

The exponential function, written or exp(x), takes any real number and returns a positive number. The constant e ≈ 2.718 is a mathematical constant that arises naturally in growth and decay processes. The critical properties: exp(0) = 1, the function is always positive, it grows explosively for positive inputs, and it decays toward zero for negative inputs. Crucially, exp(a + b) = exp(a) × exp(b), meaning the exponential converts addition into multiplication.

exp(0) = 1     exp(a+b) = exp(a) · exp(b)     exp(x) > 0 for all x
key properties

The logarithm, written ln(x) or log(x), is the inverse: if exp(a) = b, then ln(b) = a. It is defined only for positive numbers, maps 1 to 0, and grows slowly, logarithmically, toward infinity. Its essential property mirrors the exponential’s: ln(a × b) = ln(a) + ln(b). The logarithm converts multiplication into addition, which is why it appears everywhere in probability calculations where we need to multiply many small numbers together without numerical underflow.

ln(1) = 0     ln(ab) = ln(a) + ln(b)     ln(eˣ) = x     exp(ln(x)) = x
key properties

Why do these functions matter for LLMs specifically? Two reasons dominate. First, the softmax function (Chapter VI) uses exp() to convert arbitrary real-valued scores into positive numbers that can be normalized into a probability distribution. Second, the cross-entropy loss function (Chapter VII) uses ln() to measure how surprised the model is by the correct answer, specifically the negative log of the predicted probability. Every training step of every language model computes both of these functions billions of times.

Example 4Logarithms turn products into sums

A language model predicts a sequence of three tokens with individual probabilities 0.8, 0.6, and 0.9. The joint probability of the full sequence is their product:

P(sequence) = 0.8 × 0.6 × 0.9 = 0.432

In log space, the same calculation becomes addition:

ln(P) = ln(0.8) + ln(0.6) + ln(0.9)
     = −0.223 + (−0.511) + (−0.105) = −0.839

And we can verify: exp(−0.839) ≈ 0.432. With only three tokens the product is still manageable, but for sequences of hundreds of tokens, each probability less than 1, the product approaches zero rapidly and risks numerical underflow. This is why LLM training operates on log-probabilities rather than raw probabilities.

Exercises

Work each problem on paper before revealing the solution. The goal is fluency with the operation, the point at which these computations feel mechanical rather than effortful.

8.
Compute exp(0), exp(1), and exp(−1). Which is largest? Which is smallest?
Show solution
exp(0) = 1, exp(1) ≈ 2.718, exp(−1) ≈ 0.368. Largest: exp(1). Smallest: exp(−1). The exponential is always positive and strictly increasing.
9.
If ln(x) = 3, what is x?
Show solution
x = exp(3) ≈ 20.09. The logarithm is the inverse of the exponential, so if ln(x) = 3, then x = e³.
10.
Simplify: ln(exp(5)) + ln(exp(3))
Show solution
ln(exp(5)) = 5 and ln(exp(3)) = 3, because ln and exp cancel. The sum is 8. Equivalently, ln(exp(5) × exp(3)) = ln(exp(8)) = 8.
Chapter V

Probability

Distributions, conditional probability, and the chain rule for sequences.

A language model is, at its mathematical core, a machine that outputs a probability distribution over the next token given all previous tokens. Every concept in this chapter (probability distributions, conditional probability, the chain rule of probability) exists to make that sentence precise. By the end, the reader should be able to parse the statement “the model computes P(xₜ | x₁, x₂, …, xₜ₋₁)” and understand exactly what each symbol means.

A probability distribution is a list of numbers that satisfy two rules: every number is non-negative (zero or greater), and the numbers sum to exactly 1. A language model with a vocabulary of 50,000 tokens must output 50,000 numbers, one per possible next token, that are all non-negative and sum to 1. The number assigned to each token represents the model’s confidence that this token comes next. The token with the highest probability is the model’s best guess, but the full distribution captures the model’s uncertainty about all alternatives.

P(x₁) + P(x₂) + … + P(xₙ) = 1     where each P(xᵢ) ≥ 0
probability distribution

Conditional probability is the probability of an event given that some other event has already occurred. The notation P(A | B) reads “the probability of A given B.” For a language model, the relevant conditional probability is P(next token | all previous tokens). The entire context window, every token the model has seen so far, is the condition. The output distribution changes with every new token because the condition changes.

The chain rule of probability (not to be confused with the calculus chain rule in Chapter VIII) decomposes the probability of a sequence into a product of conditional probabilities:

P(x₁, x₂, …, xₙ) = P(x₁) × P(x₂|x₁) × P(x₃|x₁,x₂) × … × P(xₙ|x₁,…,xₙ₋₁)
chain rule

This is exactly how an autoregressive language model generates text. It produces one token at a time, left to right, with each token’s probability conditioned on all tokens generated so far. The probability of the entire generated sequence is the product of all the per-token conditional probabilities, which, per Chapter IV, is computed in practice as a sum of log-probabilities.

Example 5Decomposing a sentence probability

Consider the sentence “The cat sat.” A language model assigns:

P(“The”) = 0.05
P(“cat” | “The”) = 0.12
P(“sat” | “The cat”) = 0.08
P(“.” | “The cat sat”) = 0.70

The probability of the full sentence:

P = 0.05 × 0.12 × 0.08 × 0.70 = 0.000336

This number is small, as it should be, since any specific four-token sequence is one possibility among an enormous space of alternatives. In log space: ln(P) = ln(0.05) + ln(0.12) + ln(0.08) + ln(0.70) ≈ −3.00 + (−2.12) + (−2.53) + (−0.36) = −8.00.

Exercises

Work each problem on paper before revealing the solution. The goal is fluency with the operation, the point at which these computations feel mechanical rather than effortful.

11.
A model produces the distribution [0.4, 0.35, 0.2, 0.05] over four tokens. Is this a valid probability distribution? What if the last value were 0.1 instead?
Show solution
Sum = 0.4 + 0.35 + 0.2 + 0.05 = 1.0 and all values ≥ 0, so yes, it is valid. If the last value were 0.1, the sum would be 1.05, which violates the requirement that probabilities sum to exactly 1.
12.
A model generates a three-token sequence with per-token conditional probabilities 0.3, 0.5, and 0.9. What is the probability of the full sequence? What is the log-probability?
Show solution
P = 0.3 × 0.5 × 0.9 = 0.135. Log-probability: ln(0.3) + ln(0.5) + ln(0.9) ≈ −1.204 + (−0.693) + (−0.105) = −2.002.
Chapter VI

Softmax

Converting logits into a valid probability distribution.

The internal computations of a neural network produce raw numerical scores called logits that can be any real number: positive, negative, or zero. These scores are not probabilities. They do not sum to 1, and they can be negative. The softmax function solves this problem by converting a vector of arbitrary logits into a valid probability distribution, using the exponential function from Chapter IV.

Given a vector of logits z = [z₁, z₂, …, zₙ], softmax computes:

softmax(zᵢ) = exp(zᵢ) / (exp(z₁) + exp(z₂) + … + exp(zₙ))
softmax

The exponential ensures every output is positive (since exp(x) > 0 for all x). Dividing by the sum of all exponentials ensures the outputs sum to 1. The result is a valid probability distribution. Larger logits produce larger probabilities, and the relative differences between logits are preserved but amplified, because the exponential function is nonlinear and grows more steeply for larger inputs.

Figure 5 · Softmax transforms logits into a probability distribution
raw scores (logits) 2.0 1.0 0.5 ↓ apply exp( ), then normalize probabilities (sum = 1.0) 0.629 0.231 0.140 most likely

A temperature parameter controls how “peaked” or “flat” the output distribution is. The modified softmax divides each logit by the temperature T before applying the exponential: softmax(zᵢ/T). When T = 1, the standard softmax applies. As T approaches 0, the distribution collapses toward a single spike on the highest logit (deterministic output). As T increases, the distribution flattens toward uniform (maximum randomness). This is the “temperature” slider in LLM interfaces.

A practical concern: for large logit values, exp(z) can overflow to infinity. The standard fix, universally applied in implementations, is to subtract the maximum logit from all values before exponentiating. Since softmax(z) = softmax(z − c) for any constant c (the subtraction cancels in the ratio), this produces identical results while keeping the numbers in a manageable range.

Example 6Computing softmax by hand

Logits: [2.0, 1.0, 0.5]

exp(2.0) = 7.389
exp(1.0) = 2.718
exp(0.5) = 1.649
Sum = 11.756
softmax = [7.389/11.756, 2.718/11.756, 1.649/11.756]
       = [0.629, 0.231, 0.140]

The first logit (2.0) was only 1.0 larger than the second, but its probability is nearly three times higher because the exponential amplifies differences. Verify: 0.629 + 0.231 + 0.140 = 1.000.

Exercises

Work each problem on paper before revealing the solution. The goal is fluency with the operation, the point at which these computations feel mechanical rather than effortful.

13.
Apply softmax to the logits [0, 0, 0]. What distribution do you get, and why?
Show solution
exp(0) = 1 for each. Sum = 3. Softmax = [1/3, 1/3, 1/3] = [0.333, 0.333, 0.333]. When all logits are equal, softmax produces a uniform distribution, meaning the model has no preference among the options.
14.
Apply softmax to [10, 0, 0]. What happens when one logit dominates?
Show solution
exp(10) ≈ 22026, exp(0) = 1. Sum ≈ 22028. Softmax ≈ [0.9999, 0.00005, 0.00005]. Nearly all probability mass concentrates on the first element. This is what a confident model prediction looks like: one logit far above the rest.
Chapter VII

Cross-Entropy

Quantifying prediction error with surprise and perplexity.

Training a language model requires a way to measure how wrong its predictions are. The model outputs a probability distribution over the next token (Chapter V, via softmax in Chapter VI), and we know which token actually appeared. The cross-entropy loss quantifies the gap between the predicted distribution and reality, using the logarithm from Chapter IV. It is the single number that training seeks to minimize.

The intuition begins with surprise. If a model assigns probability 0.9 to the correct next token, it is not very surprised when that token appears. If it assigns probability 0.01, it is extremely surprised. The natural measure of surprise for an event with probability p is −ln(p). This measure is zero when p = 1 (no surprise at all), infinite when p approaches 0 (maximum surprise), and increases smoothly between these extremes. The negative sign makes the value positive, since ln(p) is negative for p between 0 and 1.

surprise = −ln(P(correct token))
surprise

Cross-entropy generalizes this across an entire vocabulary. Given the model’s predicted distribution q and the true distribution p (which for next-token prediction is simply 1 for the correct token and 0 for everything else), the cross-entropy is:

H(p, q) = −Σ p(xᵢ) · ln(q(xᵢ))
cross-entropy

In the language modeling case, since p is 1 for the correct token and 0 elsewhere, this simplifies to −ln(q(correct token)), exactly the surprise formula above. The average cross-entropy across many predictions is the training loss that appears on loss curves. Perplexity, the metric most commonly reported for language models, is simply the exponential of the average cross-entropy: perplexity = exp(H). A perplexity of 20 means the model is, on average, as uncertain as if it were choosing uniformly among 20 equally likely options.

Example 7Computing cross-entropy loss for a single prediction

A model predicts the distribution [0.7, 0.2, 0.1] over tokens [“the”, “a”, “an”]. The correct token is “the”.

Loss = −ln(0.7) = −(−0.357) = 0.357

If instead the correct token were “an” (predicted probability 0.1):

Loss = −ln(0.1) = −(−2.303) = 2.303

The loss is much higher when the model assigns low probability to the correct answer. Training adjusts the model’s parameters to reduce this loss, which means pushing the model to assign higher probability to tokens that actually appear.

Reference · Loss and perplexity values
Predicted P(correct)Cross-Entropy LossPerplexity
0.950.051.05
0.500.692.00
0.102.3010.0
0.014.61100
0.0016.911,000
Exercises

Work each problem on paper before revealing the solution. The goal is fluency with the operation, the point at which these computations feel mechanical rather than effortful.

15.
A model assigns probability 0.3 to the correct next token. What is the cross-entropy loss? What is the perplexity?
Show solution
Loss = −ln(0.3) ≈ 1.204. Perplexity = exp(1.204) ≈ 3.33. The model is about as confused as if it were choosing among 3–4 equally likely options.
16.
If a model achieves an average cross-entropy loss of 2.0 across a test set, what is its perplexity? Is this good?
Show solution
Perplexity = exp(2.0) ≈ 7.39. Whether this is “good” depends on the task and vocabulary. For typical English language modeling with a 50,000-token vocabulary, a perplexity of 7 means the model has narrowed its uncertainty to roughly 7 plausible tokens at each step, a strong result given 50,000 alternatives.
Chapter VIII

Derivatives & Gradients

Partial derivatives and the gradient vector.

Chapter VII defined the loss function, the number the model wants to minimize. The question now is: how does the model know which direction to adjust its parameters? The answer is the derivative, which measures how a function’s output changes when its input changes by a tiny amount. In one dimension this is the slope of a curve at a point. In the many-dimensional space of a neural network’s parameters, the collection of all partial derivatives is called the gradient, and it points in the direction of steepest increase.

The derivative of a function f(x) at a point x, written df/dx or f′(x), answers the question: if I nudge x by a tiny amount, how much does f(x) change? A positive derivative means the function is increasing; a negative derivative means it is decreasing; a zero derivative means the function is flat, potentially at a minimum or maximum.

df/dx ≈ [f(x + Δx) − f(x)] / Δx     as Δx → 0
derivative

When a function has multiple inputs (as a neural network’s loss function does, with millions of parameters) we take partial derivatives, one for each input, while holding all others constant. The notation ∂L/∂w reads “the partial derivative of the loss L with respect to the parameter w.” It answers: if I change just this one weight slightly, how much does the loss change?

The gradient is the vector of all partial derivatives. If the loss function has n parameters, the gradient is an n-dimensional vector:

∇L = [∂L/∂w₁, ∂L/∂w₂, …, ∂L/∂wₙ]
gradient

The gradient points in the direction of steepest ascent, the direction that would increase the loss most. To decrease the loss, the model moves in the opposite direction: the negative gradient. This is the central idea behind all of neural network training, and Chapter IX makes it operational.

Example 8Partial derivatives of a simple function

Consider the function f(x, y) = x² + 3xy. Compute the partial derivatives.

∂f/∂x = 2x + 3y     (treat y as constant, differentiate with respect to x)
∂f/∂y = 3x          (treat x as constant, differentiate with respect to y)

At the point (x=2, y=1):

∂f/∂x = 2(2) + 3(1) = 7
∂f/∂y = 3(2) = 6
∇f = [7, 6]

The gradient [7, 6] says: increasing x has slightly more impact on the function than increasing y at this particular point. In training terms, the parameter corresponding to x should receive a larger update.

Exercises

Work each problem on paper before revealing the solution. The goal is fluency with the operation, the point at which these computations feel mechanical rather than effortful.

17.
For f(x) = x³, what is f′(x)? Evaluate at x = 2.
Show solution
f′(x) = 3x². At x = 2: f′(2) = 3(4) = 12. The function is increasing steeply at x = 2.
18.
For f(x, y) = 2x² − y² + xy, compute both partial derivatives. Evaluate the gradient at (1, 3).
Show solution
∂f/∂x = 4x + y. ∂f/∂y = −2y + x. At (1, 3): ∂f/∂x = 4 + 3 = 7, ∂f/∂y = −6 + 1 = −5. Gradient: [7, −5]. To decrease f, move in the direction [−7, 5], decreasing x and increasing y.
Chapter IX

Gradient Descent

Iterative parameter updates via SGD, Adam, and backpropagation.

Gradient descent is the algorithm that uses the gradient (Chapter VIII) to iteratively improve the model’s parameters. The idea is straightforward: compute the gradient of the loss with respect to every parameter, then take a small step in the opposite direction, repeating for every batch of training data. Each step reduces the loss slightly, and over millions of steps, the model’s predictions improve from random noise to coherent language.

The update rule is a single equation:

w ← w − α · ∂L/∂w
update rule

The symbol α is the learning rate, a small positive number (typically between 0.0001 and 0.01) that controls the step size. Too large, and the model overshoots the minimum, bouncing around without converging. Too small, and training takes prohibitively long. The learning rate is perhaps the most important hyperparameter in all of deep learning, and modern training runs often vary it on a schedule: starting higher and gradually decreasing as training progresses.

Figure 6 · Gradient descent: following the negative gradient toward a loss minimum
start minimum loss landscape (simplified to one dimension) parameter value → loss ↑

In practice, computing the gradient over the entire training dataset for each update would be prohibitively expensive. Stochastic gradient descent (SGD) computes the gradient on a small random subset of the data, a mini-batch, and uses that estimate instead. The gradient estimate is noisy (it may not point exactly toward the true minimum), but averaged over many steps the noise cancels out. Mini-batch sizes for LLM training are typically in the thousands to millions of tokens.

Most modern LLMs use a refined variant called Adam (Adaptive Moment Estimation), which maintains running averages of both the gradient and its square for each parameter, effectively giving each parameter its own adaptive learning rate. The mathematical details of Adam are beyond the scope of this primer, but conceptually it is still gradient descent (compute the gradient, step in the opposite direction) with per-parameter step sizes that adapt based on the history of recent gradients.

Backpropagation is the algorithm that computes the gradient efficiently. It works backward through the network, applying the chain rule (Chapter VIII) layer by layer, computing ∂L/∂w for every weight in the network in a single backward pass. The cost of the backward pass is roughly twice the cost of the forward pass, a remarkably efficient ratio that makes training networks with billions of parameters feasible.

Example 9One step of gradient descent

A model has a single weight w = 3.0 and the computed gradient is ∂L/∂w = 4.0. The learning rate α = 0.1.

w_new = w − α · ∂L/∂w
      = 3.0 − (0.1)(4.0)
      = 3.0 − 0.4 = 2.6

The weight moved from 3.0 to 2.6, a step in the direction that reduces the loss. After many such updates across all parameters, the model converges toward parameters that produce lower loss (better predictions).

Exercises

Work each problem on paper before revealing the solution. The goal is fluency with the operation, the point at which these computations feel mechanical rather than effortful.

19.
A weight w = 5.0 has gradient ∂L/∂w = −2.0 and the learning rate is α = 0.05. What is the new value after one gradient descent step?
Show solution
w_new = 5.0 − (0.05)(−2.0) = 5.0 + 0.1 = 5.1. Note: the negative gradient means the loss was decreasing as w increased, so the update correctly increases w to reduce the loss further.
20.
If a model has 7 billion parameters and we compute the gradient of the loss with respect to each one, how many numbers are in the gradient vector?
Show solution
7 billion. The gradient has one component per parameter. In a training step, all 7 billion values are computed via backpropagation and used to update all 7 billion weights simultaneously. This is why LLM training requires substantial compute.
Chapter X

Attention

Queries, keys, values, and the scaled dot-product mechanism.

The attention mechanism is the mathematical core of the transformer architecture, and every concept from the preceding nine chapters converges here. Vectors represent tokens (Chapter I). Dot products measure relevance between tokens (Chapter II). Matrices project embeddings into queries, keys, and values (Chapter III). The exponential function and softmax convert raw scores into attention weights (Chapters IV and VI). Cross-entropy provides the loss signal (Chapter VII), and gradient descent adjusts the weight matrices to improve predictions (Chapters VIII and IX). The reader who has worked through this primer now possesses every tool needed to read the attention equation from the original “Attention Is All You Need” paper and parse every symbol.

The mechanism begins with three linear projections. Given an input matrix X (one row per token, each row a vector), the model computes:

Q = XW_Q     K = XW_K     V = XW_V
projections

Q is the query matrix, what each token is “looking for.” K is the key matrix, what each token “advertises” about itself. V is the value matrix, the actual information each token carries. The weight matrices W_Q, W_K, and W_V are learned parameters, adjusted through gradient descent during training. Each of these matrix multiplications is exactly the operation defined in Chapter III.

Next, the model computes attention scores by taking the dot product of every query with every key. This produces a matrix where entry (i, j) measures how much token i should attend to token j. The scores are scaled by dividing by √d_k (the square root of the key dimension), which prevents the dot products from growing too large in high-dimensional spaces, a scaling factor that keeps the softmax from saturating.

scores = QKᵀ / √d_k
attention scores

Softmax (Chapter VI) converts each row of the score matrix into a probability distribution, a set of attention weights that sum to 1. Each token now has a distribution over all other tokens indicating how much attention it pays to each.

Finally, these attention weights multiply the value matrix. Each token’s output is a weighted sum of all other tokens’ values, where the weights are determined by the attention scores. The full equation:

Attention(Q, K, V) = softmax(QKᵀ / √d_k) · V
scaled dot-product attention
Figure 7 · The attention mechanism: from input to output
Input X Q = XW_Q K = XW_K V = XW_V QKᵀ / √d_k softmax × Attention(Q,K,V)

Multi-head attention runs this entire process multiple times in parallel, with different learned projection matrices for each “head.” If a model uses 12 attention heads, it learns 12 different sets of (W_Q, W_K, W_V) matrices, computes 12 separate attention outputs, concatenates them, and projects the result back to the model’s embedding dimension through a final weight matrix. Each head can learn to attend to different aspects of the input; one head might track syntactic relationships while another tracks semantic similarity.

Every symbol in the attention equation has now been defined through the preceding chapters. Q, K, and V are matrices produced by the matrix multiplications of Chapter III. QKᵀ is a matrix of dot products from Chapter II. The division by √d_k uses the square root from Chapter I’s magnitude formula. Softmax applies the exponential function from Chapter IV to produce probability distributions from Chapter V. The final multiplication by V is another matrix operation. The loss function (Chapter VII) measures how well the model’s predictions match reality, and gradient descent (Chapter IX) adjusts all the weight matrices to improve the next prediction.

Example 10Tracing attention for a three-token sequence

Consider three tokens with 2-dimensional embeddings, and a single attention head with d_k = 2:

Q = [[1, 0],    K = [[1, 1],    V = [[10, 0],
     [0, 1],        [0, 1],        [0, 10],
     [1, 1]]        [1, 0]]        [5, 5]]

Step 1: Compute QKᵀ

QKᵀ = [[1·1+0·1, 1·0+0·1, 1·1+0·0],   = [[1, 0, 1],
       [0·1+1·1, 0·0+1·1, 0·1+1·0],      [1, 1, 0],
       [1·1+1·1, 1·0+1·1, 1·1+1·0]]      [2, 1, 1]]

Step 2: Scale by √d_k = √2 ≈ 1.414

scaled = [[0.71, 0.00, 0.71],
         [0.71, 0.71, 0.00],
         [1.41, 0.71, 0.71]]

Step 3: Apply softmax row-wise (each row sums to 1)

weights ≈ [[0.40, 0.20, 0.40],
           [0.40, 0.40, 0.20],
           [0.50, 0.25, 0.25]]

Step 4: Multiply by V (each row is a weighted sum of the value vectors)

output row 1 = 0.40·[10,0] + 0.20·[0,10] + 0.40·[5,5]
             = [4.0, 0] + [0, 2.0] + [2.0, 2.0]
             = [6.0, 4.0]

Token 1 attends equally to tokens 1 and 3 (weights 0.40 each), with less attention to token 2 (weight 0.20). Its output is a blend of those tokens’ values, weighted accordingly.

Exercises

Work each problem on paper before revealing the solution. The goal is fluency with the operation, the point at which these computations feel mechanical rather than effortful.

21.
In a transformer with d_model = 768 and 12 attention heads, what is d_k (the dimension per head)? What shape are the W_Q, W_K, and W_V matrices for a single head?
Show solution
d_k = 768 / 12 = 64. Each projection matrix maps from 768 dimensions to 64 dimensions, so W_Q, W_K, and W_V each have shape (768 × 64) per head. Across all 12 heads, the total parameter count for Q projections alone is 12 × 768 × 64 = 589,824.
22.
Why does the attention formula divide by √d_k rather than d_k or some other scaling factor?
Show solution
When two random vectors of dimension d_k are dot-producted, the expected magnitude of the result scales proportionally to √d_k. Dividing by √d_k normalizes the scores so their variance remains approximately 1, regardless of the dimension. Without this scaling, high-dimensional dot products would produce very large values, pushing softmax into its saturation regions where gradients vanish and training stalls.
23.
A model has a sequence of 100 tokens and uses 8 attention heads. How many separate dot products are computed in the attention layer?
Show solution
Each head computes a QKᵀ matrix with shape (100 × 100) = 10,000 dot products. With 8 heads: 8 × 10,000 = 80,000 dot products. This is why attention is often described as quadratic in sequence length: doubling the sequence length quadruples the number of dot products.
Notation Reference
Symbols used in this primer
SymbolMeaningIntroduced
v, a, bVectorsCh. I
||v||Magnitude (length) of vector vCh. I
a · bDot productCh. II
cos(θ)Cosine similarityCh. II
WMatrix (weight matrix)Ch. III
WvMatrix-vector multiplicationCh. III
exp(x), eˣExponential functionCh. IV
ln(x), log(x)Natural logarithmCh. IV
P(A | B)Conditional probabilityCh. V
softmax(z)Softmax functionCh. VI
H(p, q)Cross-entropyCh. VII
∂L/∂wPartial derivative of loss w.r.t. weightCh. VIII
∇LGradient vectorCh. VIII
αLearning rateCh. IX
Q, K, VQuery, key, value matricesCh. X
d_kDimension per attention headCh. X

Fluency with these ten topics changes what a practitioner can do with the primary literature. Architecture papers become readable rather than skimmable. Ablation studies make sense because the operations being ablated have concrete definitions. Hyperparameter choices (learning rate schedules, attention head counts, embedding dimensions) connect to the mechanics they govern rather than appearing as arbitrary numbers to copy from a reference implementation. The goal is not to become a mathematician but to stop being blocked by the notation.

thinkwright
Experiments & Research by Brandon Huey