How ChatGPT Was Built, Part 4: The Math Behind the Magic

Four equations that explain why AI capabilities appear from nowhere — and what they tell us about where this is all heading
How ChatGPT Was Built, Part 4: The Math Behind the Magic — Tech Reader Magazine
Tech Reader Magazine  ·  How ChatGPT Was Built  ·  Part 4
Longform Essay

The Math Behind the Magic

Four equations that explain why AI capabilities appear from nowhere — and what they tell us about where this is all heading
Part 3 told the story of emergence — capabilities that appeared without being designed, at scales nobody had fully predicted. This final installment goes underneath that story and asks the harder question: is there mathematics that explains why? The answer is yes. And it is more illuminating than the mystery it replaces.

There is a certain comfort in mystery. The story of emergent AI capabilities — capabilities that appeared without being designed, at thresholds nobody fully predicted — has a quality that invites wonder and resists explanation. It is tempting to leave it there. To say that something remarkable happened, that the models surprised their creators, and that the full account of why remains beyond reach.

That temptation should be resisted. Because there is mathematics for this. Not a single grand unified theory, not a complete predictive framework that would have let researchers forecast GPT-3's capabilities before it was trained — but four distinct mathematical frameworks, each illuminating a different facet of emergence, each built from serious research in physics, information theory, statistics, and geometry. Together they tell a story that is more interesting than the mystery they replace.

This article presents those four frameworks. Each one comes with a single equation. Each equation is explained in plain English before and after it appears. You do not need a mathematics degree to follow this. You need the same thing you needed for Parts 1, 2, and 3 of this series: willingness to follow an argument carefully.

That is the only prerequisite.

· · ·

Framework One: Scaling Laws — The Equation That Made Scaling a Science

The first framework is the most empirical. It does not explain why emergence happens. It describes, with remarkable precision, how performance improves as models grow — and it was this description that gave OpenAI the confidence to keep scaling when most of the research community expected diminishing returns.

In 2020, Jared Kaplan and colleagues at OpenAI published a paper demonstrating that the relationship between model size and performance followed a remarkably clean mathematical pattern. Not a rough trend. Not a general tendency. A precise, quantifiable relationship called a power law — the same kind of relationship that describes earthquake magnitudes, city populations, and the frequency of words in natural language.

The equation at the center of their findings was this:

Equation 1  ·  The Neural Scaling Law \[ L \approx A \cdot N^{-\alpha} \]

Where L is the model's loss — its error rate, essentially — N is the number of parameters, A is a constant determined by the data, and α (alpha) is the scaling exponent that describes how quickly performance improves as the model grows.

In plain English: as you make the model larger — as N increases — the loss L decreases. The model gets better. What made this finding extraordinary was not the direction of the relationship, which anyone would have expected, but its consistency. The curve fit real experimental data with a precision that surprised even the researchers running the experiments. You could plug in a model size you had never trained, and the equation would tell you, with reasonable accuracy, how well it would perform.

This transformed AI development from art into engineering. Before scaling laws, training a larger model was a gamble. You spent the compute and hoped the results justified the investment. After scaling laws, you could forecast. You could calculate the optimal ratio of model size to dataset size to compute budget. You could estimate when a capability would become economically achievable. You could plan.

But the scaling law equation also contained a subtlety that pointed toward something deeper. At certain thresholds of scale, the exponent α appeared to shift — the rate of improvement changed character. The smooth power law curve had kinks in it, places where the model seemed to cross into a qualitatively different regime. Those kinks were the first mathematical fingerprint of emergence. Something was happening at those thresholds that the power law alone could not explain.

· · ·

Framework Two: Phase Transitions — The Equation That Explains the Kinks

To understand what was happening at those threshold kinks in the scaling curve, researchers reached for a framework borrowed from statistical physics — the mathematical description of phase transitions. The same mathematics that describes water turning to ice, or a magnet losing its magnetism above a critical temperature, turned out to provide a powerful conceptual vocabulary for emergence in language models.

The key mathematical concept is the order parameter — a quantity that measures the degree of organization in a system. In a magnet, the order parameter is magnetization: how aligned the atomic spins are. Below a critical temperature, the spins align and magnetization is high. Above that temperature, thermal noise destroys the alignment and magnetization drops to zero. The transition between those two states is sharp and precise. A tiny change in temperature produces a complete change in behavior.

The mathematical description of that transition is expressed as a threshold function:

Equation 2  ·  The Phase Transition Threshold \[ \Phi(N) = \begin{cases} 0 & \text{if } N < N_c \\ > 0 & \text{if } N \geq N_c \end{cases} \]

Where Φ (Phi) is the capability being measured, N is model scale, and Nc is the critical scale — the threshold at which the capability transitions from absent to present.

In plain English: below a certain scale, the capability is zero. Above that scale, it exists. The function does not care about the smooth continuous improvement happening underneath — it only registers whether the system has crossed the threshold. This is precisely the graph shape that researchers observed in GPT scaling experiments. Nothing, nothing, nothing — then something. Not a gradual rise. A transition.

What makes this framework powerful is what it implies about the underlying process. A phase transition does not require anything special to happen at the threshold. The underlying physics changes continuously. What changes discretely is the macroscopic behavior — the observable property of the system. Applied to language models, this suggests that the internal representations of the model may be improving continuously throughout training and scaling, while the observable capability only becomes measurable once it crosses a threshold of reliable usefulness.

The child learning to read did not become a reader at a single moment. The brain was building toward reading for months. The phase transition framework is the mathematics of that building — continuous underneath, discrete on the surface.

2020 The year Kaplan et al. published the neural scaling laws paper at OpenAI — the finding that turned AI scaling from a gamble into a forecastable engineering discipline.

Framework Three: The Information Bottleneck — The Equation That Explains the Mechanism

The first two frameworks describe the shape of emergence. The third framework goes deeper and asks about the mechanism — what is actually happening inside the model as it scales that produces new capabilities at threshold crossings?

The answer comes from information theory, the branch of mathematics developed by Claude Shannon in the late 1940s to describe the fundamental limits of communication and compression. A large language model, viewed through the lens of information theory, is not primarily a prediction machine. It is a compression machine. Its training objective — predict the next token — forces it to find the most compact, efficient representation of the statistical structure of human language. And compression, pushed far enough, produces something unexpected: abstraction. It is worth being precise here: the information bottleneck is a theoretical framework that researchers use to understand what is happening inside a model, not a literal objective that OpenAI explicitly coded into the training process. The model does not know it is solving an information bottleneck problem. But the mathematics describes what it is doing nonetheless.

The mathematical framework that captures this is called the information bottleneck, and its core equation describes the trade-off a model must navigate between compressing its input and retaining the information needed to make accurate predictions:

Equation 3  ·  The Information Bottleneck \[ \min_{\theta} \; \mathcal{L}(\theta) + \beta \cdot I(\theta;\, X) \]

Where L(θ) is the model's prediction loss, I(θ; X) is the mutual information between the model's parameters and the training data — essentially, how much of the raw data the model is retaining — and β (beta) is a coefficient that controls the trade-off between compression and accuracy.

In plain English: the model is simultaneously trying to minimize its errors and minimize how much raw data it holds onto. Those two goals are in tension. Holding onto everything would make predictions accurate but produce no generalization — the model would simply memorize. Compressing too aggressively would produce generalization but destroy accuracy. The training process navigates that tension, finding representations that are compact enough to generalize and accurate enough to be useful.

Here is where emergence enters. Information theory research has shown that as compression increases past certain thresholds, representations do not just become smaller versions of what they were. They undergo discontinuous changes — phase transitions in the structure of what the model has learned. At those transitions, the model stops representing surface patterns and starts representing the deeper structure that generates those patterns. It stops memorizing and starts understanding, in the only sense that word can apply to a mathematical system.

This is the mechanism underneath the spooky behavior. A model that has compressed human language hard enough does not just store patterns. It uncovers the latent structure that produces patterns. And from that latent structure, capabilities that nobody explicitly trained emerge naturally — because they are implicit in the structure itself.

A model that has compressed human language hard enough does not just store patterns. It uncovers the structure that generates them. That is the mechanism underneath the magic.

Framework Four: Representation Geometry — The Equation That Makes It Visible

The fourth framework is the most modern and, in some ways, the most concrete. It asks not about the abstract mathematical properties of what the model is doing, but about the geometric structure of what the model has learned — and it provides a precise, measurable condition for when a capability has emerged.

Every large language model, as it processes text, converts tokens into vectors — lists of numbers that represent meaning in a high-dimensional mathematical space called the latent space or embedding space. The geometry of that space — how the vectors are arranged, how they cluster, how they relate to each other — encodes everything the model knows. A model that has learned to distinguish formal from informal language has vectors for formal and informal text that are geometrically separated in its latent space. A model that has not learned that distinction has vectors that are geometrically mixed.

The mathematical condition for a capability to exist — for a model to be able to reliably perform a task — is that the relevant representations must be linearly separable in the latent space. There must exist a direction in that space that cleanly divides the vectors associated with correct performance from those associated with incorrect performance. The emergence of a capability is, geometrically, the emergence of that separability:

Equation 4  ·  The Geometric Separability Condition \[ \exists \; \mathbf{w} : \mathbf{w}^\top h(x) > 0 \quad \forall x \in \mathcal{C} \]

Where w is a weight vector — a direction in the latent space — h(x) is the model's internal representation of input x, and C is the set of inputs belonging to the capability class. The equation says: there exists a direction in the model's internal space that correctly classifies all inputs relevant to the capability.

In plain English: a capability exists when the model's internal geometry has organized itself so that a clean boundary can be drawn between inputs where the capability applies and inputs where it does not. Before the capability emerges, that boundary does not exist — the relevant vectors are scattered and mixed. After the capability emerges, the boundary is there. The model has organized its internal space in a way that makes the capability accessible.

As models scale, their latent spaces undergo topological changes — clusters merge, manifolds smooth out, new directions of organization appear. Each of those geometric changes corresponds to a new capability becoming separable, a new boundary becoming drawable. The emergent abilities that surprised researchers watching GPT-3's benchmark scores were, geometrically, the moments when the model's internal space reorganized itself enough to support a new separating boundary.

The spooky behavior has a geometry. It is not magic. It is topology.

· · ·

What the Four Frameworks Tell Us Together

Taken individually, each framework illuminates one piece of the emergence story. Taken together, they tell a complete account — incomplete in its details, as all frontier science is, but coherent in its structure.

The scaling law tells us that performance improves predictably with scale, and that the rate of improvement can shift at certain thresholds. The phase transition framework tells us why those threshold shifts produce abrupt capability changes rather than gradual ones — the underlying change is continuous, but the observable behavior is discrete. The information bottleneck tells us the mechanism: compression past critical thresholds forces the model to uncover latent structure rather than memorize surface patterns, and that structural knowledge is what new capabilities are built from. The geometric separability condition tells us what emergence looks like in the model's internal space: a new direction becomes available, a new boundary becomes drawable, a new capability becomes accessible.

None of these frameworks could have predicted, before GPT-3 was trained, exactly which capabilities would appear at exactly which scales. That remains beyond current mathematical reach. But together they explain why capabilities appear at all — why scaling a next-token predictor produces a system that can write code, follow instructions, perform arithmetic, and translate rare languages, none of which it was explicitly taught to do.

The answer is compression, geometry, and phase transitions. The answer is mathematics.

· · ·

What the Math Does Not Yet Explain

Intellectual honesty requires acknowledging what remains open. The four frameworks presented here are genuine research-grade tools, not speculation. But they do not constitute a complete theory of emergence in large language models, and the researchers who developed them would be the first to say so.

We cannot yet predict exactly when a specific capability will appear. We cannot fully explain why some abilities require enormous scale while others emerge relatively early. We cannot unify the four frameworks into a single coherent theory that derives all their results from first principles. We do not have a complete account of why chain-of-thought reasoning works, why in-context learning is possible, or why models can sometimes self-correct.

These are open questions. They are being worked on. The mathematics is getting closer every year. But the honest position in 2026 is that we understand the shape of emergence better than we understand its causes, and we understand its causes better than we can predict its future instances.

That is the frontier. It is genuinely interesting, genuinely unresolved, and genuinely worth following.

Aaron's Take

What strikes me about these four frameworks, having spent several weeks now living inside this series, is how they reframe the entire story of ChatGPT's development. Parts 1 and 2 told the story of a conviction and a construction — OpenAI recognized the transformer's potential and built an industrial system to realize it. Part 3 told the story of surprise — capabilities appearing that nobody had designed. This final installment tells us that the surprise, while genuine, was not arbitrary. There was structure underneath it. There was mathematics.

The scaling law told OpenAI that investing in scale was rational, not reckless. The phase transition framework told researchers that the abrupt capability jumps they were seeing were physically meaningful, not measurement artifacts. The information bottleneck told theorists that compression was the engine of abstraction. The geometric separability condition gave researchers a precise, testable definition of what it means for a capability to exist inside a model.

None of this diminishes the achievement. If anything it deepens it. OpenAI built a system that, when scaled past certain thresholds, underwent mathematical phase transitions in its internal geometry, forcing its representations through information bottlenecks that uncovered latent structure in human language and knowledge. That is what ChatGPT is, underneath the chat window. That is what the engineers built, even if they did not have all of this language for it at the time.

The math came later. The build came first. And the build changed the world.

This series is complete
How ChatGPT Was Built — All Four Parts

Part 1: The Idea That Started Everything  ·  Part 2: The Engineering Challenge That Made It Real  ·  Part 3: The Capabilities Nobody Designed  ·  Part 4: The Math Behind the Magic. All four parts are available at Tech Reader Magazine.