How ChatGPT Was Built, Part 3: The Capabilities Nobody Designed

The engineers scaled the models. The researchers watched the benchmarks. And then, at certain thresholds, capabilities appeared that nobody had programmed, planned, or predicted. This is the story of emergence — and why it remains one of the most honest open questions in all of science.
How ChatGPT Was Built, Part 3: The Capabilities Nobody Designed — Tech Reader Magazine
Tech Reader Magazine  ·  How ChatGPT Was Built  ·  Part 3
Longform Essay

The Capabilities Nobody Designed

OpenAI built the infrastructure. Then the model started doing things nobody asked it to do.
The engineers scaled the models. The researchers watched the benchmarks. And then, at certain thresholds, capabilities appeared that nobody had programmed, planned, or predicted. This is the story of emergence — and why it remains one of the most honest open questions in all of science.

There is a moment in the history of GPT-3 that researchers who were there still describe with a particular quality of unease. The model had been trained. The infrastructure had held. The scaling had worked exactly as the math suggested it would. And then someone sat down to evaluate what the system could do — and found capabilities that nobody had put there.

Not approximations of capabilities. Not hints. Actual, functional abilities that the model had not been trained to develop. It could write syntactically valid code in languages it had never been explicitly taught. It could perform arithmetic it had never been drilled on. It could follow instructions in a conversational register that nobody had designed into it. It could translate between rare language pairs that appeared infrequently in its training data.

The researchers had built a next-token predictor. They had ended up with something that behaved, in certain respects, like a general reasoner.

Nobody had a complete explanation for why.

· · ·

Nothing. Nothing. Nothing. Then Something.

The pattern that emerged from the GPT scaling experiments was consistent enough to have its own graph shape, and that shape was deeply strange. Researchers would evaluate a capability across models of increasing size. The small model would score near zero. The slightly larger model would score near zero. The larger model after that — still near zero. And then, at some threshold of scale, the capability would appear. Not gradually. Abruptly.

Plot it on a chart and you got a flat line, then a sharp vertical rise. Nothing, nothing, nothing — then something. The phenomenon acquired a name: emergent abilities. The name captured the shape of what researchers were seeing without fully explaining it.

What made this genuinely unsettling was not just the abruptness. It was the absence of deliberate design. When a software engineer adds a feature to a product, there is a commit, a pull request, a decision. Someone chose to build that capability and someone built it. The emergent abilities in large language models had none of that provenance. Nobody had written a function for arithmetic reasoning. Nobody had inserted a translation module. Nobody had designed a code-writing component. The capabilities appeared because the model had been made larger and trained on more data. That was the entire explanation available at the time.

Nobody had written a function for arithmetic reasoning. The capability appeared because the model had been made larger. That was the entire explanation available at the time.

Steam Engines Before Thermodynamics

It is worth pausing here to note that this situation — engineering outrunning theory — is not unprecedented in the history of science. People built working steam engines before thermodynamics existed as a formal discipline. The engines worked. The theory of why they worked came later. People flew airplanes before aerodynamics was fully understood. The Wright brothers were not waiting for a complete mathematical account of lift before they attempted flight at Kitty Hawk.

Large language models followed the same pattern. The systems were advancing faster than the theories explaining them. Researchers were making empirical observations — this model can do this, that model cannot, this larger model can do both — before they had a principled account of what was happening underneath. The engineering arrived before the explanation. That is one reason the early GPT era felt simultaneously exciting and disorienting to the people inside it.

It also means that anyone who tells you the emergent abilities of large language models were predicted and planned is telling you a story that is tidier than the truth. The scaling laws predicted that performance would improve. They did not predict which specific capabilities would appear, at what scale, or why. Those were discoveries, not designs.

· · ·

Water Does Not Gradually Become Ice

The best physical analogy for what researchers were observing comes from a phenomenon that everyone has encountered and almost nobody thinks about carefully: the freezing of water.

At 33 degrees Fahrenheit, water is liquid. At 32 degrees Fahrenheit, it is ice. The difference in temperature is one degree. The difference in behavior is total. The underlying molecular motion changes continuously as temperature drops — there is no discontinuity in the physics — but the macroscopic behavior of the substance undergoes a complete transformation at a precise threshold. This is what physicists call a phase transition.

Researchers looking at the emergence of capabilities in large language models began to recognize something structurally similar. The training loss — the measure of how well the model was predicting text — changed smoothly and continuously as the model scaled. There was no sudden jump in the loss curves. But the observable capabilities did jump. Below a certain scale, a model could not perform multi-step arithmetic. Above that scale, it could. The underlying change was continuous. The behavioral change was discrete.

This framing — emergence as phase transition — gave researchers a mathematical vocabulary for describing what they were seeing, borrowed from statistical physics. It did not fully explain why specific capabilities appeared at specific scales. But it provided a framework for thinking about the phenomenon that was more rigorous than pointing at a benchmark chart and calling it spooky.

~100B+ The approximate parameter threshold at which researchers began consistently observing emergent abilities in language models — capabilities that were absent at smaller scales and appeared without explicit training.

The Child Who Could Not Read, Then Could

There is a second analogy that gets at the human experience of witnessing emergence, and it comes not from physics but from childhood development.

A child learning to read spends months in apparent confusion. Letters seem arbitrary. Words seem arbitrary. The relationship between the marks on a page and the sounds they represent seems arbitrary. A parent watching this process sees no progress, week after week. Then one day, something shifts. The child reads a paragraph. Not slowly, not haltingly — reads it. The capability appears to arrive fully formed, as if switched on.

But the brain was not idle during those months of apparent confusion. Thousands of small adjustments were accumulating in the neural architecture — connections forming, patterns consolidating, representations building toward a threshold. The progress was continuous and invisible. The capability became visible only when enough of that invisible work had accumulated to cross a threshold of functional use.

Large language models may be doing something structurally similar. The loss curves show continuous improvement during training. The benchmark evaluations show apparent stasis, then sudden capability. The disconnect between those two observations suggests that something is building underneath the surface of what the benchmarks measure — and that capabilities become visible only when they cross the threshold of being reliably useful, not when they first begin to form.

· · ·

The Uncomfortable Debate: Was It Real?

Science being science, not everyone accepted the emergence narrative uncritically. Beginning around 2023, a series of papers argued that emergent abilities in large language models might be, at least in part, an artifact of how researchers were measuring them rather than a genuine property of the models themselves.

The argument was pointed and worth taking seriously. Most capability benchmarks are pass/fail. A model either answers a question correctly or it does not. If the underlying capability is improving continuously but the benchmark only records a binary outcome, then gradual improvement will appear as sudden emergence. A model that improves from 49 percent accuracy to 51 percent accuracy crosses the threshold from failure to success. To the benchmark, it looks like a capability appeared overnight. To the model, it was one more incremental step in a continuous process.

When researchers switched to benchmarks that could capture partial credit and continuous improvement, some emergent abilities became smoother. The sharp vertical rise flattened into a curve. The spookiness diminished.

This does not resolve the question entirely. Some emergent behaviors remained genuinely abrupt even under more granular measurement. And the practical reality — that capabilities useful enough to matter appeared at specific scales and not before — was real regardless of what the underlying curves looked like. But the debate introduced an important epistemic caution: what looks like a phase transition in nature may sometimes be a phase transition in our measurement instruments.

What looks like a phase transition in nature may sometimes be a phase transition in our measurement instruments. The debate is still open.

What the Engineers Actually Learned

Whatever the ultimate theoretical account of emergence turns out to be, the practical lesson that researchers drew from the GPT scaling experiments was clear and consequential: intelligence-like capabilities often improve continuously as scale increases, and they do not hit the ceilings that most experts had predicted.

Before GPT-3, the dominant expectation in the research community was that scaling would eventually plateau. You could make models larger, but at some point larger would stop meaning better. There would be diminishing returns, then no returns. The empirical evidence from GPT-2 to GPT-3 and beyond challenged that expectation directly. The capabilities kept unlocking. The ceilings that experts had confidently predicted kept not appearing.

That finding reshaped the industry not because anyone had proved a grand theory of intelligence, but because the empirical evidence was too consistent to ignore. More compute produced better representations. Better representations produced better performance. Better performance, at certain thresholds, produced new capabilities. The cycle repeated. The lesson was: keep scaling, keep watching, and do not assume you know what the next threshold will reveal.

It was a lesson learned not from theory but from observation. From researchers sitting down with a newly trained model and discovering what it could do. From benchmark charts that showed nothing, nothing, nothing — and then something.

· · ·

Why This Changes the Story

The emergence of undesigned capabilities is not just an interesting scientific curiosity. It changes the fundamental character of what large language models are and how we should think about them.

A system that can only do what it was explicitly trained to do is a tool in the traditional sense — a sophisticated tool, but a tool. Its capabilities are bounded by the intentions of its designers. A system that develops capabilities its designers did not intend and cannot fully explain is something different. Not necessarily more dangerous, not necessarily more powerful, but categorically different in kind. It is a system whose behavior is partly a product of its training and partly a product of processes that are not yet fully understood.

This is why alignment research — the work of ensuring that AI systems behave in ways that are safe and beneficial — became so urgent so quickly. It is relatively straightforward to align a tool that only does what you tell it to do. It is considerably more challenging to align a system that may develop new capabilities at the next threshold of scale, capabilities that nobody predicted and nobody designed, emerging from the same process of gradient descent and data that produced everything else the system knows.

The engineers built the infrastructure. They scaled the models. They crossed the thresholds. And the systems started doing things nobody had asked them to do. Understanding why that happens — and what it means for where this technology is heading — is the subject of the mathematics. Which is where Part 4 of this series begins.

Aaron's Take

The thing that stays with me about emergence is how thoroughly it unsettles the standard narrative of technology development. We are accustomed to thinking of software as intentional — someone decided what it would do, someone built what they decided, and the result does what was built. The emergent abilities of large language models break that chain of intentionality. Nobody decided that GPT-3 would write code. Nobody built a code-writing module. The capability appeared because the model was large enough and had seen enough text that something clicked into place that nobody had planned.

That should give us pause — not alarm, but genuine intellectual humility. We are building systems whose full capability space we do not know in advance. We discover what they can do by training them and watching. The benchmarks are our instruments of discovery. And as the debate over emergence measurement reminds us, our instruments are imperfect. We may be seeing phase transitions in the models, or we may be seeing phase transitions in our ability to measure them. The honest answer, in 2026, is that we are still not entirely sure which.

What we are sure of is that the capabilities are real, they were not all designed, and they keep appearing. That is enough to demand the serious mathematical treatment that Part 4 will provide.

Next in this series
How ChatGPT Was Built — Part 4: The Math Behind the Magic

Phase transitions. Scaling laws. Information theory. Representation geometry. Researchers have spent years developing mathematical frameworks to explain why capabilities emerge when they do. Part 4 goes into the mathematics — in plain English, with the full Feynman treatment — and asks what the equations actually tell us about where this is all heading.

 

Popular posts from this blog

Claude Mythos