How ChatGPT Was Built, Part 1: The Idea That Started Everything
The Idea That Started Everything
In the summer of 2017, a research paper appeared on arXiv — the open-access server where scientists post findings before formal publication — with a title that sounded almost like a provocation: Attention Is All You Need. Eight researchers at Google Brain had written it over five months of experiments at the company's campus in Mountain View, California. The paper was fifteen pages. It included equations, architecture diagrams, and a small block of TensorFlow code. It was presented at NeurIPS, the major annual machine learning conference, later that year. It did not make the front page of any newspaper.
Inside OpenAI, a young San Francisco organization founded just two years earlier with the stated mission of building artificial intelligence that would benefit humanity, the paper landed with the force of a recognition. Ilya Sutskever, OpenAI's chief scientist and one of the most respected researchers in the field, read it and understood immediately. His reaction, recounted in interviews, was close to: oh my God, this is the thing. They began building GPT-1 shortly after.
That moment — the paper, the recognition, the decision to build — is where this story begins.
The Question OpenAI Was Already Asking
To understand why the transformer paper landed the way it did, you have to understand what OpenAI was already thinking about when it arrived.
From its earliest days, the organization had been circling a deceptively simple idea: that predicting the next thing might be all you need. Not a narrow, task-specific model trained to translate languages or classify images. A general model. One that could learn from raw, unlabeled text by doing one thing repeatedly — guess what word comes next — and develop something closer to general understanding in the process.
This idea connected to a concept that was, in 2017, still considered the holy grail of machine learning: unsupervised learning. Most AI systems at the time were supervised. You showed them thousands of labeled examples — this image is a cat, this sentence is positive, this email is spam — and they learned to recognize patterns in those labels. It worked, but it required enormous amounts of human effort to produce the labels, and it generated models that were narrow. Good at one thing. Useless at everything else.
Unsupervised learning was the dream of something different: a model that could learn from raw data without labels and develop broadly applicable understanding. Nobody had cracked it. The neural networks of the time were too limited, too slow, too difficult to scale. The direction felt right. The tool to pursue it didn't yet exist.
Then the transformer arrived.
What Came Before — And Why It Wasn't Enough
Before Attention Is All You Need, the dominant approach to teaching computers to work with language relied on a class of architectures called recurrent neural networks, or RNNs, and their more sophisticated variants, Long Short-Term Memory networks — LSTMs. The idea behind them was intuitive: process language the way a person reads a sentence, one word at a time, left to right, carrying a memory of what came before into each new step.
The problem was that this sequential processing was slow and fragile. The further back in a sentence a relevant word appeared, the harder it was for the network to retain it. Long sentences degraded. Context faded. It was as if the machine could remember the last few words clearly and everything before that only dimly.
There was a deeper problem as well. The sequential nature of the processing meant that training these models could not be parallelized — it could not be distributed across many processors running simultaneously. Modern graphics processing units, the powerful parallel computing chips that had transformed gaming and visual effects, were designed to do many things at once. RNNs could not take advantage of that. They were fundamentally one-thing-at-a-time machines running on hardware built for doing everything simultaneously. Training was slow. Scaling was harder than it needed to be.
The Google Brain researchers were not the first to notice these problems. They were the ones who found a way through.
The transformer didn't just process language faster. It processed language differently — and that difference turned out to matter more than anyone anticipated.
The Mechanism: Self-Attention
The core insight of the transformer paper was both simple to state and radical in execution: drop the sequential processing entirely. Instead of reading left to right and accumulating memory step by step, have the model look at the entire input at once and compute the relationships between every word and every other word simultaneously.
The mechanism that made this possible was called self-attention. For every word in a sequence, the model asked: which other words in this text are most relevant to understanding this one? It built a map of those relationships — an attention score across the entire input — and used that map to understand context. The word "bank" near "river" attended differently to its surrounding words than "bank" near "money." The model could make that distinction across any distance in the text, without losing earlier context.
Because all of this happened simultaneously rather than one step at a time, the transformer could be distributed across thousands of processors at once. Training was dramatically faster. The architecture could scale to larger datasets. And critically, as models grew larger, they did not just get marginally better — they got qualitatively different in ways that nobody had fully predicted.
The paper came with open-source TensorFlow code. A blueprint and a starting point, available to anyone who could read it.
OpenAI Makes Its Bet — 2017
The decision OpenAI made after reading the transformer paper was not obvious at the time. It required a specific kind of intellectual conviction that most of the research community did not share.
Sutskever championed what became OpenAI's foundational wager: that massive unsupervised pre-training on text data, using the transformer architecture, would unlock powerful general capabilities. The approach was to take the transformer, adapt it for next-token prediction — not translation, not classification, just predict the next word — train it on an enormous amount of text, and see what emerged.
Most researchers thought this would hit diminishing returns quickly. Make the model bigger, yes, but at some point bigger stops meaning better. That was the conventional wisdom.
Sutskever disagreed. His intuition, informed by years of work on scaling neural networks — including his involvement with AlexNet, the model that had ignited the deep learning revolution in 2012 — was that scale would not hit a wall. It would reveal new capabilities. The reported position he held among colleagues: just make it bigger. The capabilities will emerge. It was a contrarian stance in 2017. It looks obvious now only because it turned out to be correct.
OpenAI rebuilt the transformer from scratch in PyTorch, adapted it as a decoder-only model for language generation, and trained what would become GPT-1. The model was small by later standards — 117 million parameters — but it worked. More importantly, it validated the direction. A general-purpose language model trained on raw text with no task-specific labels could perform a surprising range of tasks simply by being asked.
GPT-2 and the First Warning Sign — 2019
GPT-2, released in 2019, was a significant step up in scale — 1.5 billion parameters compared to GPT-1's 117 million. Its most interesting quality was not its size. It was the quality of what it produced.
GPT-2 could write coherent, sustained prose. It could continue a news article convincingly, finish a short story, answer questions in a plausible register. It was good enough that OpenAI initially withheld the full model from public release, citing concerns about potential misuse — a decision that generated significant debate in the research community and considerable press coverage. Whether the concern was warranted is still argued. What was not arguable was the quality of the outputs.
Looking back, GPT-2 was the first public signal that something different was happening. This was not a model that had been explicitly trained to write. It had been trained to predict text. The writing ability was an emergent property — something that arose from scale and exposure to vast amounts of human-generated language, without anyone having designed it in directly. The pattern was becoming clear. The transformer wanted to scale. And scaling kept producing surprises.
The Moment the Ground Shifted — 2020
GPT-3 arrived in 2020 with 175 billion parameters — more than a hundred times larger than GPT-2. The jump in scale was staggering. The jump in capability was stranger still.
GPT-3 could write code. It could perform arithmetic. It could answer factual questions, draft emails, translate languages, and summarize documents — none of which it had been explicitly trained to do. These were the emergent abilities that Sutskever had predicted and that the broader research community had doubted. They appeared not because anyone had engineered them in, but because the model had processed enough language to absorb the underlying patterns of all of it.
Researchers and developers who accessed GPT-3 through OpenAI's API found themselves doing something unusual: discovering capabilities by accident. You would prompt the model to do one thing and find it could do three others. The experience was disorienting in a specific way. You were interacting with a system that seemed to know more than it had been told.
GPT-3 was proof. The transformer, scaled aggressively and trained on enough data, produced something qualitatively different from anything that had come before.
But GPT-3 was not a finished product. It was a raw language model — a text-in, text-out engine with no conversational wrapper, no safety guardrails, no sense of what a user actually wanted. To use it, you had to learn how to prompt it. You had to speak its grammar. For researchers and developers, that was manageable. For the general public, it was not.
The bet OpenAI had placed in 2017 had paid off. Now the question was harder: how do you turn this into something the world can actually use?
Teaching the Model to Behave — 2021 to 2022
A powerful language model is not the same thing as a useful product. GPT-3 could do remarkable things, but it had no sense of what a user wanted, no guardrails against harmful outputs, and no conversational instinct. Closing that gap required two innovations that were less about architecture and more about behavior.
The first was instruction tuning. OpenAI trained a version of GPT-3 on a large dataset of tasks phrased as instructions — not raw text to complete, but requests with clear intent. Summarize this article. Translate this sentence. Answer this question. The model learned to interpret instructions rather than simply continue text. It became, for the first time, something closer to a cooperative tool.
The second was RLHF — Reinforcement Learning from Human Feedback. Human annotators reviewed model outputs and ranked them by quality. A separate model — called a reward model — learned those preferences. The base language model was then fine-tuned using reinforcement learning to maximize the reward: to produce outputs that human reviewers judged as helpful, accurate, and appropriate.
RLHF sounds straightforward described this way. In practice it was a significant engineering and research challenge run on a model that cost enormous amounts of money to train. Every flaw in the reward model propagated directly into the behavior of the final system. Getting it right required iteration, careful data collection, and a kind of patience that is hard to sustain when the stakes are high and the feedback loops are slow. But it worked. The resulting model — called InstructGPT, published in early 2022 — was dramatically more useful and substantially safer than the raw GPT-3 it was built on. It was the direct technical ancestor of ChatGPT.
ChatGPT and the Day Everything Changed — November 2022
ChatGPT launched on November 30, 2022. In one sense it was a simple product: a conversational interface layered over a fine-tuned version of GPT-3.5. A chat window. A text box. A model that responded in a conversational register rather than as a raw completion engine.
It reached one million users in five days.
Nothing in the history of consumer technology had grown that fast. Not Facebook. Not Instagram. Not TikTok. The world had apparently been waiting for exactly this: a system that could hold a coherent conversation, answer complex questions, write and edit and explain, and do all of it in plain English without requiring any technical knowledge to operate. Educators debated what it meant for classrooms. Lawyers asked what it meant for contracts. Programmers discovered it could write code. Students found it could explain anything.
Five years after Ilya Sutskever read a fifteen-page research paper and understood that the transformer was the thing they had been looking for, the product built on that conviction was in the hands of a hundred million people.
Why This Story Matters
The story of how OpenAI got from the 2017 transformer paper to ChatGPT is not a story about a single breakthrough. It is a story about a sustained, deliberate, expensive, and often unglamorous commitment to a direction that most people thought was wrong or at least premature.
The transformer architecture was not invented at OpenAI. The idea of training on raw text was not new. Unsupervised learning had been a research goal for decades. What OpenAI contributed was the conviction that scale would unlock capabilities no one had explicitly designed — and the willingness to build the infrastructure required to test that conviction at a size that had never been attempted before.
That infrastructure — the data pipelines, the training systems, the distributed computing, the RLHF machinery — is the subject of Part 2 of this series. Because if the story of Part 1 is about recognizing an idea, the story of Part 2 is about what it actually takes to build one. And that story is considerably more complicated.
Aaron's Take
What stays with me about this story is how thoroughly it defies the way we usually talk about invention. There is no lone genius here. There is no garage moment, no single eureka, no one person who saw what everyone else had missed. There is a research paper from Google, a team at OpenAI that was already asking the right question, and a conviction — held against the grain of expert opinion — that scale would produce something nobody had explicitly designed.
Sutskever's instinct turned out to be right. But it was an instinct, not a certainty. The researchers building GPT-1 and GPT-2 did not know that GPT-3 would write code it had never been taught to write. They did not know that emergent abilities were coming. They believed they were headed in the right direction, and they kept building.
That is the part of this story I think gets lost in the mythology — the sustained uncertainty of it. A bet that lasted five years, on a direction most experts thought would plateau, producing something that changed computing. Not because it was inevitable. Because someone believed it and built it anyway.
The idea was settled. Now came the hard part: building the data pipelines, training infrastructure, and systems engineering required to turn a research conviction into a product used by a hundred million people. Coming soon at Tech Reader Magazine.