The Paper

In June 2017, eight researchers at Google sat down to solve a narrow problem in machine translation. They were not trying to change the world. They were trying to make computers better at translating sentences. What they built instead was the architecture that every major AI system in the world now runs on.
The Paper: Inside the 2017 Research Paper That Started the AI Era — Tech Reader Magazine
Tech Reader Magazine  ·  The AI Era
Investigative Essay

The Paper

Inside the 2017 Research Paper That Started the AI Era
In June 2017, eight researchers at Google sat down to solve a narrow problem in machine translation. They were not trying to change the world. What they built instead was the architecture that every major AI system in the world now runs on.

The title was a Beatles reference. The eight researchers who wrote it thought that was funny — a playful nod to "All You Need Is Love" tucked into the name of a machine learning paper that almost nobody outside of their field would read. They were not wrong about the obscurity. They were very wrong about everything else.

Attention Is All You Need, published on the arXiv preprint server on June 12, 2017, introduced an architecture called the Transformer. It was presented at NeurIPS, the major annual conference for machine learning researchers, later that year. It did not make the front page of any newspaper. It was not discussed on television. The technology industry was largely focused, in 2017, on other things: autonomous vehicles, the ongoing dominance of the smartphone, and a great deal of noise about blockchain.

The paper has since been cited more than 173,000 times. It is among the most cited scientific papers of the twenty-first century. Every large language model in widespread use today — GPT, Claude, Gemini, LLaMA, Grok — is built on the architecture it describes. The AI era, in a very direct and technical sense, began with those eight researchers, a lunch conversation, and five months of experiments at a campus in Mountain View, California.

· · ·

The Problem They Were Trying to Solve

To understand why the Transformer mattered, it helps to understand what came before it. In 2017, the dominant approach to natural language processing — teaching computers to work with human language — relied on a class of architectures called recurrent neural networks, or RNNs, and their more sophisticated descendants, Long Short-Term Memory networks, known as LSTMs.

These systems worked by processing language the way a person might read a sentence aloud: one word at a time, left to right, carrying a kind of running memory from each word to the next. They were effective at short sequences. They struggled badly with long ones. The further you got from the beginning of a sentence, the more the earlier context faded. It was as if the machine could remember the last few words clearly and everything before that only dimly.

There was also a deeper problem: the sequential nature of the processing meant that training these models was slow. Modern graphics processing units — GPUs, the powerful parallel computing chips that had transformed visual effects and gaming — were designed to do many things at once. RNNs could not take advantage of that. They were fundamentally one-thing-at-a-time machines running on hardware built for doing everything simultaneously.

The researchers at Google Brain were not the first to notice these problems. But they were the ones who found a way through.

The machine could remember the last few words clearly and everything before that only dimly. The Transformer changed what it meant to pay attention.

The Idea: Attention Without Recurrence

The core insight of the Transformer paper was both simple to state and radical to execute: what if you dropped the sequential processing entirely? What if, instead of reading left to right and accumulating memory step by step, the model looked at the entire sentence at once and figured out the relationships between every word and every other word simultaneously?

The mechanism that made this possible was called self-attention. Rather than processing words in order, the Transformer asked, for every word in a sequence: which other words in this sentence are most relevant to understanding this one? It built a map of those relationships — a kind of attention score across the entire input — and used that map to understand context. The word "bank" near "river" would attend differently to its neighbors than "bank" near "money." The model could tell the difference, across any distance in the text, without forgetting.

And because all of this happened in parallel rather than sequentially, it could take full advantage of the GPU hardware that was becoming the standard infrastructure of machine learning research. Training was dramatically faster. The models could scale to larger datasets. The results, on the translation benchmarks the team was using to evaluate their work, were better than anything that had come before.

The paper itself was careful in its claims. The researchers described their results precisely, noted the limitations, and suggested directions for future work. They did not announce a revolution. They published findings. The revolution was something that happened afterward, when the rest of the field read what they had done and understood what it meant.

· · ·

The Eight People in the Room

The authors listed on the paper are presented in randomized order — the paper itself notes that all eight contributed equally. That is an unusual and deliberate choice, and it reflects something true about how the work was done: different people contributed different pieces, across months of collaboration, and no single person owns the idea. A brief look at who each of them was, and what the paper credits them with, tells the story of how a research breakthrough actually gets made.

Jakob Uszkoreit

According to the paper's own notes, Uszkoreit proposed the original idea of replacing recurrent networks with self-attention and started the effort to evaluate it. He is, in this sense, the person who asked the question the paper answers. After leaving Google, he founded Inceptive, a company applying AI to the design of RNA molecules for medical therapeutics — one of the most unexpected pivots in the diaspora.

Ashish Vaswani

Vaswani, along with Illia Polosukhin, designed and implemented the first working Transformer models. The paper describes him as "crucially involved in every aspect of this work." He left Google in 2021 to co-found Adept AI Labs, an enterprise-focused AI research company, alongside fellow co-author Niki Parmar.

Noam Shazeer

A veteran Google engineer who had been with the company for more than two decades, Shazeer proposed scaled dot-product attention and multi-head attention — two of the Transformer's defining technical contributions. The paper describes him as "the other person involved in nearly every detail." He left Google in 2021 after the company declined to launch a conversational AI he had built internally. He co-founded Character.AI, which grew to a valuation of over $1 billion before Google paid $2.7 billion to license its technology and bring Shazeer back into the fold at DeepMind in 2024.

Niki Parmar

The only woman on the team, Parmar designed, implemented, and evaluated model variants across the original codebase. She co-founded Adept with Vaswani, later departing to pursue other work.

Llion Jones

Jones was responsible for the initial codebase, experimented with model variants, and worked on efficient inference and visualizations. He was the last of the eight to leave Google, departing in 2023. His exit meant that, for the first time, none of the paper's authors remained at the company that had published it.

Aidan Gomez

At twenty years old, Gomez was a summer intern at Google Brain when the paper was written. He was pursuing a PhD at Oxford, which he eventually completed in 2024. In 2019, while still a graduate student, he co-founded Cohere, an AI company focused on enterprise language model applications. Cohere grew to a valuation of $6.8 billion, backed by AMD, Nvidia, and Salesforce, counting Spotify and Oracle among its clients. Gomez is its CEO.

Ɓukasz Kaiser

Co-creator of TensorFlow, Google's foundational open-source machine learning platform, Kaiser left Google in 2021 to join OpenAI as a researcher — moving from the company that invented the Transformer to the company most associated with deploying it at scale.

Illia Polosukhin

A Ukrainian-born computer scientist, Polosukhin left Google the same year the paper was published. His path after the Transformer was the most unexpected of the eight: rather than founding an AI company, he co-founded NEAR Protocol, a high-performance blockchain platform, taking the computational thinking behind the Transformer into a completely different domain.

· · ·

The Open Publication Question

One of the most consequential decisions surrounding the paper was a non-decision: Google chose to publish it openly on arXiv and present it at a public conference. This was standard practice in academic machine learning research, a field with deep traditions of open sharing that predated the current commercial AI boom. At the time, it seemed unremarkable.

In retrospect, it was one of the most significant acts of open publication in the history of technology. Within months of the paper's release, research teams around the world were building on the Transformer architecture. OpenAI used it as the foundation for the GPT series. Google itself used it for BERT, an influential language model that improved search quality. Meta used it for LLaMA. Every major AI lab in the world — including the ones that would eventually compete with Google directly — built their systems on the architecture that Google's researchers had given away for free.

It would be too simple to call this a mistake. The culture of open publication in machine learning had real value: it accelerated progress, enabled peer review, and helped build the field that made Google's AI products possible in the first place. The argument that Google should have kept the Transformer proprietary assumes that secrecy was a realistic option in a research community where talent moves fluidly and ideas travel fast. It probably was not.

But the irony is genuine and documented. As one account put it: Google invented the technology revolutionizing AI, published it openly, and then watched its own researchers leave to build the competing companies that would come to challenge it. Every other AI lab recognized the potential. Google saw incremental improvement where others saw revolution.

173,000+ Times cited — making "Attention Is All You Need" one of the most referenced scientific papers of the twenty-first century, and the 7th most cited paper across all fields.

The Diaspora

The story of what happened after the paper is, in a very real sense, the story of the AI era itself. The eight authors scattered. The idea did not.

Vaswani and Parmar built Adept. Gomez built Cohere. Shazeer built Character.AI before returning to Google at extraordinary cost. Uszkoreit went into computational biology. Kaiser went to OpenAI. Polosukhin went into blockchain. Jones, the last to leave, built a new startup. The paper's idea, meanwhile, had already propagated far beyond any of them.

The comparison that emerged in the press — and that has some genuine historical resonance — was to the "traitorous eight" of the 1950s: the group of engineers who left semiconductor pioneer Shockley Laboratories to found Fairchild Semiconductor, from which two of them went on to found Intel. Those eight people did not invent the transistor. But they scattered the knowledge of how to build with it across an industry, and the industry was never the same.

The Transformer eight did not invent neural networks, or attention mechanisms, or even the specific idea of applying attention to language. The attention mechanism itself had been proposed years earlier, in 2014, by a group of researchers including Yoshua Bengio. What the 2017 paper did was show that attention was not just a useful addition to existing architectures — it was sufficient on its own. You did not need the recurrence. Attention was all you needed.

That insight, published openly and freely, became the common inheritance of an entire industry. Every company that would eventually compete with Google for AI supremacy — OpenAI, Anthropic, Meta, Mistral, Cohere, DeepSeek — built on the same foundation. The Transformer paper did not just start the AI era. It made the AI era a collective project, open to anyone with the computing resources and the talent to participate.

The Transformer paper did not just start the AI era. It made the AI era a collective project, open to anyone with the resources and talent to participate.

What Happened Next, and Why It Matters

The path from the 2017 paper to ChatGPT is not a straight line, but it is a traceable one. OpenAI began building on the Transformer architecture almost immediately, releasing GPT-1 in 2018, GPT-2 in 2019, and GPT-3 in 2020. Each generation was larger and more capable than the last. The key insight that OpenAI added — and that the 2017 paper had not fully explored — was that scaling up the size of Transformer-based models produced qualitative improvements in capability, not just quantitative ones. Bigger models did not just know more facts. They reasoned differently. They generalized in ways that smaller models could not.

That discovery about scaling, combined with the Transformer architecture, produced the models that would eventually find their way into ChatGPT and everything that followed. The architecture from 2017 was the engine. The scaling insight was the fuel. The product decision to release a conversational interface to the public, in November 2022, was the match.

By the time the public noticed what was happening, the researchers who had written the paper were already five years into building the companies and products the paper had made possible. The lunch conversation in Mountain View had become an industry. The industry had become a global race. The race was already well underway before most people knew it had started.

· · ·

Aaron's Take

What stays with me, reading back through the history of this paper, is how thoroughly it defies the way we usually tell stories about invention. There is no lone genius here. There is no garage moment, no eureka, no single person who saw what everyone else had missed. There are eight people at lunch, a five-month collaboration, a paper that nobody outside the field read, and a decision to publish openly that probably felt like standard procedure at the time.

And then, slowly and then quickly, everything changed.

The other thing I keep returning to is the diaspora. Those eight researchers did not stay together. They did not build one company that captured the value of what they had created. They scattered — into enterprise AI, into consumer chatbots, into blockchain, into RNA medicine, into open-source research. The idea they had created was bigger than any of them, and maybe that is the only way an idea of that size could propagate: by attaching itself to many different people with many different visions of what it could become.

That is not a tidy story. But it is an honest one. The AI era did not arrive from a single direction. It arrived from all of them at once.

Read More — The AI Era
The Nine Years That Changed Everything

From the quiet publication of this paper in 2017 to the world of 2026 — a plain-language walk through the milestones, the race, and the questions that remain. Available now at Tech Reader Magazine.

Popular posts from this blog

Claude Mythos