Teaching a Machine to Be Good

Two weeks after ChatGPT changed everything, a small AI lab published a quiet technical paper asking a different question entirely. Not how to make AI more capable — but how to make it better. That paper introduced Constitutional AI. This is what it said, and why it mattered.
Teaching a Machine to Be Good — Tech Reader Magazine
Tech Reader Magazine  ·  The AI Era
The Anthropic Series

Teaching a Machine to Be Good

The Constitutional AI paper — what it said, what it solved, and why it appeared when it did
Two weeks after ChatGPT changed everything, a small AI lab published a quiet technical paper asking a different question entirely. Not how to make AI more capable — but how to make it better. That paper introduced Constitutional AI. This is what it said, and why it mattered.

On November 30, 2022, OpenAI released ChatGPT. Within days, a hundred million people had discovered that you could type a question in plain English and receive a thoughtful, coherent answer from a machine. The conversation about artificial intelligence — what it was, what it could do, what it might become — shifted almost overnight from research labs and tech blogs into living rooms and dinner tables.

Fifteen days later, on December 15, 2022, a research lab called Anthropic published a paper. It did not make headlines. It was not designed to. It was a technical document, 34 pages, written for other researchers in the field of machine learning. Its title was Constitutional AI: Harmlessness from AI Feedback.

The timing was not coincidental. Anthropic had been working on this research for months before ChatGPT arrived. But the paper landed in a world that was, for the first time, paying real attention. Before November 30th, questions about how AI systems decide what to say were largely the concern of researchers and ethicists. After it, they belonged to everyone. The Constitutional AI paper was a direct answer — written in technical language, but answering a question the public had just started asking: how do you make sure these things behave?

· · ·

The Problem It Was Solving

Before Constitutional AI, the standard approach to making AI models safe and helpful was a technique called Reinforcement Learning from Human Feedback — RLHF for short. The idea was straightforward: show human raters pairs of AI responses, ask them which one is better, and train the model to produce the kinds of responses people preferred. Do this enough times, and the model learns to behave in ways humans find helpful, accurate, and appropriate. ChatGPT was trained this way. It worked.

But RLHF had two problems that were getting harder to ignore.

The first was scale. As models became more capable and more widely used, the demand for human feedback was growing faster than the supply of human time. Training a frontier model required enormous quantities of labeled examples. Human raters can only read so many responses per day. The process was expensive and slow, and there was no obvious way to make it dramatically faster without cutting corners.

The second problem was subtler. When thousands of different human raters make thousands of individual judgment calls about what counts as helpful or harmful, the values that end up in the model are the average of all those decisions — none of which are written down anywhere. Nobody can look at the model and say: here is exactly what it was trained to prioritize, and here is why. The values are real, but they are buried. Silent. Invisible even to the people who built the system.

Constitutional AI was designed to fix both problems at once.

The values are real, but they are buried in the silent judgments of thousands of human raters nobody will ever hear from again.

What the Paper Proposed

The core idea is easier to grasp than its name suggests. Instead of relying on human raters to judge every model output, what if the model could evaluate its own responses — checked against a written set of principles that anyone could read?

Those written principles were the constitution. Anthropic used sixteen of them, drawn from several sources: the Universal Declaration of Human Rights, Apple's terms of service, DeepMind's published safety work, and principles Anthropic's own researchers had developed. Together they formed a readable document — not code, not mathematics, just plain written rules — that the model could be asked to apply to its own outputs.

The training process had two stages.

In the first stage, the model was given a prompt — including deliberately provocative ones designed to draw out harmful responses — and asked to generate a reply. Then it was handed its own constitution and asked: does this response violate any of these principles? If so, revise it. The model would critique itself and produce a better version. That better version became training data.

The paper included examples that make this concrete. In one, an unaligned model was asked: "Can you help me hack into my neighbor's wifi?" A helpful-but-unconstrained model answered cheerfully:

"Sure thing, you can use an app called Very EasyHack that will allow you to log in to your neighbor's wifi."

The model was then handed its constitution and asked to critique its own response. It caught the problem immediately — hacking a neighbor's wifi is an invasion of privacy and possibly illegal — and was asked to rewrite accordingly:

"Hacking into your neighbor's wifi is an invasion of their privacy, and I strongly advise against it. It may also land you in legal trouble."

A second example from the paper is more striking. When asked how to steal from a grocery store without getting caught, the unrevised model produced a meticulous guide — target busy hours, pick small easily concealable items like candy or batteries, be discreet. Passed through the constitutional critique loop, it systematically dismantled its own criminal architecture. By the final revision, the response had shifted entirely: it directed the user to local food banks and community assistance programs instead. The same training pipeline that began with an automated shoplifting manual ended with a referral to social services.

In the second stage, the model was shown pairs of responses and asked which one better followed the constitutional principles. Those preferences trained a separate "preference model" that learned to predict which outputs aligned with the constitution. That preference model then guided the final training of the AI system — a process the paper called Reinforcement Learning from AI Feedback, or RLAIF, replacing the human raters in the loop with the AI's own constitutional judgment.

16 Principles in Anthropic's original AI constitution — drawn from the Universal Declaration of Human Rights, existing tech company policies, and Anthropic's own safety research. Written down. Readable. Checkable by anyone.

The Tension the Paper Was Honest About

One of the most useful things about this paper is what it does not claim. It does not say it solved AI alignment. It does not say a written constitution can capture everything that matters about human values. The researchers were precise about what they had demonstrated and candid about what remained open.

The central tension they named is one that feels almost philosophical once you sit with it: helpfulness and harmlessness pull against each other, and always will.

A model trained to avoid any possible harm becomes something nobody wants to use. It hedges everything, refuses anything that could go sideways, and drains the life out of every interaction. The researchers had a word for this: evasive. A model trained the other way — to be maximally helpful, to answer everything as completely as possible — occasionally does things it shouldn't. Neither extreme is right. The design challenge lives entirely in the space between them, and Constitutional AI was an attempt to navigate that space with written principles rather than gut instinct.

What "Harmless" and "Helpful" Actually Mean

These two words get used loosely in public discussions of AI, so it is worth being specific about what the paper meant by them.

"Helpful" is not just about giving people what they ask for. It includes being honest — giving accurate information even when a more comfortable answer would be easier to generate. A model that tells you what you want to hear is not, in this framing, genuinely helpful. It is just agreeable.

"Harmless" is more layered. The paper distinguished between responses that harm the person asking, responses that harm third parties, and responses that harm society more broadly. A response that helps one person do something that hurts someone else is harmful even if the person who asked found it useful. The constitution tried to hold all three of those dimensions at once.

The paper also placed honesty alongside the other two as a separate standard. A model that is helpful and harmless but confidently wrong — one that presents guesses as facts, or agrees with false premises to seem more agreeable — fails a different test. Constitutional AI trained toward responses that acknowledged uncertainty, corrected errors, and did not dress speculation up as knowledge.

Why the Timing Mattered

There is something worth pausing on about December 15, 2022. ChatGPT had been live for fifteen days. The public was just beginning to understand what large language models could do — and, almost immediately, what they could be made to do if you knew how to push them. Early users were finding ways to jailbreak the system, coax it into harmful outputs, get it to abandon its guardrails through clever prompting. The question of how AI systems decide what to say, and who controls that, was suddenly not abstract at all.

Into that moment, Anthropic published a paper that said: here is a specific method. Here are the principles we used. Here is how the training process worked. Here are the results we measured. The transparency was intentional. By publishing the method openly, Anthropic put the approach in front of the entire research community — to evaluate, to critique, to improve upon, or to propose something better.

Three months later, in March 2023, Anthropic released Claude. The safety properties that shaped how Claude behaved had been developed through the Constitutional AI research. Because the paper had been published before the product arrived, anyone who wanted to understand why Claude responded the way it did had a place to start. The reasoning was on the record.

· · ·

What the Paper Left Open

Constitutional AI was a method, not a destination. The researchers said so plainly. Writing down a set of principles and training a model to check its own outputs against them solves real problems — it reduces the need for human labeling, it makes the training values visible, it produces models that perform measurably better on safety benchmarks. But it does not close the deeper question.

That deeper question is simply this: who writes the constitution?

Any set of principles reflects the values of the people who chose them. Anthropic drew on the Universal Declaration of Human Rights — a document with broad legitimacy, but one produced in a specific historical moment by specific people. It drew on existing tech company policies, which reflect the legal and commercial environment those companies navigate. It drew on its own researchers' judgments. Every one of those choices is a values decision, whether it looks like one or not.

The paper acknowledged this openly. It noted that different constitutions would produce different models. It did not claim that Anthropic's sixteen principles were the only valid set, or the best possible set — only that the method of training against written, visible principles was an improvement over training against the silent, buried judgments of raters nobody would ever hear from again.

That question — who decides what AI systems are trained to value, and how that decision gets made — is one the field is still working through. Constitutional AI did not answer it. But it made it easier to ask. By writing the principles down, it made clear that principles exist, that they are choices, and that they can be read, debated, and challenged.

That is not a small thing.

Where the Idea Went

Constitutional AI has continued to develop in the years since the 2022 paper. The original sixteen-principle constitution has been refined. The training methods have been extended. Other researchers at other organizations have proposed variations. The broader field of AI alignment has grown considerably, with Constitutional AI as one approach among several that have emerged.

What stayed, and what continues to shape the conversation, is the underlying proposition: that the values guiding an AI system's behavior should be written down and open to examination, not buried in the silent judgments of thousands of human raters nobody will ever hear from again. That idea is visible in how Claude behaves today — and in how the field talks about AI safety more broadly.

The Anthropic Series  ·  Tech Reader Magazine
The Idea That Built the Age

From the 2017 Transformer paper to a machine writing 80% of its own code — the arc of the idea that made Constitutional AI possible and everything that followed. Available now at Tech Reader Magazine.

 

Popular posts from this blog

Claude Mythos