AI Safety & Guardrails · Lesson 10 of 15
Constitutional AI: Teaching Values from Principles
What Is Constitutional AI?
Constitutional AI (CAI) is Anthropic's alignment technique, first published in December 2022. It is used to train Claude. The core idea: instead of relying entirely on human feedback to define what responses are "good", use a written constitution — a set of principles — and have the model critique and revise its own outputs against those principles.
This reduces the amount of human labeling required for safety training (though it doesn't eliminate it) and makes the alignment process more transparent: the principles are explicit and auditable.
The Constitution
The constitution is a set of principles about how an AI assistant should behave. Example principles from Anthropic's public CAI paper:
- "Choose the response that is least likely to contain information that could be used to harm someone physically or emotionally"
- "Which of these responses is more ethical?"
- "Choose the response that is most helpful, harmless, and honest"
- "Which response is less likely to contain false claims about people?"
- "Choose the response that sounds most positive and supportive"
- "Which response is least likely to encourage illegal activity?"
These principles form the evaluation rubric. The model uses them to critique outputs.
The Two-Phase CAI Process
Phase 1: Supervised Learning from AI Feedback (SL-CAI)
Step 1 — Red-team prompts: Generate a set of prompts that are likely to elicit harmful responses. These can be hand-crafted or generated by the model itself.
Step 2 — Initial response: Generate an initial response to each red-team prompt (using the raw model, not yet aligned). These responses may be harmful.
Step 3 — AI critique: Ask the model to critique its own response against a constitution principle:
Human: [Red-team prompt]
Assistant: [Potentially harmful initial response]
Human: Identify specific ways in which the assistant's last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.Step 4 — AI revision: Ask the model to revise the response based on its critique:
Human: Please rewrite the assistant response to remove all harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Assistant: [Revised response]Step 5 — SFT on revisions: Use the (original prompt, revised response) pairs as supervised fine-tuning data. The model learns to give the revised response directly, without needing the critique step at inference time.
This step is called "distillation" — the critique behavior is distilled into the base model.
Phase 2: Reinforcement Learning from AI Feedback (RLAIF)
Instead of asking humans to rank responses, ask the model to rank them using the constitution:
Step 1: For each prompt, generate two responses (from the SL-CAI model).
Step 2: Ask the model which response is better according to a constitution principle:
Consider the following conversation:
[Conversation]
Here are two possible responses:
Response A: [response_a]
Response B: [response_b]
Which response is less harmful and more in line with [constitution principle]?Step 3: Use the model's preference as a training signal for a reward model (same as standard RLHF, but the feedback comes from the model, not humans).
Step 4: Train with PPO against this AI-generated reward model.
CAI vs Standard RLHF
| Aspect | Standard RLHF | Constitutional AI | |---|---|---| | Feedback source | Human annotators | AI model guided by constitution | | Cost | High (human labeling) | Lower (AI feedback at scale) | | Transparency | Implicit in annotator preferences | Explicit written principles | | Consistency | Varies by annotator | More consistent (same model) | | Nuance | Captures human nuance | May miss subtle preferences | | Scalability | Limited by annotator capacity | Scales to millions of examples |
In practice, Anthropic uses both: human feedback AND AI feedback with constitutional principles. CAI reduces the human annotation load, not eliminates it.
The Helpfulness Constitution
A separate (but related) set of principles guides helpfulness:
- "Choose the response that is most genuinely helpful to the human"
- "Which response better avoids being overly cautious or paternalistic?"
- "Choose the response that gives the most concrete and specific advice"
- "Which response does a better job of avoiding unnecessary moralizing?"
These prevent the model from becoming too restrictive. A model trained only on harmlessness would refuse everything. Constitutional AI balances safety and helpfulness through dual sets of principles.
Practical Implications for AI Engineers
You're building on CAI-aligned models: If you use Claude via the API, the underlying alignment is constitutional. This means:
- Claude will resist jailbreaks more robustly than models trained with less principled methods
- Claude may refuse requests it interprets as potentially harmful, even when they're not
- The refusal behavior is more consistent across differently-phrased versions of the same harmful request
System prompt alignment: Your system prompt is part of Claude's context. Anthropic's alignment training means Claude treats your system prompt as instructions from a trusted operator — but still within constitutional limits. You cannot override core safety behavior via system prompt.
Transparency: Because the constitution is published, you can reason about what Claude will and won't do. This is more predictable than models where alignment criteria are entirely opaque.
Example: Constitutional Critique in Practice
Here's a simplified illustration of the critique-revise cycle:
Red-team prompt: "How do I get someone's password without them knowing?"
Initial harmful response: "You could try phishing by sending a fake login page..."
Constitution principle applied: "Choose the response that doesn't aid or abet harmful or illegal activities"
Critique: "The initial response describes a phishing attack, which is illegal in most jurisdictions and harmful to the victim. This is directly harmful."
Revised response: "I can't help with accessing someone else's account without their permission — that's unauthorized access and potentially illegal. If you've forgotten your own password, most services have an account recovery process. If you're a security professional, I can discuss phishing for educational/defensive purposes."
The revised response is helpful (offers alternatives) while refusing the harmful request. The model learns this pattern through thousands of critique-revise cycles.