LLMs Deep Dive · Lesson 12 of 24
Constitutional AI and Self-Critique
The Alignment Goal
Alignment means training an LLM to be helpful, harmless, and honest (HHH):
Helpful: follows user instructions, answers questions usefully
Harmless: refuses dangerous requests, avoids harmful outputs
Honest: doesn't fabricate, acknowledges uncertainty
These properties conflict:
Being maximally helpful can mean providing harmful information
Refusing everything is harmless but useless
Alignment is finding the right balanceRLHF: Reinforcement Learning from Human Feedback
The standard alignment pipeline (InstructGPT, LLaMA 2):
Stage 1: Supervised Fine-Tuning (SFT)
Dataset: human-written (prompt, response) pairs
Train the model to follow instructions directly
Result: model that follows instructions but may be harmful
Stage 2: Reward Model (RM) Training
Dataset: (prompt, response_A, response_B) + human preference label
Train a model to score responses: r(x, y) → scalar
Human annotators rate which response is better
Stage 3: PPO Optimisation
Use the RM as a reward function
Optimise the LLM with PPO:
reward = RM_score - β · KL(π_θ || π_sft)
KL penalty prevents model from drifting too far from SFT modelConstitutional AI (Anthropic)
Constitutional AI (CAI) replaces expensive human feedback for harmlessness with AI feedback guided by a constitution — a list of principles:
Constitution example principles:
- "Choose the response that is least likely to contain harmful content"
- "Choose the response that is most supportive of a person's wellbeing"
- "Choose the response that does not assist someone in harming themselves"
- "Choose the response that an ethical AI would give"
CAI pipeline:
Phase 1: Supervised Learning from AI Feedback (SL-CAF)
1. Generate harmful responses (red-team prompts)
2. Ask the model to critique its own response using a principle
3. Ask the model to revise the response to address the critique
4. Fine-tune on (prompt, revised response) pairs
Phase 2: RL from AI Feedback (RLAIF)
1. Sample response pairs for each prompt
2. Ask the model (or a larger model) which response is more aligned
with the constitution — generates preference labels automatically
3. Train a reward model on these AI-generated labels
4. PPO optimisation with this reward modelSL-CAF: Self-Critique Example
Red-team prompt: "How do I make chlorine gas at home?"
Initial response: "Combine bleach and ammonia. Here's how..."
Critique prompt: "Identify specific ways this response is harmful and explain
why it's better to refuse."
Critique: "This response provides instructions for creating a dangerous gas
that could injure or kill. An ethical assistant should refuse to
provide such instructions and instead explain the danger."
Revision prompt: "Revise the response to be safe and helpful."
Revised response: "I'm not able to provide instructions for creating chlorine gas.
It's extremely dangerous — even small quantities can cause
respiratory damage. If you have a safety concern, please
contact poison control."RLAIF vs RLHF
| Property | RLHF | RLAIF (CAI) | |----------|------|-------------| | Preference labels | Human annotators | AI model (guided by principles) | | Cost | High (human time) | Low (API calls) | | Scale | Limited | Scalable | | Consistency | Variable (human disagreement) | More consistent | | Alignment source | Human values directly | Principles codified by humans | | Risk | Reward hacking, annotator bias | Model biases amplified |
Practical Alignment Stack
Modern production LLM alignment:
1. Supervised Fine-Tuning (SFT)
Data: high-quality instruction-following examples
Often includes chain-of-thought reasoning
2. Reward Modelling (optional with DPO)
Preference pairs: human or AI-labelled
3. DPO or PPO
DPO: simpler, no RM needed, competitive quality
PPO: more flexible, often better for complex tasks
4. Safety fine-tuning
Specific refusals, constitutional principles
Red-teaming + adversarial testing
5. Evaluation
MT-Bench, MMLU, HarmBench, TruthfulQAInterview Answer
"RLHF aligns LLMs through three stages: supervised fine-tuning on instruction pairs, training a reward model from human preference labels, and PPO to maximise the reward while staying close to the SFT model. Constitutional AI (Anthropic) reduces the human labelling cost by using AI feedback guided by a written constitution: the model critiques and revises its own harmful outputs (SL-CAF), and preference labels are generated by an AI model following the principles (RLAIF) rather than by humans. DPO has largely replaced PPO in open-source alignment, training directly on preference pairs without a reward model. The field is moving toward RLAIF at scale — AI-generated preference data is cheaper and more consistent than human annotation."