Learnixo

LLMs Deep Dive · Lesson 12 of 24

Constitutional AI and Self-Critique

The Alignment Goal

Alignment means training an LLM to be helpful, harmless, and honest (HHH):

Helpful:  follows user instructions, answers questions usefully
Harmless: refuses dangerous requests, avoids harmful outputs
Honest:   doesn't fabricate, acknowledges uncertainty

These properties conflict:
  Being maximally helpful can mean providing harmful information
  Refusing everything is harmless but useless
  Alignment is finding the right balance

RLHF: Reinforcement Learning from Human Feedback

The standard alignment pipeline (InstructGPT, LLaMA 2):

Stage 1: Supervised Fine-Tuning (SFT)
  Dataset: human-written (prompt, response) pairs
  Train the model to follow instructions directly
  Result: model that follows instructions but may be harmful

Stage 2: Reward Model (RM) Training
  Dataset: (prompt, response_A, response_B) + human preference label
  Train a model to score responses: r(x, y) → scalar
  Human annotators rate which response is better

Stage 3: PPO Optimisation
  Use the RM as a reward function
  Optimise the LLM with PPO:
    reward = RM_score - β · KL(π_θ || π_sft)
  KL penalty prevents model from drifting too far from SFT model

Constitutional AI (Anthropic)

Constitutional AI (CAI) replaces expensive human feedback for harmlessness with AI feedback guided by a constitution — a list of principles:

Constitution example principles:
  - "Choose the response that is least likely to contain harmful content"
  - "Choose the response that is most supportive of a person's wellbeing"
  - "Choose the response that does not assist someone in harming themselves"
  - "Choose the response that an ethical AI would give"

CAI pipeline:

Phase 1: Supervised Learning from AI Feedback (SL-CAF)
  1. Generate harmful responses (red-team prompts)
  2. Ask the model to critique its own response using a principle
  3. Ask the model to revise the response to address the critique
  4. Fine-tune on (prompt, revised response) pairs

Phase 2: RL from AI Feedback (RLAIF)
  1. Sample response pairs for each prompt
  2. Ask the model (or a larger model) which response is more aligned
     with the constitution — generates preference labels automatically
  3. Train a reward model on these AI-generated labels
  4. PPO optimisation with this reward model

SL-CAF: Self-Critique Example

Red-team prompt: "How do I make chlorine gas at home?"

Initial response: "Combine bleach and ammonia. Here's how..."

Critique prompt: "Identify specific ways this response is harmful and explain
                 why it's better to refuse."

Critique: "This response provides instructions for creating a dangerous gas
           that could injure or kill. An ethical assistant should refuse to
           provide such instructions and instead explain the danger."

Revision prompt: "Revise the response to be safe and helpful."

Revised response: "I'm not able to provide instructions for creating chlorine gas.
                   It's extremely dangerous — even small quantities can cause
                   respiratory damage. If you have a safety concern, please
                   contact poison control."

RLAIF vs RLHF

| Property | RLHF | RLAIF (CAI) | |----------|------|-------------| | Preference labels | Human annotators | AI model (guided by principles) | | Cost | High (human time) | Low (API calls) | | Scale | Limited | Scalable | | Consistency | Variable (human disagreement) | More consistent | | Alignment source | Human values directly | Principles codified by humans | | Risk | Reward hacking, annotator bias | Model biases amplified |


Practical Alignment Stack

Modern production LLM alignment:

1. Supervised Fine-Tuning (SFT)
   Data: high-quality instruction-following examples
   Often includes chain-of-thought reasoning

2. Reward Modelling (optional with DPO)
   Preference pairs: human or AI-labelled

3. DPO or PPO
   DPO: simpler, no RM needed, competitive quality
   PPO: more flexible, often better for complex tasks

4. Safety fine-tuning
   Specific refusals, constitutional principles
   Red-teaming + adversarial testing

5. Evaluation
   MT-Bench, MMLU, HarmBench, TruthfulQA

Interview Answer

"RLHF aligns LLMs through three stages: supervised fine-tuning on instruction pairs, training a reward model from human preference labels, and PPO to maximise the reward while staying close to the SFT model. Constitutional AI (Anthropic) reduces the human labelling cost by using AI feedback guided by a written constitution: the model critiques and revises its own harmful outputs (SL-CAF), and preference labels are generated by an AI model following the principles (RLAIF) rather than by humans. DPO has largely replaced PPO in open-source alignment, training directly on preference pairs without a reward model. The field is moving toward RLAIF at scale — AI-generated preference data is cheaper and more consistent than human annotation."