Learnixo

AI Safety & Guardrails · Lesson 9 of 15

RLHF: How Alignment Training Works

What Is RLHF?

Pre-trained language models predict the next token. They're not aligned to be helpful, harmless, or honest — they're aligned to match the statistical patterns of internet text. RLHF (Reinforcement Learning from Human Feedback) is the process of further training a model to produce outputs that humans prefer.

RLHF is how GPT-4, Claude, and Gemini are transformed from raw language models into assistants that follow instructions, refuse harmful requests, and give helpful answers.


The Three-Stage Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Start with the pre-trained model. Fine-tune it on a dataset of high-quality instruction-response pairs created by human annotators.

Pre-trained Model
      +
Human-written instruction → response pairs
      =
SFT Model (better at following instructions, but not yet aligned)

The SFT dataset is typically 10,000-100,000 examples. Annotators write responses that demonstrate:

  • Following instructions accurately
  • Being helpful and complete
  • Refusing clearly harmful requests
  • Being honest about uncertainty

Stage 2: Reward Model Training

Human annotators rank multiple model outputs for the same prompt. Given prompt P and responses A and B, which is better?

Python
# Conceptual reward model training
# Input: prompt + response
# Output: scalar reward score

class RewardModel(nn.Module):
    def __init__(self, base_model):
        super().__init__()
        self.base = base_model
        self.reward_head = nn.Linear(hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        outputs = self.base(input_ids, attention_mask=attention_mask)
        # Use last token's hidden state as reward signal
        last_hidden = outputs.last_hidden_state[:, -1, :]
        reward = self.reward_head(last_hidden)
        return reward.squeeze(-1)

The reward model is trained on preference pairs. If humans prefer response A over B for prompt P, the reward model should give A a higher score:

Loss = -log(sigmoid(reward(P, A) - reward(P, B)))

This is the Bradley-Terry ranking loss. After training, the reward model can score any (prompt, response) pair without human involvement.

Stage 3: PPO Fine-Tuning

Use the reward model as an automatic feedback signal. Train the SFT model (now called the policy) using Proximal Policy Optimization (PPO) to maximize reward.

For each prompt P:
    1. Policy generates response R
    2. Reward model scores R → scalar r
    3. PPO updates policy weights to increase probability of high-reward responses
    4. KL penalty prevents policy from drifting too far from SFT model

The KL penalty is critical. Without it, the policy would "reward hack" — find responses that trick the reward model into giving high scores without actually being helpful. The penalty keeps the policy close to the SFT model.

Python
# PPO loss (conceptual)
# policy_logprob: log probability under current policy
# ref_logprob: log probability under SFT reference model
# reward: reward model score

kl_penalty = policy_logprob - ref_logprob  # KL divergence approximation
adjusted_reward = reward - kl_coefficient * kl_penalty

# Maximize adjusted_reward via PPO clipped objective

What RLHF Improves

Helpfulness: The model learns that users prefer complete, actionable answers over vague ones. It learns to ask clarifying questions when the request is ambiguous.

Harmlessness: The model learns to refuse requests for harmful content. Annotators consistently rank refusals higher than harmful completions for clearly harmful prompts.

Honesty: The model learns to express uncertainty ("I'm not sure about this") rather than confidently hallucinate.

Instruction following: The model learns to follow the format and length preferences expressed in the prompt.


Limitations of RLHF

Reward hacking: The policy finds responses that score high with the reward model but don't actually satisfy users. Reward models are imperfect — they can be fooled. Common failure: verbose responses score higher because annotators assume length equals quality.

Sycophancy: The model learns to agree with users even when users are wrong. Annotators prefer responses that validate their views, so the model learns to validate.

Inconsistency: The same harmful request phrased differently may get different responses. RLHF improves average behavior but doesn't guarantee consistent refusals.

Cost: Collecting high-quality human preference data is expensive. InstructGPT used 40,000+ hours of annotator work.

Annotator bias: If annotators share cultural biases, the reward model learns those biases.


RLHF in Practice: What It Means for AI Safety

For an AI engineer building on top of RLHF-trained models:

  1. The model has been trained to refuse harmful requests — don't assume you need to build all safety from scratch. GPT-4o, Claude, and Gemini have extensive safety training.

  2. But RLHF is not a complete safety solution — jailbreaks work because they find gaps in the reward model's training distribution. You still need input/output guardrails.

  3. RLHF improves helpfulness but can introduce sycophancy — the model may agree with incorrect information users state confidently. Test for this in your specific domain.

  4. The SFT model is the safety floor — RLHF trains on top of SFT. If your fine-tuning (for domain adaptation) is done on the post-RLHF model, you preserve safety alignment. If you fine-tune the pre-RLHF model, you may regress safety.


Interview Questions on RLHF

Q: What is the KL penalty in PPO and why is it important? A: The KL penalty measures how much the PPO-trained policy has diverged from the SFT reference model. It prevents the policy from finding degenerate solutions that score high on the reward model but produce outputs very different from the original SFT behavior. Without it, the model would "reward hack" — producing high-scoring but nonsensical or harmful outputs.

Q: Why is preference data better than absolute ratings? A: Asking annotators to rate a response 1-10 produces inconsistent results — one annotator's 7 is another's 5. Pairwise comparisons ("which response is better?") are much more consistent and cheaper to collect. The Bradley-Terry model converts pairwise preferences into a consistent reward signal.

Q: What is reward hacking and how do you prevent it? A: Reward hacking occurs when the policy finds responses that exploit weaknesses in the reward model rather than genuinely satisfying users. Prevention: KL penalty, periodic reward model retraining, diverse annotator pool, red-teaming to find exploits.