LLMs Deep Dive · Lesson 10 of 24
RLHF: Reward Model + PPO Explained
The Alignment Problem RLHF Solves
After SFT, a model follows instructions but doesn't reliably produce responses humans prefer. SFT maximizes likelihood of demonstration data — it doesn't optimize for actual human preference signals. Two demonstrations look equally good in the SFT loss even if humans strongly prefer one over the other.
RLHF introduces explicit human preference signals and optimizes directly for what humans consider good responses.
The RLHF pipeline:
- Supervised Fine-Tuning — create a competent base (described in the SFT article)
- Reward Model Training — learn a scalar preference score from human comparisons
- PPO Training — optimize the LLM using the reward model as a proxy for human judgment
Step 1: Collecting Human Preference Data
Humans are shown pairs of responses to the same prompt and select the preferred one:
from dataclasses import dataclass
@dataclass
class PreferencePair:
prompt: str
chosen: str # Human-preferred response
rejected: str # Less preferred response
# Example pairs from a pharmaceutical assistant
EXAMPLE_PAIRS = [
PreferencePair(
prompt="What is the interaction between warfarin and aspirin?",
chosen="""Warfarin and aspirin have a major pharmacodynamic interaction.
Aspirin inhibits platelet aggregation (COX-1 pathway) while warfarin inhibits
clotting factor synthesis. Together, they significantly increase bleeding risk.
Management: Avoid combination if possible. If required (e.g., mechanical heart
valve with AFib), use the lowest effective aspirin dose (75-100mg) and monitor
INR closely. Discuss bleeding risk with patient.""",
rejected="""These two drugs can interact. Taking them together may increase
bleeding risk. You should talk to your doctor about this interaction.""",
)
]
# The chosen response is more detailed, clinically specific, and actionable.Scale required: OpenAI reportedly collected ~300,000 comparison pairs for InstructGPT. This is the most expensive part of the RLHF pipeline.
Step 2: Reward Model Training
The reward model learns to predict which response humans prefer:
import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer
class RewardModel(nn.Module):
"""
Reward model: takes prompt+response → scalar reward score.
Built on top of a pretrained LM backbone.
"""
def __init__(self, base_model_name: str):
super().__init__()
# Use pretrained LM backbone (usually the SFT model)
self.backbone = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.bfloat16,
)
# Replace LM head with a single-value regression head
d_model = self.backbone.config.hidden_size
self.backbone.lm_head = nn.Linear(d_model, 1, bias=False)
def forward(self, input_ids, attention_mask):
outputs = self.backbone(
input_ids=input_ids,
attention_mask=attention_mask,
output_hidden_states=True,
)
# Use the last token's hidden state as the reward
# (EOS token position captures the full sequence)
last_hidden = outputs.hidden_states[-1][:, -1, :] # (B, d_model)
reward = self.backbone.lm_head(last_hidden) # (B, 1)
return reward.squeeze(-1) # (B,)
def compute_reward_loss(
reward_model: RewardModel,
chosen_ids: torch.Tensor,
chosen_mask: torch.Tensor,
rejected_ids: torch.Tensor,
rejected_mask: torch.Tensor,
) -> torch.Tensor:
"""
Bradley-Terry loss: reward_chosen should be > reward_rejected.
Minimizing -log(sigmoid(r_chosen - r_rejected)).
"""
r_chosen = reward_model(chosen_ids, chosen_mask)
r_rejected = reward_model(rejected_ids, rejected_mask)
# Prefer chosen: log probability that chosen is preferred
loss = -torch.log(torch.sigmoid(r_chosen - r_rejected)).mean()
accuracy = (r_chosen > r_rejected).float().mean()
return loss, accuracy
def train_reward_model(
reward_model: RewardModel,
preference_pairs: list[PreferencePair],
tokenizer,
epochs: int = 1,
) -> RewardModel:
optimizer = torch.optim.AdamW(reward_model.parameters(), lr=1e-5)
reward_model.train()
for epoch in range(epochs):
for pair in preference_pairs:
# Tokenize chosen and rejected responses
chosen = tokenizer(
pair.prompt + pair.chosen,
return_tensors="pt",
max_length=1024,
truncation=True,
)
rejected = tokenizer(
pair.prompt + pair.rejected,
return_tensors="pt",
max_length=1024,
truncation=True,
)
loss, acc = compute_reward_loss(
reward_model,
chosen["input_ids"],
chosen["attention_mask"],
rejected["input_ids"],
rejected["attention_mask"],
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return reward_modelStep 3: PPO Training
Proximal Policy Optimization (PPO) optimizes the LLM to maximize reward while staying close to the SFT model:
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
# PPO requires a model with a value head (estimates expected future reward)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
"path/to/sft-model",
torch_dtype=torch.bfloat16,
)
# Reference model (frozen SFT model for KL penalty)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(
"path/to/sft-model",
torch_dtype=torch.bfloat16,
)
ppo_config = PPOConfig(
learning_rate=1.41e-5,
batch_size=128,
mini_batch_size=32,
gradient_accumulation_steps=1,
optimize_cuda_cache=True,
early_stopping=True,
target_kl=6.0, # Stop if KL divergence exceeds this
kl_penalty="kl", # Penalty type
seed=42,
use_score_scaling=True, # Normalize rewards to zero mean, unit variance
use_score_norm=True,
)
ppo_trainer = PPOTrainer(
config=ppo_config,
model=model,
ref_model=ref_model,
tokenizer=tokenizer,
)
# Training loop
for batch in prompt_dataset:
# 1. Generate responses using current policy
query_tensors = batch["input_ids"]
response_tensors = ppo_trainer.generate(
query_tensors,
return_prompt=False,
max_new_tokens=256,
)
# 2. Score responses with reward model
texts = [tokenizer.decode(r) for r in response_tensors]
rewards = [reward_model(text) for text in texts]
# 3. Run PPO step (updates policy with reward signal + KL penalty)
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)The KL Penalty: Why It's Critical
Without the KL penalty, the policy collapses — it learns to exploit the reward model rather than actually improving:
def compute_rlhf_objective(
policy_log_probs: torch.Tensor, # Log probs of current model
ref_log_probs: torch.Tensor, # Log probs of frozen SFT model
rewards: torch.Tensor, # Reward model scores
kl_coef: float = 0.1, # How much to penalize KL divergence
) -> torch.Tensor:
"""
Full RLHF objective: reward - β·KL(π||π_ref)
The KL term prevents the model from gaming the reward model by
drifting far from the SFT initialization (which loses language quality).
"""
# KL divergence: how much policy diverges from reference
kl = (policy_log_probs - ref_log_probs).sum(dim=-1) # Per-sequence KL
# Final objective: maximize reward, minimize KL divergence
objective = rewards - kl_coef * kl
return objective.mean()Why reward hacking happens: The reward model is imperfect. Without the KL penalty, the policy finds "adversarial" inputs that the reward model scores highly but that don't represent real quality improvements. Classic examples: very long responses that score high just because the RM associates length with quality, or responses that repeat certain phrases the RM learned to associate with human approval.
PPO vs Alternatives: When to Use What
| Method | Complexity | Data Required | Stability | Human Feedback | |---|---|---|---|---| | SFT only | Low | Demonstrations | High | Implicit (in data quality) | | RLHF/PPO | Very high | Comparison pairs + RL | Low | Explicit | | DPO | Medium | Comparison pairs | High | Explicit | | RLAIF | High | AI comparisons (no humans) | Medium | AI proxy | | Best-of-N | Low | None | N/A | Reward model |
RLHF/PPO is expensive because it requires:
- Running 4 models simultaneously (policy, reference, reward, value head)
- Online generation (slow, can't parallelize as easily as offline training)
- Careful hyperparameter tuning (KL coefficient, clip ratio, value loss coefficient)
- Large memory footprint per GPU
This is why DPO (covered in the next article) became popular — it achieves similar alignment without the RL complexity.
RLHF in Practice: What Actually Changes
Before RLHF (SFT model):
- Follows instructions but may be sycophantic
- Gives long-winded responses when brevity is better
- Doesn't reliably calibrate uncertainty
- May generate plausible-sounding but incorrect content
After RLHF (InstructGPT findings):
- Truthfulness improves (model expresses uncertainty instead of confabulating)
- Appropriate response length (more concise when brevity is preferred)
- Better instruction following on complex prompts
- Reduced toxic outputs
The labeler agreement problem: Human labelers disagree ~25-30% of the time on which response is better. The reward model learns the average preference signal, which may not represent any individual human's preferences well. This is one driver of Constitutional AI and RLAIF approaches.