RLHF: Reward Model + PPO Explained — LLMs Deep Dive | Learnixo

The Alignment Problem RLHF Solves

After SFT, a model follows instructions but doesn't reliably produce responses humans prefer. SFT maximizes likelihood of demonstration data — it doesn't optimize for actual human preference signals. Two demonstrations look equally good in the SFT loss even if humans strongly prefer one over the other.

RLHF introduces explicit human preference signals and optimizes directly for what humans consider good responses.

The RLHF pipeline:

Supervised Fine-Tuning — create a competent base (described in the SFT article)
Reward Model Training — learn a scalar preference score from human comparisons
PPO Training — optimize the LLM using the reward model as a proxy for human judgment

Step 1: Collecting Human Preference Data

Humans are shown pairs of responses to the same prompt and select the preferred one:

Python

from dataclasses import dataclass

@dataclass
class PreferencePair:
    prompt: str
    chosen: str      # Human-preferred response
    rejected: str    # Less preferred response
    
# Example pairs from a pharmaceutical assistant
EXAMPLE_PAIRS = [
    PreferencePair(
        prompt="What is the interaction between warfarin and aspirin?",
        chosen="""Warfarin and aspirin have a major pharmacodynamic interaction. 
Aspirin inhibits platelet aggregation (COX-1 pathway) while warfarin inhibits 
clotting factor synthesis. Together, they significantly increase bleeding risk.

Management: Avoid combination if possible. If required (e.g., mechanical heart 
valve with AFib), use the lowest effective aspirin dose (75-100mg) and monitor 
INR closely. Discuss bleeding risk with patient.""",
        rejected="""These two drugs can interact. Taking them together may increase 
bleeding risk. You should talk to your doctor about this interaction.""",
    )
]
# The chosen response is more detailed, clinically specific, and actionable.

Scale required: OpenAI reportedly collected ~300,000 comparison pairs for InstructGPT. This is the most expensive part of the RLHF pipeline.

Step 2: Reward Model Training

The reward model learns to predict which response humans prefer:

Python

import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer

class RewardModel(nn.Module):
    """
    Reward model: takes prompt+response → scalar reward score.
    Built on top of a pretrained LM backbone.
    """
    def __init__(self, base_model_name: str):
        super().__init__()
        # Use pretrained LM backbone (usually the SFT model)
        self.backbone = AutoModelForCausalLM.from_pretrained(
            base_model_name,
            torch_dtype=torch.bfloat16,
        )
        # Replace LM head with a single-value regression head
        d_model = self.backbone.config.hidden_size
        self.backbone.lm_head = nn.Linear(d_model, 1, bias=False)

    def forward(self, input_ids, attention_mask):
        outputs = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
        )
        # Use the last token's hidden state as the reward
        # (EOS token position captures the full sequence)
        last_hidden = outputs.hidden_states[-1][:, -1, :]  # (B, d_model)
        reward = self.backbone.lm_head(last_hidden)         # (B, 1)
        return reward.squeeze(-1)                           # (B,)


def compute_reward_loss(
    reward_model: RewardModel,
    chosen_ids: torch.Tensor,
    chosen_mask: torch.Tensor,
    rejected_ids: torch.Tensor,
    rejected_mask: torch.Tensor,
) -> torch.Tensor:
    """
    Bradley-Terry loss: reward_chosen should be > reward_rejected.
    Minimizing -log(sigmoid(r_chosen - r_rejected)).
    """
    r_chosen = reward_model(chosen_ids, chosen_mask)
    r_rejected = reward_model(rejected_ids, rejected_mask)

    # Prefer chosen: log probability that chosen is preferred
    loss = -torch.log(torch.sigmoid(r_chosen - r_rejected)).mean()
    accuracy = (r_chosen > r_rejected).float().mean()

    return loss, accuracy


def train_reward_model(
    reward_model: RewardModel,
    preference_pairs: list[PreferencePair],
    tokenizer,
    epochs: int = 1,
) -> RewardModel:
    optimizer = torch.optim.AdamW(reward_model.parameters(), lr=1e-5)
    reward_model.train()

    for epoch in range(epochs):
        for pair in preference_pairs:
            # Tokenize chosen and rejected responses
            chosen = tokenizer(
                pair.prompt + pair.chosen,
                return_tensors="pt",
                max_length=1024,
                truncation=True,
            )
            rejected = tokenizer(
                pair.prompt + pair.rejected,
                return_tensors="pt",
                max_length=1024,
                truncation=True,
            )

            loss, acc = compute_reward_loss(
                reward_model,
                chosen["input_ids"],
                chosen["attention_mask"],
                rejected["input_ids"],
                rejected["attention_mask"],
            )

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

    return reward_model

Step 3: PPO Training

Proximal Policy Optimization (PPO) optimizes the LLM to maximize reward while staying close to the SFT model:

Python

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead

# PPO requires a model with a value head (estimates expected future reward)
model = AutoModelForCausalLMWithValueHead.from_pretrained(
    "path/to/sft-model",
    torch_dtype=torch.bfloat16,
)

# Reference model (frozen SFT model for KL penalty)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(
    "path/to/sft-model",
    torch_dtype=torch.bfloat16,
)

ppo_config = PPOConfig(
    learning_rate=1.41e-5,
    batch_size=128,
    mini_batch_size=32,
    gradient_accumulation_steps=1,
    optimize_cuda_cache=True,
    early_stopping=True,
    target_kl=6.0,              # Stop if KL divergence exceeds this
    kl_penalty="kl",            # Penalty type
    seed=42,
    use_score_scaling=True,     # Normalize rewards to zero mean, unit variance
    use_score_norm=True,
)

ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=model,
    ref_model=ref_model,
    tokenizer=tokenizer,
)

# Training loop
for batch in prompt_dataset:
    # 1. Generate responses using current policy
    query_tensors = batch["input_ids"]
    response_tensors = ppo_trainer.generate(
        query_tensors,
        return_prompt=False,
        max_new_tokens=256,
    )

    # 2. Score responses with reward model
    texts = [tokenizer.decode(r) for r in response_tensors]
    rewards = [reward_model(text) for text in texts]

    # 3. Run PPO step (updates policy with reward signal + KL penalty)
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

The KL Penalty: Why It's Critical

Without the KL penalty, the policy collapses — it learns to exploit the reward model rather than actually improving:

Python

def compute_rlhf_objective(
    policy_log_probs: torch.Tensor,      # Log probs of current model
    ref_log_probs: torch.Tensor,         # Log probs of frozen SFT model
    rewards: torch.Tensor,               # Reward model scores
    kl_coef: float = 0.1,               # How much to penalize KL divergence
) -> torch.Tensor:
    """
    Full RLHF objective: reward - β·KL(π||π_ref)
    
    The KL term prevents the model from gaming the reward model by
    drifting far from the SFT initialization (which loses language quality).
    """
    # KL divergence: how much policy diverges from reference
    kl = (policy_log_probs - ref_log_probs).sum(dim=-1)  # Per-sequence KL

    # Final objective: maximize reward, minimize KL divergence
    objective = rewards - kl_coef * kl

    return objective.mean()

Why reward hacking happens: The reward model is imperfect. Without the KL penalty, the policy finds "adversarial" inputs that the reward model scores highly but that don't represent real quality improvements. Classic examples: very long responses that score high just because the RM associates length with quality, or responses that repeat certain phrases the RM learned to associate with human approval.

PPO vs Alternatives: When to Use What

| Method | Complexity | Data Required | Stability | Human Feedback | |---|---|---|---|---| | SFT only | Low | Demonstrations | High | Implicit (in data quality) | | RLHF/PPO | Very high | Comparison pairs + RL | Low | Explicit | | DPO | Medium | Comparison pairs | High | Explicit | | RLAIF | High | AI comparisons (no humans) | Medium | AI proxy | | Best-of-N | Low | None | N/A | Reward model |

RLHF/PPO is expensive because it requires:

Running 4 models simultaneously (policy, reference, reward, value head)
Online generation (slow, can't parallelize as easily as offline training)
Careful hyperparameter tuning (KL coefficient, clip ratio, value loss coefficient)
Large memory footprint per GPU

This is why DPO (covered in the next article) became popular — it achieves similar alignment without the RL complexity.

RLHF in Practice: What Actually Changes

Before RLHF (SFT model):

Follows instructions but may be sycophantic
Gives long-winded responses when brevity is better
Doesn't reliably calibrate uncertainty
May generate plausible-sounding but incorrect content

After RLHF (InstructGPT findings):

Truthfulness improves (model expresses uncertainty instead of confabulating)
Appropriate response length (more concise when brevity is preferred)
Better instruction following on complex prompts
Reduced toxic outputs

The labeler agreement problem: Human labelers disagree ~25-30% of the time on which response is better. The reward model learns the average preference signal, which may not represent any individual human's preferences well. This is one driver of Constitutional AI and RLAIF approaches.