RLHF and DPO: Beyond Supervised Fine-Tuning — Fine-Tuning LLMs | Learnixo

Why SFT Alone Isn't Enough

Supervised fine-tuning (SFT) teaches a model what to say. It doesn't teach the model what's better when multiple responses are possible.

Problems SFT can't fix:

Model gives technically correct but unhelpful responses
Model is overly verbose or overly terse
Model is sycophantic (agrees with wrong user claims)
Model gives dangerous advice that sounds plausible

Alignment techniques — RLHF and DPO — train on preference data: pairs of responses where humans indicate which is better. This directly optimizes for human preference, not just next-token prediction.

The Preference Dataset

Both RLHF and DPO require the same data format: triplets of (prompt, chosen, rejected):

Python

preference_data = [
    {
        "prompt": "A patient is taking warfarin and asks about taking ibuprofen for headaches. What do you advise?",
        "chosen": "Ibuprofen (an NSAID) significantly increases bleeding risk when combined with warfarin through two mechanisms: it inhibits platelet aggregation via COX-1, and it can irritate the gastric mucosa. It may also displace warfarin from protein binding sites, increasing free drug levels. I'd recommend acetaminophen (up to 2g/day) as a safer alternative. If ibuprofen is necessary, INR monitoring should be intensified.",
        "rejected": "You should be careful with ibuprofen and warfarin because they can interact. It's probably fine in small doses, but check with your doctor.",
    },
    {
        "prompt": "What is the mechanism of action of metformin?",
        "chosen": "Metformin works primarily by inhibiting hepatic gluconeogenesis through activation of AMP-activated protein kinase (AMPK). AMPK activation reduces the expression of gluconeogenic enzymes (PEPCK, G6Pase), decreasing glucose output from the liver. Secondary mechanisms include improved peripheral insulin sensitivity and reduced intestinal glucose absorption. The result is lower fasting blood glucose without causing hypoglycemia.",
        "rejected": "Metformin lowers blood sugar. It affects the liver and helps your body use insulin better. It's used for type 2 diabetes.",
    },
]

The chosen response is what you want the model to learn. The rejected response is what you want the model to move away from. Both are responses the model might plausibly generate.

DPO: Direct Preference Optimization

DPO (Rafailov et al., 2023) is the practical standard for alignment fine-tuning. It eliminates the need for a separate reward model — instead, it trains directly on preference pairs.

The DPO loss function:

L_DPO = -log σ(β log π_θ(y_w|x) / π_ref(y_w|x) - β log π_θ(y_l|x) / π_ref(y_l|x))

Where:

y_w = chosen (winning) response
y_l = rejected (losing) response
π_θ = policy being trained
π_ref = reference policy (original SFT model, frozen)
β = temperature controlling divergence from reference

In plain terms: increase probability of chosen responses relative to the reference model, decrease probability of rejected responses.

DPO Training with TRL

Python

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, LoraConfig, get_peft_model, TaskType
from trl import DPOTrainer, DPOConfig
from datasets import Dataset

# Load SFT model (this is your starting point for DPO)
sft_model = AutoModelForCausalLM.from_pretrained("./sft-fine-tuned-model")
tokenizer = AutoTokenizer.from_pretrained("./sft-fine-tuned-model")
tokenizer.pad_token = tokenizer.eos_token

# Add LoRA for DPO training (optional — can train full SFT model too)
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
)
dpo_model = get_peft_model(sft_model, lora_config)

# Preference dataset
dpo_dataset = Dataset.from_list(preference_data)

dpo_config = DPOConfig(
    beta=0.1,                       # KL penalty strength (higher = closer to reference)
    max_prompt_length=512,
    max_length=1024,
    per_device_train_batch_size=2,
    num_train_epochs=3,
    learning_rate=5e-5,
    output_dir="./dpo-aligned-model",
    logging_steps=10,
    report_to="none",
)

trainer = DPOTrainer(
    model=dpo_model,
    ref_model=None,  # None = use a frozen copy of the initial model
    args=dpo_config,
    train_dataset=dpo_dataset,
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model("./dpo-aligned-model")

RLHF vs DPO

| Aspect | RLHF | DPO | |---|---|---| | Requires reward model | Yes | No | | Training stability | Harder (PPO) | More stable | | Compute cost | High | Lower | | Implementation complexity | High | Moderate | | Performance | State of the art | Near-equivalent | | Common use | Frontier model training | Production fine-tuning |

For most practitioners fine-tuning an open-source model, DPO is the right choice. RLHF with PPO is primarily used by AI labs training frontier models.

Collecting Preference Data

If you don't have preference data, generate it:

Method 1: Human ranking Show domain experts two model responses to the same prompt. They pick the better one.

Method 2: LLM-generated preferences Have GPT-4o generate both a high-quality and a deliberately weaker response, then label accordingly:

Python

def generate_preference_pair(prompt: str, domain_system: str) -> dict:
    """Generate a (chosen, rejected) pair for DPO training."""

    # Generate the chosen (good) response
    chosen_resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": domain_system},
            {"role": "user", "content": prompt},
        ],
        temperature=0.3,
    )

    # Generate the rejected (worse) response with degraded instructions
    degraded_system = domain_system + "\n\nIMPORTANT: Keep your response very brief and non-specific. Do not include mechanisms or evidence."
    rejected_resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": degraded_system},
            {"role": "user", "content": prompt},
        ],
        temperature=0.7,
    )

    return {
        "prompt": prompt,
        "chosen": chosen_resp.choices[0].message.content,
        "rejected": rejected_resp.choices[0].message.content,
    }

Method 3: RLAIF (AI Feedback) Use Anthropic's Constitutional AI approach — have one LLM judge which of two responses better satisfies a set of principles.

What DPO Can Fix

DPO is effective at correcting:

Verbosity / over-explaining
Sycophancy (agreeing with incorrect claims)
Refusal over-triggering (refusing benign medical questions)
Format inconsistency
Tone (too casual / too clinical)

DPO cannot inject new factual knowledge — that requires more SFT data or RAG. Use DPO for behavioral alignment, not knowledge improvement.