Instruction Tuning: From Predictor to Assistant

The Gap Between Pretraining and Assistance

A pretrained model optimizes next-token prediction on web text. Given the prompt "What is warfarin?", it will produce the statistically most likely continuation — which might be another question (if the training data contained Q&A pages), a Wikipedia-style article, or advertisement copy, depending on what context it infers.

Instruction tuning (also called SFT: Supervised Fine-Tuning) teaches the model a specific behavior pattern:

Input: an instruction (what to do)
Output: a helpful, accurate, direct response

This is done by fine-tuning on a dataset of (instruction, response) pairs.

The Data Format

Instruction-tuning data follows a consistent template. Different models use different templates:

Alpaca format:

### Instruction:
Explain the mechanism of action of warfarin in 3 sentences.

### Response:
Warfarin inhibits vitamin K epoxide reductase (VKOR), an enzyme responsible for recycling vitamin K to its active form. Without active vitamin K, the liver cannot synthesize functional clotting factors II, VII, IX, and X. This results in reduced clotting ability and increased anticoagulation.

Chat format (ChatML):

<|im_start|>system
You are a helpful clinical pharmacology assistant.<|im_end|>
<|im_start|>user
Explain the mechanism of action of warfarin in 3 sentences.<|im_end|>
<|im_start|>assistant
Warfarin inhibits vitamin K epoxide reductase (VKOR)...<|im_end|>

LLaMA-3 format:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Explain warfarin's mechanism.<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
Warfarin inhibits VKOR...<|eot_id|>

Training Setup

Python

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
from datasets import Dataset
from peft import LoraConfig, get_peft_model

# Format data using the model's chat template
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")

def format_example(example: dict) -> str:
    """Format a single instruction-response pair."""
    messages = [
        {"role": "system", "content": "You are a helpful clinical pharmacology assistant."},
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["response"]},
    ]
    return tokenizer.apply_chat_template(messages, tokenize=False)

# Load your instruction dataset
raw_data = [
    {
        "instruction": "What is the therapeutic range for warfarin?",
        "response": "The therapeutic INR range for warfarin is typically 2.0-3.0 for most indications, such as atrial fibrillation and DVT. For mechanical heart valves, a higher range of 2.5-3.5 is often targeted.",
    },
    # ... thousands more
]

dataset = Dataset.from_list(raw_data)
formatted_dataset = dataset.map(lambda x: {"text": format_example(x)})

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    torch_dtype="auto",
    device_map="auto",
)

# Add LoRA adapters (fine-tune efficiently)
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

# Response-only masking: only compute loss on the assistant's response
response_template = "<|start_header_id|>assistant<|end_header_id|>\n\n"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

# Train
training_args = TrainingArguments(
    output_dir="./sft-pharmacology",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=50,
    save_strategy="epoch",
    warmup_ratio=0.05,
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=formatted_dataset,
    data_collator=collator,
    tokenizer=tokenizer,
)

trainer.train()

Why Response-Only Loss Matters

Without masking, the model computes loss on both the instruction (user turn) and the response (assistant turn). This causes two problems:

The model is penalized for not being able to predict the instruction, which it shouldn't be generating
The gradient signal from the instruction tokens is noise — the model should learn to follow instructions, not memorize them

Response-only masking sets labels = -100 for instruction tokens. The CrossEntropyLoss with ignore_index=-100 skips those positions:

Python

# What the labels tensor looks like with response-only masking:
# Input:  [sys] [user_tokens] [assistant_tokens] [eos]
# Labels: [-100] [-100 × n]  [assistant_tokens]  [eos]
#                 ↑ ignored     ↑ these contribute to loss

Data Quality vs Quantity

For instruction tuning, quality dominates quantity at small scale:

| Dataset size | What to optimize | |---|---| | Under 1,000 examples | Focus entirely on response quality — every example must be excellent | | 1,000–10,000 | Balance quality and coverage of different instruction types | | 10,000–100,000 | Start deduplication, balance task distribution | | 100k+ | Quality filtering and deduplication are critical |

The LIMA paper (2023) showed 1,000 carefully curated examples can produce a model competitive with models trained on much larger datasets. The key: each example must demonstrate the exact behavior pattern you want to instill.

What SFT Teaches vs What It Doesn't

SFT teaches:

Format: respond to questions directly, use structured output when asked
Tone: be helpful, clear, and appropriately concise
Safety behaviors partially: avoid obviously harmful responses (if training data includes refusals)
Domain skills: if fine-tuned on domain-specific Q&A

SFT does NOT reliably teach:

Honest calibration of uncertainty — the model may confidently generate false answers in the training data's style
Consistent refusal of harmful requests — without explicit refusal examples, models may comply
Preference alignment — which of two responses is better quality

These require RLHF or DPO after SFT.

Multi-Task Instruction Tuning

Training on diverse instruction types generalizes better than single-task fine-tuning:

Python

# Example of diverse instruction types
instruction_types = {
    "question_answering": [
        "What is the mechanism of action of metformin?",
        "When was penicillin discovered?",
    ],
    "summarization": [
        "Summarize the following clinical study in 3 bullet points: ...",
    ],
    "classification": [
        "Classify the following drug interaction as major, moderate, or minor: ...",
    ],
    "extraction": [
        "Extract all drug names mentioned in the following text: ...",
    ],
    "generation": [
        "Write a patient information leaflet for warfarin.",
    ],
    "reasoning": [
        "A patient on warfarin starts taking ibuprofen. What is the clinical concern?",
    ],
    "code": [
        "Write a Python function to calculate creatinine clearance using the Cockcroft-Gault formula.",
    ],
}

FLAN (Finetuned Language Net) and instruction-tuned models like GPT-4 train on thousands of task types, which enables zero-shot generalization to new task formats.

Evaluating SFT Quality

Python

def evaluate_sft_model(model, tokenizer, eval_examples: list[dict]) -> dict:
    """Evaluate instruction-following quality on held-out examples."""
    from openai import OpenAI
    judge = OpenAI()

    scores = []
    for example in eval_examples:
        # Generate response from fine-tuned model
        messages = [{"role": "user", "content": example["instruction"]}]
        inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", tokenize=True)
        outputs = model.generate(inputs, max_new_tokens=512)
        generated = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)

        # Judge with GPT-4o
        judge_prompt = f"""Rate this response on a scale of 1-5:
Instruction: {example['instruction']}
Reference response: {example['response']}
Generated response: {generated}

Score criteria:
1: Incorrect or completely unhelpful
3: Partially correct, acceptable format
5: Accurate, well-formatted, appropriately concise

Return only the integer score."""

        judge_response = judge.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": judge_prompt}],
        )
        score = int(judge_response.choices[0].message.content.strip())
        scores.append(score)

    return {
        "mean_score": sum(scores) / len(scores),
        "min_score": min(scores),
        "score_distribution": {i: scores.count(i) for i in range(1, 6)},
    }

SFT vs Pretraining: Key Differences

| | Pretraining | SFT | |---|---|---| | Data | Trillions of raw tokens | Thousands to millions of (instruction, response) pairs | | Epochs | Less than 1 (each token seen once) | 1–5 epochs | | Learning rate | 1e-4 to 3e-4 | 1e-5 to 2e-4 | | Duration | Weeks on hundreds of GPUs | Hours to days on one to 8 GPUs | | Loss computation | All tokens | Response tokens only | | Goal | Learn the distribution of language | Learn instruction-following behavior | | LoRA | Not typically used | Standard for efficient adaptation |

Instruction Tuning: From Predictor to Assistant

The Gap Between Pretraining and Assistance

The Data Format

Training Setup

Why Response-Only Loss Matters

Data Quality vs Quantity

What SFT Teaches vs What It Doesn't

Multi-Task Instruction Tuning

Evaluating SFT Quality

SFT vs Pretraining: Key Differences

Enjoyed this article?

Leave a comment