Instruction Tuning: From Predictor to Assistant
How supervised fine-tuning (SFT) on instruction-response pairs transforms a pretrained language model into an assistant that follows directions and completes tasks.
The Gap Between Pretraining and Assistance
A pretrained model optimizes next-token prediction on web text. Given the prompt "What is warfarin?", it will produce the statistically most likely continuation — which might be another question (if the training data contained Q&A pages), a Wikipedia-style article, or advertisement copy, depending on what context it infers.
Instruction tuning (also called SFT: Supervised Fine-Tuning) teaches the model a specific behavior pattern:
- Input: an instruction (what to do)
- Output: a helpful, accurate, direct response
This is done by fine-tuning on a dataset of (instruction, response) pairs.
The Data Format
Instruction-tuning data follows a consistent template. Different models use different templates:
Alpaca format:
### Instruction:
Explain the mechanism of action of warfarin in 3 sentences.
### Response:
Warfarin inhibits vitamin K epoxide reductase (VKOR), an enzyme responsible for recycling vitamin K to its active form. Without active vitamin K, the liver cannot synthesize functional clotting factors II, VII, IX, and X. This results in reduced clotting ability and increased anticoagulation.Chat format (ChatML):
<|im_start|>system
You are a helpful clinical pharmacology assistant.<|im_end|>
<|im_start|>user
Explain the mechanism of action of warfarin in 3 sentences.<|im_end|>
<|im_start|>assistant
Warfarin inhibits vitamin K epoxide reductase (VKOR)...<|im_end|>LLaMA-3 format:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
Explain warfarin's mechanism.<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
Warfarin inhibits VKOR...<|eot_id|>Training Setup
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
from datasets import Dataset
from peft import LoraConfig, get_peft_model
# Format data using the model's chat template
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B")
def format_example(example: dict) -> str:
"""Format a single instruction-response pair."""
messages = [
{"role": "system", "content": "You are a helpful clinical pharmacology assistant."},
{"role": "user", "content": example["instruction"]},
{"role": "assistant", "content": example["response"]},
]
return tokenizer.apply_chat_template(messages, tokenize=False)
# Load your instruction dataset
raw_data = [
{
"instruction": "What is the therapeutic range for warfarin?",
"response": "The therapeutic INR range for warfarin is typically 2.0-3.0 for most indications, such as atrial fibrillation and DVT. For mechanical heart valves, a higher range of 2.5-3.5 is often targeted.",
},
# ... thousands more
]
dataset = Dataset.from_list(raw_data)
formatted_dataset = dataset.map(lambda x: {"text": format_example(x)})
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
torch_dtype="auto",
device_map="auto",
)
# Add LoRA adapters (fine-tune efficiently)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
# Response-only masking: only compute loss on the assistant's response
response_template = "<|start_header_id|>assistant<|end_header_id|>\n\n"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)
# Train
training_args = TrainingArguments(
output_dir="./sft-pharmacology",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
bf16=True,
logging_steps=50,
save_strategy="epoch",
warmup_ratio=0.05,
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=formatted_dataset,
data_collator=collator,
tokenizer=tokenizer,
)
trainer.train()Why Response-Only Loss Matters
Without masking, the model computes loss on both the instruction (user turn) and the response (assistant turn). This causes two problems:
- The model is penalized for not being able to predict the instruction, which it shouldn't be generating
- The gradient signal from the instruction tokens is noise — the model should learn to follow instructions, not memorize them
Response-only masking sets labels = -100 for instruction tokens. The CrossEntropyLoss with ignore_index=-100 skips those positions:
# What the labels tensor looks like with response-only masking:
# Input: [sys] [user_tokens] [assistant_tokens] [eos]
# Labels: [-100] [-100 × n] [assistant_tokens] [eos]
# ↑ ignored ↑ these contribute to lossData Quality vs Quantity
For instruction tuning, quality dominates quantity at small scale:
| Dataset size | What to optimize | |---|---| | Under 1,000 examples | Focus entirely on response quality — every example must be excellent | | 1,000–10,000 | Balance quality and coverage of different instruction types | | 10,000–100,000 | Start deduplication, balance task distribution | | 100k+ | Quality filtering and deduplication are critical |
The LIMA paper (2023) showed 1,000 carefully curated examples can produce a model competitive with models trained on much larger datasets. The key: each example must demonstrate the exact behavior pattern you want to instill.
What SFT Teaches vs What It Doesn't
SFT teaches:
- Format: respond to questions directly, use structured output when asked
- Tone: be helpful, clear, and appropriately concise
- Safety behaviors partially: avoid obviously harmful responses (if training data includes refusals)
- Domain skills: if fine-tuned on domain-specific Q&A
SFT does NOT reliably teach:
- Honest calibration of uncertainty — the model may confidently generate false answers in the training data's style
- Consistent refusal of harmful requests — without explicit refusal examples, models may comply
- Preference alignment — which of two responses is better quality
These require RLHF or DPO after SFT.
Multi-Task Instruction Tuning
Training on diverse instruction types generalizes better than single-task fine-tuning:
# Example of diverse instruction types
instruction_types = {
"question_answering": [
"What is the mechanism of action of metformin?",
"When was penicillin discovered?",
],
"summarization": [
"Summarize the following clinical study in 3 bullet points: ...",
],
"classification": [
"Classify the following drug interaction as major, moderate, or minor: ...",
],
"extraction": [
"Extract all drug names mentioned in the following text: ...",
],
"generation": [
"Write a patient information leaflet for warfarin.",
],
"reasoning": [
"A patient on warfarin starts taking ibuprofen. What is the clinical concern?",
],
"code": [
"Write a Python function to calculate creatinine clearance using the Cockcroft-Gault formula.",
],
}FLAN (Finetuned Language Net) and instruction-tuned models like GPT-4 train on thousands of task types, which enables zero-shot generalization to new task formats.
Evaluating SFT Quality
def evaluate_sft_model(model, tokenizer, eval_examples: list[dict]) -> dict:
"""Evaluate instruction-following quality on held-out examples."""
from openai import OpenAI
judge = OpenAI()
scores = []
for example in eval_examples:
# Generate response from fine-tuned model
messages = [{"role": "user", "content": example["instruction"]}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", tokenize=True)
outputs = model.generate(inputs, max_new_tokens=512)
generated = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
# Judge with GPT-4o
judge_prompt = f"""Rate this response on a scale of 1-5:
Instruction: {example['instruction']}
Reference response: {example['response']}
Generated response: {generated}
Score criteria:
1: Incorrect or completely unhelpful
3: Partially correct, acceptable format
5: Accurate, well-formatted, appropriately concise
Return only the integer score."""
judge_response = judge.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": judge_prompt}],
)
score = int(judge_response.choices[0].message.content.strip())
scores.append(score)
return {
"mean_score": sum(scores) / len(scores),
"min_score": min(scores),
"score_distribution": {i: scores.count(i) for i in range(1, 6)},
}SFT vs Pretraining: Key Differences
| | Pretraining | SFT | |---|---|---| | Data | Trillions of raw tokens | Thousands to millions of (instruction, response) pairs | | Epochs | Less than 1 (each token seen once) | 1–5 epochs | | Learning rate | 1e-4 to 3e-4 | 1e-5 to 2e-4 | | Duration | Weeks on hundreds of GPUs | Hours to days on one to 8 GPUs | | Loss computation | All tokens | Response tokens only | | Goal | Learn the distribution of language | Learn instruction-following behavior | | LoRA | Not typically used | Standard for efficient adaptation |
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.