Training Data Formats for Fine-Tuning
Format training data correctly for instruction fine-tuning and chat fine-tuning. Understand prompt templates, chat templates, and how to structure JSONL datasets.
Why Format Matters
The format of your training data must match how the model was pre-trained to respond to prompts. A mismatch between training format and inference format causes degraded performance — even with a perfect dataset.
Two main formats exist:
- Instruction format:
{"instruction": "...", "input": "...", "output": "..."} - Chat format:
[{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]
Instruction Format (Alpaca-style)
The Alpaca format is common for instruction-following fine-tuning:
{"instruction": "Explain the mechanism of action of Metformin", "input": "", "output": "Metformin works primarily by inhibiting hepatic gluconeogenesis through activation of AMP-activated protein kinase (AMPK). This reduces glucose production in the liver, lowering fasting blood glucose levels."}
{"instruction": "What are the contraindications for Metformin?", "input": "", "output": "Metformin is contraindicated in: (1) eGFR below 30 mL/min/1.73m2 (risk of lactic acidosis), (2) acute or chronic metabolic acidosis, (3) patients undergoing iodinated contrast imaging, (4) hypersensitivity to metformin."}
{"instruction": "Summarize the following drug label section", "input": "WARNINGS: Lactic Acidosis: Postmarketing cases of metformin-associated lactic acidosis have resulted in death...", "output": "Metformin carries a risk of lactic acidosis, a rare but serious complication. Risk increases with renal impairment, hepatic impairment, and conditions causing hypoxia."}Prompt template at training time:
ALPACA_TEMPLATE = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:
{output}"""
def format_alpaca(example: dict) -> str:
if example.get("input"):
return ALPACA_TEMPLATE.format(
instruction=example["instruction"],
input=example["input"],
output=example["output"],
)
else:
return ALPACA_TEMPLATE.replace("\n\n### Input:\n{input}", "").format(
instruction=example["instruction"],
output=example["output"],
)Chat Format (OpenAI / Hugging Face)
Modern models trained with RLHF use a chat template. Each example is a conversation:
{"messages": [{"role": "system", "content": "You are a clinical pharmacology assistant. Provide accurate, evidence-based drug information."}, {"role": "user", "content": "What is the mechanism of action of warfarin?"}, {"role": "assistant", "content": "Warfarin inhibits vitamin K epoxide reductase (VKOR), preventing the recycling of vitamin K. This depletes active vitamin K, which is required for the synthesis of clotting factors II, VII, IX, and X. The result is anticoagulation with a delayed onset of 2-5 days."}]}
{"messages": [{"role": "system", "content": "You are a clinical pharmacology assistant."}, {"role": "user", "content": "Patient is on warfarin and starts ibuprofen. What interaction should I be aware of?"}, {"role": "assistant", "content": "Warfarin + ibuprofen is a major interaction. NSAIDs like ibuprofen inhibit platelet aggregation and can cause gastric mucosal damage, increasing bleeding risk. Ibuprofen may also displace warfarin from protein binding sites, elevating free warfarin levels and increasing anticoagulation. Recommend using acetaminophen as the analgesic if possible, or intensifying INR monitoring if ibuprofen is necessary."}]}Applying chat templates:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
messages = [
{"role": "system", "content": "You are a clinical pharmacology assistant."},
{"role": "user", "content": "What is warfarin's mechanism of action?"},
{"role": "assistant", "content": "Warfarin inhibits VKOR..."},
]
# Apply the model's native chat template
formatted = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False, # False for training (includes assistant response)
)
print(formatted)
# <|begin_of_text|><|start_header_id|>system<|end_header_id|>\nYou are a clinical...Always use the model's native chat template. Different models (Llama, Mistral, Gemma) have different special tokens and formats.
Building a Dataset for TRL SFTTrainer
from datasets import Dataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token
# Raw training data
raw_data = [
{
"messages": [
{"role": "system", "content": "You are a clinical pharmacology expert."},
{"role": "user", "content": "Explain the drug interaction between warfarin and aspirin."},
{"role": "assistant", "content": "Warfarin and aspirin have two overlapping mechanisms that increase bleeding risk. Aspirin inhibits COX-1, reducing thromboxane A2-mediated platelet aggregation, while warfarin inhibits vitamin K-dependent clotting factors. Together, they significantly increase bleeding risk, particularly GI bleeding. Aspirin also displaces warfarin from albumin binding sites, potentially elevating free warfarin levels."},
]
},
# ... more examples
]
dataset = Dataset.from_list(raw_data)
def format_for_training(example):
"""Apply chat template and prepare for training."""
formatted = tokenizer.apply_chat_template(
example["messages"],
tokenize=False,
add_generation_prompt=False,
)
return {"text": formatted}
dataset = dataset.map(format_for_training)
print(dataset[0]["text"][:300])Training Only on Responses (Response Masking)
During training, you want to compute loss only on the assistant's response tokens, not on the prompt. This is response masking:
from trl import DataCollatorForCompletionOnlyLM
# For Llama-3 Instruct format
response_template = "<|start_header_id|>assistant<|end_header_id|>"
collator = DataCollatorForCompletionOnlyLM(
response_template=response_template,
tokenizer=tokenizer,
)
# Use in SFTTrainer
from trl import SFTTrainer
trainer = SFTTrainer(
model=peft_model,
train_dataset=dataset,
dataset_text_field="text",
data_collator=collator, # Masks prompt tokens from loss
max_seq_length=2048,
...
)Without response masking, the model trains on both the prompt and response, which can cause it to overfit to prompt patterns and generate poor responses.
JSONL File Format
Training datasets are typically stored as JSONL (one JSON object per line):
import json
def save_jsonl(data: list[dict], path: str):
with open(path, "w", encoding="utf-8") as f:
for item in data:
f.write(json.dumps(item, ensure_ascii=False) + "\n")
def load_jsonl(path: str) -> list[dict]:
with open(path, encoding="utf-8") as f:
return [json.loads(line) for line in f if line.strip()]
# Save
save_jsonl(raw_data, "drug_interactions_train.jsonl")
# Load into HuggingFace Dataset
from datasets import load_dataset
dataset = load_dataset("json", data_files={"train": "drug_interactions_train.jsonl"})Data Validation Before Training
Always validate format before starting a training run:
def validate_chat_dataset(dataset, tokenizer, max_seq_length=2048):
issues = []
for i, example in enumerate(dataset):
messages = example.get("messages", [])
# Check structure
if not messages:
issues.append(f"Example {i}: empty messages")
continue
# Check roles
roles = [m["role"] for m in messages]
if roles[-1] != "assistant":
issues.append(f"Example {i}: last message is not from assistant")
# Check token length
formatted = tokenizer.apply_chat_template(messages, tokenize=True)
if len(formatted) > max_seq_length:
issues.append(f"Example {i}: {len(formatted)} tokens exceeds limit of {max_seq_length}")
return issues
issues = validate_chat_dataset(raw_data, tokenizer)
print(f"Found {len(issues)} issues")
for issue in issues[:10]:
print(f" {issue}")Fix all validation errors before training — a malformed dataset will either crash training or silently degrade the model.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.