Learnixo
Back to blog
AI Systemsintermediate

Training Data Formats for Fine-Tuning

Format training data correctly for instruction fine-tuning and chat fine-tuning. Understand prompt templates, chat templates, and how to structure JSONL datasets.

Asma Hafeez KhanMay 16, 20265 min read
Fine-TuningData PreparationInstruction TuningPython
Share:𝕏

Why Format Matters

The format of your training data must match how the model was pre-trained to respond to prompts. A mismatch between training format and inference format causes degraded performance — even with a perfect dataset.

Two main formats exist:

  • Instruction format: {"instruction": "...", "input": "...", "output": "..."}
  • Chat format: [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]

Instruction Format (Alpaca-style)

The Alpaca format is common for instruction-following fine-tuning:

JSONL
{"instruction": "Explain the mechanism of action of Metformin", "input": "", "output": "Metformin works primarily by inhibiting hepatic gluconeogenesis through activation of AMP-activated protein kinase (AMPK). This reduces glucose production in the liver, lowering fasting blood glucose levels."}
{"instruction": "What are the contraindications for Metformin?", "input": "", "output": "Metformin is contraindicated in: (1) eGFR below 30 mL/min/1.73m2 (risk of lactic acidosis), (2) acute or chronic metabolic acidosis, (3) patients undergoing iodinated contrast imaging, (4) hypersensitivity to metformin."}
{"instruction": "Summarize the following drug label section", "input": "WARNINGS: Lactic Acidosis: Postmarketing cases of metformin-associated lactic acidosis have resulted in death...", "output": "Metformin carries a risk of lactic acidosis, a rare but serious complication. Risk increases with renal impairment, hepatic impairment, and conditions causing hypoxia."}

Prompt template at training time:

Python
ALPACA_TEMPLATE = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

def format_alpaca(example: dict) -> str:
    if example.get("input"):
        return ALPACA_TEMPLATE.format(
            instruction=example["instruction"],
            input=example["input"],
            output=example["output"],
        )
    else:
        return ALPACA_TEMPLATE.replace("\n\n### Input:\n{input}", "").format(
            instruction=example["instruction"],
            output=example["output"],
        )

Chat Format (OpenAI / Hugging Face)

Modern models trained with RLHF use a chat template. Each example is a conversation:

JSONL
{"messages": [{"role": "system", "content": "You are a clinical pharmacology assistant. Provide accurate, evidence-based drug information."}, {"role": "user", "content": "What is the mechanism of action of warfarin?"}, {"role": "assistant", "content": "Warfarin inhibits vitamin K epoxide reductase (VKOR), preventing the recycling of vitamin K. This depletes active vitamin K, which is required for the synthesis of clotting factors II, VII, IX, and X. The result is anticoagulation with a delayed onset of 2-5 days."}]}
{"messages": [{"role": "system", "content": "You are a clinical pharmacology assistant."}, {"role": "user", "content": "Patient is on warfarin and starts ibuprofen. What interaction should I be aware of?"}, {"role": "assistant", "content": "Warfarin + ibuprofen is a major interaction. NSAIDs like ibuprofen inhibit platelet aggregation and can cause gastric mucosal damage, increasing bleeding risk. Ibuprofen may also displace warfarin from protein binding sites, elevating free warfarin levels and increasing anticoagulation. Recommend using acetaminophen as the analgesic if possible, or intensifying INR monitoring if ibuprofen is necessary."}]}

Applying chat templates:

Python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

messages = [
    {"role": "system", "content": "You are a clinical pharmacology assistant."},
    {"role": "user", "content": "What is warfarin's mechanism of action?"},
    {"role": "assistant", "content": "Warfarin inhibits VKOR..."},
]

# Apply the model's native chat template
formatted = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=False,  # False for training (includes assistant response)
)
print(formatted)
# <|begin_of_text|><|start_header_id|>system<|end_header_id|>\nYou are a clinical...

Always use the model's native chat template. Different models (Llama, Mistral, Gemma) have different special tokens and formats.


Building a Dataset for TRL SFTTrainer

Python
from datasets import Dataset
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer.pad_token = tokenizer.eos_token

# Raw training data
raw_data = [
    {
        "messages": [
            {"role": "system", "content": "You are a clinical pharmacology expert."},
            {"role": "user", "content": "Explain the drug interaction between warfarin and aspirin."},
            {"role": "assistant", "content": "Warfarin and aspirin have two overlapping mechanisms that increase bleeding risk. Aspirin inhibits COX-1, reducing thromboxane A2-mediated platelet aggregation, while warfarin inhibits vitamin K-dependent clotting factors. Together, they significantly increase bleeding risk, particularly GI bleeding. Aspirin also displaces warfarin from albumin binding sites, potentially elevating free warfarin levels."},
        ]
    },
    # ... more examples
]

dataset = Dataset.from_list(raw_data)

def format_for_training(example):
    """Apply chat template and prepare for training."""
    formatted = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
        add_generation_prompt=False,
    )
    return {"text": formatted}

dataset = dataset.map(format_for_training)
print(dataset[0]["text"][:300])

Training Only on Responses (Response Masking)

During training, you want to compute loss only on the assistant's response tokens, not on the prompt. This is response masking:

Python
from trl import DataCollatorForCompletionOnlyLM

# For Llama-3 Instruct format
response_template = "<|start_header_id|>assistant<|end_header_id|>"

collator = DataCollatorForCompletionOnlyLM(
    response_template=response_template,
    tokenizer=tokenizer,
)

# Use in SFTTrainer
from trl import SFTTrainer

trainer = SFTTrainer(
    model=peft_model,
    train_dataset=dataset,
    dataset_text_field="text",
    data_collator=collator,  # Masks prompt tokens from loss
    max_seq_length=2048,
    ...
)

Without response masking, the model trains on both the prompt and response, which can cause it to overfit to prompt patterns and generate poor responses.


JSONL File Format

Training datasets are typically stored as JSONL (one JSON object per line):

Python
import json

def save_jsonl(data: list[dict], path: str):
    with open(path, "w", encoding="utf-8") as f:
        for item in data:
            f.write(json.dumps(item, ensure_ascii=False) + "\n")

def load_jsonl(path: str) -> list[dict]:
    with open(path, encoding="utf-8") as f:
        return [json.loads(line) for line in f if line.strip()]

# Save
save_jsonl(raw_data, "drug_interactions_train.jsonl")

# Load into HuggingFace Dataset
from datasets import load_dataset
dataset = load_dataset("json", data_files={"train": "drug_interactions_train.jsonl"})

Data Validation Before Training

Always validate format before starting a training run:

Python
def validate_chat_dataset(dataset, tokenizer, max_seq_length=2048):
    issues = []
    for i, example in enumerate(dataset):
        messages = example.get("messages", [])

        # Check structure
        if not messages:
            issues.append(f"Example {i}: empty messages")
            continue

        # Check roles
        roles = [m["role"] for m in messages]
        if roles[-1] != "assistant":
            issues.append(f"Example {i}: last message is not from assistant")

        # Check token length
        formatted = tokenizer.apply_chat_template(messages, tokenize=True)
        if len(formatted) > max_seq_length:
            issues.append(f"Example {i}: {len(formatted)} tokens exceeds limit of {max_seq_length}")

    return issues

issues = validate_chat_dataset(raw_data, tokenizer)
print(f"Found {len(issues)} issues")
for issue in issues[:10]:
    print(f"  {issue}")

Fix all validation errors before training — a malformed dataset will either crash training or silently degrade the model.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.