Data Quality for Fine-Tuning

Why Data Quality Outweighs Quantity

The most common fine-tuning mistake: using a large low-quality dataset instead of a small high-quality one. LLMs absorb patterns from training data — bad data teaches bad patterns.

The rule: 1,000 carefully curated examples almost always outperform 100,000 scraped examples.

A high-quality training example for an LLM:

Has a clear, well-formed prompt
Has a response that is factually correct
Has a response that demonstrates the target behavior (tone, format, depth)
Does not contradict other examples in the dataset

Dimensions of Data Quality

1. Factual Accuracy

For domain-specific fine-tuning, incorrect facts in training data are the worst possible outcome — the model learns wrong information confidently.

Python

# Bad example — wrong drug information
{
    "messages": [
        {"role": "user", "content": "What is warfarin's mechanism?"},
        {"role": "assistant", "content": "Warfarin works by inhibiting COX-2 enzymes."}  # WRONG
    ]
}

# Good example
{
    "messages": [
        {"role": "user", "content": "What is warfarin's mechanism?"},
        {"role": "assistant", "content": "Warfarin inhibits vitamin K epoxide reductase (VKOR), preventing the recycling of vitamin K, which is required for the synthesis of clotting factors II, VII, IX, and X."}
    ]
}

Fix: Have domain experts review a random sample of 100–200 examples. Calculate the error rate. If it exceeds 5%, improve the data generation process before scaling up.

2. Response Consistency

Contradictory examples in the dataset confuse the model:

Python

# Example 1: says X
{"user": "Is metformin safe in pregnancy?", "assistant": "Metformin is generally considered safe in pregnancy and is used for gestational diabetes."}

# Example 2: says opposite of X (contradiction!)
{"user": "Can metformin be used in pregnant patients?", "assistant": "Metformin should be avoided in pregnancy as it crosses the placenta."}

Fix: Cluster semantically similar questions and audit for contradictions. Use embedding similarity to find near-duplicate prompts with conflicting answers.

3. Response Quality

Responses should demonstrate the target behavior — not just be technically correct:

| Quality | Example | |---|---| | Poor (vague) | "Warfarin has many interactions." | | Moderate (correct) | "Warfarin interacts with NSAIDs, increasing bleeding risk." | | High (complete) | "Warfarin + NSAIDs is a major interaction. NSAIDs inhibit platelet aggregation (via COX-1) and can cause GI mucosal damage. Combined with warfarin's anticoagulant effect, this significantly elevates GI bleeding risk. Recommend acetaminophen as an alternative analgesic, or intensify INR monitoring if NSAID use is unavoidable." |

4. Format Consistency

Every response should follow the same format conventions — markdown structure, response length, use of bullet points vs prose — if that's the behavior you want the model to learn.

Data Quality Pipeline

Python

import json
from typing import NamedTuple

class QualityScore(NamedTuple):
    example_id: int
    length_score: float    # 0-1, penalize too short/long
    has_structure: bool    # Has expected format markers
    passes_filters: bool   # Didn't hit any keyword filters

def score_example(example: dict, min_length=50, max_length=800) -> QualityScore:
    messages = example.get("messages", [])
    assistant_messages = [m for m in messages if m["role"] == "assistant"]
    if not assistant_messages:
        return QualityScore(0, 0.0, False, False)

    response = assistant_messages[-1]["content"]
    length = len(response.split())

    # Length scoring
    if length < min_length:
        length_score = length / min_length
    elif length > max_length:
        length_score = max_length / length
    else:
        length_score = 1.0

    # Structural markers (for medical Q&A)
    has_structure = any(marker in response for marker in [":", "-", "1.", "•"])

    # Filter out clearly bad examples
    bad_phrases = ["I don't know", "I cannot provide", "As an AI"]
    passes_filters = not any(phrase.lower() in response.lower() for phrase in bad_phrases)

    return QualityScore(
        example_id=id(example),
        length_score=length_score,
        has_structure=has_structure,
        passes_filters=passes_filters,
    )

def filter_dataset(data: list[dict], min_quality=0.7) -> list[dict]:
    filtered = []
    rejected = []

    for example in data:
        score = score_example(example)
        composite = (score.length_score * 0.4 +
                     float(score.has_structure) * 0.3 +
                     float(score.passes_filters) * 0.3)

        if composite >= min_quality:
            filtered.append(example)
        else:
            rejected.append((example, composite))

    print(f"Kept: {len(filtered)} / {len(data)} ({100*len(filtered)//len(data)}%)")
    print(f"Rejected: {len(rejected)}")
    return filtered

Deduplication

Near-duplicate examples waste training budget and can cause overfitting to duplicated content:

Python

from sentence_transformers import SentenceTransformer
import numpy as np

def deduplicate_dataset(data: list[dict], threshold=0.95) -> list[dict]:
    """Remove near-duplicate examples using embedding similarity."""
    model = SentenceTransformer("all-MiniLM-L6-v2")

    # Extract user prompts for comparison
    prompts = []
    for example in data:
        user_msgs = [m["content"] for m in example["messages"] if m["role"] == "user"]
        prompts.append(user_msgs[0] if user_msgs else "")

    embeddings = model.encode(prompts, batch_size=64, show_progress_bar=True)

    # Find duplicates
    to_remove = set()
    for i in range(len(embeddings)):
        if i in to_remove:
            continue
        for j in range(i + 1, len(embeddings)):
            if j in to_remove:
                continue
            similarity = np.dot(embeddings[i], embeddings[j]) / (
                np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[j])
            )
            if similarity > threshold:
                to_remove.add(j)

    deduplicated = [ex for i, ex in enumerate(data) if i not in to_remove]
    print(f"Removed {len(to_remove)} duplicates. Kept: {len(deduplicated)}")
    return deduplicated

Quality vs Quantity Trade-off

A practical guide to dataset size and quality requirements:

| Training goal | Min examples (high quality) | Effect of low quality | |---|---|---| | Style / format adaptation | 200–500 | Minor format inconsistency | | Domain vocabulary / tone | 500–2,000 | Poor domain coverage | | New task learning | 1,000–5,000 | Task failure | | Factual knowledge injection | 5,000–50,000 | Hallucination | | Full behavioral alignment | 50,000+ | Misaligned behavior |

Note: these are minimums for high-quality data. Increase by 5–10x if using auto-generated data without expert review.

Labeling Protocol for Human-Curated Data

When having humans write or verify training examples:

Annotator guidelines — Write a detailed rubric: what makes a response good, what to avoid, expected length and format. Include 20+ annotated examples.
Inter-annotator agreement — Have 10% of examples labeled by two annotators. Calculate Cohen's kappa. Below 0.6 means your guidelines are ambiguous.
Calibration sessions — Regular group review of borderline cases keeps annotators aligned.
Specialist review — For domain-specific content (medical, legal), have subject matter experts review a random 10% sample.

The labeling protocol is as important as the data itself. Bad guidelines produce inconsistent data regardless of annotator effort.

Data Quality for Fine-Tuning

Why Data Quality Outweighs Quantity

Dimensions of Data Quality

1. Factual Accuracy

2. Response Consistency

3. Response Quality

4. Format Consistency

Data Quality Pipeline

Deduplication

Quality vs Quantity Trade-off

Labeling Protocol for Human-Curated Data

Enjoyed this article?

Leave a comment