Data Quality for Fine-Tuning
What makes fine-tuning data high quality. Learn how to audit, clean, and score training examples to maximize model improvement per training example.
Why Data Quality Outweighs Quantity
The most common fine-tuning mistake: using a large low-quality dataset instead of a small high-quality one. LLMs absorb patterns from training data ā bad data teaches bad patterns.
The rule: 1,000 carefully curated examples almost always outperform 100,000 scraped examples.
A high-quality training example for an LLM:
- Has a clear, well-formed prompt
- Has a response that is factually correct
- Has a response that demonstrates the target behavior (tone, format, depth)
- Does not contradict other examples in the dataset
Dimensions of Data Quality
1. Factual Accuracy
For domain-specific fine-tuning, incorrect facts in training data are the worst possible outcome ā the model learns wrong information confidently.
# Bad example ā wrong drug information
{
"messages": [
{"role": "user", "content": "What is warfarin's mechanism?"},
{"role": "assistant", "content": "Warfarin works by inhibiting COX-2 enzymes."} # WRONG
]
}
# Good example
{
"messages": [
{"role": "user", "content": "What is warfarin's mechanism?"},
{"role": "assistant", "content": "Warfarin inhibits vitamin K epoxide reductase (VKOR), preventing the recycling of vitamin K, which is required for the synthesis of clotting factors II, VII, IX, and X."}
]
}Fix: Have domain experts review a random sample of 100ā200 examples. Calculate the error rate. If it exceeds 5%, improve the data generation process before scaling up.
2. Response Consistency
Contradictory examples in the dataset confuse the model:
# Example 1: says X
{"user": "Is metformin safe in pregnancy?", "assistant": "Metformin is generally considered safe in pregnancy and is used for gestational diabetes."}
# Example 2: says opposite of X (contradiction!)
{"user": "Can metformin be used in pregnant patients?", "assistant": "Metformin should be avoided in pregnancy as it crosses the placenta."}Fix: Cluster semantically similar questions and audit for contradictions. Use embedding similarity to find near-duplicate prompts with conflicting answers.
3. Response Quality
Responses should demonstrate the target behavior ā not just be technically correct:
| Quality | Example | |---|---| | Poor (vague) | "Warfarin has many interactions." | | Moderate (correct) | "Warfarin interacts with NSAIDs, increasing bleeding risk." | | High (complete) | "Warfarin + NSAIDs is a major interaction. NSAIDs inhibit platelet aggregation (via COX-1) and can cause GI mucosal damage. Combined with warfarin's anticoagulant effect, this significantly elevates GI bleeding risk. Recommend acetaminophen as an alternative analgesic, or intensify INR monitoring if NSAID use is unavoidable." |
4. Format Consistency
Every response should follow the same format conventions ā markdown structure, response length, use of bullet points vs prose ā if that's the behavior you want the model to learn.
Data Quality Pipeline
import json
from typing import NamedTuple
class QualityScore(NamedTuple):
example_id: int
length_score: float # 0-1, penalize too short/long
has_structure: bool # Has expected format markers
passes_filters: bool # Didn't hit any keyword filters
def score_example(example: dict, min_length=50, max_length=800) -> QualityScore:
messages = example.get("messages", [])
assistant_messages = [m for m in messages if m["role"] == "assistant"]
if not assistant_messages:
return QualityScore(0, 0.0, False, False)
response = assistant_messages[-1]["content"]
length = len(response.split())
# Length scoring
if length < min_length:
length_score = length / min_length
elif length > max_length:
length_score = max_length / length
else:
length_score = 1.0
# Structural markers (for medical Q&A)
has_structure = any(marker in response for marker in [":", "-", "1.", "ā¢"])
# Filter out clearly bad examples
bad_phrases = ["I don't know", "I cannot provide", "As an AI"]
passes_filters = not any(phrase.lower() in response.lower() for phrase in bad_phrases)
return QualityScore(
example_id=id(example),
length_score=length_score,
has_structure=has_structure,
passes_filters=passes_filters,
)
def filter_dataset(data: list[dict], min_quality=0.7) -> list[dict]:
filtered = []
rejected = []
for example in data:
score = score_example(example)
composite = (score.length_score * 0.4 +
float(score.has_structure) * 0.3 +
float(score.passes_filters) * 0.3)
if composite >= min_quality:
filtered.append(example)
else:
rejected.append((example, composite))
print(f"Kept: {len(filtered)} / {len(data)} ({100*len(filtered)//len(data)}%)")
print(f"Rejected: {len(rejected)}")
return filteredDeduplication
Near-duplicate examples waste training budget and can cause overfitting to duplicated content:
from sentence_transformers import SentenceTransformer
import numpy as np
def deduplicate_dataset(data: list[dict], threshold=0.95) -> list[dict]:
"""Remove near-duplicate examples using embedding similarity."""
model = SentenceTransformer("all-MiniLM-L6-v2")
# Extract user prompts for comparison
prompts = []
for example in data:
user_msgs = [m["content"] for m in example["messages"] if m["role"] == "user"]
prompts.append(user_msgs[0] if user_msgs else "")
embeddings = model.encode(prompts, batch_size=64, show_progress_bar=True)
# Find duplicates
to_remove = set()
for i in range(len(embeddings)):
if i in to_remove:
continue
for j in range(i + 1, len(embeddings)):
if j in to_remove:
continue
similarity = np.dot(embeddings[i], embeddings[j]) / (
np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[j])
)
if similarity > threshold:
to_remove.add(j)
deduplicated = [ex for i, ex in enumerate(data) if i not in to_remove]
print(f"Removed {len(to_remove)} duplicates. Kept: {len(deduplicated)}")
return deduplicatedQuality vs Quantity Trade-off
A practical guide to dataset size and quality requirements:
| Training goal | Min examples (high quality) | Effect of low quality | |---|---|---| | Style / format adaptation | 200ā500 | Minor format inconsistency | | Domain vocabulary / tone | 500ā2,000 | Poor domain coverage | | New task learning | 1,000ā5,000 | Task failure | | Factual knowledge injection | 5,000ā50,000 | Hallucination | | Full behavioral alignment | 50,000+ | Misaligned behavior |
Note: these are minimums for high-quality data. Increase by 5ā10x if using auto-generated data without expert review.
Labeling Protocol for Human-Curated Data
When having humans write or verify training examples:
- Annotator guidelines ā Write a detailed rubric: what makes a response good, what to avoid, expected length and format. Include 20+ annotated examples.
- Inter-annotator agreement ā Have 10% of examples labeled by two annotators. Calculate Cohen's kappa. Below 0.6 means your guidelines are ambiguous.
- Calibration sessions ā Regular group review of borderline cases keeps annotators aligned.
- Specialist review ā For domain-specific content (medical, legal), have subject matter experts review a random 10% sample.
The labeling protocol is as important as the data itself. Bad guidelines produce inconsistent data regardless of annotator effort.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.