Learnixo

Fine-Tuning LLMs · Lesson 12 of 16

Synthetic Data: Generating Training Data with LLMs

Why Synthetic Data?

Collecting high-quality human-labeled training data is expensive and slow. Synthetic data — generated by a stronger LLM — can be produced at scale cheaply and adapted to any domain.

The key insight: a frontier model (GPT-4o, Claude Opus) can generate training examples that teach a smaller model (Llama 8B, Mistral 7B) to behave similarly on a target task.

This is the foundation of approaches like Stanford's Alpaca (GPT-3.5-generated data to fine-tune LLaMA) and many production fine-tuning pipelines.


Basic Synthetic Generation Pattern

Python
from openai import OpenAI
import json

client = OpenAI()

def generate_training_example(
    topic: str,
    question_seed: str,
    system_prompt: str,
) -> dict | None:
    """Generate one training example using GPT-4o."""

    generation_prompt = f"""Generate a training example for a clinical pharmacology assistant.

Topic: {topic}
Seed question: {question_seed}

Create a realistic user question about this topic (vary from the seed), then write an ideal expert response.

Return JSON only:
{{
  "user_question": "...",
  "expert_response": "..."
}}

Requirements for the response:
- Medically accurate and evidence-based
- 100-300 words
- Include specific mechanisms, not vague generalities
- Clinical in tone"""

    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": generation_prompt}],
            response_format={"type": "json_object"},
            temperature=0.8,
        )
        data = json.loads(response.choices[0].message.content)
        return {
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": data["user_question"]},
                {"role": "assistant", "content": data["expert_response"]},
            ]
        }
    except Exception as e:
        print(f"Generation error: {e}")
        return None

# Generate examples
system = "You are a clinical pharmacology expert. Provide accurate, evidence-based drug information."

seeds = [
    ("drug interactions", "What happens when warfarin and aspirin are taken together?"),
    ("drug interactions", "Is it safe to take metformin with ibuprofen?"),
    ("mechanisms", "How does metformin lower blood glucose?"),
    ("mechanisms", "What is the mechanism of action of beta blockers?"),
    ("dosing", "What dose adjustments are needed for renal impairment?"),
]

examples = []
for topic, seed in seeds:
    for _ in range(10):  # 10 variations per seed
        example = generate_training_example(topic, seed, system)
        if example:
            examples.append(example)

print(f"Generated {len(examples)} training examples")

Self-Instruct: Bootstrapping from Seeds

Self-Instruct (Wang et al., 2022) generates diverse instructions from a small seed set:

Python
import random

SEED_TASKS = [
    "Explain the mechanism of action of warfarin",
    "What are the contraindications for metformin?",
    "Describe the pharmacokinetics of amoxicillin",
    "What drug interactions should I be aware of with clopidogrel?",
    "How should I adjust the dose of vancomycin in a patient with renal failure?",
]

def generate_new_instructions(seeds: list[str], n_new: int = 5) -> list[str]:
    """Generate new instructions inspired by seeds."""
    sample = random.sample(seeds, min(3, len(seeds)))

    prompt = f"""Here are some example questions a clinical pharmacology assistant might receive:

{chr(10).join(f'- {s}' for s in sample)}

Generate {n_new} new, diverse questions that are similar in style but cover different drugs, mechanisms, or clinical scenarios. Return as a JSON array of strings."""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.9,
    )

    try:
        data = json.loads(response.choices[0].message.content)
        return data.get("questions", [])
    except Exception:
        return []

# Bootstrap: start with seeds, expand iteratively
all_instructions = list(SEED_TASKS)
for round_num in range(5):  # 5 rounds of expansion
    new_instructions = generate_new_instructions(all_instructions, n_new=10)
    all_instructions.extend(new_instructions)
    print(f"Round {round_num + 1}: {len(all_instructions)} total instructions")

Quality Filtering for Synthetic Data

Not all generated examples are good. Filter aggressively:

Python
def quality_filter_synthetic(
    example: dict,
    min_response_words: int = 50,
    max_response_words: int = 600,
) -> tuple[bool, str]:
    """Returns (passes, rejection_reason)."""
    messages = example.get("messages", [])
    assistant_msgs = [m for m in messages if m["role"] == "assistant"]

    if not assistant_msgs:
        return False, "no assistant message"

    response = assistant_msgs[-1]["content"]
    word_count = len(response.split())

    if word_count < min_response_words:
        return False, f"too short ({word_count} words)"

    if word_count > max_response_words:
        return False, f"too long ({word_count} words)"

    # Reject refusals and evasions
    refusal_phrases = [
        "I cannot provide medical advice",
        "Please consult a healthcare professional",
        "I'm not able to",
        "As an AI language model",
    ]
    response_lower = response.lower()
    for phrase in refusal_phrases:
        if phrase.lower() in response_lower:
            return False, f"refusal: {phrase}"

    # Require some domain specificity (at least one drug/mechanism term)
    domain_terms = ["mg", "dose", "receptor", "enzyme", "inhibit", "mechanism", "clinical"]
    has_domain = any(term in response_lower for term in domain_terms)
    if not has_domain:
        return False, "lacks domain specificity"

    return True, ""

# Filter
filtered = []
rejected = []
for example in examples:
    passes, reason = quality_filter_synthetic(example)
    if passes:
        filtered.append(example)
    else:
        rejected.append((reason, example))

print(f"Passed: {len(filtered)}, Rejected: {len(rejected)}")

Diversity Filtering

Avoid semantic duplicates that waste training budget:

Python
from sentence_transformers import SentenceTransformer
import numpy as np

def diversity_filter(examples: list[dict], similarity_threshold=0.85) -> list[dict]:
    """Remove examples with very similar user prompts."""
    encoder = SentenceTransformer("all-MiniLM-L6-v2")

    prompts = []
    for ex in examples:
        user_msgs = [m["content"] for m in ex["messages"] if m["role"] == "user"]
        prompts.append(user_msgs[0] if user_msgs else "")

    embeddings = encoder.encode(prompts, batch_size=64, show_progress_bar=True)
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    embeddings_normalized = embeddings / norms

    keep = []
    kept_embeddings = []

    for i, (example, emb) in enumerate(zip(examples, embeddings_normalized)):
        if not kept_embeddings:
            keep.append(example)
            kept_embeddings.append(emb)
            continue

        similarities = np.array(kept_embeddings) @ emb
        if similarities.max() < similarity_threshold:
            keep.append(example)
            kept_embeddings.append(emb)

    print(f"After diversity filter: {len(keep)} / {len(examples)}")
    return keep

Limitations of Synthetic Data

Bias inheritance: Synthetic data inherits biases from the generator model. If GPT-4o has misconceptions about a drug, those propagate to training data.

Hallucination risk: The generator can produce plausible-sounding but incorrect medical information. Always have a domain expert review a sample before using synthetic medical data.

Distribution shift: The generator produces examples based on its training distribution — it may not cover the edge cases your users actually encounter.

Practical approach:

  1. Generate synthetic data for common, well-covered queries
  2. Collect human-labeled data for critical edge cases and rare scenarios
  3. Have experts audit a random 5–10% of synthetic examples
  4. Use synthetic to reach scale; human-labeled data for quality anchoring

Synthetic data works best as a volume booster on top of a smaller human-labeled core.