When to Fine-Tune vs Prompt Engineer

The single most expensive mistake in LLM projects is fine-tuning when prompting would have worked — and the second most expensive is prompting when only fine-tuning can deliver the needed consistency. This lesson gives you a repeatable framework for making the right call.

The Decision Matrix

Before committing to fine-tuning, evaluate your project across five dimensions.

| Dimension | Favour Prompting / RAG | Favour Fine-Tuning | |---|---|---| | Consistency requirement | Occasional variation acceptable | Every output must match a strict schema | | Training data availability | Fewer than 200 high-quality examples | 500 or more curated examples | | Domain vocabulary | General English | Highly specialized terminology | | Latency budget | 200ms or more acceptable | System prompt tokens add unacceptable latency | | Task stability | Task definition changes weekly | Task definition is stable for months |

Score each dimension. If three or more dimensions point to fine-tuning, it is probably the right choice.

Signals That Fine-Tuning Is the Right Move

Signal 1: Prompting Fails to Produce Consistent Format

Python

# The problem: model ignores format instructions under adversarial input
from openai import OpenAI
import json

client = OpenAI()

SYSTEM = """Extract drug information and return ONLY valid JSON:
{
  "drug_name": "string",
  "drug_class": "string",
  "indication": "string",
  "max_daily_dose_mg": number
}"""

# Works fine on clean input:
clean_input = "Metformin is a biguanide used for type 2 diabetes, max dose 2000 mg/day."

# Breaks on messy clinical note:
messy_input = (
    "Pt started on met 500 bid for T2DM, titrating to max per guidelines. "
    "Also on lisinopril 10mg for HTN - unrelated to today's visit. "
    "Family history: father had DM too, started insulin eventually."
)

def extract_drug_info(text: str) -> dict | None:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM},
            {"role": "user", "content": text}
        ],
        temperature=0.0
    )
    content = response.choices[0].message.content.strip()
    try:
        return json.loads(content)
    except json.JSONDecodeError:
        return None  # Model added explanation text around the JSON

# With prompting: ~85% parse success rate on messy notes
# With a fine-tuned model trained on 1,000 clinical notes: ~99.2% parse success rate

result = extract_drug_info(messy_input)
print(f"Parsed: {result is not None}")

Rule: If you run 100 test examples through your prompt and fewer than 95% produce correctly structured output, fine-tuning is likely worth it.

Signal 2: Domain Vocabulary Is Highly Specialized

Medical, legal, and financial domains have terminology that base models handle inconsistently. Fine-tuning teaches the model the local vocabulary reliably.

Python

# Vocabulary test: does the base model know your domain's abbreviations?
from openai import OpenAI

client = OpenAI()

domain_abbreviations = [
    ("HTN", "hypertension"),
    ("eGFR", "estimated glomerular filtration rate"),
    ("T2DM", "type 2 diabetes mellitus"),
    ("MSSA", "methicillin-susceptible Staphylococcus aureus"),
    ("LVEF", "left ventricular ejection fraction"),
    ("PRN", "as needed (pro re nata)"),
]

def test_abbreviation_knowledge(abbrev: str, expected: str) -> bool:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"What does the medical abbreviation '{abbrev}' stand for? Give only the expansion, nothing else."
        }],
        temperature=0.0
    )
    answer = response.choices[0].message.content.strip().lower()
    return expected.lower() in answer

results = {abbrev: test_abbreviation_knowledge(abbrev, expansion)
           for abbrev, expansion in domain_abbreviations}

score = sum(results.values()) / len(results)
print(f"Domain vocabulary accuracy: {score:.0%}")
print(results)

# If score is under 80%, fine-tuning on domain text will help significantly

Signal 3: You Are Paying Too Much for System Prompts

Python

# Cost analysis: system prompt token cost over time
def calculate_prompt_vs_finetune_cost(
    daily_requests: int,
    system_prompt_tokens: int,
    avg_user_tokens: int,
    avg_output_tokens: int,
    input_price_per_million: float,   # e.g., 0.15 for gpt-4o-mini
    output_price_per_million: float,  # e.g., 0.60 for gpt-4o-mini
    finetune_cost_usd: float,         # one-time cost
    days: int = 365
) -> dict:
    """Compare cumulative prompting cost vs fine-tuning cost."""

    # Prompting cost: system prompt + user message every request
    total_input_tokens_per_day = daily_requests * (system_prompt_tokens + avg_user_tokens)
    total_output_tokens_per_day = daily_requests * avg_output_tokens

    daily_input_cost = (total_input_tokens_per_day / 1_000_000) * input_price_per_million
    daily_output_cost = (total_output_tokens_per_day / 1_000_000) * output_price_per_million
    daily_prompt_cost = daily_input_cost + daily_output_cost

    cumulative_prompt_cost = daily_prompt_cost * days

    # Fine-tuned model: no system prompt, same user message + output
    ft_input_per_day = daily_requests * avg_user_tokens
    ft_output_per_day = daily_requests * avg_output_tokens
    daily_ft_cost = (
        (ft_input_per_day / 1_000_000) * input_price_per_million +
        (ft_output_per_day / 1_000_000) * output_price_per_million
    )

    cumulative_ft_cost = finetune_cost_usd + (daily_ft_cost * days)

    # Break-even
    savings_per_day = daily_prompt_cost - daily_ft_cost
    break_even_days = finetune_cost_usd / savings_per_day if savings_per_day > 0 else float("inf")

    return {
        "daily_prompt_cost_usd": round(daily_prompt_cost, 4),
        "daily_ft_inference_cost_usd": round(daily_ft_cost, 4),
        f"cumulative_prompt_cost_{days}d_usd": round(cumulative_prompt_cost, 2),
        f"cumulative_ft_cost_{days}d_usd": round(cumulative_ft_cost, 2),
        "break_even_days": round(break_even_days, 1),
        "savings_per_day_usd": round(savings_per_day, 4),
    }

# Scenario: 10,000 drug queries/day, 800-token system prompt
result = calculate_prompt_vs_finetune_cost(
    daily_requests=10_000,
    system_prompt_tokens=800,
    avg_user_tokens=80,
    avg_output_tokens=200,
    input_price_per_million=0.15,
    output_price_per_million=0.60,
    finetune_cost_usd=500,
    days=365
)
for k, v in result.items():
    print(f"  {k}: {v}")

# break_even_days is often under 30 for high-volume applications

When NOT to Fine-Tune

Do Not Fine-Tune With Fewer Than 100 Examples

With fewer than 100 examples, the model will overfit. You will see:

Training loss drops to near zero
Validation loss stays high or increases
The model memorizes training examples verbatim

Python

# Detecting overfitting early in training
import matplotlib.pyplot as plt

def plot_training_curves(train_losses: list[float], val_losses: list[float]):
    """Visualize overfitting pattern."""
    steps = list(range(len(train_losses)))

    plt.figure(figsize=(10, 5))
    plt.plot(steps, train_losses, label="Training Loss", color="blue")
    plt.plot(steps, val_losses, label="Validation Loss", color="red")
    plt.xlabel("Training Steps")
    plt.ylabel("Loss")
    plt.title("Training vs Validation Loss")
    plt.legend()

    # Overfitting diagnostic
    final_gap = val_losses[-1] - train_losses[-1]
    if final_gap > 0.5:
        print(f"WARNING: Large train/val gap ({final_gap:.2f}). "
              f"Likely overfitting. Consider: more data, lower learning rate, "
              f"fewer epochs, or add dropout.")
    else:
        print(f"Train/val gap: {final_gap:.2f} — looks healthy.")

    plt.tight_layout()
    plt.savefig("training_curves.png")

# Example: overfitting with only 50 examples
# train_losses = [2.3, 1.8, 1.2, 0.6, 0.2, 0.05]
# val_losses   = [2.3, 2.1, 2.2, 2.4, 2.7, 3.1]   ← classic overfit

Do Not Fine-Tune When the Task Changes Frequently

If your output schema or task definition changes every few weeks, you would need to retrain constantly. Prompting handles task evolution at zero cost.

Python

# Task stability assessment
from datetime import date, timedelta

def assess_task_stability(
    task_definition_changes: list[date],
    horizon_days: int = 90
) -> str:
    """Estimate how stable a task is over the planning horizon."""
    if not task_definition_changes:
        return "stable — good fine-tuning candidate"

    recent_changes = [
        d for d in task_definition_changes
        if d >= date.today() - timedelta(days=horizon_days)
    ]

    change_rate = len(recent_changes) / (horizon_days / 30)  # changes per month

    if change_rate > 1.5:
        return (f"unstable ({change_rate:.1f} changes/month) — "
                f"use prompting; fine-tuning ROI is low")
    elif change_rate > 0.5:
        return (f"moderately stable ({change_rate:.1f} changes/month) — "
                f"fine-tune if consistency gains justify retraining cost")
    else:
        return (f"stable ({change_rate:.1f} changes/month) — "
                f"strong fine-tuning candidate")

# Example: task that has changed 5 times in the last 90 days
changes = [
    date(2026, 2, 15),
    date(2026, 3, 1),
    date(2026, 3, 20),
    date(2026, 4, 10),
    date(2026, 5, 1),
]
print(assess_task_stability(changes))
# "unstable (1.7 changes/month) — use prompting"

Case Study: Pharmaceutical Drug Labeling

The Problem

A pharmaceutical company needs to extract structured information from unstructured drug label PDFs and output FDA-compliant JSON for regulatory submission. Each label contains:

Drug name, manufacturer, NDC code
Indications and usage (free text)
Dosing instructions by patient population
Contraindications
Adverse reactions table

Requirement: 99.5% structured JSON output rate. Any malformed output requires human remediation at $45/hour.

Option A: Prompting Only

Python

# Approach: very detailed system prompt with examples
DRUG_LABEL_PROMPT = """You are an FDA regulatory information extractor.
Extract drug label information and return ONLY valid JSON matching this exact schema:

{
  "drug_name": "string",
  "manufacturer": "string",
  "ndc_code": "string (format: XXXXX-XXXX-XX)",
  "indications": ["string"],
  "dosing": {
    "adult": "string",
    "pediatric": "string or null",
    "renal_impairment": "string or null"
  },
  "contraindications": ["string"],
  "adverse_reactions": [{"reaction": "string", "frequency": "string"}]
}

Rules:
- Return ONLY the JSON object, no explanation
- Use null for missing fields
- Extract verbatim from the label, do not paraphrase

Example input: [300 tokens of example]
Example output: [200 tokens of example JSON]

Now extract from the following label:"""

# Result after testing on 500 labels:
# - JSON parse success rate: 91.2%
# - Correct schema rate: 87.4%
# - Error cases: model adds "Here is the extracted JSON:" prefix, truncates long arrays
# - System prompt cost: ~600 tokens × 10,000 labels/month = 6M tokens = $0.90/month extra
# - Remediation cost: 0.126 failure rate × 10,000 × 15 min × $45/hr = $1,417/month

Option B: RAG

RAG does not apply here — the input text IS the knowledge source. The model needs to extract from it, not retrieve from elsewhere. RAG solves knowledge gaps, not structured extraction.

Option C: Fine-Tuned Model

Python

# Fine-tuned on 2,000 labeled drug documents (human-verified JSON pairs)
# Training: QLoRA on Llama 3.1 8B, rank 16, 3 epochs, 4 hours on A100

# Results after fine-tuning:
fine_tuning_results = {
    "json_parse_success_rate": 0.998,      # vs 0.912 for prompting
    "correct_schema_rate": 0.996,           # vs 0.874 for prompting
    "avg_tokens_per_request": 180,          # vs 780 for prompting (600 system prompt)
    "monthly_training_cost_amortized": 42,  # $500 training ÷ 12 months
    "monthly_inference_savings": 0.90,      # no system prompt tokens
    "monthly_remediation_cost": 18.00,      # 0.004 × 10,000 × 15min × $45/hr
}

# Monthly cost comparison:
# Prompting:     $1,417 (remediation) + inference cost
# Fine-tuned:    $18 (remediation) + $42 (amortized training) = $60/month

The Verdict

Fine-tuning wins decisively for pharmaceutical drug labeling because:

The output format is rigid and legally important
The vocabulary is specialized (dosing schedules, drug classes, NDC codes)
Volume is high enough that training amortizes quickly
The task definition (FDA schema) changes at most once a year

Summary Decision Tree

Is your output format flexible?
├─ Yes → Try prompting first. Fine-tune only if quality is insufficient.
└─ No → Is your dataset larger than 200 examples?
         ├─ No → Collect more data before fine-tuning.
         └─ Yes → Does the task change more than once a month?
                  ├─ Yes → Use prompting (fine-tuning ROI is low).
                  └─ No → Fine-tune. Calculate break-even point first.

The framework is simple: fine-tuning is a capital investment. Like all capital investments, it pays off when volume is high, task is stable, and consistency requirements are strict.

When Should You Fine-Tune vs Use RAG?

When to Fine-Tune vs Prompt Engineer

The Decision Matrix

Signals That Fine-Tuning Is the Right Move

Signal 1: Prompting Fails to Produce Consistent Format

Signal 2: Domain Vocabulary Is Highly Specialized

Signal 3: You Are Paying Too Much for System Prompts

When NOT to Fine-Tune

Do Not Fine-Tune With Fewer Than 100 Examples

Do Not Fine-Tune When the Task Changes Frequently

Case Study: Pharmaceutical Drug Labeling

The Problem

Option A: Prompting Only

Option B: RAG

Option C: Fine-Tuned Model

The Verdict

Summary Decision Tree