Fine-Tuning LLMs · Lesson 2 of 16
When Should You Fine-Tune vs Use RAG?
When to Fine-Tune vs Prompt Engineer
The single most expensive mistake in LLM projects is fine-tuning when prompting would have worked — and the second most expensive is prompting when only fine-tuning can deliver the needed consistency. This lesson gives you a repeatable framework for making the right call.
The Decision Matrix
Before committing to fine-tuning, evaluate your project across five dimensions.
| Dimension | Favour Prompting / RAG | Favour Fine-Tuning | |---|---|---| | Consistency requirement | Occasional variation acceptable | Every output must match a strict schema | | Training data availability | Fewer than 200 high-quality examples | 500 or more curated examples | | Domain vocabulary | General English | Highly specialized terminology | | Latency budget | 200ms or more acceptable | System prompt tokens add unacceptable latency | | Task stability | Task definition changes weekly | Task definition is stable for months |
Score each dimension. If three or more dimensions point to fine-tuning, it is probably the right choice.
Signals That Fine-Tuning Is the Right Move
Signal 1: Prompting Fails to Produce Consistent Format
# The problem: model ignores format instructions under adversarial input
from openai import OpenAI
import json
client = OpenAI()
SYSTEM = """Extract drug information and return ONLY valid JSON:
{
"drug_name": "string",
"drug_class": "string",
"indication": "string",
"max_daily_dose_mg": number
}"""
# Works fine on clean input:
clean_input = "Metformin is a biguanide used for type 2 diabetes, max dose 2000 mg/day."
# Breaks on messy clinical note:
messy_input = (
"Pt started on met 500 bid for T2DM, titrating to max per guidelines. "
"Also on lisinopril 10mg for HTN - unrelated to today's visit. "
"Family history: father had DM too, started insulin eventually."
)
def extract_drug_info(text: str) -> dict | None:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": SYSTEM},
{"role": "user", "content": text}
],
temperature=0.0
)
content = response.choices[0].message.content.strip()
try:
return json.loads(content)
except json.JSONDecodeError:
return None # Model added explanation text around the JSON
# With prompting: ~85% parse success rate on messy notes
# With a fine-tuned model trained on 1,000 clinical notes: ~99.2% parse success rate
result = extract_drug_info(messy_input)
print(f"Parsed: {result is not None}")Rule: If you run 100 test examples through your prompt and fewer than 95% produce correctly structured output, fine-tuning is likely worth it.
Signal 2: Domain Vocabulary Is Highly Specialized
Medical, legal, and financial domains have terminology that base models handle inconsistently. Fine-tuning teaches the model the local vocabulary reliably.
# Vocabulary test: does the base model know your domain's abbreviations?
from openai import OpenAI
client = OpenAI()
domain_abbreviations = [
("HTN", "hypertension"),
("eGFR", "estimated glomerular filtration rate"),
("T2DM", "type 2 diabetes mellitus"),
("MSSA", "methicillin-susceptible Staphylococcus aureus"),
("LVEF", "left ventricular ejection fraction"),
("PRN", "as needed (pro re nata)"),
]
def test_abbreviation_knowledge(abbrev: str, expected: str) -> bool:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"What does the medical abbreviation '{abbrev}' stand for? Give only the expansion, nothing else."
}],
temperature=0.0
)
answer = response.choices[0].message.content.strip().lower()
return expected.lower() in answer
results = {abbrev: test_abbreviation_knowledge(abbrev, expansion)
for abbrev, expansion in domain_abbreviations}
score = sum(results.values()) / len(results)
print(f"Domain vocabulary accuracy: {score:.0%}")
print(results)
# If score is under 80%, fine-tuning on domain text will help significantlySignal 3: You Are Paying Too Much for System Prompts
# Cost analysis: system prompt token cost over time
def calculate_prompt_vs_finetune_cost(
daily_requests: int,
system_prompt_tokens: int,
avg_user_tokens: int,
avg_output_tokens: int,
input_price_per_million: float, # e.g., 0.15 for gpt-4o-mini
output_price_per_million: float, # e.g., 0.60 for gpt-4o-mini
finetune_cost_usd: float, # one-time cost
days: int = 365
) -> dict:
"""Compare cumulative prompting cost vs fine-tuning cost."""
# Prompting cost: system prompt + user message every request
total_input_tokens_per_day = daily_requests * (system_prompt_tokens + avg_user_tokens)
total_output_tokens_per_day = daily_requests * avg_output_tokens
daily_input_cost = (total_input_tokens_per_day / 1_000_000) * input_price_per_million
daily_output_cost = (total_output_tokens_per_day / 1_000_000) * output_price_per_million
daily_prompt_cost = daily_input_cost + daily_output_cost
cumulative_prompt_cost = daily_prompt_cost * days
# Fine-tuned model: no system prompt, same user message + output
ft_input_per_day = daily_requests * avg_user_tokens
ft_output_per_day = daily_requests * avg_output_tokens
daily_ft_cost = (
(ft_input_per_day / 1_000_000) * input_price_per_million +
(ft_output_per_day / 1_000_000) * output_price_per_million
)
cumulative_ft_cost = finetune_cost_usd + (daily_ft_cost * days)
# Break-even
savings_per_day = daily_prompt_cost - daily_ft_cost
break_even_days = finetune_cost_usd / savings_per_day if savings_per_day > 0 else float("inf")
return {
"daily_prompt_cost_usd": round(daily_prompt_cost, 4),
"daily_ft_inference_cost_usd": round(daily_ft_cost, 4),
f"cumulative_prompt_cost_{days}d_usd": round(cumulative_prompt_cost, 2),
f"cumulative_ft_cost_{days}d_usd": round(cumulative_ft_cost, 2),
"break_even_days": round(break_even_days, 1),
"savings_per_day_usd": round(savings_per_day, 4),
}
# Scenario: 10,000 drug queries/day, 800-token system prompt
result = calculate_prompt_vs_finetune_cost(
daily_requests=10_000,
system_prompt_tokens=800,
avg_user_tokens=80,
avg_output_tokens=200,
input_price_per_million=0.15,
output_price_per_million=0.60,
finetune_cost_usd=500,
days=365
)
for k, v in result.items():
print(f" {k}: {v}")
# break_even_days is often under 30 for high-volume applicationsWhen NOT to Fine-Tune
Do Not Fine-Tune With Fewer Than 100 Examples
With fewer than 100 examples, the model will overfit. You will see:
- Training loss drops to near zero
- Validation loss stays high or increases
- The model memorizes training examples verbatim
# Detecting overfitting early in training
import matplotlib.pyplot as plt
def plot_training_curves(train_losses: list[float], val_losses: list[float]):
"""Visualize overfitting pattern."""
steps = list(range(len(train_losses)))
plt.figure(figsize=(10, 5))
plt.plot(steps, train_losses, label="Training Loss", color="blue")
plt.plot(steps, val_losses, label="Validation Loss", color="red")
plt.xlabel("Training Steps")
plt.ylabel("Loss")
plt.title("Training vs Validation Loss")
plt.legend()
# Overfitting diagnostic
final_gap = val_losses[-1] - train_losses[-1]
if final_gap > 0.5:
print(f"WARNING: Large train/val gap ({final_gap:.2f}). "
f"Likely overfitting. Consider: more data, lower learning rate, "
f"fewer epochs, or add dropout.")
else:
print(f"Train/val gap: {final_gap:.2f} — looks healthy.")
plt.tight_layout()
plt.savefig("training_curves.png")
# Example: overfitting with only 50 examples
# train_losses = [2.3, 1.8, 1.2, 0.6, 0.2, 0.05]
# val_losses = [2.3, 2.1, 2.2, 2.4, 2.7, 3.1] ← classic overfitDo Not Fine-Tune When the Task Changes Frequently
If your output schema or task definition changes every few weeks, you would need to retrain constantly. Prompting handles task evolution at zero cost.
# Task stability assessment
from datetime import date, timedelta
def assess_task_stability(
task_definition_changes: list[date],
horizon_days: int = 90
) -> str:
"""Estimate how stable a task is over the planning horizon."""
if not task_definition_changes:
return "stable — good fine-tuning candidate"
recent_changes = [
d for d in task_definition_changes
if d >= date.today() - timedelta(days=horizon_days)
]
change_rate = len(recent_changes) / (horizon_days / 30) # changes per month
if change_rate > 1.5:
return (f"unstable ({change_rate:.1f} changes/month) — "
f"use prompting; fine-tuning ROI is low")
elif change_rate > 0.5:
return (f"moderately stable ({change_rate:.1f} changes/month) — "
f"fine-tune if consistency gains justify retraining cost")
else:
return (f"stable ({change_rate:.1f} changes/month) — "
f"strong fine-tuning candidate")
# Example: task that has changed 5 times in the last 90 days
changes = [
date(2026, 2, 15),
date(2026, 3, 1),
date(2026, 3, 20),
date(2026, 4, 10),
date(2026, 5, 1),
]
print(assess_task_stability(changes))
# "unstable (1.7 changes/month) — use prompting"Case Study: Pharmaceutical Drug Labeling
The Problem
A pharmaceutical company needs to extract structured information from unstructured drug label PDFs and output FDA-compliant JSON for regulatory submission. Each label contains:
- Drug name, manufacturer, NDC code
- Indications and usage (free text)
- Dosing instructions by patient population
- Contraindications
- Adverse reactions table
Requirement: 99.5% structured JSON output rate. Any malformed output requires human remediation at $45/hour.
Option A: Prompting Only
# Approach: very detailed system prompt with examples
DRUG_LABEL_PROMPT = """You are an FDA regulatory information extractor.
Extract drug label information and return ONLY valid JSON matching this exact schema:
{
"drug_name": "string",
"manufacturer": "string",
"ndc_code": "string (format: XXXXX-XXXX-XX)",
"indications": ["string"],
"dosing": {
"adult": "string",
"pediatric": "string or null",
"renal_impairment": "string or null"
},
"contraindications": ["string"],
"adverse_reactions": [{"reaction": "string", "frequency": "string"}]
}
Rules:
- Return ONLY the JSON object, no explanation
- Use null for missing fields
- Extract verbatim from the label, do not paraphrase
Example input: [300 tokens of example]
Example output: [200 tokens of example JSON]
Now extract from the following label:"""
# Result after testing on 500 labels:
# - JSON parse success rate: 91.2%
# - Correct schema rate: 87.4%
# - Error cases: model adds "Here is the extracted JSON:" prefix, truncates long arrays
# - System prompt cost: ~600 tokens × 10,000 labels/month = 6M tokens = $0.90/month extra
# - Remediation cost: 0.126 failure rate × 10,000 × 15 min × $45/hr = $1,417/monthOption B: RAG
RAG does not apply here — the input text IS the knowledge source. The model needs to extract from it, not retrieve from elsewhere. RAG solves knowledge gaps, not structured extraction.
Option C: Fine-Tuned Model
# Fine-tuned on 2,000 labeled drug documents (human-verified JSON pairs)
# Training: QLoRA on Llama 3.1 8B, rank 16, 3 epochs, 4 hours on A100
# Results after fine-tuning:
fine_tuning_results = {
"json_parse_success_rate": 0.998, # vs 0.912 for prompting
"correct_schema_rate": 0.996, # vs 0.874 for prompting
"avg_tokens_per_request": 180, # vs 780 for prompting (600 system prompt)
"monthly_training_cost_amortized": 42, # $500 training ÷ 12 months
"monthly_inference_savings": 0.90, # no system prompt tokens
"monthly_remediation_cost": 18.00, # 0.004 × 10,000 × 15min × $45/hr
}
# Monthly cost comparison:
# Prompting: $1,417 (remediation) + inference cost
# Fine-tuned: $18 (remediation) + $42 (amortized training) = $60/monthThe Verdict
Fine-tuning wins decisively for pharmaceutical drug labeling because:
- The output format is rigid and legally important
- The vocabulary is specialized (dosing schedules, drug classes, NDC codes)
- Volume is high enough that training amortizes quickly
- The task definition (FDA schema) changes at most once a year
Summary Decision Tree
Is your output format flexible?
├─ Yes → Try prompting first. Fine-tune only if quality is insufficient.
└─ No → Is your dataset larger than 200 examples?
├─ No → Collect more data before fine-tuning.
└─ Yes → Does the task change more than once a month?
├─ Yes → Use prompting (fine-tuning ROI is low).
└─ No → Fine-tune. Calculate break-even point first.The framework is simple: fine-tuning is a capital investment. Like all capital investments, it pays off when volume is high, task is stable, and consistency requirements are strict.