Learnixo
Back to blog
AI Systemsbeginner

RAG vs Fine-Tuning

When to choose RAG over fine-tuning and vice versa — the decision framework based on knowledge type, update frequency, cost, and latency requirements.

Asma Hafeez KhanMay 16, 20264 min read
RAGFine-TuningLLMsArchitecture DecisionInterview
Share:𝕏

The Core Distinction

Fine-tuning encodes knowledge into MODEL WEIGHTS:
  Advantage: fast inference (no retrieval step)
  Advantage: captures style, format, domain-specific patterns
  Disadvantage: expensive to update (retrain when knowledge changes)
  Disadvantage: knowledge may degrade ("catastrophic forgetting")
  Disadvantage: can't cite sources — knowledge is baked in opaquely

RAG stores knowledge in an EXTERNAL KNOWLEDGE BASE:
  Advantage: update instantly (add/remove/edit documents)
  Advantage: traceable — can cite the source document
  Advantage: scales to very large knowledge bases
  Disadvantage: retrieval latency and cost at each query
  Disadvantage: retrieval quality limits answer quality

Decision Framework

Choose RAG when:
  ✓ Knowledge updates frequently (drug formulary, clinical guidelines, prices)
  ✓ You need source traceability ("According to NICE NG196...")
  ✓ Knowledge base is large (thousands of documents)
  ✓ Low volume of unique query types (wide knowledge base, diverse queries)
  ✓ You don't have enough labelled examples to fine-tune well
  ✓ Hallucination is unacceptable (clinical, legal, financial)

Choose fine-tuning when:
  ✓ Knowledge is stable and doesn't change often
  ✓ Task requires specific FORMAT, TONE, or STYLE (not just facts)
  ✓ High query volume with tight latency SLA (no retrieval step)
  ✓ Domain vocabulary is very specialised and poorly handled by base model
  ✓ Task is CLASSIFICATION or structured extraction (not open-ended Q&A)
  ✓ You have enough high-quality labelled examples (>1000)

Use BOTH when:
  Fine-tune for style/format + RAG for factual grounding
  Example: fine-tune on clinical note style, RAG for patient-specific facts

Knowledge Type Matters Most

Factual, citable knowledge → RAG
  Clinical guidelines (updated regularly)
  Drug information (dosing, interactions, contraindications)
  Company policies and procedures
  Legal regulations

Behavioural patterns → Fine-tuning
  "Always use formal clinical language"
  "Format output as HL7 FHIR JSON"
  "Classify this ICD-10 code as level 1, 2, or 3"
  "Write in the style of a physician discharge summary"

Both:
  Medical assistant: fine-tuned for clinical response style + RAG for current guidelines

Cost Comparison

RAG costs:
  Indexing: one-time embedding cost (cheap: ~$0.002/1M tokens for text-embedding-3-small)
  Storage: vector database (Chroma: free; Azure AI Search: $100-1000/month)
  Retrieval: embedding query + vector search (fast, cheap)
  LLM: larger context window due to injected documents (costs more per query)
  
Fine-tuning costs:
  Training: GPU compute × hours (GPT-4o fine-tuning: ~$0.025/1K tokens input)
  Retraining: every time knowledge changes
  Hosting: dedicated fine-tuned model endpoint
  
Rough comparison:
  For a knowledge base that changes monthly:
    RAG: $50-500/month (storage + retrieval overhead)
    Fine-tuning: $500-5000/month (monthly retraining + hosting)
  
  For stable knowledge, very high query volume (10M+/month):
    Fine-tuning may be cheaper (no per-query context overhead)

Latency Comparison

RAG at 100ms budget:
  Embed query:  10-30ms
  Vector search: 10-50ms
  LLM call:     varies (dominant factor)
  Total retrieval overhead: 20-80ms — significant for sub-200ms SLAs

Fine-tuning at 100ms budget:
  No retrieval step
  Shorter context (no documents injected) → faster LLM call
  Total: faster by 30-50ms for the same model

For interactive real-time applications with < 200ms SLA:
  Fine-tuning or streaming RAG (show retrieved context progressively)

Hybrid: Fine-Tune + RAG

The most capable production systems combine both:

Layer 1: Fine-tuned base model
  Trained on domain examples (clinical note format, medical reasoning style)
  Handles: tone, format, domain vocabulary, reasoning patterns

Layer 2: RAG at inference
  Retrieves current guidelines, patient-specific data, formulary
  Handles: up-to-date facts, specific clinical questions, citations

Example: ClinicalBERT fine-tuned on discharge summary style + RAG on hospital formulary
  → Writes in the right clinical style, with current drug information

Interview Answer

"RAG and fine-tuning address different problems: RAG injects external knowledge at inference time — best for frequently-updated information that must be citable (clinical guidelines, drug formularies). Fine-tuning encodes knowledge into model weights — best for stable domain-specific FORMAT, STYLE, or classification behaviour where retrieval latency is unacceptable. Decision framework: if knowledge changes monthly → RAG; if you need source citations → RAG; if you have thousands of labelled classification examples → fine-tuning; if the task is stylistic rather than factual → fine-tuning. Production systems often combine both: fine-tune for clinical language style, RAG for current formulary and guidelines."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.