What Is RAG?
What Retrieval-Augmented Generation is, why it exists, and how it solves the hallucination and knowledge cutoff problems of standalone LLMs.
The Problem RAG Solves
LLMs have fixed knowledge — what was in their training data. They have two fundamental limitations:
1. Knowledge cutoff:
GPT-4 training data: up to ~April 2024
Your hospital's formulary was updated last month
New NICE guidelines were published last week
→ The LLM doesn't know any of this
2. Hallucination:
LLMs confidently generate plausible-sounding text
"Warfarin dose for CYP2C9 poor metabolisers is 2.5mg daily"
→ Might be correct, might be wrong — the LLM can't tell you which
→ No traceability: you can't verify where this came fromRAG solves both by retrieving relevant, verified documents before generating the answer.
What RAG Does
Without RAG:
User: "What is the current NICE guidance on Warfarin monitoring?"
LLM: [generates from training data, possibly outdated, possibly wrong]
With RAG:
1. Retrieve: search your knowledge base for "NICE Warfarin monitoring"
→ returns 5 relevant document chunks from your indexed guidelines
2. Augment: inject the retrieved chunks into the prompt:
"Answer based on this context: [NICE guideline excerpt]..."
3. Generate: LLM answers based on the retrieved context
→ answer is grounded in the actual guideline
→ can cite the source: "According to NICE NG196, Section 3.2..."The RAG Architecture
┌──────────────────────────┐
Documents ──────→ │ Document Processing │
(guidelines, │ - Chunking │
protocols, │ - Embedding │
notes) │ - Vector Store Index │
└──────────────┬───────────┘
│
User Query ──────────────────────→ │
▼
┌──────────────────────────┐
│ Retrieval │
│ - Embed query │
│ - Search vector store │
│ - Return top-k chunks │
└──────────────┬───────────┘
│
┌──────────────▼───────────┐
│ Augmented Prompt │
│ System + Context + │
│ User Query │
└──────────────┬───────────┘
│
┌──────────────▼───────────┐
│ LLM Generation │
│ → Grounded Answer │
└──────────────────────────┘What RAG Is Not
RAG is NOT fine-tuning:
Fine-tuning changes the model weights — expensive, requires data
RAG changes the model's context at inference time — no weight change
RAG is NOT a database query:
A database returns exact records that match a query
RAG retrieves semantically similar documents — approximate, not exact
RAG is NOT a guarantee of accuracy:
The retrieved document might be outdated
The model might not faithfully follow the retrieved context
The relevant document might not exist in your knowledge baseWhen to Use RAG
Use RAG when:
Your knowledge base updates frequently (guidelines, protocols, drug info)
Answers must be traceable to specific sources
Domain knowledge is highly specialised and not well-represented in LLMs
Hallucination risk is unacceptable (clinical, legal, financial)
You need to restrict the model to a specific corpus
Don't use RAG when:
The query requires reasoning over general knowledge (use LLM directly)
You need real-time data (RAG is limited to your indexed corpus)
Latency is extremely tight (RAG adds retrieval time)
Your corpus is tiny (just put it all in the context window)Interview Answer
"RAG (Retrieval-Augmented Generation) extends LLMs with a retrieval step: before generating, the system searches a knowledge base for relevant documents and injects them into the prompt as context. This solves two core LLM limitations: knowledge cutoff (your knowledge base is up-to-date regardless of the LLM's training date) and hallucination (the model answers based on retrieved, verified documents rather than from parametric knowledge, enabling source citations). The pipeline has three stages: index the knowledge base (chunk, embed, store), retrieve top-k relevant chunks at query time, then generate a grounded answer from the LLM with the retrieved context injected."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.