Learnixo
Back to blog
AI Systemsbeginner

What Is RAG?

What Retrieval-Augmented Generation is, why it exists, and how it solves the hallucination and knowledge cutoff problems of standalone LLMs.

Asma Hafeez KhanMay 16, 20263 min read
RAGLLMsRetrievalGroundingInterview
Share:𝕏

The Problem RAG Solves

LLMs have fixed knowledge — what was in their training data. They have two fundamental limitations:

1. Knowledge cutoff:
   GPT-4 training data: up to ~April 2024
   Your hospital's formulary was updated last month
   New NICE guidelines were published last week
   → The LLM doesn't know any of this

2. Hallucination:
   LLMs confidently generate plausible-sounding text
   "Warfarin dose for CYP2C9 poor metabolisers is 2.5mg daily"
   → Might be correct, might be wrong — the LLM can't tell you which
   → No traceability: you can't verify where this came from

RAG solves both by retrieving relevant, verified documents before generating the answer.


What RAG Does

Without RAG:
  User: "What is the current NICE guidance on Warfarin monitoring?"
  LLM: [generates from training data, possibly outdated, possibly wrong]

With RAG:
  1. Retrieve: search your knowledge base for "NICE Warfarin monitoring"
     → returns 5 relevant document chunks from your indexed guidelines
  
  2. Augment: inject the retrieved chunks into the prompt:
     "Answer based on this context: [NICE guideline excerpt]..."
  
  3. Generate: LLM answers based on the retrieved context
     → answer is grounded in the actual guideline
     → can cite the source: "According to NICE NG196, Section 3.2..."

The RAG Architecture

                    ┌──────────────────────────┐
Documents ──────→   │  Document Processing      │
(guidelines,        │  - Chunking               │
 protocols,         │  - Embedding              │
 notes)             │  - Vector Store Index     │
                    └──────────────┬───────────┘
                                   │
User Query ──────────────────────→ │
                                   ▼
                    ┌──────────────────────────┐
                    │  Retrieval               │
                    │  - Embed query           │
                    │  - Search vector store   │
                    │  - Return top-k chunks   │
                    └──────────────┬───────────┘
                                   │
                    ┌──────────────▼───────────┐
                    │  Augmented Prompt        │
                    │  System + Context +      │
                    │  User Query              │
                    └──────────────┬───────────┘
                                   │
                    ┌──────────────▼───────────┐
                    │  LLM Generation          │
                    │  → Grounded Answer       │
                    └──────────────────────────┘

What RAG Is Not

RAG is NOT fine-tuning:
  Fine-tuning changes the model weights — expensive, requires data
  RAG changes the model's context at inference time — no weight change
  
RAG is NOT a database query:
  A database returns exact records that match a query
  RAG retrieves semantically similar documents — approximate, not exact
  
RAG is NOT a guarantee of accuracy:
  The retrieved document might be outdated
  The model might not faithfully follow the retrieved context
  The relevant document might not exist in your knowledge base

When to Use RAG

Use RAG when:
  Your knowledge base updates frequently (guidelines, protocols, drug info)
  Answers must be traceable to specific sources
  Domain knowledge is highly specialised and not well-represented in LLMs
  Hallucination risk is unacceptable (clinical, legal, financial)
  You need to restrict the model to a specific corpus

Don't use RAG when:
  The query requires reasoning over general knowledge (use LLM directly)
  You need real-time data (RAG is limited to your indexed corpus)
  Latency is extremely tight (RAG adds retrieval time)
  Your corpus is tiny (just put it all in the context window)

Interview Answer

"RAG (Retrieval-Augmented Generation) extends LLMs with a retrieval step: before generating, the system searches a knowledge base for relevant documents and injects them into the prompt as context. This solves two core LLM limitations: knowledge cutoff (your knowledge base is up-to-date regardless of the LLM's training date) and hallucination (the model answers based on retrieved, verified documents rather than from parametric knowledge, enabling source citations). The pipeline has three stages: index the knowledge base (chunk, embed, store), retrieve top-k relevant chunks at query time, then generate a grounded answer from the LLM with the retrieved context injected."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.