Learnixo
Back to blog
AI Systemsintermediate

What Is RAG?

Retrieval-Augmented Generation: fetch relevant docs, inject into LLM context, reduce hallucination, keep knowledge current, and cite sources.

Asma Hafeez KhanMay 15, 20266 min read
RAGVector SearchLLMRetrievalHallucinationKnowledge
Share:𝕏

What Is RAG?

Retrieval-Augmented Generation (RAG) is an architecture that grounds a Large Language Model's answers in real, retrieved documents rather than relying solely on knowledge baked into its weights. The core idea is simple: before asking the LLM to answer, you retrieve the most relevant documents from an external store and inject them into the prompt as context.

Why LLMs Hallucinate Without RAG

LLMs are frozen snapshots of the world at training time. Ask GPT-4 about a regulation published last month, an internal policy document, or a customer's account details, and it will either refuse or fabricate an answer with confidence. This is the hallucination problem β€” the model's probability distribution over tokens has no anchor to your specific facts.

RAG solves this by providing the anchor at inference time.

The Three Core Promises of RAG

1. Reduce hallucination β€” the model is instructed to answer only from the provided context. If the answer isn't there, it says so.

2. Keep knowledge current β€” update your document store without retraining the model. New product specs, updated policies, fresh research papers β€” add them to the index and they're immediately available.

3. Cite sources β€” because you know exactly which chunks were retrieved, you can tell the model to cite them, giving users a traceable path back to the original document.

The End-to-End RAG Flow

User Query
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Query Encoder      β”‚  ← embed the query with the same model used at ingestion
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚  query vector
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Vector Store       β”‚  ← ANN search: top-k most similar chunks
β”‚  (Pinecone, pgvectorβ”‚
β”‚   Qdrant, FAISS…)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚  retrieved chunks (text + metadata)
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Prompt Builder     β”‚  ← assemble: system prompt + chunks + user question
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚  full prompt
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  LLM                β”‚  ← GPT-4o, Claude 3.5, Mistral, Llama 3…
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚  grounded answer (with citations)
         β–Ό
      Response

Ingestion Flow (Offline)

Before you can retrieve, you must build the index:

Raw Documents (PDFs, DOCX, HTML, DB rows…)
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Parser / Loader    β”‚  ← extract plain text, preserve metadata
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Chunker            β”‚  ← split into passages of ~512 tokens
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Embedding Model    β”‚  ← text-embedding-3-small, E5-large, BGE-M3…
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚  float32 vectors
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Vector Store Upsertβ”‚  ← store vector + metadata + original text
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Minimal Python RAG in 50 Lines

Python
import os
from openai import OpenAI
import numpy as np

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# ── 1. Ingestion ──────────────────────────────────────────────────────────────

DOCS = [
    {"id": "doc1", "text": "Our refund policy allows returns within 30 days of purchase."},
    {"id": "doc2", "text": "Shipping takes 3–5 business days for standard delivery."},
    {"id": "doc3", "text": "We offer a lifetime warranty on all hardware products."},
    {"id": "doc4", "text": "Customer support is available Monday to Friday, 9 AM to 6 PM EST."},
]

def embed(texts: list[str]) -> np.ndarray:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return np.array([e.embedding for e in response.data], dtype=np.float32)

# Build in-memory index
doc_texts = [d["text"] for d in DOCS]
doc_vectors = embed(doc_texts)  # shape: (4, 1536)

# ── 2. Retrieval ──────────────────────────────────────────────────────────────

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    a_norm = a / np.linalg.norm(a)
    b_norm = b / np.linalg.norm(b, axis=1, keepdims=True)
    return b_norm @ a_norm

def retrieve(query: str, top_k: int = 2) -> list[str]:
    q_vec = embed([query])[0]
    scores = cosine_similarity(q_vec, doc_vectors)
    top_indices = np.argsort(scores)[::-1][:top_k]
    return [DOCS[i]["text"] for i in top_indices]

# ── 3. Generation ─────────────────────────────────────────────────────────────

def rag_answer(question: str) -> str:
    chunks = retrieve(question)
    context = "\n\n".join(f"[{i+1}] {c}" for i, c in enumerate(chunks))

    system = (
        "You are a helpful support assistant. "
        "Answer ONLY using the provided context. "
        "If the answer is not in the context, say 'I don't have that information.'"
    )
    user = f"Context:\n{context}\n\nQuestion: {question}"

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
        temperature=0,
    )
    return response.choices[0].message.content

# ── 4. Test ───────────────────────────────────────────────────────────────────

print(rag_answer("Can I return something I bought 3 weeks ago?"))
print(rag_answer("Do you sell groceries?"))

RAG vs Fine-Tuning vs Prompting

| Dimension | Prompting Only | RAG | Fine-Tuning | |---|---|---|---| | Knowledge source | Model weights | External store | Updated weights | | Update cost | None | Low (re-index) | High (retrain) | | Cites sources | No | Yes | No | | Hallucination risk | High | Low | Medium | | Latency | Low | Medium | Low | | Best for | General tasks | Domain knowledge, live data | Style, tone, task format |

When to use RAG: you have a body of domain documents (manuals, policies, research papers) that change frequently and that the base model does not know about.

When to fine-tune instead: you want the model to behave differently (more concise, different format, specialized vocabulary) and the knowledge is stable.

When to combine them: fine-tune for style and behavior, use RAG for facts. This is increasingly the production pattern.

The Hallucination Triangle

          Accurate
            /\
           /  \
          /    \
  Current ──── Citable

Without RAG, a model can be at most one or two of these. RAG enables all three simultaneously for your specific domain.

Key Terminology Reference

| Term | Definition | |---|---| | Chunk | A passage of text, typically 256–1024 tokens | | Embedding | A dense float vector representing semantic meaning | | ANN | Approximate Nearest Neighbor search | | Top-k | The k chunks with highest similarity to the query | | Context window | Maximum tokens the LLM can process at once | | Grounding | Constraining LLM output to provided source material | | Faithfulness | Whether the answer is supported by retrieved chunks |

Common Misconceptions

"RAG is just search." No β€” RAG combines retrieval with generation. The LLM synthesizes, reasons over, and rewrites the retrieved content into a coherent answer.

"Bigger embedding model always wins." Embedding quality matters, but so does chunking strategy, retrieval depth, and prompt design. A well-chunked index with a small embedding model often outperforms a poorly-chunked index with a large one.

"Just put the whole document in the prompt." With 128K+ context windows, this feels tempting. But it's expensive, slow, and LLMs struggle with "lost in the middle" β€” they attend poorly to information in the center of a long context.

What Comes Next

The rest of this course walks through each component in depth:

  • Naive RAG β€” the basic pipeline end to end
  • Chunking strategies β€” the biggest lever on retrieval quality
  • Embedding models β€” how to pick the right one
  • Vector stores β€” architecture and tradeoffs
  • Hybrid search β€” combining dense and sparse retrieval
  • Advanced patterns β€” HyDE, reranking, self-RAG, graph RAG
  • Production concerns β€” cost, caching, security, evaluation

Let's build.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.