What Is RAG?
Retrieval-Augmented Generation: fetch relevant docs, inject into LLM context, reduce hallucination, keep knowledge current, and cite sources.
What Is RAG?
Retrieval-Augmented Generation (RAG) is an architecture that grounds a Large Language Model's answers in real, retrieved documents rather than relying solely on knowledge baked into its weights. The core idea is simple: before asking the LLM to answer, you retrieve the most relevant documents from an external store and inject them into the prompt as context.
Why LLMs Hallucinate Without RAG
LLMs are frozen snapshots of the world at training time. Ask GPT-4 about a regulation published last month, an internal policy document, or a customer's account details, and it will either refuse or fabricate an answer with confidence. This is the hallucination problem β the model's probability distribution over tokens has no anchor to your specific facts.
RAG solves this by providing the anchor at inference time.
The Three Core Promises of RAG
1. Reduce hallucination β the model is instructed to answer only from the provided context. If the answer isn't there, it says so.
2. Keep knowledge current β update your document store without retraining the model. New product specs, updated policies, fresh research papers β add them to the index and they're immediately available.
3. Cite sources β because you know exactly which chunks were retrieved, you can tell the model to cite them, giving users a traceable path back to the original document.
The End-to-End RAG Flow
User Query
β
βΌ
βββββββββββββββββββββββ
β Query Encoder β β embed the query with the same model used at ingestion
ββββββββββ¬βββββββββββββ
β query vector
βΌ
βββββββββββββββββββββββ
β Vector Store β β ANN search: top-k most similar chunks
β (Pinecone, pgvectorβ
β Qdrant, FAISSβ¦) β
ββββββββββ¬βββββββββββββ
β retrieved chunks (text + metadata)
βΌ
βββββββββββββββββββββββ
β Prompt Builder β β assemble: system prompt + chunks + user question
ββββββββββ¬βββββββββββββ
β full prompt
βΌ
βββββββββββββββββββββββ
β LLM β β GPT-4o, Claude 3.5, Mistral, Llama 3β¦
ββββββββββ¬βββββββββββββ
β grounded answer (with citations)
βΌ
ResponseIngestion Flow (Offline)
Before you can retrieve, you must build the index:
Raw Documents (PDFs, DOCX, HTML, DB rowsβ¦)
β
βΌ
βββββββββββββββββββββββ
β Parser / Loader β β extract plain text, preserve metadata
ββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Chunker β β split into passages of ~512 tokens
ββββββββββ¬βββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Embedding Model β β text-embedding-3-small, E5-large, BGE-M3β¦
ββββββββββ¬βββββββββββββ
β float32 vectors
βΌ
βββββββββββββββββββββββ
β Vector Store Upsertβ β store vector + metadata + original text
βββββββββββββββββββββββMinimal Python RAG in 50 Lines
import os
from openai import OpenAI
import numpy as np
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# ββ 1. Ingestion ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
DOCS = [
{"id": "doc1", "text": "Our refund policy allows returns within 30 days of purchase."},
{"id": "doc2", "text": "Shipping takes 3β5 business days for standard delivery."},
{"id": "doc3", "text": "We offer a lifetime warranty on all hardware products."},
{"id": "doc4", "text": "Customer support is available Monday to Friday, 9 AM to 6 PM EST."},
]
def embed(texts: list[str]) -> np.ndarray:
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
return np.array([e.embedding for e in response.data], dtype=np.float32)
# Build in-memory index
doc_texts = [d["text"] for d in DOCS]
doc_vectors = embed(doc_texts) # shape: (4, 1536)
# ββ 2. Retrieval ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> np.ndarray:
a_norm = a / np.linalg.norm(a)
b_norm = b / np.linalg.norm(b, axis=1, keepdims=True)
return b_norm @ a_norm
def retrieve(query: str, top_k: int = 2) -> list[str]:
q_vec = embed([query])[0]
scores = cosine_similarity(q_vec, doc_vectors)
top_indices = np.argsort(scores)[::-1][:top_k]
return [DOCS[i]["text"] for i in top_indices]
# ββ 3. Generation βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
def rag_answer(question: str) -> str:
chunks = retrieve(question)
context = "\n\n".join(f"[{i+1}] {c}" for i, c in enumerate(chunks))
system = (
"You are a helpful support assistant. "
"Answer ONLY using the provided context. "
"If the answer is not in the context, say 'I don't have that information.'"
)
user = f"Context:\n{context}\n\nQuestion: {question}"
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": user},
],
temperature=0,
)
return response.choices[0].message.content
# ββ 4. Test βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
print(rag_answer("Can I return something I bought 3 weeks ago?"))
print(rag_answer("Do you sell groceries?"))RAG vs Fine-Tuning vs Prompting
| Dimension | Prompting Only | RAG | Fine-Tuning | |---|---|---|---| | Knowledge source | Model weights | External store | Updated weights | | Update cost | None | Low (re-index) | High (retrain) | | Cites sources | No | Yes | No | | Hallucination risk | High | Low | Medium | | Latency | Low | Medium | Low | | Best for | General tasks | Domain knowledge, live data | Style, tone, task format |
When to use RAG: you have a body of domain documents (manuals, policies, research papers) that change frequently and that the base model does not know about.
When to fine-tune instead: you want the model to behave differently (more concise, different format, specialized vocabulary) and the knowledge is stable.
When to combine them: fine-tune for style and behavior, use RAG for facts. This is increasingly the production pattern.
The Hallucination Triangle
Accurate
/\
/ \
/ \
Current ββββ CitableWithout RAG, a model can be at most one or two of these. RAG enables all three simultaneously for your specific domain.
Key Terminology Reference
| Term | Definition | |---|---| | Chunk | A passage of text, typically 256β1024 tokens | | Embedding | A dense float vector representing semantic meaning | | ANN | Approximate Nearest Neighbor search | | Top-k | The k chunks with highest similarity to the query | | Context window | Maximum tokens the LLM can process at once | | Grounding | Constraining LLM output to provided source material | | Faithfulness | Whether the answer is supported by retrieved chunks |
Common Misconceptions
"RAG is just search." No β RAG combines retrieval with generation. The LLM synthesizes, reasons over, and rewrites the retrieved content into a coherent answer.
"Bigger embedding model always wins." Embedding quality matters, but so does chunking strategy, retrieval depth, and prompt design. A well-chunked index with a small embedding model often outperforms a poorly-chunked index with a large one.
"Just put the whole document in the prompt." With 128K+ context windows, this feels tempting. But it's expensive, slow, and LLMs struggle with "lost in the middle" β they attend poorly to information in the center of a long context.
What Comes Next
The rest of this course walks through each component in depth:
- Naive RAG β the basic pipeline end to end
- Chunking strategies β the biggest lever on retrieval quality
- Embedding models β how to pick the right one
- Vector stores β architecture and tradeoffs
- Hybrid search β combining dense and sparse retrieval
- Advanced patterns β HyDE, reranking, self-RAG, graph RAG
- Production concerns β cost, caching, security, evaluation
Let's build.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.