AI Systems Engineering · Lesson 1 of 2
Production RAG Pipeline
Why RAG Matters
Large Language Models are powerful, but they hallucinate. They don't know your data. Retrieval-Augmented Generation (RAG) solves this by grounding LLM responses in your actual documents.
But most RAG tutorials stop at "call an embedding API and stuff it into a prompt." Production RAG is much harder.
The Architecture
A production RAG pipeline has four stages:
- Ingestion — Parse and chunk documents
- Embedding — Convert chunks to vectors
- Retrieval — Find relevant chunks for a query
- Generation — Feed context to the LLM
Let's build each one.
Stage 1: Document Ingestion
The first mistake people make is treating all documents the same. A PDF is not a markdown file. A table is not a paragraph.
from dataclasses import dataclass
@dataclass
class DocumentChunk:
content: str
metadata: dict
source: str
chunk_index: int
def chunk_document(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
"""Split text into overlapping chunks."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start = end - overlap
return chunksKey decisions:
- Chunk size: 512 tokens is a good starting point. Too small = no context. Too large = noise.
- Overlap: 50-100 tokens prevents cutting sentences in half.
- Metadata: Always store source, page number, and section headers with each chunk.
Stage 2: Embedding
Use a model that matches your domain. For most cases, OpenAI's text-embedding-3-small is a great default.
from openai import OpenAI
client = OpenAI()
def embed_chunks(chunks: list[str]) -> list[list[float]]:
"""Embed a batch of text chunks."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=chunks,
)
return [item.embedding for item in response.data]Production considerations:
- Batch your requests — don't embed one chunk at a time
- Cache embeddings — recomputing is expensive
- Track your model version — switching models invalidates all stored embeddings
Stage 3: Retrieval
This is where most RAG systems fail silently. Bad retrieval = bad answers, no matter how good your LLM is.
import numpy as np
def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def retrieve(query_embedding: list[float], stored_embeddings: list, top_k: int = 5):
"""Retrieve top-k most similar chunks."""
scores = [
(i, cosine_similarity(query_embedding, emb))
for i, emb in enumerate(stored_embeddings)
]
scores.sort(key=lambda x: x[1], reverse=True)
return scores[:top_k]In production, use a vector database (Pinecone, Weaviate, pgvector) instead of brute-force search.
Stage 4: Generation
Now feed the retrieved context to your LLM:
def generate_answer(query: str, context_chunks: list[str]) -> str:
context = "\n\n---\n\n".join(context_chunks)
response = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": (
"Answer the user's question based on the provided context. "
"If the context doesn't contain the answer, say so."
),
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}",
},
],
)
return response.choices[0].message.contentWhat Breaks in Production
- Stale data — Your documents change. You need an update pipeline, not just an ingestion pipeline.
- Bad chunking — Tables, code blocks, and lists need special handling.
- Retrieval quality — Monitor what's being retrieved. Log it. Measure relevance.
- Cost — Embedding and LLM calls add up. Cache aggressively.
- Latency — Users expect fast answers. Optimize your vector search and consider streaming.
Next Steps
This is the foundation. In the next article, we'll add:
- Hybrid search (keyword + semantic)
- Re-ranking with cross-encoders
- Evaluation and monitoring
- Multi-document support
The gap between a demo RAG and a production RAG is enormous. Start with this architecture, measure everything, and iterate.