Production RAG Pipeline — AI Systems Engineering | Learnixo

Why RAG Matters

Large Language Models are powerful, but they hallucinate. They don't know your data. Retrieval-Augmented Generation (RAG) solves this by grounding LLM responses in your actual documents.

But most RAG tutorials stop at "call an embedding API and stuff it into a prompt." Production RAG is much harder.

The Architecture

A production RAG pipeline has four stages:

Ingestion — Parse and chunk documents
Embedding — Convert chunks to vectors
Retrieval — Find relevant chunks for a query
Generation — Feed context to the LLM

Let's build each one.

Stage 1: Document Ingestion

The first mistake people make is treating all documents the same. A PDF is not a markdown file. A table is not a paragraph.

Python

from dataclasses import dataclass

@dataclass
class DocumentChunk:
    content: str
    metadata: dict
    source: str
    chunk_index: int

def chunk_document(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

Key decisions:

Chunk size: 512 tokens is a good starting point. Too small = no context. Too large = noise.
Overlap: 50-100 tokens prevents cutting sentences in half.
Metadata: Always store source, page number, and section headers with each chunk.

Stage 2: Embedding

Use a model that matches your domain. For most cases, OpenAI's text-embedding-3-small is a great default.

Python

from openai import OpenAI

client = OpenAI()

def embed_chunks(chunks: list[str]) -> list[list[float]]:
    """Embed a batch of text chunks."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=chunks,
    )
    return [item.embedding for item in response.data]

Production considerations:

Batch your requests — don't embed one chunk at a time
Cache embeddings — recomputing is expensive
Track your model version — switching models invalidates all stored embeddings

Stage 3: Retrieval

This is where most RAG systems fail silently. Bad retrieval = bad answers, no matter how good your LLM is.

Python

import numpy as np

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def retrieve(query_embedding: list[float], stored_embeddings: list, top_k: int = 5):
    """Retrieve top-k most similar chunks."""
    scores = [
        (i, cosine_similarity(query_embedding, emb))
        for i, emb in enumerate(stored_embeddings)
    ]
    scores.sort(key=lambda x: x[1], reverse=True)
    return scores[:top_k]

In production, use a vector database (Pinecone, Weaviate, pgvector) instead of brute-force search.

Stage 4: Generation

Now feed the retrieved context to your LLM:

Python

def generate_answer(query: str, context_chunks: list[str]) -> str:
    context = "\n\n---\n\n".join(context_chunks)

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {
                "role": "system",
                "content": (
                    "Answer the user's question based on the provided context. "
                    "If the context doesn't contain the answer, say so."
                ),
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {query}",
            },
        ],
    )
    return response.choices[0].message.content

What Breaks in Production

Stale data — Your documents change. You need an update pipeline, not just an ingestion pipeline.
Bad chunking — Tables, code blocks, and lists need special handling.
Retrieval quality — Monitor what's being retrieved. Log it. Measure relevance.
Cost — Embedding and LLM calls add up. Cache aggressively.
Latency — Users expect fast answers. Optimize your vector search and consider streaming.

Next Steps

This is the foundation. In the next article, we'll add:

Hybrid search (keyword + semantic)
Re-ranking with cross-encoders
Evaluation and monitoring
Multi-document support

The gap between a demo RAG and a production RAG is enormous. Start with this architecture, measure everything, and iterate.