Back to blog
AI Systemsbeginner

RAG — Retrieval-Augmented Generation Architecture

Understand how RAG works: chunk documents, generate embeddings, store in a vector database, retrieve relevant context, and augment LLM prompts to ground answers in your own data.

Asma HafeezApril 17, 20265 min read
airagembeddingsvector-searchdotnetllm
Share:𝕏

RAG — Retrieval-Augmented Generation

LLMs have a knowledge cutoff and don't know your data. RAG solves this: retrieve relevant chunks from your own documents and inject them into the prompt. The model answers based on your data, not just training data.


The RAG Pipeline

Indexing (one time):
  Document → Chunk → Embed → Store in vector DB

Querying (each request):
  User Question → Embed → Search vector DB → Retrieve top-K chunks
                → Inject into prompt → LLM → Answer

Why RAG Works

An embedding is a vector (list of numbers) that captures semantic meaning. Semantically similar text has similar vectors. The vector DB finds the chunks most relevant to the question — then the LLM uses those chunks to answer.

Question: "What is the refund policy?"
→ Embeds to vector [0.12, -0.45, 0.88, ...]

Stored chunk: "Refunds are processed within 5-7 business days..."
→ Similar vector → high cosine similarity → retrieved

Unrelated chunk: "Our office is open 9am-5pm..."
→ Distant vector → low similarity → not retrieved

Step 1 — Chunking Documents

C#
public static List<string> ChunkText(string text, int maxChars = 500, int overlap = 50)
{
    var chunks = new List<string>();
    int start  = 0;

    while (start < text.Length)
    {
        int end = Math.Min(start + maxChars, text.Length);

        // Try to break at a sentence boundary
        if (end < text.Length)
        {
            var breakPoint = text.LastIndexOfAny(['.', '!', '?'], end, Math.Min(100, end - start));
            if (breakPoint > start) end = breakPoint + 1;
        }

        chunks.Add(text[start..end].Trim());
        start = end - overlap;
    }

    return chunks;
}

Chunk size matters:

  • Too small: chunks lack context for the model
  • Too large: irrelevant content dilutes the relevant parts
  • 300–600 characters with 10-20% overlap is a common starting point

Step 2 — Generating Embeddings

C#
public class EmbeddingService(OpenAIClient openai)
{
    private readonly EmbeddingClient _client =
        openai.GetEmbeddingClient("text-embedding-3-small");

    public async Task<float[]> EmbedAsync(string text)
    {
        var result = await _client.GenerateEmbeddingAsync(text);
        return result.Value.ToFloats().ToArray();
    }

    public async Task<List<float[]>> EmbedBatchAsync(IEnumerable<string> texts)
    {
        var results = await _client.GenerateEmbeddingsAsync(texts.ToList());
        return results.Value.Select(e => e.ToFloats().ToArray()).ToList();
    }
}

Step 3 — Storing in a Vector Database

Using pgvector (PostgreSQL extension):

SQL
-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Table with embedding column
CREATE TABLE document_chunks (
    id          SERIAL PRIMARY KEY,
    source      TEXT NOT NULL,
    chunk_text  TEXT NOT NULL,
    embedding   vector(1536)  -- 1536 dimensions for text-embedding-3-small
);

-- Index for fast approximate nearest-neighbor search
CREATE INDEX ON document_chunks USING ivfflat (embedding vector_cosine_ops);
C#
// Insert a chunk
public async Task StoreChunkAsync(string source, string text, float[] embedding)
{
    await db.ExecuteAsync(
        "INSERT INTO document_chunks (source, chunk_text, embedding) VALUES (@s, @t, @e::vector)",
        new { s = source, t = text, e = "[" + string.Join(",", embedding) + "]" }
    );
}

Alternatives: Azure AI Search, Qdrant, Weaviate, Pinecone, Chroma.


Step 4 — Retrieving Relevant Chunks

C#
public async Task<List<string>> SearchAsync(float[] queryEmbedding, int topK = 5)
{
    var vectorLiteral = "[" + string.Join(",", queryEmbedding) + "]";

    return (await db.QueryAsync<string>(
        """
        SELECT chunk_text
        FROM document_chunks
        ORDER BY embedding <=> @embedding::vector   -- cosine distance
        LIMIT @k
        """,
        new { embedding = vectorLiteral, k = topK }
    )).ToList();
}

Step 5 — Augmenting the Prompt

C#
public class RagService(EmbeddingService embedder, VectorStore store, OpenAIClient openai)
{
    private readonly ChatClient _chat = openai.GetChatClient("gpt-4o");

    public async Task<string> AskAsync(string question)
    {
        // Retrieve
        var queryVector = await embedder.EmbedAsync(question);
        var chunks      = await store.SearchAsync(queryVector, topK: 5);
        var context     = string.Join("\n\n---\n\n", chunks);

        // Augment
        var messages = new List<ChatMessage>
        {
            new SystemChatMessage("""
                Answer the question using ONLY the provided context.
                If the answer is not in the context, say "I don't have that information."
                Do not make up information.
                """),
            new UserChatMessage($"""
                Context:
                {context}

                Question: {question}
                """)
        };

        var completion = await _chat.CompleteChatAsync(messages);
        return completion.Content[0].Text;
    }
}

Indexing Documents

C#
public class DocumentIndexer(EmbeddingService embedder, VectorStore store)
{
    public async Task IndexAsync(string filePath)
    {
        var text   = await File.ReadAllTextAsync(filePath);
        var chunks = ChunkText(text);
        var embeddings = await embedder.EmbedBatchAsync(chunks);

        for (int i = 0; i < chunks.Count; i++)
            await store.StoreChunkAsync(filePath, chunks[i], embeddings[i]);

        Console.WriteLine($"Indexed {chunks.Count} chunks from {Path.GetFileName(filePath)}");
    }
}

// Index all PDFs in a folder
var indexer = new DocumentIndexer(embedder, store);
foreach (var file in Directory.GetFiles("docs", "*.txt"))
    await indexer.IndexAsync(file);

RAG Quality — Common Improvements

Chunking:
  • Sentence-aware splitting — don't cut in the middle of sentences
  • Hierarchical: chunk parents (paragraphs) + children (sentences)

Retrieval:
  • Hybrid search: vector + keyword (BM25) combined
  • Reranking: use a cross-encoder to re-score top-K results
  • Metadata filtering: filter by document type, date, category before vector search

Prompt:
  • Include source attribution: "According to [policy.pdf]..."
  • Ask the model to cite which context it used
  • Add negative instruction: "Do not use information outside the context"

Key Takeaways

  1. RAG = chunk → embed → store → retrieve → augment — five clear steps
  2. Chunk size is a tuning parameter — start at 300-600 chars with overlap
  3. Vector similarity (cosine distance) finds semantically relevant chunks, not just keyword matches
  4. The prompt must constrain the model to the retrieved context — otherwise it hallucinates
  5. RAG is the right choice for large, changing document sets where fine-tuning would be too expensive

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.