RAG — Retrieval-Augmented Generation Architecture
Understand how RAG works: chunk documents, generate embeddings, store in a vector database, retrieve relevant context, and augment LLM prompts to ground answers in your own data.
RAG — Retrieval-Augmented Generation
LLMs have a knowledge cutoff and don't know your data. RAG solves this: retrieve relevant chunks from your own documents and inject them into the prompt. The model answers based on your data, not just training data.
The RAG Pipeline
Indexing (one time):
Document → Chunk → Embed → Store in vector DB
Querying (each request):
User Question → Embed → Search vector DB → Retrieve top-K chunks
→ Inject into prompt → LLM → AnswerWhy RAG Works
An embedding is a vector (list of numbers) that captures semantic meaning. Semantically similar text has similar vectors. The vector DB finds the chunks most relevant to the question — then the LLM uses those chunks to answer.
Question: "What is the refund policy?"
→ Embeds to vector [0.12, -0.45, 0.88, ...]
Stored chunk: "Refunds are processed within 5-7 business days..."
→ Similar vector → high cosine similarity → retrieved
Unrelated chunk: "Our office is open 9am-5pm..."
→ Distant vector → low similarity → not retrievedStep 1 — Chunking Documents
public static List<string> ChunkText(string text, int maxChars = 500, int overlap = 50)
{
var chunks = new List<string>();
int start = 0;
while (start < text.Length)
{
int end = Math.Min(start + maxChars, text.Length);
// Try to break at a sentence boundary
if (end < text.Length)
{
var breakPoint = text.LastIndexOfAny(['.', '!', '?'], end, Math.Min(100, end - start));
if (breakPoint > start) end = breakPoint + 1;
}
chunks.Add(text[start..end].Trim());
start = end - overlap;
}
return chunks;
}Chunk size matters:
- Too small: chunks lack context for the model
- Too large: irrelevant content dilutes the relevant parts
- 300–600 characters with 10-20% overlap is a common starting point
Step 2 — Generating Embeddings
public class EmbeddingService(OpenAIClient openai)
{
private readonly EmbeddingClient _client =
openai.GetEmbeddingClient("text-embedding-3-small");
public async Task<float[]> EmbedAsync(string text)
{
var result = await _client.GenerateEmbeddingAsync(text);
return result.Value.ToFloats().ToArray();
}
public async Task<List<float[]>> EmbedBatchAsync(IEnumerable<string> texts)
{
var results = await _client.GenerateEmbeddingsAsync(texts.ToList());
return results.Value.Select(e => e.ToFloats().ToArray()).ToList();
}
}Step 3 — Storing in a Vector Database
Using pgvector (PostgreSQL extension):
-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Table with embedding column
CREATE TABLE document_chunks (
id SERIAL PRIMARY KEY,
source TEXT NOT NULL,
chunk_text TEXT NOT NULL,
embedding vector(1536) -- 1536 dimensions for text-embedding-3-small
);
-- Index for fast approximate nearest-neighbor search
CREATE INDEX ON document_chunks USING ivfflat (embedding vector_cosine_ops);// Insert a chunk
public async Task StoreChunkAsync(string source, string text, float[] embedding)
{
await db.ExecuteAsync(
"INSERT INTO document_chunks (source, chunk_text, embedding) VALUES (@s, @t, @e::vector)",
new { s = source, t = text, e = "[" + string.Join(",", embedding) + "]" }
);
}Alternatives: Azure AI Search, Qdrant, Weaviate, Pinecone, Chroma.
Step 4 — Retrieving Relevant Chunks
public async Task<List<string>> SearchAsync(float[] queryEmbedding, int topK = 5)
{
var vectorLiteral = "[" + string.Join(",", queryEmbedding) + "]";
return (await db.QueryAsync<string>(
"""
SELECT chunk_text
FROM document_chunks
ORDER BY embedding <=> @embedding::vector -- cosine distance
LIMIT @k
""",
new { embedding = vectorLiteral, k = topK }
)).ToList();
}Step 5 — Augmenting the Prompt
public class RagService(EmbeddingService embedder, VectorStore store, OpenAIClient openai)
{
private readonly ChatClient _chat = openai.GetChatClient("gpt-4o");
public async Task<string> AskAsync(string question)
{
// Retrieve
var queryVector = await embedder.EmbedAsync(question);
var chunks = await store.SearchAsync(queryVector, topK: 5);
var context = string.Join("\n\n---\n\n", chunks);
// Augment
var messages = new List<ChatMessage>
{
new SystemChatMessage("""
Answer the question using ONLY the provided context.
If the answer is not in the context, say "I don't have that information."
Do not make up information.
"""),
new UserChatMessage($"""
Context:
{context}
Question: {question}
""")
};
var completion = await _chat.CompleteChatAsync(messages);
return completion.Content[0].Text;
}
}Indexing Documents
public class DocumentIndexer(EmbeddingService embedder, VectorStore store)
{
public async Task IndexAsync(string filePath)
{
var text = await File.ReadAllTextAsync(filePath);
var chunks = ChunkText(text);
var embeddings = await embedder.EmbedBatchAsync(chunks);
for (int i = 0; i < chunks.Count; i++)
await store.StoreChunkAsync(filePath, chunks[i], embeddings[i]);
Console.WriteLine($"Indexed {chunks.Count} chunks from {Path.GetFileName(filePath)}");
}
}
// Index all PDFs in a folder
var indexer = new DocumentIndexer(embedder, store);
foreach (var file in Directory.GetFiles("docs", "*.txt"))
await indexer.IndexAsync(file);RAG Quality — Common Improvements
Chunking:
• Sentence-aware splitting — don't cut in the middle of sentences
• Hierarchical: chunk parents (paragraphs) + children (sentences)
Retrieval:
• Hybrid search: vector + keyword (BM25) combined
• Reranking: use a cross-encoder to re-score top-K results
• Metadata filtering: filter by document type, date, category before vector search
Prompt:
• Include source attribution: "According to [policy.pdf]..."
• Ask the model to cite which context it used
• Add negative instruction: "Do not use information outside the context"Key Takeaways
- RAG = chunk → embed → store → retrieve → augment — five clear steps
- Chunk size is a tuning parameter — start at 300-600 chars with overlap
- Vector similarity (cosine distance) finds semantically relevant chunks, not just keyword matches
- The prompt must constrain the model to the retrieved context — otherwise it hallucinates
- RAG is the right choice for large, changing document sets where fine-tuning would be too expensive
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.