Learnixo
Back to blog
Backend Systemsadvanced

Case Study: Our RAG System Was Hallucinating 30% of the Time

A production RAG system was returning plausible-sounding but wrong answers on 30% of queries. Full investigation: how we measured it, the five root causes, and the fixes that got hallucination below 4%.

Asma Hafeez KhanMay 25, 20269 min read
.NETC#AIRAGhallucinationevaluationproductionpostmortem
Share:𝕏

Case Study: Our RAG System Was Hallucinating 30% of the Time

System: Internal knowledge base chatbot — 1,200 employees, HR + policy documents
Stack: ASP.NET Core 9, Microsoft.Extensions.AI, OpenAI gpt-4o, pgvector
Discovery: User survey (month 3 post-launch)
Measured hallucination rate (before): 31%
Measured hallucination rate (after): 3.8%
Time to fix: 3 weeks


How We Discovered It

Month 3 post-launch — anonymous employee survey:
  "How would you rate the accuracy of the policy chatbot?"
  Very accurate:       12%
  Mostly accurate:     38%
  Sometimes wrong:     35%
  Often wrong:         15%

Sample complaints:
  "It told me parental leave is 16 weeks — it's 20 weeks."
  "It said I need manager approval for training under £1,000 — I don't."
  "It gave me last year's travel reimbursement rates."

We had no automated measurement in place.
We only found out from a survey.

Step 1: Build a Measurement Framework

Before fixing anything, we needed to know the baseline precisely.

C#
// Faithfulness evaluator — does the answer stay within the retrieved context?
public class FaithfulnessEvaluator(IChatClient judge)
{
    public async Task<EvalScore> EvaluateAsync(
        string question,
        string context,
        string answer,
        CancellationToken ct = default)
    {
        var response = await judge.CompleteAsync([
            new(ChatRole.System, """
                You are an evaluation assistant. Your job is to determine whether
                the given ANSWER is fully supported by the CONTEXT.

                Respond with JSON only:
                {"score": 0-10, "reason": "one sentence explanation"}

                Score 10 = every claim in the answer is directly stated in the context.
                Score 0  = the answer makes claims not in the context at all.
                """),
            new(ChatRole.User, $"""
                QUESTION: {question}

                CONTEXT:
                {context}

                ANSWER: {answer}
                """)
        ], new ChatOptions { ResponseFormat = ChatResponseFormat.Json }, ct);

        var result = JsonSerializer.Deserialize<EvalScore>(response.Message.Text!)!;
        return result;
    }
}

public record EvalScore(int Score, string Reason);
C#
// Run eval against 200 question/answer pairs sampled from production logs
public class RagEvalRunner(
    FaithfulnessEvaluator faithfulness,
    AppDbContext db)
{
    public async Task<EvalReport> RunAsync(CancellationToken ct)
    {
        // Load 200 random production queries from the last 30 days
        var samples = await db.QueryLogs
            .Where(q => q.CreatedAt > DateTime.UtcNow.AddDays(-30))
            .OrderBy(_ => EF.Functions.Random())
            .Take(200)
            .ToListAsync(ct);

        var scores = new List<int>();
        var failures = new List<EvalFailure>();

        foreach (var sample in samples)
        {
            var score = await faithfulness.EvaluateAsync(
                sample.Question,
                sample.RetrievedContext,
                sample.Answer,
                ct);

            scores.Add(score.Score);

            // Score below 7 = hallucination (answer makes unsupported claims)
            if (score.Score < 7)
                failures.Add(new EvalFailure(sample.Question, sample.Answer, score.Reason));
        }

        return new EvalReport(
            Total:            samples.Count,
            HallucinationRate: (double)failures.Count / samples.Count,
            MeanFaithfulness:  scores.Average(),
            Failures:          failures);
    }
}
Baseline results (200 samples):
  Mean faithfulness score: 5.8 / 10
  Hallucination rate:      31%   (score below 7)
  Worst category:          Policy documents (41% hallucination rate)
  Best category:           FAQ documents (18% hallucination rate)

Root Cause Investigation

We categorised the 62 hallucinated answers and found five root causes.

Root Cause 1: Stale Documents (40% of hallucinations)

Parental leave updated from 16 to 20 weeks in January.
The knowledge base was last ingested in October.

The document in the vector store still said 16 weeks.
The model answered correctly — from the wrong data.

This is not a hallucination in the LLM sense.
The model answered faithfully to the context.
But the context was wrong.
C#
// Fix: Track document modification dates; re-ingest on change
public class DocumentChangeDetector(
    IDocumentRepository docs,
    ISharePointClient sharePoint)
    : BackgroundService
{
    protected override async Task ExecuteAsync(CancellationToken ct)
    {
        while (!ct.IsCancellationRequested)
        {
            var storedDocs = await docs.GetAllAsync(ct);

            foreach (var doc in storedDocs)
            {
                var remote = await sharePoint.GetMetadataAsync(doc.SourcePath, ct);

                if (remote.LastModified > doc.LastIngested)
                {
                    // Document changed — re-ingest
                    await docs.MarkForReingestionAsync(doc.Id, ct);
                }
            }

            await System.Threading.Tasks.Task.Delay(TimeSpan.FromHours(1), ct);
        }
    }
}
C#
// Add document date to every chunk — surfaced in the answer
var chunkText = $"""
    [Source: {doc.Title} | Last updated: {doc.LastModified:yyyy-MM-dd}]

    {chunk.Text}
    """;

// Prompt instructs the model to cite the date
var systemPrompt = """
    You are an HR policy assistant.
    Always cite the document name and last-updated date in your answer.
    If a document is more than 6 months old, add a note:
    "Note: this policy document may be outdated  please verify with HR."
    """;

Root Cause 2: Poor Chunk Boundaries (25% of hallucinations)

Policy document structure:
  Section 3.1: Training Budget
  "Employees may claim up to £1,000 for external training..."

  Section 3.2: Manager Approval
  "All training requests require line manager approval."

Chunk 1 ended after "up to £1,000 for external training"
Chunk 2 started with "All training requests require line manager approval"

Query: "Do I need manager approval for training under £1,000?"
Retrieved: Chunk 2 (about manager approval, context score 0.82)
Answer: "Yes, all training requests require manager approval."

Correct answer: "No, training under £1,000 does not require approval."
(The exception was in Chunk 1 — not retrieved.)

The chunk boundary split a conditional statement from its condition.
C#
// Fix 1: Larger chunks with overlap
// Before: 300 tokens, 30-token overlap
// After:  500 tokens, 100-token overlap
var chunkOptions = new TextChunkingOptions
{
    MaxTokensPerChunk = 500,
    OverlapTokens     = 100,
};

// Fix 2: Semantic chunking — split on paragraph boundaries, not token count
public class SemanticChunker
{
    public List<string> Chunk(string text)
    {
        // Split on double newlines (paragraph boundaries) first
        var paragraphs = text.Split("\n\n", StringSplitOptions.RemoveEmptyEntries);
        var chunks     = new List<string>();
        var current    = new System.Text.StringBuilder();

        foreach (var para in paragraphs)
        {
            var estimatedTokens = (current.Length + para.Length) / 4;

            if (estimatedTokens > 500 && current.Length > 0)
            {
                chunks.Add(current.ToString().Trim());
                current.Clear();
            }

            current.AppendLine(para);
        }

        if (current.Length > 0)
            chunks.Add(current.ToString().Trim());

        return chunks;
    }
}

// Fix 3: Retrieve top-8 instead of top-3; filter by threshold
// Before: top-3 chunks, no threshold
// After:  top-8 chunks, filter below 0.65 cosine similarity

Root Cause 3: Missing "I Don't Know" Behaviour (20% of hallucinations)

Query: "What is the company's policy on crypto salary payments?"
Retrieved context: general compensation policy (similarity: 0.61)
Answer: "The company does not currently offer cryptocurrency salary options,
         though this may be reviewed in future."

There was no such policy. The model invented a plausible-sounding answer
because the prompt said "always answer the question."
C#
// Fix: retrieval threshold gate + explicit "I don't know" instruction
public async Task<RagAnswer> AnswerAsync(string question, CancellationToken ct)
{
    var queryEmbedding = await embedder.GenerateAsync([question], cancellationToken: ct);
    var chunks         = await vectorStore.SearchAsync(queryEmbedding[0].Vector, topK: 8, ct);

    // Gate: if the best chunk is below threshold, no answer is possible
    var relevantChunks = chunks.Where(c => c.Similarity > 0.65).ToList();

    if (relevantChunks.Count == 0)
    {
        return new RagAnswer(
            Answer:   "I don't have information about this in the policy documents. "
                    + "Please contact HR directly at hr@company.com.",
            Grounded: false,
            Sources:  []);
    }

    var context = string.Join("\n\n---\n\n", relevantChunks.Select(c => c.Text));

    var response = await chatClient.CompleteAsync([
        new(ChatRole.System, """
            You are an HR policy assistant.
            Answer ONLY from the provided context.
            If the context does not contain enough information to answer,
            say exactly: "I don't have enough information about this in our policies."
            Do not infer, speculate, or add information not in the context.
            """),
        new(ChatRole.User, $"Context:\n{context}\n\nQuestion: {question}")
    ], cancellationToken: ct);

    return new RagAnswer(
        Answer:   response.Message.Text!,
        Grounded: true,
        Sources:  relevantChunks.Select(c => c.Source).ToList());
}

Root Cause 4: Context Window Overflow (10% of hallucinations)

8 chunks × 500 tokens = 4,000 context tokens
System prompt:           800 tokens
Question:                50 tokens
Total:                   4,850 tokens

gpt-4o context: 128K — no overflow.
But gpt-4o's attention degraded on the middle chunks.
("Lost in the middle" problem.)

Answer used info from chunk 1 and chunk 8.
Chunk 4 (the most relevant) was ignored.
C#
// Fix: re-rank chunks by relevance, put the most relevant first and last
public List<DocumentChunk> ReorderForAttention(List<DocumentChunk> chunks)
{
    if (chunks.Count <= 2) return chunks;

    // Sort by similarity descending
    var sorted = chunks.OrderByDescending(c => c.Similarity).ToList();

    // Put top chunk first, second-best chunk last, rest in the middle
    var result = new List<DocumentChunk> { sorted[0] };
    result.AddRange(sorted.Skip(2));     // middle
    result.Add(sorted[1]);               // last position = high attention
    return result;
}

Root Cause 5: System Prompt Too Permissive (5% of hallucinations)

Original system prompt:
  "You are a helpful HR assistant. Answer employee questions about company policy."

This gave the model freedom to be "helpful" — which meant filling gaps
with plausible-sounding information when the context was incomplete.
C#
// Fix: explicit constraints in the system prompt
var systemPrompt = """
    You are an HR policy assistant for [Company Name].

    STRICT RULES:
    1. Answer ONLY from the provided policy context — never from general knowledge.
    2. If you cannot find the answer in the context, say:
       "I don't have this information. Please contact HR at hr@company.com."
    3. Quote the exact policy text when possible.
    4. Always state the document name and last-updated date.
    5. Never add your own interpretation or infer unstated rules.
    6. If a policy has exceptions or conditions, state all of them.
    """;

Results After All Five Fixes

                          Before    After
Mean faithfulness score    5.8       8.9 / 10
Hallucination rate         31%       3.8%
"I don't know" rate         0%        8%   (appropriate — better than wrong answers)
User satisfaction          38%       81%   (follow-up survey, 6 weeks after fixes)

By category:
  Policy documents:  41% → 4.2%
  FAQ documents:     18% → 3.1%
  Benefits:          35% → 4.5%

The Eval CI Pipeline That Now Exists

YAML
# .github/workflows/rag-eval.yml
name: RAG Eval

on:
  schedule:
    - cron: '0 6 * * 1'   # every Monday 6am
  push:
    paths:
      - 'src/KnowledgeBase/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run RAG eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          CONNECTION_STRING: ${{ secrets.DB_CONNECTION }}
        run: dotnet test tests/RagEval/ --logger "console;verbosity=normal"

      - name: Check hallucination threshold
        run: |
          RATE=$(cat eval-results.json | jq '.hallucinationRate')
          if (( $(echo "$RATE > 0.08" | bc -l) )); then
            echo "Hallucination rate $RATE exceeds 8% threshold"
            exit 1
          fi
C#
// xUnit eval test — fails CI if quality drops
public class FaithfulnessEvalTests(RagEvalRunner runner)
{
    [Fact]
    public async Task Faithfulness_MeetsThreshold()
    {
        var report = await runner.RunAsync(CancellationToken.None);

        report.HallucinationRate.Should().BeLessThan(0.08,
            $"hallucination rate {report.HallucinationRate:P1} exceeds 8% threshold. " +
            $"Mean faithfulness: {report.MeanFaithfulness:F1}/10");
    }
}

Lessons

1. "It works" is not the same as "it's accurate."
   Ship evals on day one. Don't wait for a user survey to tell you quality is poor.

2. Stale data causes most "hallucinations" in enterprise RAG.
   The model isn't making things up — it's answering faithfully from outdated documents.
   Automated re-ingestion on document change is not optional.

3. Chunk boundaries matter more than chunk size.
   A conditional statement split across two chunks is unrecoverable at retrieval time.
   Use semantic boundaries (paragraphs, sections) not fixed token counts.

4. A confident "I don't know" is better than a confident wrong answer.
   Gate on retrieval similarity threshold. If no relevant context exists, say so.
   Employees trust the system more when it admits uncertainty.

5. The system prompt is a safety constraint, not a personality description.
   "Helpful HR assistant" → the model is helpful by filling gaps.
   Explicit rules ("Answer ONLY from context", "Never infer") reduce this.

Enjoyed this article?

Explore the Backend Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.