Learnixo

Scenario Based Questions · Lesson 1 of 13

Scenario: Your RAG System Is Hallucinating — Debug It

The Scenario

Your team shipped a RAG chatbot backed by a product documentation knowledge base. Within a week, support tickets start arriving: users say the bot is giving wrong answers. One user asked "what is the maximum file upload size?" and got "50MB" — the actual limit is 10MB. Another asked about pricing and received a number that does not exist anywhere in your docs.

The knowledge base exists. The LLM is smart. So why is it hallucinating?

This is one of the most common production AI incidents. Let us work through it systematically.

Taxonomy of RAG Hallucination

Hallucination in a RAG system is almost never the LLM's fault alone. It falls into three categories:

Category 1: Retrieval failure — The right chunk never made it into the context window. The LLM had no choice but to generate from its pretrained weights (i.e., make something up).

Category 2: Context confusion — Multiple chunks were retrieved, some contradictory. The LLM picked the wrong one or merged them incorrectly.

Category 3: Instruction failure — The LLM was not told to stay grounded. No instruction like "only answer from the provided context." It freely extrapolated.

Your diagnosis strategy depends on which category you are dealing with.

Step 1: Log the Retrieved Context

You cannot debug what you cannot see. The first instrument to add is a context logger that records exactly what was passed to the LLM for every query.

Python
import json
import logging
from datetime import datetime
from dataclasses import dataclass, asdict
from typing import List

logger = logging.getLogger("rag.context_audit")

@dataclass
class RetrievedChunk:
    chunk_id: str
    source_document: str
    page_number: int
    content: str
    similarity_score: float

@dataclass
class RAGQueryLog:
    query_id: str
    user_query: str
    retrieved_chunks: List[RetrievedChunk]
    llm_response: str
    timestamp: str
    answer_found_in_context: bool  # set this after post-hoc check

def log_rag_query(
    query_id: str,
    user_query: str,
    chunks: List[RetrievedChunk],
    llm_response: str,
) -> RAGQueryLog:
    entry = RAGQueryLog(
        query_id=query_id,
        user_query=user_query,
        retrieved_chunks=chunks,
        llm_response=llm_response,
        timestamp=datetime.utcnow().isoformat(),
        answer_found_in_context=False,  # filled by verifier
    )
    logger.info(json.dumps(asdict(entry)))
    return entry

Once you have logs, run a spot-check script that searches whether the answer is actually present in the retrieved chunks:

Python
def check_answer_in_context(
    answer_fragment: str,
    chunks: List[RetrievedChunk],
    case_sensitive: bool = False,
) -> bool:
    """
    Simple heuristic: does the key fact from the LLM response
    appear verbatim in any retrieved chunk?
    """
    if not case_sensitive:
        answer_fragment = answer_fragment.lower()

    for chunk in chunks:
        text = chunk.content if case_sensitive else chunk.content.lower()
        if answer_fragment in text:
            return True
    return False

# Example usage during offline evaluation
for log in load_recent_logs(hours=24):
    # Extract a short key phrase from the response to check
    key_phrase = extract_key_claim(log.llm_response)
    log.answer_found_in_context = check_answer_in_context(
        key_phrase, log.retrieved_chunks
    )

hallucination_rate = sum(
    1 for log in logs if not log.answer_found_in_context
) / len(logs)
print(f"Estimated hallucination rate: {hallucination_rate:.1%}")

A rate above 5% is a red flag that demands investigation.

Step 2: Diagnose Chunk Quality

Run your failing queries against the vector store and print the top-5 results with scores:

Python
from openai import AzureOpenAI
import numpy as np

client = AzureOpenAI(
    azure_endpoint="https://your-resource.openai.azure.com",
    api_version="2024-02-01",
)

def diagnose_retrieval(query: str, vector_store, top_k: int = 5):
    """Print retrieved chunks and scores for a failing query."""
    embedding = client.embeddings.create(
        model="text-embedding-3-large",
        input=query,
    ).data[0].embedding

    results = vector_store.similarity_search_with_score(
        embedding, k=top_k
    )

    print(f"\n=== RETRIEVAL DIAGNOSIS for: '{query}' ===")
    for i, (chunk, score) in enumerate(results, 1):
        print(f"\n--- Chunk {i} (score: {score:.4f}) ---")
        print(f"Source: {chunk.metadata.get('source', 'unknown')}")
        print(f"Content: {chunk.page_content[:300]}...")
        print()

    # Check if any chunk is above a reasonable threshold
    top_score = results[0][1] if results else 0
    if top_score < 0.75:
        print("WARNING: Top score below 0.75 — retrieval is unreliable")
    return results

# Run against your failing queries
failing_queries = [
    "what is the maximum file upload size",
    "enterprise pricing tier",
]
for q in failing_queries:
    diagnose_retrieval(q, vector_store)

Root Cause: Chunks Too Small

A common mistake is chunking at 256 tokens to "keep context tight." But short chunks lose the surrounding context that gives a sentence meaning. "The limit is 10MB" without the preceding sentence "For file uploads, the platform enforces the following restriction:" can be ambiguous or unretievable.

Python
# BAD: 256-token chunks with no overlap
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=256,
    chunk_overlap=0,
)

# GOOD: 512-token chunks with 10% overlap to preserve context boundaries
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
)

Root Cause: Chunks Too Large

The opposite problem also exists. If a chunk is 2,000 tokens covering five different topics, the embedding is an average of all five topics and matches nothing precisely. Aim for single-concept chunks.

Fix 1: Add a Cross-Encoder Reranker

Bi-encoder embeddings (the kind used in vector search) are fast but approximate. A cross-encoder takes the query and each candidate chunk together and produces a precise relevance score. It is slower but much more accurate.

Python
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def retrieve_and_rerank(
    query: str,
    vector_store,
    initial_k: int = 20,
    final_k: int = 5,
) -> List[RetrievedChunk]:
    # Step 1: broad retrieval with bi-encoder
    candidates = vector_store.similarity_search(query, k=initial_k)

    # Step 2: precise reranking with cross-encoder
    pairs = [(query, chunk.page_content) for chunk in candidates]
    scores = reranker.predict(pairs)

    # Step 3: sort by reranker score, keep top final_k
    ranked = sorted(
        zip(candidates, scores), key=lambda x: x[1], reverse=True
    )
    top_chunks = [chunk for chunk, _ in ranked[:final_k]]
    return top_chunks

Using reranking typically reduces hallucination rate by 30-50% in production systems.

Fix 2: Hybrid Search (BM25 + Vector)

Pure semantic search misses exact-match queries. If a user asks "10MB limit," BM25 (keyword search) will find the chunk containing "10MB" even if the embedding similarity is mediocre.

Python
from rank_bm25 import BM25Okapi

class HybridRetriever:
    def __init__(self, chunks: List[str], embeddings):
        self.chunks = chunks
        self.embeddings = embeddings
        tokenized = [c.lower().split() for c in chunks]
        self.bm25 = BM25Okapi(tokenized)

    def retrieve(self, query: str, k: int = 5, alpha: float = 0.5):
        """
        alpha=0.5 means equal weight to BM25 and vector scores.
        Increase alpha toward 1.0 for more keyword weighting.
        """
        # BM25 scores
        bm25_scores = self.bm25.get_scores(query.lower().split())
        bm25_norm = bm25_scores / (bm25_scores.max() + 1e-9)

        # Vector scores (cosine similarity assumed pre-computed)
        vector_scores = self.embeddings.similarity(query)
        vector_norm = vector_scores / (vector_scores.max() + 1e-9)

        # Fused score
        fused = alpha * bm25_norm + (1 - alpha) * vector_norm
        top_indices = fused.argsort()[-k:][::-1]
        return [self.chunks[i] for i in top_indices]

Fix 3: Citation Enforcement in the System Prompt

Even with perfect retrieval, the LLM can ignore the context. The system prompt must be explicit:

Python
SYSTEM_PROMPT = """You are a documentation assistant. You MUST answer ONLY using
the provided context passages. Each passage is labeled [SOURCE_N].

Rules:
1. Every claim in your answer MUST be followed by its source label, e.g., [SOURCE_1].
2. If the answer is NOT present in the provided context, respond with:
   "I don't have information about that in the current documentation."
3. Never infer, extrapolate, or use your general knowledge to fill gaps.
4. If context passages contradict each other, say so and cite both.

Context:
{context}

User question: {question}"""

Fix 4: Post-Response Citation Verifier

Add an automatic check after the LLM responds to verify each cited claim exists in the context:

Python
import re

def verify_citations(
    response: str,
    source_map: dict,  # {"SOURCE_1": chunk_text, ...}
) -> dict:
    """
    Checks that every [SOURCE_N] citation in the response
    corresponds to a real retrieved chunk.
    Returns a report of valid and invalid citations.
    """
    cited = re.findall(r"\[SOURCE_(\d+)\]", response)
    results = {"valid": [], "invalid": [], "uncited_claims": []}

    for num in cited:
        key = f"SOURCE_{num}"
        if key in source_map:
            results["valid"].append(key)
        else:
            results["invalid"].append(key)
            # Flag for human review or automatic retry

    if results["invalid"]:
        # Log for monitoring and optionally retry with stricter prompt
        logger.warning(f"Invalid citations found: {results['invalid']}")

    return results

Putting It Together: A Hardened RAG Pipeline

Python
async def hardened_rag_query(
    user_query: str,
    query_id: str,
    vector_store,
    llm_client,
) -> dict:
    # 1. Retrieve with reranking
    chunks = retrieve_and_rerank(
        query=user_query,
        vector_store=vector_store,
        initial_k=20,
        final_k=5,
    )

    # 2. Build source map and context string
    source_map = {f"SOURCE_{i+1}": c.page_content for i, c in enumerate(chunks)}
    context = "\n\n".join(
        f"[SOURCE_{i+1}]\n{c.page_content}" for i, c in enumerate(chunks)
    )

    # 3. Construct prompt
    prompt = SYSTEM_PROMPT.format(context=context, question=user_query)

    # 4. Call LLM
    response = llm_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,  # zero temp reduces hallucination
    )
    answer = response.choices[0].message.content

    # 5. Verify citations
    citation_report = verify_citations(answer, source_map)

    # 6. Log everything for monitoring
    log_rag_query(
        query_id=query_id,
        user_query=user_query,
        chunks=[
            RetrievedChunk(
                chunk_id=str(i),
                source_document=c.metadata.get("source", ""),
                page_number=c.metadata.get("page", 0),
                content=c.page_content,
                similarity_score=0.0,  # populated by reranker in production
            )
            for i, c in enumerate(chunks)
        ],
        llm_response=answer,
    )

    return {
        "answer": answer,
        "sources": list(source_map.keys()),
        "citation_validity": citation_report,
        "has_invalid_citations": bool(citation_report["invalid"]),
    }

Summary: The Hallucination Checklist

When your RAG system hallucinates, run through this checklist in order:

  1. Log retrieved chunks — Can you see what the LLM actually received?
  2. Check similarity scores — Is the top score above 0.75?
  3. Check chunk size — Are chunks between 256 and 800 tokens with overlap?
  4. Add reranking — Does a cross-encoder improve top-5 quality?
  5. Try hybrid search — Does adding BM25 recover exact-match failures?
  6. Harden the system prompt — Is the LLM explicitly told to stay grounded?
  7. Enforce citations — Is every claim traceable to a source?
  8. Set temperature=0 — High temperature increases creative (wrong) completions.

Each step is independently valuable. Together they typically reduce hallucination rates from 15-25% (naive RAG) to under 3% (production-hardened RAG).