Scenario: Retrieval Returns Irrelevant Chunks — Scenario Based Questions | Learnixo

The Scenario

Your team built a RAG system for internal HR policy documents. The vector database has 10,000 chunks from 500 policy PDFs. But users report that when they ask "how many vacation days do I get?", the system surfaces chunks about expense reimbursement. When they ask about parental leave, they get results about remote work policies.

The embedding model is running. The vector store is populated. So why is retrieval consistently off-target?

This scenario is more common than hallucination and harder to diagnose because the failures are silent — you retrieve something, you just retrieve the wrong thing.

Diagnosing Retrieval Quality

Before fixing anything, quantify the problem. Build a small test set of 20-30 query/expected-document pairs and measure retrieval precision.

Python

from typing import List, Tuple
import numpy as np

@dataclass
class RetrievalTestCase:
    query: str
    expected_doc_ids: List[str]  # at least one of these should appear in top-k

def evaluate_retrieval(
    test_cases: List[RetrievalTestCase],
    retriever,
    k: int = 5,
) -> dict:
    hits = 0
    mrr_scores = []

    for case in test_cases:
        results = retriever.retrieve(case.query, k=k)
        result_ids = [r.metadata["doc_id"] for r in results]

        # Hit@k: did any expected doc appear in top-k?
        hit = any(doc_id in result_ids for doc_id in case.expected_doc_ids)
        hits += int(hit)

        # MRR: where did the first relevant result appear?
        for rank, doc_id in enumerate(result_ids, 1):
            if doc_id in case.expected_doc_ids:
                mrr_scores.append(1 / rank)
                break
        else:
            mrr_scores.append(0)

    precision_at_k = hits / len(test_cases)
    mean_rr = np.mean(mrr_scores)

    print(f"Precision@{k}: {precision_at_k:.2%}")
    print(f"MRR:          {mean_rr:.4f}")
    return {"precision_at_k": precision_at_k, "mrr": mean_rr}

A Precision@5 below 60% means your retrieval layer has a serious problem. Below 40% means users cannot rely on the system at all.

Root Cause 1: Wrong Embedding Model

The most common root cause is using a general-purpose embedding model for a domain-specific corpus. text-embedding-ada-002 was trained on general web text. If your documents use specialized vocabulary (medical, legal, financial, pharmaceutical), the embeddings will not capture the semantic relationships correctly.

Test this by checking similarity scores for known relevant pairs:

Python

from openai import AzureOpenAI
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

client = AzureOpenAI(
    azure_endpoint="https://your-resource.openai.azure.com",
    api_version="2024-02-01",
)

def test_embedding_model(
    model: str,
    relevant_pairs: List[Tuple[str, str]],
    irrelevant_pairs: List[Tuple[str, str]],
):
    """
    Relevant pairs should score much higher than irrelevant pairs.
    A good model shows a gap of at least 0.2 between the means.
    """
    def embed(text):
        return client.embeddings.create(model=model, input=text).data[0].embedding

    relevant_scores = []
    for q, doc in relevant_pairs:
        eq, ed = np.array(embed(q)), np.array(embed(doc))
        score = cosine_similarity([eq], [ed])[0][0]
        relevant_scores.append(score)

    irrelevant_scores = []
    for q, doc in irrelevant_pairs:
        eq, ed = np.array(embed(q)), np.array(embed(doc))
        score = cosine_similarity([eq], [ed])[0][0]
        irrelevant_scores.append(score)

    print(f"Model: {model}")
    print(f"  Relevant pairs avg score:   {np.mean(relevant_scores):.4f}")
    print(f"  Irrelevant pairs avg score: {np.mean(irrelevant_scores):.4f}")
    print(f"  Gap: {np.mean(relevant_scores) - np.mean(irrelevant_scores):.4f}")
    print()

# Compare models
relevant = [
    ("annual leave entitlement", "Employees are entitled to 25 days of annual leave per year."),
    ("parental leave policy", "Primary caregivers receive 16 weeks of paid parental leave."),
]
irrelevant = [
    ("annual leave entitlement", "All expense claims must be submitted within 30 days."),
    ("parental leave policy", "Remote work requests require manager approval."),
]

test_embedding_model("text-embedding-ada-002", relevant, irrelevant)
test_embedding_model("text-embedding-3-large", relevant, irrelevant)

Root Cause 2: Embedding Dimension Mismatch

If you switched embedding models without re-indexing, you have a dimension mismatch. Old vectors might be 1536-dimensional (ada-002) while new query embeddings are 3072-dimensional (3-large). The cosine similarity between differently-shaped vectors is meaningless.

Python

def check_dimension_consistency(vector_store, sample_query: str):
    """
    Compares the dimension of stored vectors vs. the dimension
    produced by the current embedding model.
    """
    # Get dimension of a stored vector
    stored_vector = vector_store.get_sample_vector()
    stored_dim = len(stored_vector)

    # Get dimension of a freshly produced embedding
    fresh_embedding = client.embeddings.create(
        model="text-embedding-3-large",
        input=sample_query,
    ).data[0].embedding
    query_dim = len(fresh_embedding)

    print(f"Stored vector dimension:  {stored_dim}")
    print(f"Query embedding dimension: {query_dim}")

    if stored_dim != query_dim:
        print("CRITICAL: Dimension mismatch! Re-index all documents.")
        return False
    return True

Fix: Re-embed all documents with the new model and rebuild the index. There is no shortcut.

Root Cause 3: Poor Chunking Loses Context

Consider this document structure:

Section 4: Time Off Policies

4.1 Annual Leave
Employees working full-time are entitled to 25 days per year.

4.2 Sick Leave
...

If chunked at section boundaries, chunk 4.1 starts with "Employees working full-time are entitled to 25 days per year." with no mention of "annual leave" or "vacation." The embedding will not strongly associate this chunk with a query about vacation days.

Fix this by including the section header in the chunk content:

Python

from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_with_headers(document_text: str, headers: List[str]) -> List[str]:
    """
    Prepend the nearest section header to each chunk so context
    is preserved even when the chunk is retrieved in isolation.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=64,
    )
    raw_chunks = splitter.split_text(document_text)
    enriched = []

    current_header = ""
    for chunk in raw_chunks:
        # Update current header if chunk starts with a known header pattern
        for header in headers:
            if header.lower() in chunk[:100].lower():
                current_header = header
                break
        # Prepend header to chunk for better embedding
        enriched.append(f"{current_header}\n\n{chunk}" if current_header else chunk)

    return enriched

Fix: Hybrid Search in Azure AI Search

Azure AI Search supports hybrid retrieval (BM25 + vector) out of the box. This is the most practical fix for most production teams because it requires no additional infrastructure.

Python

from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery
from azure.core.credentials import AzureKeyCredential

search_client = SearchClient(
    endpoint="https://your-search-service.search.windows.net",
    index_name="hr-policies",
    credential=AzureKeyCredential("YOUR_API_KEY"),
)

def hybrid_search_with_filter(
    query: str,
    department: str,
    top_k: int = 5,
) -> List[dict]:
    """
    Combines keyword (BM25) and vector search, filtered by department.
    This catches both semantic matches and exact keyword matches.
    """
    vector_query = VectorizableTextQuery(
        text=query,
        k_nearest_neighbors=top_k * 4,  # retrieve more, then fuse
        fields="content_vector",
        exhaustive=True,
    )

    results = search_client.search(
        search_text=query,               # BM25 component
        vector_queries=[vector_query],   # vector component
        filter=f"department eq '{department}'",  # metadata filter
        select=["id", "content", "source", "section"],
        top=top_k,
        query_type="semantic",           # semantic ranking pass on top
        semantic_configuration_name="hr-semantic-config",
    )

    return [
        {
            "id": r["id"],
            "content": r["content"],
            "source": r["source"],
            "section": r["section"],
            "score": r["@search.score"],
            "reranker_score": r.get("@search.reranker_score"),
        }
        for r in results
    ]

Fix: Metadata Filtering to Narrow the Search Space

When retrieval is broad and inconsistent, adding metadata filters dramatically improves precision. Index metadata fields at ingestion time:

Python

from azure.search.documents import SearchIndexingBufferedSender
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
)

def create_hr_search_index(index_client: SearchIndexClient):
    fields = [
        SearchField(name="id", type=SearchFieldDataType.String, key=True),
        SearchField(name="content", type=SearchFieldDataType.String, searchable=True),
        SearchField(name="department", type=SearchFieldDataType.String, filterable=True, facetable=True),
        SearchField(name="policy_category", type=SearchFieldDataType.String, filterable=True),
        SearchField(name="effective_date", type=SearchFieldDataType.DateTimeOffset, filterable=True, sortable=True),
        SearchField(
            name="content_vector",
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
            searchable=True,
            vector_search_dimensions=3072,
            vector_search_profile_name="hnsw-profile",
        ),
    ]

    vector_search = VectorSearch(
        algorithms=[HnswAlgorithmConfiguration(name="hnsw-algo", parameters={"m": 4, "ef_construction": 400})],
        profiles=[VectorSearchProfile(name="hnsw-profile", algorithm_configuration_name="hnsw-algo")],
    )

    index = SearchIndex(name="hr-policies", fields=fields, vector_search=vector_search)
    index_client.create_or_update_index(index)

def ingest_document(chunk: dict, sender: SearchIndexingBufferedSender):
    sender.upload_documents([
        {
            "id": chunk["id"],
            "content": chunk["content"],
            "department": chunk["metadata"]["department"],
            "policy_category": chunk["metadata"]["category"],
            "effective_date": chunk["metadata"]["effective_date"],
            "content_vector": chunk["embedding"],
        }
    ])

Putting It Together: Retrieval Quality Monitoring

Once you fix retrieval, keep it healthy with continuous evaluation:

Python

import schedule
import time

def run_retrieval_health_check():
    """
    Runs a fixed golden test set every hour and alerts if precision drops.
    """
    test_cases = load_golden_test_cases()  # 50 hand-labeled query/doc pairs
    metrics = evaluate_retrieval(test_cases, retriever=build_retriever(), k=5)

    if metrics["precision_at_k"] < 0.70:
        send_alert(
            title="Retrieval quality degraded",
            message=f"Precision@5 dropped to {metrics['precision_at_k']:.1%}",
            severity="high",
        )

    log_metric("retrieval.precision_at_5", metrics["precision_at_k"])
    log_metric("retrieval.mrr", metrics["mrr"])

schedule.every(1).hours.do(run_retrieval_health_check)

Summary: Retrieval Irrelevance Diagnosis Checklist

| Symptom | Likely Cause | Fix | |---|---|---| | Score below 0.6 for relevant queries | Wrong or undertrained embedding model | Switch to text-embedding-3-large or fine-tune | | Inconsistent results across identical queries | Dimension mismatch after model change | Re-embed and re-index everything | | Keyword queries fail but semantic works | No BM25 component | Add hybrid search | | Cross-department contamination | No metadata filter | Add filterable fields and filter at query time | | Headers/sections missing from results | Poor chunking loses header context | Prepend section headers to chunks |

Start with the evaluation script. Numbers tell you which category you are in. Do not guess — measure first, then fix.