Scenario Based Questions · Lesson 2 of 13
Scenario: Retrieval Returns Irrelevant Chunks
The Scenario
Your team built a RAG system for internal HR policy documents. The vector database has 10,000 chunks from 500 policy PDFs. But users report that when they ask "how many vacation days do I get?", the system surfaces chunks about expense reimbursement. When they ask about parental leave, they get results about remote work policies.
The embedding model is running. The vector store is populated. So why is retrieval consistently off-target?
This scenario is more common than hallucination and harder to diagnose because the failures are silent — you retrieve something, you just retrieve the wrong thing.
Diagnosing Retrieval Quality
Before fixing anything, quantify the problem. Build a small test set of 20-30 query/expected-document pairs and measure retrieval precision.
from typing import List, Tuple
import numpy as np
@dataclass
class RetrievalTestCase:
query: str
expected_doc_ids: List[str] # at least one of these should appear in top-k
def evaluate_retrieval(
test_cases: List[RetrievalTestCase],
retriever,
k: int = 5,
) -> dict:
hits = 0
mrr_scores = []
for case in test_cases:
results = retriever.retrieve(case.query, k=k)
result_ids = [r.metadata["doc_id"] for r in results]
# Hit@k: did any expected doc appear in top-k?
hit = any(doc_id in result_ids for doc_id in case.expected_doc_ids)
hits += int(hit)
# MRR: where did the first relevant result appear?
for rank, doc_id in enumerate(result_ids, 1):
if doc_id in case.expected_doc_ids:
mrr_scores.append(1 / rank)
break
else:
mrr_scores.append(0)
precision_at_k = hits / len(test_cases)
mean_rr = np.mean(mrr_scores)
print(f"Precision@{k}: {precision_at_k:.2%}")
print(f"MRR: {mean_rr:.4f}")
return {"precision_at_k": precision_at_k, "mrr": mean_rr}A Precision@5 below 60% means your retrieval layer has a serious problem. Below 40% means users cannot rely on the system at all.
Root Cause 1: Wrong Embedding Model
The most common root cause is using a general-purpose embedding model for a domain-specific corpus. text-embedding-ada-002 was trained on general web text. If your documents use specialized vocabulary (medical, legal, financial, pharmaceutical), the embeddings will not capture the semantic relationships correctly.
Test this by checking similarity scores for known relevant pairs:
from openai import AzureOpenAI
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
client = AzureOpenAI(
azure_endpoint="https://your-resource.openai.azure.com",
api_version="2024-02-01",
)
def test_embedding_model(
model: str,
relevant_pairs: List[Tuple[str, str]],
irrelevant_pairs: List[Tuple[str, str]],
):
"""
Relevant pairs should score much higher than irrelevant pairs.
A good model shows a gap of at least 0.2 between the means.
"""
def embed(text):
return client.embeddings.create(model=model, input=text).data[0].embedding
relevant_scores = []
for q, doc in relevant_pairs:
eq, ed = np.array(embed(q)), np.array(embed(doc))
score = cosine_similarity([eq], [ed])[0][0]
relevant_scores.append(score)
irrelevant_scores = []
for q, doc in irrelevant_pairs:
eq, ed = np.array(embed(q)), np.array(embed(doc))
score = cosine_similarity([eq], [ed])[0][0]
irrelevant_scores.append(score)
print(f"Model: {model}")
print(f" Relevant pairs avg score: {np.mean(relevant_scores):.4f}")
print(f" Irrelevant pairs avg score: {np.mean(irrelevant_scores):.4f}")
print(f" Gap: {np.mean(relevant_scores) - np.mean(irrelevant_scores):.4f}")
print()
# Compare models
relevant = [
("annual leave entitlement", "Employees are entitled to 25 days of annual leave per year."),
("parental leave policy", "Primary caregivers receive 16 weeks of paid parental leave."),
]
irrelevant = [
("annual leave entitlement", "All expense claims must be submitted within 30 days."),
("parental leave policy", "Remote work requests require manager approval."),
]
test_embedding_model("text-embedding-ada-002", relevant, irrelevant)
test_embedding_model("text-embedding-3-large", relevant, irrelevant)Root Cause 2: Embedding Dimension Mismatch
If you switched embedding models without re-indexing, you have a dimension mismatch. Old vectors might be 1536-dimensional (ada-002) while new query embeddings are 3072-dimensional (3-large). The cosine similarity between differently-shaped vectors is meaningless.
def check_dimension_consistency(vector_store, sample_query: str):
"""
Compares the dimension of stored vectors vs. the dimension
produced by the current embedding model.
"""
# Get dimension of a stored vector
stored_vector = vector_store.get_sample_vector()
stored_dim = len(stored_vector)
# Get dimension of a freshly produced embedding
fresh_embedding = client.embeddings.create(
model="text-embedding-3-large",
input=sample_query,
).data[0].embedding
query_dim = len(fresh_embedding)
print(f"Stored vector dimension: {stored_dim}")
print(f"Query embedding dimension: {query_dim}")
if stored_dim != query_dim:
print("CRITICAL: Dimension mismatch! Re-index all documents.")
return False
return TrueFix: Re-embed all documents with the new model and rebuild the index. There is no shortcut.
Root Cause 3: Poor Chunking Loses Context
Consider this document structure:
Section 4: Time Off Policies
4.1 Annual Leave
Employees working full-time are entitled to 25 days per year.
4.2 Sick Leave
...If chunked at section boundaries, chunk 4.1 starts with "Employees working full-time are entitled to 25 days per year." with no mention of "annual leave" or "vacation." The embedding will not strongly associate this chunk with a query about vacation days.
Fix this by including the section header in the chunk content:
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_with_headers(document_text: str, headers: List[str]) -> List[str]:
"""
Prepend the nearest section header to each chunk so context
is preserved even when the chunk is retrieved in isolation.
"""
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
)
raw_chunks = splitter.split_text(document_text)
enriched = []
current_header = ""
for chunk in raw_chunks:
# Update current header if chunk starts with a known header pattern
for header in headers:
if header.lower() in chunk[:100].lower():
current_header = header
break
# Prepend header to chunk for better embedding
enriched.append(f"{current_header}\n\n{chunk}" if current_header else chunk)
return enrichedFix: Hybrid Search in Azure AI Search
Azure AI Search supports hybrid retrieval (BM25 + vector) out of the box. This is the most practical fix for most production teams because it requires no additional infrastructure.
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery
from azure.core.credentials import AzureKeyCredential
search_client = SearchClient(
endpoint="https://your-search-service.search.windows.net",
index_name="hr-policies",
credential=AzureKeyCredential("YOUR_API_KEY"),
)
def hybrid_search_with_filter(
query: str,
department: str,
top_k: int = 5,
) -> List[dict]:
"""
Combines keyword (BM25) and vector search, filtered by department.
This catches both semantic matches and exact keyword matches.
"""
vector_query = VectorizableTextQuery(
text=query,
k_nearest_neighbors=top_k * 4, # retrieve more, then fuse
fields="content_vector",
exhaustive=True,
)
results = search_client.search(
search_text=query, # BM25 component
vector_queries=[vector_query], # vector component
filter=f"department eq '{department}'", # metadata filter
select=["id", "content", "source", "section"],
top=top_k,
query_type="semantic", # semantic ranking pass on top
semantic_configuration_name="hr-semantic-config",
)
return [
{
"id": r["id"],
"content": r["content"],
"source": r["source"],
"section": r["section"],
"score": r["@search.score"],
"reranker_score": r.get("@search.reranker_score"),
}
for r in results
]Fix: Metadata Filtering to Narrow the Search Space
When retrieval is broad and inconsistent, adding metadata filters dramatically improves precision. Index metadata fields at ingestion time:
from azure.search.documents import SearchIndexingBufferedSender
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
SearchIndex,
SearchField,
SearchFieldDataType,
VectorSearch,
HnswAlgorithmConfiguration,
VectorSearchProfile,
)
def create_hr_search_index(index_client: SearchIndexClient):
fields = [
SearchField(name="id", type=SearchFieldDataType.String, key=True),
SearchField(name="content", type=SearchFieldDataType.String, searchable=True),
SearchField(name="department", type=SearchFieldDataType.String, filterable=True, facetable=True),
SearchField(name="policy_category", type=SearchFieldDataType.String, filterable=True),
SearchField(name="effective_date", type=SearchFieldDataType.DateTimeOffset, filterable=True, sortable=True),
SearchField(
name="content_vector",
type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
searchable=True,
vector_search_dimensions=3072,
vector_search_profile_name="hnsw-profile",
),
]
vector_search = VectorSearch(
algorithms=[HnswAlgorithmConfiguration(name="hnsw-algo", parameters={"m": 4, "ef_construction": 400})],
profiles=[VectorSearchProfile(name="hnsw-profile", algorithm_configuration_name="hnsw-algo")],
)
index = SearchIndex(name="hr-policies", fields=fields, vector_search=vector_search)
index_client.create_or_update_index(index)
def ingest_document(chunk: dict, sender: SearchIndexingBufferedSender):
sender.upload_documents([
{
"id": chunk["id"],
"content": chunk["content"],
"department": chunk["metadata"]["department"],
"policy_category": chunk["metadata"]["category"],
"effective_date": chunk["metadata"]["effective_date"],
"content_vector": chunk["embedding"],
}
])Putting It Together: Retrieval Quality Monitoring
Once you fix retrieval, keep it healthy with continuous evaluation:
import schedule
import time
def run_retrieval_health_check():
"""
Runs a fixed golden test set every hour and alerts if precision drops.
"""
test_cases = load_golden_test_cases() # 50 hand-labeled query/doc pairs
metrics = evaluate_retrieval(test_cases, retriever=build_retriever(), k=5)
if metrics["precision_at_k"] < 0.70:
send_alert(
title="Retrieval quality degraded",
message=f"Precision@5 dropped to {metrics['precision_at_k']:.1%}",
severity="high",
)
log_metric("retrieval.precision_at_5", metrics["precision_at_k"])
log_metric("retrieval.mrr", metrics["mrr"])
schedule.every(1).hours.do(run_retrieval_health_check)Summary: Retrieval Irrelevance Diagnosis Checklist
| Symptom | Likely Cause | Fix | |---|---|---| | Score below 0.6 for relevant queries | Wrong or undertrained embedding model | Switch to text-embedding-3-large or fine-tune | | Inconsistent results across identical queries | Dimension mismatch after model change | Re-embed and re-index everything | | Keyword queries fail but semantic works | No BM25 component | Add hybrid search | | Cross-department contamination | No metadata filter | Add filterable fields and filter at query time | | Headers/sections missing from results | Poor chunking loses header context | Prepend section headers to chunks |
Start with the evaluation script. Numbers tell you which category you are in. Do not guess — measure first, then fix.