RAG Systems · Lesson 13 of 24
Chunk Overlap: Why 10% Overlap Improves Recall
Why Overlap Exists
Documents contain sentences that span natural chunk boundaries. Without overlap, those sentences are cut and may appear incomplete in both adjacent chunks:
Chunk 1 (no overlap):
"...INR monitoring is required to prevent both bleeding and
clotting events. The optimal INR range for"
Chunk 2 (no overlap):
"patients with AF on warfarin is 2.0–3.0. Values above 3.0..."
Query: "What is the INR target for AF patients?"
Chunk 1 retrieved: ends mid-sentence, incomplete answer
Chunk 2 retrieved: starts with no context about what 2.0–3.0 refers to
With overlap (40 tokens):
Chunk 1: "...INR monitoring is required. The optimal INR range for patients with AF on warfarin is 2.0–3.0."
Chunk 2: "The optimal INR range for patients with AF on warfarin is 2.0–3.0. Values above 3.0 increase bleeding risk..."
Either chunk retrieved gives a complete answer.Overlap Arithmetic
chunk_size = 512 tokens
overlap = 64 tokens
Number of chunks for a 5000-token document:
stride = chunk_size - overlap = 448 tokens
chunks = ceil(5000 / 448) ≈ 12 chunks
Storage overhead from overlap:
Extra tokens = (chunks - 1) × overlap = 11 × 64 = 704 extra tokens
Overhead = 704 / 5000 = 14%
Memory in vector store:
Without overlap: 10 chunks × 768 floats × 4 bytes = 30KB
With overlap: 12 chunks × 768 floats × 4 bytes = 36KB
20% more storage for much better boundary coverageHow Much Overlap
Too little (0–5%):
Sentences at boundaries may be split
Answers spanning boundaries may not be retrieved
Right amount (10–20%):
Most boundary sentences appear complete in at least one chunk
Manageable storage overhead
Too much (50%+):
Near-duplicate chunks pollute retrieval results
"Lost in the middle" effect — same content retrieved multiple times
LLM sees repetitive context
Recommended defaults:
chunk_size 256, overlap 32 (12.5%) — for tight embedding model limits
chunk_size 512, overlap 64 (12.5%) — general purpose
chunk_size 1024, overlap 128 (12.5%) — for large context modelsDeduplication After Retrieval
When top-k retrieval returns overlapping chunks, deduplicate before sending to the LLM:
from difflib import SequenceMatcher
def deduplicate_chunks(
chunks: list[str],
similarity_threshold: float = 0.85,
) -> list[str]:
"""Remove chunks that are highly similar to an earlier chunk."""
unique = []
for candidate in chunks:
is_duplicate = any(
SequenceMatcher(None, candidate, existing).ratio() > similarity_threshold
for existing in unique
)
if not is_duplicate:
unique.append(candidate)
return unique
def retrieve_and_deduplicate(
query: str,
collection,
top_k: int = 10,
final_k: int = 5,
) -> list[dict]:
results = collection.query(
query_embeddings=[embed(query)],
n_results=top_k,
include=["documents", "metadatas", "distances"],
)
# Pair docs with metadata and distance
docs = list(zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0],
))
# Deduplicate
seen_texts = []
deduped = []
for text, meta, dist in docs:
is_dup = any(
SequenceMatcher(None, text, seen).ratio() > 0.85
for seen in seen_texts
)
if not is_dup:
deduped.append({"content": text, "metadata": meta, "distance": dist})
seen_texts.append(text)
if len(deduped) >= final_k:
break
return dedupedOverlap and Metadata Continuity
When indexing with overlap, carry metadata across chunks to preserve document context:
def chunk_with_metadata(
text: str,
doc_id: str,
source: str,
chunk_size: int = 512,
overlap: int = 64,
) -> list[dict]:
token_ids = tokenizer.encode(text, add_special_tokens=False)
chunks = []
stride = chunk_size - overlap
for i, start in enumerate(range(0, len(token_ids), stride)):
end = min(start + chunk_size, len(token_ids))
chunk_text = tokenizer.decode(token_ids[start:end])
chunks.append({
"id": f"{doc_id}_chunk_{i}",
"text": chunk_text,
"metadata": {
"doc_id": doc_id,
"source": source,
"chunk_index": i,
"token_start": start,
"token_end": end,
"is_first_chunk": (i == 0),
}
})
if end >= len(token_ids):
break
return chunksInterview Answer
"Chunk overlap prevents sentences at chunk boundaries from appearing incomplete in both adjacent chunks. A query about information that straddles a boundary will still find a complete answer in an overlapping chunk. The recommended overlap is 10–20% of chunk size — enough to cover boundary sentences without creating near-duplicate chunks that pollute retrieval. When top-k retrieval returns overlapping chunks (common with high overlap), deduplicate before sending to the LLM to avoid wasting context tokens on repeated content."