Chunk Overlap and Boundary Handling
Why chunk overlap exists, how much to use, how it affects storage and retrieval, and strategies for handling boundaries in clinical RAG.
Why Overlap Exists
Documents contain sentences that span natural chunk boundaries. Without overlap, those sentences are cut and may appear incomplete in both adjacent chunks:
Chunk 1 (no overlap):
"...INR monitoring is required to prevent both bleeding and
clotting events. The optimal INR range for"
Chunk 2 (no overlap):
"patients with AF on warfarin is 2.0–3.0. Values above 3.0..."
Query: "What is the INR target for AF patients?"
Chunk 1 retrieved: ends mid-sentence, incomplete answer
Chunk 2 retrieved: starts with no context about what 2.0–3.0 refers to
With overlap (40 tokens):
Chunk 1: "...INR monitoring is required. The optimal INR range for patients with AF on warfarin is 2.0–3.0."
Chunk 2: "The optimal INR range for patients with AF on warfarin is 2.0–3.0. Values above 3.0 increase bleeding risk..."
Either chunk retrieved gives a complete answer.Overlap Arithmetic
chunk_size = 512 tokens
overlap = 64 tokens
Number of chunks for a 5000-token document:
stride = chunk_size - overlap = 448 tokens
chunks = ceil(5000 / 448) ≈ 12 chunks
Storage overhead from overlap:
Extra tokens = (chunks - 1) × overlap = 11 × 64 = 704 extra tokens
Overhead = 704 / 5000 = 14%
Memory in vector store:
Without overlap: 10 chunks × 768 floats × 4 bytes = 30KB
With overlap: 12 chunks × 768 floats × 4 bytes = 36KB
20% more storage for much better boundary coverageHow Much Overlap
Too little (0–5%):
Sentences at boundaries may be split
Answers spanning boundaries may not be retrieved
Right amount (10–20%):
Most boundary sentences appear complete in at least one chunk
Manageable storage overhead
Too much (50%+):
Near-duplicate chunks pollute retrieval results
"Lost in the middle" effect — same content retrieved multiple times
LLM sees repetitive context
Recommended defaults:
chunk_size 256, overlap 32 (12.5%) — for tight embedding model limits
chunk_size 512, overlap 64 (12.5%) — general purpose
chunk_size 1024, overlap 128 (12.5%) — for large context modelsDeduplication After Retrieval
When top-k retrieval returns overlapping chunks, deduplicate before sending to the LLM:
from difflib import SequenceMatcher
def deduplicate_chunks(
chunks: list[str],
similarity_threshold: float = 0.85,
) -> list[str]:
"""Remove chunks that are highly similar to an earlier chunk."""
unique = []
for candidate in chunks:
is_duplicate = any(
SequenceMatcher(None, candidate, existing).ratio() > similarity_threshold
for existing in unique
)
if not is_duplicate:
unique.append(candidate)
return unique
def retrieve_and_deduplicate(
query: str,
collection,
top_k: int = 10,
final_k: int = 5,
) -> list[dict]:
results = collection.query(
query_embeddings=[embed(query)],
n_results=top_k,
include=["documents", "metadatas", "distances"],
)
# Pair docs with metadata and distance
docs = list(zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0],
))
# Deduplicate
seen_texts = []
deduped = []
for text, meta, dist in docs:
is_dup = any(
SequenceMatcher(None, text, seen).ratio() > 0.85
for seen in seen_texts
)
if not is_dup:
deduped.append({"content": text, "metadata": meta, "distance": dist})
seen_texts.append(text)
if len(deduped) >= final_k:
break
return dedupedOverlap and Metadata Continuity
When indexing with overlap, carry metadata across chunks to preserve document context:
def chunk_with_metadata(
text: str,
doc_id: str,
source: str,
chunk_size: int = 512,
overlap: int = 64,
) -> list[dict]:
token_ids = tokenizer.encode(text, add_special_tokens=False)
chunks = []
stride = chunk_size - overlap
for i, start in enumerate(range(0, len(token_ids), stride)):
end = min(start + chunk_size, len(token_ids))
chunk_text = tokenizer.decode(token_ids[start:end])
chunks.append({
"id": f"{doc_id}_chunk_{i}",
"text": chunk_text,
"metadata": {
"doc_id": doc_id,
"source": source,
"chunk_index": i,
"token_start": start,
"token_end": end,
"is_first_chunk": (i == 0),
}
})
if end >= len(token_ids):
break
return chunksInterview Answer
"Chunk overlap prevents sentences at chunk boundaries from appearing incomplete in both adjacent chunks. A query about information that straddles a boundary will still find a complete answer in an overlapping chunk. The recommended overlap is 10–20% of chunk size — enough to cover boundary sentences without creating near-duplicate chunks that pollute retrieval. When top-k retrieval returns overlapping chunks (common with high overlap), deduplicate before sending to the LLM to avoid wasting context tokens on repeated content."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.