Learnixo

RAG Systems · Lesson 13 of 24

Chunk Overlap: Why 10% Overlap Improves Recall

Why Overlap Exists

Documents contain sentences that span natural chunk boundaries. Without overlap, those sentences are cut and may appear incomplete in both adjacent chunks:

Chunk 1 (no overlap):
  "...INR monitoring is required to prevent both bleeding and
  clotting events. The optimal INR range for"

Chunk 2 (no overlap):
  "patients with AF on warfarin is 2.0–3.0. Values above 3.0..."

Query: "What is the INR target for AF patients?"
  Chunk 1 retrieved: ends mid-sentence, incomplete answer
  Chunk 2 retrieved: starts with no context about what 2.0–3.0 refers to

With overlap (40 tokens):
  Chunk 1: "...INR monitoring is required. The optimal INR range for patients with AF on warfarin is 2.0–3.0."
  Chunk 2: "The optimal INR range for patients with AF on warfarin is 2.0–3.0. Values above 3.0 increase bleeding risk..."
  
  Either chunk retrieved gives a complete answer.

Overlap Arithmetic

chunk_size = 512 tokens
overlap = 64 tokens

Number of chunks for a 5000-token document:
  stride = chunk_size - overlap = 448 tokens
  chunks = ceil(5000 / 448) ≈ 12 chunks

Storage overhead from overlap:
  Extra tokens = (chunks - 1) × overlap = 11 × 64 = 704 extra tokens
  Overhead = 704 / 5000 = 14%

Memory in vector store:
  Without overlap: 10 chunks × 768 floats × 4 bytes = 30KB
  With overlap:    12 chunks × 768 floats × 4 bytes = 36KB
  20% more storage for much better boundary coverage

How Much Overlap

Too little (0–5%):
  Sentences at boundaries may be split
  Answers spanning boundaries may not be retrieved

Right amount (10–20%):
  Most boundary sentences appear complete in at least one chunk
  Manageable storage overhead

Too much (50%+):
  Near-duplicate chunks pollute retrieval results
  "Lost in the middle" effect — same content retrieved multiple times
  LLM sees repetitive context

Recommended defaults:
  chunk_size 256, overlap 32  (12.5%)  — for tight embedding model limits
  chunk_size 512, overlap 64  (12.5%)  — general purpose
  chunk_size 1024, overlap 128 (12.5%) — for large context models

Deduplication After Retrieval

When top-k retrieval returns overlapping chunks, deduplicate before sending to the LLM:

Python
from difflib import SequenceMatcher

def deduplicate_chunks(
    chunks: list[str],
    similarity_threshold: float = 0.85,
) -> list[str]:
    """Remove chunks that are highly similar to an earlier chunk."""
    unique = []
    for candidate in chunks:
        is_duplicate = any(
            SequenceMatcher(None, candidate, existing).ratio() > similarity_threshold
            for existing in unique
        )
        if not is_duplicate:
            unique.append(candidate)
    return unique


def retrieve_and_deduplicate(
    query: str,
    collection,
    top_k: int = 10,
    final_k: int = 5,
) -> list[dict]:
    results = collection.query(
        query_embeddings=[embed(query)],
        n_results=top_k,
        include=["documents", "metadatas", "distances"],
    )
    
    # Pair docs with metadata and distance
    docs = list(zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0],
    ))
    
    # Deduplicate
    seen_texts = []
    deduped = []
    for text, meta, dist in docs:
        is_dup = any(
            SequenceMatcher(None, text, seen).ratio() > 0.85
            for seen in seen_texts
        )
        if not is_dup:
            deduped.append({"content": text, "metadata": meta, "distance": dist})
            seen_texts.append(text)
        if len(deduped) >= final_k:
            break
    
    return deduped

Overlap and Metadata Continuity

When indexing with overlap, carry metadata across chunks to preserve document context:

Python
def chunk_with_metadata(
    text: str,
    doc_id: str,
    source: str,
    chunk_size: int = 512,
    overlap: int = 64,
) -> list[dict]:
    token_ids = tokenizer.encode(text, add_special_tokens=False)
    chunks = []
    stride = chunk_size - overlap
    
    for i, start in enumerate(range(0, len(token_ids), stride)):
        end = min(start + chunk_size, len(token_ids))
        chunk_text = tokenizer.decode(token_ids[start:end])
        
        chunks.append({
            "id": f"{doc_id}_chunk_{i}",
            "text": chunk_text,
            "metadata": {
                "doc_id": doc_id,
                "source": source,
                "chunk_index": i,
                "token_start": start,
                "token_end": end,
                "is_first_chunk": (i == 0),
            }
        })
        
        if end >= len(token_ids):
            break
    
    return chunks

Interview Answer

"Chunk overlap prevents sentences at chunk boundaries from appearing incomplete in both adjacent chunks. A query about information that straddles a boundary will still find a complete answer in an overlapping chunk. The recommended overlap is 10–20% of chunk size — enough to cover boundary sentences without creating near-duplicate chunks that pollute retrieval. When top-k retrieval returns overlapping chunks (common with high overlap), deduplicate before sending to the LLM to avoid wasting context tokens on repeated content."