Chunk Overlap and Boundary Handling

Why Overlap Exists

Documents contain sentences that span natural chunk boundaries. Without overlap, those sentences are cut and may appear incomplete in both adjacent chunks:

Chunk 1 (no overlap):
  "...INR monitoring is required to prevent both bleeding and
  clotting events. The optimal INR range for"

Chunk 2 (no overlap):
  "patients with AF on warfarin is 2.0–3.0. Values above 3.0..."

Query: "What is the INR target for AF patients?"
  Chunk 1 retrieved: ends mid-sentence, incomplete answer
  Chunk 2 retrieved: starts with no context about what 2.0–3.0 refers to

With overlap (40 tokens):
  Chunk 1: "...INR monitoring is required. The optimal INR range for patients with AF on warfarin is 2.0–3.0."
  Chunk 2: "The optimal INR range for patients with AF on warfarin is 2.0–3.0. Values above 3.0 increase bleeding risk..."
  
  Either chunk retrieved gives a complete answer.

Overlap Arithmetic

chunk_size = 512 tokens
overlap = 64 tokens

Number of chunks for a 5000-token document:
  stride = chunk_size - overlap = 448 tokens
  chunks = ceil(5000 / 448) ≈ 12 chunks

Storage overhead from overlap:
  Extra tokens = (chunks - 1) × overlap = 11 × 64 = 704 extra tokens
  Overhead = 704 / 5000 = 14%

Memory in vector store:
  Without overlap: 10 chunks × 768 floats × 4 bytes = 30KB
  With overlap:    12 chunks × 768 floats × 4 bytes = 36KB
  20% more storage for much better boundary coverage

How Much Overlap

Too little (0–5%):
  Sentences at boundaries may be split
  Answers spanning boundaries may not be retrieved

Right amount (10–20%):
  Most boundary sentences appear complete in at least one chunk
  Manageable storage overhead

Too much (50%+):
  Near-duplicate chunks pollute retrieval results
  "Lost in the middle" effect — same content retrieved multiple times
  LLM sees repetitive context

Recommended defaults:
  chunk_size 256, overlap 32  (12.5%)  — for tight embedding model limits
  chunk_size 512, overlap 64  (12.5%)  — general purpose
  chunk_size 1024, overlap 128 (12.5%) — for large context models

Deduplication After Retrieval

When top-k retrieval returns overlapping chunks, deduplicate before sending to the LLM:

Python

from difflib import SequenceMatcher

def deduplicate_chunks(
    chunks: list[str],
    similarity_threshold: float = 0.85,
) -> list[str]:
    """Remove chunks that are highly similar to an earlier chunk."""
    unique = []
    for candidate in chunks:
        is_duplicate = any(
            SequenceMatcher(None, candidate, existing).ratio() > similarity_threshold
            for existing in unique
        )
        if not is_duplicate:
            unique.append(candidate)
    return unique


def retrieve_and_deduplicate(
    query: str,
    collection,
    top_k: int = 10,
    final_k: int = 5,
) -> list[dict]:
    results = collection.query(
        query_embeddings=[embed(query)],
        n_results=top_k,
        include=["documents", "metadatas", "distances"],
    )
    
    # Pair docs with metadata and distance
    docs = list(zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0],
    ))
    
    # Deduplicate
    seen_texts = []
    deduped = []
    for text, meta, dist in docs:
        is_dup = any(
            SequenceMatcher(None, text, seen).ratio() > 0.85
            for seen in seen_texts
        )
        if not is_dup:
            deduped.append({"content": text, "metadata": meta, "distance": dist})
            seen_texts.append(text)
        if len(deduped) >= final_k:
            break
    
    return deduped

Overlap and Metadata Continuity

When indexing with overlap, carry metadata across chunks to preserve document context: