Learnixo
Back to blog
AI Systemsbeginner

Chunk Overlap and Boundary Handling

Why chunk overlap exists, how much to use, how it affects storage and retrieval, and strategies for handling boundaries in clinical RAG.

Asma Hafeez KhanMay 21, 20264 min read
RAGChunkingOverlapDocument Processing
Share:𝕏

Why Overlap Exists

Documents contain sentences that span natural chunk boundaries. Without overlap, those sentences are cut and may appear incomplete in both adjacent chunks:

Chunk 1 (no overlap):
  "...INR monitoring is required to prevent both bleeding and
  clotting events. The optimal INR range for"

Chunk 2 (no overlap):
  "patients with AF on warfarin is 2.0–3.0. Values above 3.0..."

Query: "What is the INR target for AF patients?"
  Chunk 1 retrieved: ends mid-sentence, incomplete answer
  Chunk 2 retrieved: starts with no context about what 2.0–3.0 refers to

With overlap (40 tokens):
  Chunk 1: "...INR monitoring is required. The optimal INR range for patients with AF on warfarin is 2.0–3.0."
  Chunk 2: "The optimal INR range for patients with AF on warfarin is 2.0–3.0. Values above 3.0 increase bleeding risk..."
  
  Either chunk retrieved gives a complete answer.

Overlap Arithmetic

chunk_size = 512 tokens
overlap = 64 tokens

Number of chunks for a 5000-token document:
  stride = chunk_size - overlap = 448 tokens
  chunks = ceil(5000 / 448) ≈ 12 chunks

Storage overhead from overlap:
  Extra tokens = (chunks - 1) × overlap = 11 × 64 = 704 extra tokens
  Overhead = 704 / 5000 = 14%

Memory in vector store:
  Without overlap: 10 chunks × 768 floats × 4 bytes = 30KB
  With overlap:    12 chunks × 768 floats × 4 bytes = 36KB
  20% more storage for much better boundary coverage

How Much Overlap

Too little (0–5%):
  Sentences at boundaries may be split
  Answers spanning boundaries may not be retrieved

Right amount (10–20%):
  Most boundary sentences appear complete in at least one chunk
  Manageable storage overhead

Too much (50%+):
  Near-duplicate chunks pollute retrieval results
  "Lost in the middle" effect — same content retrieved multiple times
  LLM sees repetitive context

Recommended defaults:
  chunk_size 256, overlap 32  (12.5%)  — for tight embedding model limits
  chunk_size 512, overlap 64  (12.5%)  — general purpose
  chunk_size 1024, overlap 128 (12.5%) — for large context models

Deduplication After Retrieval

When top-k retrieval returns overlapping chunks, deduplicate before sending to the LLM:

Python
from difflib import SequenceMatcher

def deduplicate_chunks(
    chunks: list[str],
    similarity_threshold: float = 0.85,
) -> list[str]:
    """Remove chunks that are highly similar to an earlier chunk."""
    unique = []
    for candidate in chunks:
        is_duplicate = any(
            SequenceMatcher(None, candidate, existing).ratio() > similarity_threshold
            for existing in unique
        )
        if not is_duplicate:
            unique.append(candidate)
    return unique


def retrieve_and_deduplicate(
    query: str,
    collection,
    top_k: int = 10,
    final_k: int = 5,
) -> list[dict]:
    results = collection.query(
        query_embeddings=[embed(query)],
        n_results=top_k,
        include=["documents", "metadatas", "distances"],
    )
    
    # Pair docs with metadata and distance
    docs = list(zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0],
    ))
    
    # Deduplicate
    seen_texts = []
    deduped = []
    for text, meta, dist in docs:
        is_dup = any(
            SequenceMatcher(None, text, seen).ratio() > 0.85
            for seen in seen_texts
        )
        if not is_dup:
            deduped.append({"content": text, "metadata": meta, "distance": dist})
            seen_texts.append(text)
        if len(deduped) >= final_k:
            break
    
    return deduped

Overlap and Metadata Continuity

When indexing with overlap, carry metadata across chunks to preserve document context:

Python
def chunk_with_metadata(
    text: str,
    doc_id: str,
    source: str,
    chunk_size: int = 512,
    overlap: int = 64,
) -> list[dict]:
    token_ids = tokenizer.encode(text, add_special_tokens=False)
    chunks = []
    stride = chunk_size - overlap
    
    for i, start in enumerate(range(0, len(token_ids), stride)):
        end = min(start + chunk_size, len(token_ids))
        chunk_text = tokenizer.decode(token_ids[start:end])
        
        chunks.append({
            "id": f"{doc_id}_chunk_{i}",
            "text": chunk_text,
            "metadata": {
                "doc_id": doc_id,
                "source": source,
                "chunk_index": i,
                "token_start": start,
                "token_end": end,
                "is_first_chunk": (i == 0),
            }
        })
        
        if end >= len(token_ids):
            break
    
    return chunks

Interview Answer

"Chunk overlap prevents sentences at chunk boundaries from appearing incomplete in both adjacent chunks. A query about information that straddles a boundary will still find a complete answer in an overlapping chunk. The recommended overlap is 10–20% of chunk size — enough to cover boundary sentences without creating near-duplicate chunks that pollute retrieval. When top-k retrieval returns overlapping chunks (common with high overlap), deduplicate before sending to the LLM to avoid wasting context tokens on repeated content."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.