Learnixo
Back to blog
AI Systemsintermediate

Document Chunking Strategies

Master chunking: fixed-size, sentence, paragraph, recursive, and document-aware strategies. Learn how chunk size, overlap, and boundaries drive retrieval quality.

Asma Hafeez KhanMay 15, 20267 min read
RAGChunkingText SplittingRetrievalDocument Processing
Share:𝕏

Document Chunking Strategies

Chunking is the single highest-leverage decision in RAG. A retrieval system with a great embedding model and poor chunking will underperform a mediocre embedding model with great chunking. The reason: embedding quality can only encode what the chunk contains. If the chunk is too large, the embedding averages over too much content and loses specificity. If it's too small, it lacks context.

Why Chunking Matters

Bad chunk (too large, 1500 tokens):
"Our refund policy allows 30-day returns. Shipping takes 3–5 days.
 We offer premium support. Products come with a 1-year warranty.
 Our headquarters is in Austin, Texas. We were founded in 2019..."

Query: "What is the refund period?"
Embedding similarity: 0.62  ← noisy, many topics dilute the signal

Good chunk (focused, 80 tokens):
"Our refund policy allows returns within 30 days of purchase,
 provided the item is in original condition with all packaging."

Query: "What is the refund period?"
Embedding similarity: 0.91  ← clean signal, focused content

Strategy 1: Fixed-Size Character Chunking

The simplest approach: split every N characters with M characters of overlap.

Python
def fixed_char_chunk(text: str, size: int = 1000, overlap: int = 100) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start = end - overlap
    return chunks

# Example
text = "A" * 3000
chunks = fixed_char_chunk(text, size=1000, overlap=100)
print(f"{len(chunks)} chunks")  # 4 chunks: [0..1000], [900..1900], [1800..2800], [2700..3000]

Pros: simple, predictable size, easy to implement. Cons: splits mid-sentence, mid-word even. Terrible for structured documents.

Strategy 2: Token-Based Chunking

Use the actual tokenizer to count tokens rather than characters. More accurate for LLM context management.

Python
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")  # GPT-4 encoding

def token_chunk(text: str, max_tokens: int = 512, overlap_tokens: int = 50) -> list[str]:
    tokens = enc.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = start + max_tokens
        chunk_tokens = tokens[start:end]
        chunks.append(enc.decode(chunk_tokens))
        start = end - overlap_tokens
    return chunks

# Verify chunk sizes
text = open("large_document.txt").read()
chunks = token_chunk(text, max_tokens=512)
sizes = [len(enc.encode(c)) for c in chunks]
print(f"Min: {min(sizes)}, Max: {max(sizes)}, Avg: {sum(sizes)/len(sizes):.0f} tokens")

Strategy 3: Sentence Chunking

Split on sentence boundaries, then group into chunks of N sentences.

Python
import spacy
nlp = spacy.load("en_core_web_sm")

def sentence_chunk(text: str, sentences_per_chunk: int = 5, overlap: int = 1) -> list[str]:
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents if sent.text.strip()]

    chunks = []
    step = sentences_per_chunk - overlap
    for i in range(0, len(sentences), step):
        group = sentences[i : i + sentences_per_chunk]
        chunks.append(" ".join(group))

    return chunks

# Alternative: use NLTK for lighter dependency
import nltk
nltk.download("punkt", quiet=True)
from nltk.tokenize import sent_tokenize

def sentence_chunk_nltk(text: str, sentences_per_chunk: int = 5) -> list[str]:
    sentences = sent_tokenize(text)
    return [
        " ".join(sentences[i : i + sentences_per_chunk])
        for i in range(0, len(sentences), sentences_per_chunk)
    ]

Pros: preserves complete thoughts, no mid-sentence splits. Cons: uneven chunk sizes; a document with very long or very short sentences produces wildly different chunk sizes.

Strategy 4: Recursive Character Text Splitting

The LangChain default — tries to split on natural boundaries in order of preference:

1. \n\n  (paragraph break)
2. \n    (line break)
3. .     (sentence end)
4. ,     (clause)
5. " "   (word boundary)
6. ""    (character boundary — last resort)
Python
from langchain_text_splitters import RecursiveCharacterTextSplitter
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    length_function=lambda text: len(enc.encode(text)),  # token count, not char count
    separators=["\n\n", "\n", ". ", ", ", " ", ""],
    is_separator_regex=False,
)

with open("document.txt") as f:
    text = f.read()

chunks = splitter.split_text(text)
print(f"Created {len(chunks)} chunks")
for i, c in enumerate(chunks[:3]):
    print(f"\nChunk {i+1} ({len(enc.encode(c))} tokens):\n{c[:100]}...")

This is the best default choice for mixed-format text documents.

Strategy 5: Paragraph-Based Chunking

Respect paragraph boundaries entirely — don't split within paragraphs.

Python
import re

def paragraph_chunk(text: str, max_tokens: int = 512) -> list[str]:
    enc = tiktoken.get_encoding("cl100k_base")
    paragraphs = re.split(r"\n\s*\n", text.strip())
    paragraphs = [p.strip() for p in paragraphs if p.strip()]

    chunks = []
    current = []
    current_tokens = 0

    for para in paragraphs:
        para_tokens = len(enc.encode(para))

        if para_tokens > max_tokens:
            # Single paragraph exceeds limit  must split it
            if current:
                chunks.append("\n\n".join(current))
                current, current_tokens = [], 0
            # Sub-split the large paragraph by sentence
            sub_chunks = sentence_chunk_nltk(para, sentences_per_chunk=5)
            chunks.extend(sub_chunks)
        elif current_tokens + para_tokens > max_tokens:
            chunks.append("\n\n".join(current))
            current = [para]
            current_tokens = para_tokens
        else:
            current.append(para)
            current_tokens += para_tokens

    if current:
        chunks.append("\n\n".join(current))

    return chunks

Strategy 6: Document-Aware Chunking (Markdown Headers)

For structured documents, use headers as natural chunk boundaries.

Python
from langchain_text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]

md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    strip_headers=False,
)

md_text = """
# Product Manual

## Installation

Follow these steps to install the product...

### Windows Installation

On Windows, run the installer from...

## Configuration

After installation, configure the settings...
"""

chunks = md_splitter.split_text(md_text)
for chunk in chunks:
    print(f"Metadata: {chunk.metadata}")
    print(f"Content: {chunk.page_content[:80]}\n")

# Then apply token-size limit on top
from langchain_text_splitters import RecursiveCharacterTextSplitter

token_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    length_function=lambda t: len(enc.encode(t)),
)
final_chunks = token_splitter.split_documents(chunks)

Strategy 7: Code-Aware Chunking

For technical documentation with code blocks, never split inside a code block.

Python
from langchain_text_splitters import Language, RecursiveCharacterTextSplitter

# Python-aware splitter
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=512,
    chunk_overlap=64,
)

code = '''
def process_order(order_id: str) -> dict:
    """Process a customer order."""
    order = db.get_order(order_id)
    if not order:
        raise ValueError(f"Order {order_id} not found")
    result = payment_service.charge(order)
    return {"status": "processed", "order_id": order_id}

def cancel_order(order_id: str) -> bool:
    """Cancel an existing order."""
    order = db.get_order(order_id)
    if order.status == "shipped":
        return False
    db.update_order(order_id, status="cancelled")
    return True
'''

chunks = python_splitter.split_text(code)
for c in chunks:
    print(c)
    print("---")

Small-to-Big Retrieval

A powerful pattern: embed small chunks for precision, but retrieve larger parent chunks for context.

Python
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever
from langchain_community.storage import InMemoryStore
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

# Parent splitter: large chunks (for context)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=0)

# Child splitter: small chunks (for precise retrieval)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=40)

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Qdrant.from_documents([], embeddings, location=":memory:", collection_name="children")
docstore = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add documents: stores parents in docstore, children in vectorstore
from langchain_community.document_loaders import TextLoader
docs = TextLoader("document.txt").load()
retriever.add_documents(docs)

# At query time: searches child chunks, returns parent chunks
results = retriever.get_relevant_documents("What is the refund policy?")
print(f"Retrieved {len(results)} parent documents")

Chunk Size vs Overlap Tradeoffs

| Chunk Size (tokens) | Retrieval Precision | Context Quality | Best For | |---|---|---|---| | 64–128 | Very high | Low (no context) | Fact lookup, keyword-heavy | | 256–512 | High | Medium | General Q&A (recommended default) | | 512–1024 | Medium | High | Summarization, reasoning | | 1024–2048 | Low | Very high | Whole-section analysis |

Overlap guidelines:

  • 10–15% of chunk size is the standard
  • Higher overlap means more redundancy and index size, but fewer boundary misses
  • Zero overlap is only appropriate if you use small-to-big retrieval

Benchmarking Your Chunking Strategy

Python
import json
from typing import Callable

def benchmark_chunking(
    documents: list[str],
    qa_pairs: list[dict],
    chunking_fns: dict[str, Callable],
    embed_fn,
    store_fn,
    retrieve_fn,
) -> dict:
    results = {}

    for name, chunk_fn in chunking_fns.items():
        print(f"\nBenchmarking: {name}")

        # Chunk all documents
        all_chunks = []
        for doc in documents:
            all_chunks.extend(chunk_fn(doc))

        # Build index
        store = store_fn(all_chunks, embed_fn)

        # Evaluate retrieval
        hit_at_1 = 0
        hit_at_4 = 0
        for qa in qa_pairs:
            retrieved = retrieve_fn(store, qa["question"], top_k=4)
            expected_keyword = qa["expected_keyword"]
            if any(expected_keyword in r["text"] for r in retrieved[:1]):
                hit_at_1 += 1
            if any(expected_keyword in r["text"] for r in retrieved):
                hit_at_4 += 1

        n = len(qa_pairs)
        results[name] = {
            "hit@1": hit_at_1 / n,
            "hit@4": hit_at_4 / n,
            "num_chunks": len(all_chunks),
            "avg_chunk_tokens": sum(len(enc.encode(c)) for c in all_chunks) / len(all_chunks),
        }
        print(f"  hit@1={results[name]['hit@1']:.2%}, hit@4={results[name]['hit@4']:.2%}")

    return results

# strategies to compare
strategies = {
    "fixed_512": lambda doc: fixed_char_chunk(doc, size=2000, overlap=200),
    "token_512": lambda doc: token_chunk(doc, max_tokens=512, overlap_tokens=64),
    "sentence_5": lambda doc: sentence_chunk_nltk(doc, sentences_per_chunk=5),
    "paragraph": lambda doc: paragraph_chunk(doc, max_tokens=512),
}

Practical Recommendations

Start with RecursiveCharacterTextSplitter at 512 tokens with 64-token overlap. This is the best default.

Use MarkdownHeaderTextSplitter when your documents are Markdown or have clear hierarchical structure.

Use small-to-big retrieval when questions require understanding context beyond a single passage.

Avoid character-based chunking in production — token-aware length functions are almost always worth the extra dependency.

Always inspect chunks visually before building your index. Print 20 random chunks and ask: does this chunk make sense on its own? Would this chunk answer questions about its topic?

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.