Document Chunking Strategies
Master chunking: fixed-size, sentence, paragraph, recursive, and document-aware strategies. Learn how chunk size, overlap, and boundaries drive retrieval quality.
Document Chunking Strategies
Chunking is the single highest-leverage decision in RAG. A retrieval system with a great embedding model and poor chunking will underperform a mediocre embedding model with great chunking. The reason: embedding quality can only encode what the chunk contains. If the chunk is too large, the embedding averages over too much content and loses specificity. If it's too small, it lacks context.
Why Chunking Matters
Bad chunk (too large, 1500 tokens):
"Our refund policy allows 30-day returns. Shipping takes 3–5 days.
We offer premium support. Products come with a 1-year warranty.
Our headquarters is in Austin, Texas. We were founded in 2019..."
Query: "What is the refund period?"
Embedding similarity: 0.62 ← noisy, many topics dilute the signal
Good chunk (focused, 80 tokens):
"Our refund policy allows returns within 30 days of purchase,
provided the item is in original condition with all packaging."
Query: "What is the refund period?"
Embedding similarity: 0.91 ← clean signal, focused contentStrategy 1: Fixed-Size Character Chunking
The simplest approach: split every N characters with M characters of overlap.
def fixed_char_chunk(text: str, size: int = 1000, overlap: int = 100) -> list[str]:
chunks = []
start = 0
while start < len(text):
end = start + size
chunks.append(text[start:end])
start = end - overlap
return chunks
# Example
text = "A" * 3000
chunks = fixed_char_chunk(text, size=1000, overlap=100)
print(f"{len(chunks)} chunks") # 4 chunks: [0..1000], [900..1900], [1800..2800], [2700..3000]Pros: simple, predictable size, easy to implement. Cons: splits mid-sentence, mid-word even. Terrible for structured documents.
Strategy 2: Token-Based Chunking
Use the actual tokenizer to count tokens rather than characters. More accurate for LLM context management.
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 encoding
def token_chunk(text: str, max_tokens: int = 512, overlap_tokens: int = 50) -> list[str]:
tokens = enc.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = start + max_tokens
chunk_tokens = tokens[start:end]
chunks.append(enc.decode(chunk_tokens))
start = end - overlap_tokens
return chunks
# Verify chunk sizes
text = open("large_document.txt").read()
chunks = token_chunk(text, max_tokens=512)
sizes = [len(enc.encode(c)) for c in chunks]
print(f"Min: {min(sizes)}, Max: {max(sizes)}, Avg: {sum(sizes)/len(sizes):.0f} tokens")Strategy 3: Sentence Chunking
Split on sentence boundaries, then group into chunks of N sentences.
import spacy
nlp = spacy.load("en_core_web_sm")
def sentence_chunk(text: str, sentences_per_chunk: int = 5, overlap: int = 1) -> list[str]:
doc = nlp(text)
sentences = [sent.text.strip() for sent in doc.sents if sent.text.strip()]
chunks = []
step = sentences_per_chunk - overlap
for i in range(0, len(sentences), step):
group = sentences[i : i + sentences_per_chunk]
chunks.append(" ".join(group))
return chunks
# Alternative: use NLTK for lighter dependency
import nltk
nltk.download("punkt", quiet=True)
from nltk.tokenize import sent_tokenize
def sentence_chunk_nltk(text: str, sentences_per_chunk: int = 5) -> list[str]:
sentences = sent_tokenize(text)
return [
" ".join(sentences[i : i + sentences_per_chunk])
for i in range(0, len(sentences), sentences_per_chunk)
]Pros: preserves complete thoughts, no mid-sentence splits. Cons: uneven chunk sizes; a document with very long or very short sentences produces wildly different chunk sizes.
Strategy 4: Recursive Character Text Splitting
The LangChain default — tries to split on natural boundaries in order of preference:
1. \n\n (paragraph break)
2. \n (line break)
3. . (sentence end)
4. , (clause)
5. " " (word boundary)
6. "" (character boundary — last resort)from langchain_text_splitters import RecursiveCharacterTextSplitter
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
length_function=lambda text: len(enc.encode(text)), # token count, not char count
separators=["\n\n", "\n", ". ", ", ", " ", ""],
is_separator_regex=False,
)
with open("document.txt") as f:
text = f.read()
chunks = splitter.split_text(text)
print(f"Created {len(chunks)} chunks")
for i, c in enumerate(chunks[:3]):
print(f"\nChunk {i+1} ({len(enc.encode(c))} tokens):\n{c[:100]}...")This is the best default choice for mixed-format text documents.
Strategy 5: Paragraph-Based Chunking
Respect paragraph boundaries entirely — don't split within paragraphs.
import re
def paragraph_chunk(text: str, max_tokens: int = 512) -> list[str]:
enc = tiktoken.get_encoding("cl100k_base")
paragraphs = re.split(r"\n\s*\n", text.strip())
paragraphs = [p.strip() for p in paragraphs if p.strip()]
chunks = []
current = []
current_tokens = 0
for para in paragraphs:
para_tokens = len(enc.encode(para))
if para_tokens > max_tokens:
# Single paragraph exceeds limit — must split it
if current:
chunks.append("\n\n".join(current))
current, current_tokens = [], 0
# Sub-split the large paragraph by sentence
sub_chunks = sentence_chunk_nltk(para, sentences_per_chunk=5)
chunks.extend(sub_chunks)
elif current_tokens + para_tokens > max_tokens:
chunks.append("\n\n".join(current))
current = [para]
current_tokens = para_tokens
else:
current.append(para)
current_tokens += para_tokens
if current:
chunks.append("\n\n".join(current))
return chunksStrategy 6: Document-Aware Chunking (Markdown Headers)
For structured documents, use headers as natural chunk boundaries.
from langchain_text_splitters import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
md_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on,
strip_headers=False,
)
md_text = """
# Product Manual
## Installation
Follow these steps to install the product...
### Windows Installation
On Windows, run the installer from...
## Configuration
After installation, configure the settings...
"""
chunks = md_splitter.split_text(md_text)
for chunk in chunks:
print(f"Metadata: {chunk.metadata}")
print(f"Content: {chunk.page_content[:80]}\n")
# Then apply token-size limit on top
from langchain_text_splitters import RecursiveCharacterTextSplitter
token_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
length_function=lambda t: len(enc.encode(t)),
)
final_chunks = token_splitter.split_documents(chunks)Strategy 7: Code-Aware Chunking
For technical documentation with code blocks, never split inside a code block.
from langchain_text_splitters import Language, RecursiveCharacterTextSplitter
# Python-aware splitter
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=512,
chunk_overlap=64,
)
code = '''
def process_order(order_id: str) -> dict:
"""Process a customer order."""
order = db.get_order(order_id)
if not order:
raise ValueError(f"Order {order_id} not found")
result = payment_service.charge(order)
return {"status": "processed", "order_id": order_id}
def cancel_order(order_id: str) -> bool:
"""Cancel an existing order."""
order = db.get_order(order_id)
if order.status == "shipped":
return False
db.update_order(order_id, status="cancelled")
return True
'''
chunks = python_splitter.split_text(code)
for c in chunks:
print(c)
print("---")Small-to-Big Retrieval
A powerful pattern: embed small chunks for precision, but retrieve larger parent chunks for context.
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever
from langchain_community.storage import InMemoryStore
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings
# Parent splitter: large chunks (for context)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
# Child splitter: small chunks (for precise retrieval)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=40)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Qdrant.from_documents([], embeddings, location=":memory:", collection_name="children")
docstore = InMemoryStore()
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=docstore,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
# Add documents: stores parents in docstore, children in vectorstore
from langchain_community.document_loaders import TextLoader
docs = TextLoader("document.txt").load()
retriever.add_documents(docs)
# At query time: searches child chunks, returns parent chunks
results = retriever.get_relevant_documents("What is the refund policy?")
print(f"Retrieved {len(results)} parent documents")Chunk Size vs Overlap Tradeoffs
| Chunk Size (tokens) | Retrieval Precision | Context Quality | Best For | |---|---|---|---| | 64–128 | Very high | Low (no context) | Fact lookup, keyword-heavy | | 256–512 | High | Medium | General Q&A (recommended default) | | 512–1024 | Medium | High | Summarization, reasoning | | 1024–2048 | Low | Very high | Whole-section analysis |
Overlap guidelines:
- 10–15% of chunk size is the standard
- Higher overlap means more redundancy and index size, but fewer boundary misses
- Zero overlap is only appropriate if you use small-to-big retrieval
Benchmarking Your Chunking Strategy
import json
from typing import Callable
def benchmark_chunking(
documents: list[str],
qa_pairs: list[dict],
chunking_fns: dict[str, Callable],
embed_fn,
store_fn,
retrieve_fn,
) -> dict:
results = {}
for name, chunk_fn in chunking_fns.items():
print(f"\nBenchmarking: {name}")
# Chunk all documents
all_chunks = []
for doc in documents:
all_chunks.extend(chunk_fn(doc))
# Build index
store = store_fn(all_chunks, embed_fn)
# Evaluate retrieval
hit_at_1 = 0
hit_at_4 = 0
for qa in qa_pairs:
retrieved = retrieve_fn(store, qa["question"], top_k=4)
expected_keyword = qa["expected_keyword"]
if any(expected_keyword in r["text"] for r in retrieved[:1]):
hit_at_1 += 1
if any(expected_keyword in r["text"] for r in retrieved):
hit_at_4 += 1
n = len(qa_pairs)
results[name] = {
"hit@1": hit_at_1 / n,
"hit@4": hit_at_4 / n,
"num_chunks": len(all_chunks),
"avg_chunk_tokens": sum(len(enc.encode(c)) for c in all_chunks) / len(all_chunks),
}
print(f" hit@1={results[name]['hit@1']:.2%}, hit@4={results[name]['hit@4']:.2%}")
return results
# strategies to compare
strategies = {
"fixed_512": lambda doc: fixed_char_chunk(doc, size=2000, overlap=200),
"token_512": lambda doc: token_chunk(doc, max_tokens=512, overlap_tokens=64),
"sentence_5": lambda doc: sentence_chunk_nltk(doc, sentences_per_chunk=5),
"paragraph": lambda doc: paragraph_chunk(doc, max_tokens=512),
}Practical Recommendations
Start with RecursiveCharacterTextSplitter at 512 tokens with 64-token overlap. This is the best default.
Use MarkdownHeaderTextSplitter when your documents are Markdown or have clear hierarchical structure.
Use small-to-big retrieval when questions require understanding context beyond a single passage.
Avoid character-based chunking in production — token-aware length functions are almost always worth the extra dependency.
Always inspect chunks visually before building your index. Print 20 random chunks and ask: does this chunk make sense on its own? Would this chunk answer questions about its topic?
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.