Text Splitters: Chunking Documents for RAG
Chunk documents effectively for retrieval. Compare recursive, semantic, token-based, and code splitters. Tune chunk size and overlap for your use case.
Why Chunking Matters
You cannot embed a 200-page PDF as one unit — the embedding loses specificity. But chunks too small lose context. Chunking is the step between loading and embedding: split documents into pieces that are small enough to embed meaningfully, large enough to answer a question.
Document (200 pages)
↓ chunk
[chunk_1] [chunk_2] [chunk_3] ... [chunk_N]
↓ embed
[vec_1] [vec_2] [vec_3] ... [vec_N]
↓ store in vector DB
↓ retrieve top-k at query timeThe two parameters that control all splitters:
chunk_size— target size in characters (or tokens)chunk_overlap— how many characters the next chunk shares with the previous one (prevents cutting a sentence mid-thought)
RecursiveCharacterTextSplitter (Default Choice)
Splits on \n\n → \n → → "" in order, stopping when chunks are small enough. Preserves paragraph structure where possible.
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
text = """
Warfarin mechanism of action
Warfarin inhibits vitamin K epoxide reductase (VKORC1), the enzyme responsible for
regenerating active vitamin K. Without active vitamin K, the liver cannot synthesize
clotting factors II, VII, IX, and X.
Clinical implications
The anticoagulant effect of warfarin is delayed 2-3 days because existing clotting
factors must be cleared before INR rises. This is why heparin bridging is used
when rapid anticoagulation is needed.
"""
splitter = RecursiveCharacterTextSplitter(
chunk_size=300, # Target: 300 characters per chunk
chunk_overlap=50, # 50-char overlap between consecutive chunks
length_function=len, # Count characters (default)
add_start_index=True, # Track where chunk starts in original text
)
chunks = splitter.create_documents([text])
print(f"Chunks: {len(chunks)}")
for i, chunk in enumerate(chunks):
print(f"\n[Chunk {i}] start={chunk.metadata.get('start_index')}")
print(chunk.page_content)
# Split existing Documents (preserves their metadata)
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("warfarin_guidelines.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
print(f"Pages: {len(docs)} → Chunks: {len(chunks)}")
# Metadata is preserved: each chunk keeps source, page number from original doc
print(chunks[0].metadata)
# {'source': 'warfarin_guidelines.pdf', 'page': 0, 'start_index': 0}Token-Based Splitting
Use when your embedding model or LLM has a token limit (not a character limit):
from langchain_text_splitters import TokenTextSplitter
# Split by token count — matches how LLMs actually measure length
token_splitter = TokenTextSplitter(
encoding_name="cl100k_base", # OpenAI's tokenizer (GPT-4, text-embedding-3)
chunk_size=256, # 256 tokens per chunk
chunk_overlap=32, # 32-token overlap
)
chunks = token_splitter.split_documents(docs)
# Verify actual token counts
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
token_counts = [len(enc.encode(c.page_content)) for c in chunks]
print(f"Token range: {min(token_counts)}-{max(token_counts)}")
# Should be: 1-256 tokens per chunk
# Use this when: embedding with text-embedding-3-small/large (8191 token limit)
# Use character splitter when: embedding model limit is in chars, or you're unsureSemantic Splitter
Splits based on meaning change rather than fixed size. Compares consecutive sentence embeddings — when similarity drops, that's a chunk boundary.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
semantic_splitter = SemanticChunker(
embeddings,
breakpoint_threshold_type="percentile", # Split at top N% similarity drops
breakpoint_threshold_amount=95, # 95th percentile = aggressive splitting
)
# Alternative: "standard_deviation" — split where similarity drops > 1 std dev
semantic_splitter_std = SemanticChunker(
embeddings,
breakpoint_threshold_type="standard_deviation",
breakpoint_threshold_amount=1.25,
)
chunks = semantic_splitter.split_documents(docs)
# Chunks vary in size — they end at topic boundaries
for chunk in chunks[:3]:
print(f"[{len(chunk.page_content)} chars] {chunk.page_content[:80]}...")When to use semantic splitting:
- Documents with distinct topic sections (clinical guidelines, research papers)
- You can afford extra embedding API calls at ingestion time
- Chunk quality matters more than cost
When to skip it:
- High ingestion volume (each split requires embedding API calls)
- Documents are already uniform in structure (one topic per section)
Code Splitter
Splits code by language syntax (functions, classes) rather than characters:
from langchain_text_splitters import Language, RecursiveCharacterTextSplitter
# Python: splits at class/function boundaries
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=1000,
chunk_overlap=100,
)
code = """
class DrugInteractionChecker:
def __init__(self, db_path: str):
self.db_path = db_path
self.interactions = self._load_db()
def _load_db(self) -> dict:
with open(self.db_path) as f:
return json.load(f)
def check(self, drug_a: str, drug_b: str) -> str:
key = tuple(sorted([drug_a.lower(), drug_b.lower()]))
return self.interactions.get(key, "No interaction found")
def calculate_inr_adjusted_dose(current_inr: float, target_inr: float, current_dose: float) -> float:
ratio = target_inr / current_inr
return round(current_dose * ratio, 1)
"""
python_chunks = python_splitter.create_documents([code])
for chunk in python_chunks:
print(f"---\n{chunk.page_content}\n")
# Other supported languages:
# Language.MARKDOWN, Language.JS, Language.TS, Language.GO,
# Language.JAVA, Language.RUST, Language.SQL, Language.HTMLMarkdown Splitter
Splits at heading boundaries — preserves document structure:
from langchain_text_splitters import MarkdownHeaderTextSplitter
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[
("#", "h1"),
("##", "h2"),
("###", "h3"),
],
strip_headers=False, # Keep headers in the chunk text
)
markdown_text = """
# Warfarin Clinical Guide
## Mechanism of Action
Warfarin inhibits VKORC1, blocking vitamin K recycling.
## Dosing
Start at 5mg daily. Adjust based on INR.
### Renal Adjustment
No dose adjustment required for renal impairment.
## Drug Interactions
### Major Interactions
Aspirin: additive bleeding risk. Avoid if possible.
"""
chunks = markdown_splitter.split_text(markdown_text)
for chunk in chunks:
print(f"Metadata: {chunk.metadata}")
print(f"Content: {chunk.page_content[:100]}\n")
# Metadata: {"h1": "Warfarin Clinical Guide", "h2": "Mechanism of Action"}
# Content: Warfarin inhibits VKORC1...The heading hierarchy becomes metadata — you can filter retrieval by section.
Choosing Chunk Size
# Rule of thumb by use case:
CHUNK_CONFIGS = {
# Question answering (short, precise answers)
"qa": {"chunk_size": 500, "chunk_overlap": 100},
# Summarization (needs more context per chunk)
"summarization": {"chunk_size": 2000, "chunk_overlap": 200},
# Code search (function-sized chunks)
"code": {"chunk_size": 1000, "chunk_overlap": 100},
# Clinical guidelines (paragraph-sized, preserve context)
"clinical": {"chunk_size": 800, "chunk_overlap": 150},
}
def evaluate_chunk_quality(chunks: list[Document], sample_size: int = 10) -> dict:
"""Quick sanity check on chunk quality."""
import random
sample = random.sample(chunks, min(sample_size, len(chunks)))
sizes = [len(c.page_content) for c in chunks]
tiny = [c for c in chunks if len(c.page_content) < 100]
truncated = [c for c in sample if c.page_content.endswith("...")]
return {
"total_chunks": len(chunks),
"avg_chars": round(sum(sizes) / len(sizes)),
"min_chars": min(sizes),
"max_chars": max(sizes),
"tiny_chunks_pct": round(len(tiny) / len(chunks) * 100, 1),
"warning": (
"Many tiny chunks — increase chunk_size or min_chars filter" if len(tiny) / len(chunks) > 0.1
else "OK"
),
}
# After splitting, always inspect:
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=150)
chunks = splitter.split_documents(docs)
print(evaluate_chunk_quality(chunks))Complete Ingestion Pipeline
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
def build_chunks_from_pdfs(
pdf_paths: list[str],
chunk_size: int = 800,
chunk_overlap: int = 150,
min_chunk_chars: int = 100,
) -> list[Document]:
"""Load PDFs, split, filter, and return clean chunks."""
# Load
all_docs = []
for path in pdf_paths:
try:
loader = PyPDFLoader(path)
all_docs.extend(loader.load())
except Exception as e:
print(f"Skipping {path}: {e}")
print(f"Loaded {len(all_docs)} pages from {len(pdf_paths)} PDFs")
# Split
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
add_start_index=True,
)
chunks = splitter.split_documents(all_docs)
# Filter low-quality chunks
good_chunks = [c for c in chunks if len(c.page_content.strip()) >= min_chunk_chars]
# Add chunk index for debugging
for i, chunk in enumerate(good_chunks):
chunk.metadata["chunk_id"] = i
print(f"Chunks: {len(chunks)} → {len(good_chunks)} after filtering")
print(evaluate_chunk_quality(good_chunks))
return good_chunks
chunks = build_chunks_from_pdfs(
["warfarin_guidelines.pdf", "metformin_monograph.pdf"],
chunk_size=800,
chunk_overlap=150,
)Splitter Comparison
| Splitter | Split Logic | Best For | Cost |
|---|---|---|---|
| RecursiveCharacterTextSplitter | Paragraph → sentence → word | General text, PDFs | Free |
| TokenTextSplitter | Token boundaries | LLM-aware chunking | Free (local tokenizer) |
| SemanticChunker | Embedding similarity | Topic-structured docs | API calls per ingestion |
| RecursiveCharacterTextSplitter.from_language(PYTHON) | Function/class syntax | Source code | Free |
| MarkdownHeaderTextSplitter | Heading hierarchy | Docs with headers | Free |
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.