Recursive Chunking
How recursive character text splitting respects document structure by cascading through separators, and when it outperforms fixed-size chunking.
The Problem with Fixed-Size Splitting
Fixed-size splits text at arbitrary token positions, potentially cutting:
- In the middle of a sentence
- Between a question and its answer
- Across a markdown section heading and its content
Recursive chunking respects natural boundaries first.
How Recursive Chunking Works
Try separators in order of preference, fall back to the next if a chunk is still too large:
Separators (in priority order):
1. "\n\n" ā paragraph breaks
2. "\n" ā line breaks
3. ". " ā sentence boundaries
4. " " ā word boundaries
5. "" ā character (last resort)
Algorithm:
1. Split on "\n\n" (paragraphs)
2. If a paragraph > chunk_size: split that paragraph on "\n"
3. If a section > chunk_size: split on ". "
4. Continue recursing until all chunks ⤠chunk_sizeThis ensures chunks are as semantically complete as possible.
Implementation
from typing import Optional
def recursive_split(
text: str,
chunk_size: int = 512,
chunk_overlap: int = 50,
separators: Optional[list[str]] = None,
) -> list[str]:
if separators is None:
separators = ["\n\n", "\n", ". ", " ", ""]
# Find the first separator that splits the text meaningfully
separator = ""
new_separators = []
for i, sep in enumerate(separators):
if sep == "" or sep in text:
separator = sep
new_separators = separators[i + 1:]
break
splits = text.split(separator) if separator else list(text)
# Merge small splits back together up to chunk_size, with overlap
chunks = []
current = []
current_len = 0
for split in splits:
split_len = len(split)
if current_len + split_len + len(separator) > chunk_size:
if current:
chunks.append(separator.join(current))
# Keep last overlap portion
while current and current_len > chunk_overlap:
removed = current.pop(0)
current_len -= len(removed) + len(separator)
current.append(split)
current_len += split_len + len(separator)
if current:
chunks.append(separator.join(current))
# Recursively split any chunk that's still too large
final_chunks = []
for chunk in chunks:
if len(chunk) > chunk_size and new_separators:
final_chunks.extend(
recursive_split(chunk, chunk_size, chunk_overlap, new_separators)
)
else:
if chunk.strip():
final_chunks.append(chunk.strip())
return final_chunksLangChain RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
length_function=len, # char-based; use token counter for token-based
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(clinical_document)
# Token-based (more accurate for embedding models)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("all-MiniLM-L6-v2")
def token_len(text: str) -> int:
return len(tokenizer.encode(text, add_special_tokens=False))
splitter_token = RecursiveCharacterTextSplitter(
chunk_size=256,
chunk_overlap=32,
length_function=token_len,
separators=["\n\n", "\n", ". ", " ", ""],
)Domain-Specific Separators
# Markdown documents (clinical guidelines often in markdown)
markdown_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=[
"\n## ", # H2 headings ā strongest boundary
"\n### ", # H3 headings
"\n#### ", # H4 headings
"\n\n", # paragraph breaks
"\n", # line breaks
". ", # sentences
" ", # words
"",
],
)
# Python code
code_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=[
"\nclass ",
"\ndef ",
"\n\n",
"\n",
" ",
"",
],
)Comparison: Fixed vs Recursive
Document: Clinical guideline with sections and bullet points
Fixed chunking (512 chars):
Chunk 3: "...reduce clotting. ## Dosing\n\nInitial dose: 5-10mg"
(splits mid-section, mixes conclusion of one section with start of next)
Recursive chunking (512 chars):
Chunk 3: "## Dosing\n\nInitial dose: 5-10mg. Adjust based on INR..."
(respects section boundary, chunk starts at a meaningful heading)
Retrieval impact: "What is the warfarin initial dose?"
Fixed: may retrieve chunk with split context, confusing the answer
Recursive: retrieves the Dosing section cleanlyInterview Answer
"Recursive chunking tries separators in priority order ā paragraph breaks, then line breaks, then sentence boundaries, then words ā cascading only when a chunk still exceeds the size limit. This preserves semantic coherence: chunks start at natural boundaries rather than mid-sentence. Compared to fixed-size chunking, recursive chunking improves retrieval quality for structured documents (guidelines with sections, markdown, policy documents). LangChain's RecursiveCharacterTextSplitter is the standard implementation. For domain documents, customise the separators list to match the document structure ā markdown headers before paragraph breaks gives the best section-level splits."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.