Recursive Chunking

The Problem with Fixed-Size Splitting

Fixed-size splits text at arbitrary token positions, potentially cutting:

In the middle of a sentence
Between a question and its answer
Across a markdown section heading and its content

Recursive chunking respects natural boundaries first.

How Recursive Chunking Works

Try separators in order of preference, fall back to the next if a chunk is still too large:

Separators (in priority order):
  1. "\n\n"   — paragraph breaks
  2. "\n"     — line breaks
  3. ". "     — sentence boundaries
  4. " "      — word boundaries
  5. ""       — character (last resort)

Algorithm:
  1. Split on "\n\n" (paragraphs)
  2. If a paragraph > chunk_size: split that paragraph on "\n"
  3. If a section > chunk_size: split on ". "
  4. Continue recursing until all chunks ≤ chunk_size

This ensures chunks are as semantically complete as possible.

Implementation

Python

from typing import Optional

def recursive_split(
    text: str,
    chunk_size: int = 512,
    chunk_overlap: int = 50,
    separators: Optional[list[str]] = None,
) -> list[str]:
    if separators is None:
        separators = ["\n\n", "\n", ". ", " ", ""]
    
    # Find the first separator that splits the text meaningfully
    separator = ""
    new_separators = []
    for i, sep in enumerate(separators):
        if sep == "" or sep in text:
            separator = sep
            new_separators = separators[i + 1:]
            break
    
    splits = text.split(separator) if separator else list(text)
    
    # Merge small splits back together up to chunk_size, with overlap
    chunks = []
    current = []
    current_len = 0
    
    for split in splits:
        split_len = len(split)
        if current_len + split_len + len(separator) > chunk_size:
            if current:
                chunks.append(separator.join(current))
                # Keep last overlap portion
                while current and current_len > chunk_overlap:
                    removed = current.pop(0)
                    current_len -= len(removed) + len(separator)
        current.append(split)
        current_len += split_len + len(separator)
    
    if current:
        chunks.append(separator.join(current))
    
    # Recursively split any chunk that's still too large
    final_chunks = []
    for chunk in chunks:
        if len(chunk) > chunk_size and new_separators:
            final_chunks.extend(
                recursive_split(chunk, chunk_size, chunk_overlap, new_separators)
            )
        else:
            if chunk.strip():
                final_chunks.append(chunk.strip())
    
    return final_chunks

LangChain RecursiveCharacterTextSplitter

Python

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=len,  # char-based; use token counter for token-based
    separators=["\n\n", "\n", ". ", " ", ""],
)

chunks = splitter.split_text(clinical_document)

# Token-based (more accurate for embedding models)
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("all-MiniLM-L6-v2")

def token_len(text: str) -> int:
    return len(tokenizer.encode(text, add_special_tokens=False))

splitter_token = RecursiveCharacterTextSplitter(
    chunk_size=256,
    chunk_overlap=32,
    length_function=token_len,
    separators=["\n\n", "\n", ". ", " ", ""],
)

Domain-Specific Separators

Python

# Markdown documents (clinical guidelines often in markdown)
markdown_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=[
        "\n## ",   # H2 headings — strongest boundary
        "\n### ",  # H3 headings
        "\n#### ", # H4 headings
        "\n\n",    # paragraph breaks
        "\n",      # line breaks
        ". ",      # sentences
        " ",       # words
        "",
    ],
)

# Python code
code_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=[
        "\nclass ",
        "\ndef ",
        "\n\n",
        "\n",
        " ",
        "",
    ],
)

Comparison: Fixed vs Recursive

Document: Clinical guideline with sections and bullet points

Fixed chunking (512 chars):
  Chunk 3: "...reduce clotting. ## Dosing\n\nInitial dose: 5-10mg"
  (splits mid-section, mixes conclusion of one section with start of next)

Recursive chunking (512 chars):
  Chunk 3: "## Dosing\n\nInitial dose: 5-10mg. Adjust based on INR..."
  (respects section boundary, chunk starts at a meaningful heading)

Retrieval impact: "What is the warfarin initial dose?"
  Fixed: may retrieve chunk with split context, confusing the answer
  Recursive: retrieves the Dosing section cleanly

Interview Answer

"Recursive chunking tries separators in priority order — paragraph breaks, then line breaks, then sentence boundaries, then words — cascading only when a chunk still exceeds the size limit. This preserves semantic coherence: chunks start at natural boundaries rather than mid-sentence. Compared to fixed-size chunking, recursive chunking improves retrieval quality for structured documents (guidelines with sections, markdown, policy documents). LangChain's RecursiveCharacterTextSplitter is the standard implementation. For domain documents, customise the separators list to match the document structure — markdown headers before paragraph breaks gives the best section-level splits."

Recursive Chunking

The Problem with Fixed-Size Splitting

How Recursive Chunking Works

Implementation

LangChain RecursiveCharacterTextSplitter

Domain-Specific Separators

Comparison: Fixed vs Recursive

Interview Answer

Enjoyed this article?

Leave a comment