Implement a Text Chunker

Why Chunking Matters

LLM context windows are finite. To index a large document for RAG, you split it into chunks small enough to embed and retrieve. The chunking strategy directly affects retrieval quality.

A chunk that splits mid-sentence loses context. A chunk that's too large retrieves lots of irrelevant content. A chunk with no overlap may miss relationships that span a boundary.

Fixed-Size Character Chunker

The simplest approach — split every N characters:

Python

def chunk_fixed_size(
    text: str,
    chunk_size: int = 500,
    overlap: int = 50,
) -> list[str]:
    """
    Split text into fixed-size chunks with overlap.
    Overlap preserves context at chunk boundaries.
    """
    if chunk_size <= 0:
        raise ValueError("chunk_size must be positive")
    if overlap >= chunk_size:
        raise ValueError("overlap must be smaller than chunk_size")

    chunks = []
    start = 0

    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunk = text[start:end]

        if chunk.strip():  # Skip empty chunks
            chunks.append(chunk)

        start += chunk_size - overlap

    return chunks

# Test
text = "Metformin is a biguanide antidiabetic drug. " * 10
chunks = chunk_fixed_size(text, chunk_size=100, overlap=20)
print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks[:3]):
    print(f"Chunk {i}: '{chunk[:60]}...'")

Problem: Fixed character splitting cuts mid-word and mid-sentence.

Word-Boundary Chunker

Respect word boundaries when splitting:

Python

def chunk_by_words(
    text: str,
    chunk_size: int = 100,  # In words
    overlap: int = 10,       # In words
) -> list[str]:
    """Split text into chunks at word boundaries."""
    words = text.split()
    chunks = []
    start = 0

    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk = " ".join(words[start:end])
        if chunk.strip():
            chunks.append(chunk)
        start += chunk_size - overlap

    return chunks

words_chunks = chunk_by_words(text, chunk_size=20, overlap=5)
print(f"Word-based chunks: {len(words_chunks)}")

Sentence-Aware Chunker

Better: split at sentence boundaries to preserve semantic units:

Python

import re

def split_into_sentences(text: str) -> list[str]:
    """Split text into sentences using regex."""
    # Split on sentence-ending punctuation followed by whitespace
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    return [s.strip() for s in sentences if s.strip()]

def chunk_by_sentences(
    text: str,
    max_chunk_chars: int = 500,
    overlap_sentences: int = 1,
) -> list[str]:
    """
    Group sentences into chunks without exceeding max_chunk_chars.
    overlap_sentences: number of sentences to repeat between chunks.
    """
    sentences = split_into_sentences(text)
    chunks = []
    current_chunk = []
    current_length = 0

    i = 0
    while i < len(sentences):
        sentence = sentences[i]
        sentence_length = len(sentence) + 1  # +1 for space

        if current_length + sentence_length > max_chunk_chars and current_chunk:
            # Flush current chunk
            chunks.append(" ".join(current_chunk))

            # Overlap: keep last N sentences in the next chunk
            if overlap_sentences > 0:
                current_chunk = current_chunk[-overlap_sentences:]
                current_length = sum(len(s) + 1 for s in current_chunk)
            else:
                current_chunk = []
                current_length = 0
        else:
            current_chunk.append(sentence)
            current_length += sentence_length
            i += 1

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

drug_text = """
Warfarin is an anticoagulant medication. It works by inhibiting vitamin K epoxide reductase,
preventing the recycling of vitamin K. This depletes active vitamin K, which is needed for
the synthesis of clotting factors II, VII, IX, and X. Warfarin has a narrow therapeutic index.
The target INR range is typically 2.0 to 3.0 for most indications. Regular monitoring is required.
Drug interactions are common because warfarin is extensively protein-bound and metabolized by CYP2C9.
NSAIDs, aspirin, and many antibiotics can significantly increase warfarin's effect.
"""

chunks = chunk_by_sentences(drug_text, max_chunk_chars=200, overlap_sentences=1)
for i, chunk in enumerate(chunks):
    print(f"\nChunk {i+1} ({len(chunk)} chars):")
    print(f"  {chunk[:100]}...")

Recursive Chunker (LangChain-style)

The most robust approach: try to split at semantic boundaries, falling back to smaller separators:

Python

class RecursiveChunker:
    """
    Splits text using a hierarchy of separators.
    Tries the first separator; if chunks are still too large,
    applies the next separator recursively.
    """

    def __init__(
        self,
        chunk_size: int = 500,
        chunk_overlap: int = 50,
        separators: list[str] | None = None,
    ):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        # Try separators in order: paragraph → sentence → word → character
        self.separators = separators or ["\n\n", "\n", ". ", " ", ""]

    def _split_text(self, text: str, separators: list[str]) -> list[str]:
        """Recursively split text using the separator hierarchy."""
        if not separators:
            return [text]

        separator = separators[0]
        remaining_seps = separators[1:]

        if separator == "":
            # Last resort: character-level split
            return [text[i:i+self.chunk_size] for i in range(0, len(text), self.chunk_size - self.chunk_overlap)]

        splits = text.split(separator)
        chunks = []
        current = []
        current_length = 0

        for split in splits:
            split_with_sep = split + separator if separator != "" else split
            split_length = len(split_with_sep)

            if current_length + split_length > self.chunk_size and current:
                chunk_text = separator.join(current)
                if len(chunk_text) > self.chunk_size and remaining_seps:
                    # Chunk still too large — apply next separator
                    chunks.extend(self._split_text(chunk_text, remaining_seps))
                else:
                    chunks.append(chunk_text)

                # Overlap: keep last part
                overlap_start = max(0, len(current) - 2)
                current = current[overlap_start:]
                current_length = sum(len(s) + len(separator) for s in current)

            current.append(split)
            current_length += split_length

        if current:
            chunk_text = separator.join(current)
            if len(chunk_text) > self.chunk_size and remaining_seps:
                chunks.extend(self._split_text(chunk_text, remaining_seps))
            else:
                chunks.append(chunk_text)

        return [c for c in chunks if c.strip()]

    def split(self, text: str) -> list[str]:
        return self._split_text(text, self.separators)

# Test
long_doc = """
## Drug Information: Warfarin

Warfarin is an oral anticoagulant. It inhibits vitamin K epoxide reductase.

### Mechanism
Warfarin prevents the regeneration of vitamin K1 from vitamin K epoxide. This depletes
active vitamin K, required for gamma-carboxylation of clotting factors II, VII, IX, X.

### Monitoring
The INR (International Normalized Ratio) measures anticoagulation intensity. Target INR
is 2.0-3.0 for most indications. Levels above 4.0 indicate excessive anticoagulation.

### Interactions
Warfarin has many drug interactions. NSAIDs increase bleeding risk. Rifampin induces CYP2C9,
reducing warfarin levels. Fluconazole inhibits CYP2C9, increasing warfarin levels.
"""

chunker = RecursiveChunker(chunk_size=200, chunk_overlap=30)
chunks = chunker.split(long_doc)
print(f"Total chunks: {len(chunks)}")
for i, chunk in enumerate(chunks):
    print(f"\nChunk {i+1} ({len(chunk)} chars): {chunk[:80]}...")

Chunk Size Guidelines

| Content type | Recommended chunk size | |---|---| | Dense technical docs | 200–400 chars | | General articles | 400–800 chars | | Legal/medical documents | 800–1,500 chars | | Code files | 500–1,000 chars (split by function) |

Overlap: Typically 10–20% of chunk size. Too little → context lost at boundaries. Too much → redundant retrieval.

Practical advice: Start at 500 chars / 50 overlap. Measure RAGAS context recall on your test set. If recall is low, increase chunk size. If precision is low, decrease chunk size.

Implement a Text Chunker

Why Chunking Matters

Fixed-Size Character Chunker

Word-Boundary Chunker

Sentence-Aware Chunker

Recursive Chunker (LangChain-style)

Chunk Size Guidelines

Enjoyed this article?

Leave a comment