Learnixo
Back to blog
AI Systemsintermediate

Implement a Text Chunker

Build a recursive text chunker for RAG pipelines. Implement fixed-size, sentence-aware, and recursive chunking with overlap to preserve context at chunk boundaries.

Asma Hafeez KhanMay 16, 20266 min read
Live CodingRAGChunkingPython
Share:š•

Why Chunking Matters

LLM context windows are finite. To index a large document for RAG, you split it into chunks small enough to embed and retrieve. The chunking strategy directly affects retrieval quality.

A chunk that splits mid-sentence loses context. A chunk that's too large retrieves lots of irrelevant content. A chunk with no overlap may miss relationships that span a boundary.


Fixed-Size Character Chunker

The simplest approach — split every N characters:

Python
def chunk_fixed_size(
    text: str,
    chunk_size: int = 500,
    overlap: int = 50,
) -> list[str]:
    """
    Split text into fixed-size chunks with overlap.
    Overlap preserves context at chunk boundaries.
    """
    if chunk_size <= 0:
        raise ValueError("chunk_size must be positive")
    if overlap >= chunk_size:
        raise ValueError("overlap must be smaller than chunk_size")

    chunks = []
    start = 0

    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunk = text[start:end]

        if chunk.strip():  # Skip empty chunks
            chunks.append(chunk)

        start += chunk_size - overlap

    return chunks

# Test
text = "Metformin is a biguanide antidiabetic drug. " * 10
chunks = chunk_fixed_size(text, chunk_size=100, overlap=20)
print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks[:3]):
    print(f"Chunk {i}: '{chunk[:60]}...'")

Problem: Fixed character splitting cuts mid-word and mid-sentence.


Word-Boundary Chunker

Respect word boundaries when splitting:

Python
def chunk_by_words(
    text: str,
    chunk_size: int = 100,  # In words
    overlap: int = 10,       # In words
) -> list[str]:
    """Split text into chunks at word boundaries."""
    words = text.split()
    chunks = []
    start = 0

    while start < len(words):
        end = min(start + chunk_size, len(words))
        chunk = " ".join(words[start:end])
        if chunk.strip():
            chunks.append(chunk)
        start += chunk_size - overlap

    return chunks

words_chunks = chunk_by_words(text, chunk_size=20, overlap=5)
print(f"Word-based chunks: {len(words_chunks)}")

Sentence-Aware Chunker

Better: split at sentence boundaries to preserve semantic units:

Python
import re

def split_into_sentences(text: str) -> list[str]:
    """Split text into sentences using regex."""
    # Split on sentence-ending punctuation followed by whitespace
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    return [s.strip() for s in sentences if s.strip()]

def chunk_by_sentences(
    text: str,
    max_chunk_chars: int = 500,
    overlap_sentences: int = 1,
) -> list[str]:
    """
    Group sentences into chunks without exceeding max_chunk_chars.
    overlap_sentences: number of sentences to repeat between chunks.
    """
    sentences = split_into_sentences(text)
    chunks = []
    current_chunk = []
    current_length = 0

    i = 0
    while i < len(sentences):
        sentence = sentences[i]
        sentence_length = len(sentence) + 1  # +1 for space

        if current_length + sentence_length > max_chunk_chars and current_chunk:
            # Flush current chunk
            chunks.append(" ".join(current_chunk))

            # Overlap: keep last N sentences in the next chunk
            if overlap_sentences > 0:
                current_chunk = current_chunk[-overlap_sentences:]
                current_length = sum(len(s) + 1 for s in current_chunk)
            else:
                current_chunk = []
                current_length = 0
        else:
            current_chunk.append(sentence)
            current_length += sentence_length
            i += 1

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

drug_text = """
Warfarin is an anticoagulant medication. It works by inhibiting vitamin K epoxide reductase,
preventing the recycling of vitamin K. This depletes active vitamin K, which is needed for
the synthesis of clotting factors II, VII, IX, and X. Warfarin has a narrow therapeutic index.
The target INR range is typically 2.0 to 3.0 for most indications. Regular monitoring is required.
Drug interactions are common because warfarin is extensively protein-bound and metabolized by CYP2C9.
NSAIDs, aspirin, and many antibiotics can significantly increase warfarin's effect.
"""

chunks = chunk_by_sentences(drug_text, max_chunk_chars=200, overlap_sentences=1)
for i, chunk in enumerate(chunks):
    print(f"\nChunk {i+1} ({len(chunk)} chars):")
    print(f"  {chunk[:100]}...")

Recursive Chunker (LangChain-style)

The most robust approach: try to split at semantic boundaries, falling back to smaller separators:

Python
class RecursiveChunker:
    """
    Splits text using a hierarchy of separators.
    Tries the first separator; if chunks are still too large,
    applies the next separator recursively.
    """

    def __init__(
        self,
        chunk_size: int = 500,
        chunk_overlap: int = 50,
        separators: list[str] | None = None,
    ):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        # Try separators in order: paragraph → sentence → word → character
        self.separators = separators or ["\n\n", "\n", ". ", " ", ""]

    def _split_text(self, text: str, separators: list[str]) -> list[str]:
        """Recursively split text using the separator hierarchy."""
        if not separators:
            return [text]

        separator = separators[0]
        remaining_seps = separators[1:]

        if separator == "":
            # Last resort: character-level split
            return [text[i:i+self.chunk_size] for i in range(0, len(text), self.chunk_size - self.chunk_overlap)]

        splits = text.split(separator)
        chunks = []
        current = []
        current_length = 0

        for split in splits:
            split_with_sep = split + separator if separator != "" else split
            split_length = len(split_with_sep)

            if current_length + split_length > self.chunk_size and current:
                chunk_text = separator.join(current)
                if len(chunk_text) > self.chunk_size and remaining_seps:
                    # Chunk still too large — apply next separator
                    chunks.extend(self._split_text(chunk_text, remaining_seps))
                else:
                    chunks.append(chunk_text)

                # Overlap: keep last part
                overlap_start = max(0, len(current) - 2)
                current = current[overlap_start:]
                current_length = sum(len(s) + len(separator) for s in current)

            current.append(split)
            current_length += split_length

        if current:
            chunk_text = separator.join(current)
            if len(chunk_text) > self.chunk_size and remaining_seps:
                chunks.extend(self._split_text(chunk_text, remaining_seps))
            else:
                chunks.append(chunk_text)

        return [c for c in chunks if c.strip()]

    def split(self, text: str) -> list[str]:
        return self._split_text(text, self.separators)

# Test
long_doc = """
## Drug Information: Warfarin

Warfarin is an oral anticoagulant. It inhibits vitamin K epoxide reductase.

### Mechanism
Warfarin prevents the regeneration of vitamin K1 from vitamin K epoxide. This depletes
active vitamin K, required for gamma-carboxylation of clotting factors II, VII, IX, X.

### Monitoring
The INR (International Normalized Ratio) measures anticoagulation intensity. Target INR
is 2.0-3.0 for most indications. Levels above 4.0 indicate excessive anticoagulation.

### Interactions
Warfarin has many drug interactions. NSAIDs increase bleeding risk. Rifampin induces CYP2C9,
reducing warfarin levels. Fluconazole inhibits CYP2C9, increasing warfarin levels.
"""

chunker = RecursiveChunker(chunk_size=200, chunk_overlap=30)
chunks = chunker.split(long_doc)
print(f"Total chunks: {len(chunks)}")
for i, chunk in enumerate(chunks):
    print(f"\nChunk {i+1} ({len(chunk)} chars): {chunk[:80]}...")

Chunk Size Guidelines

| Content type | Recommended chunk size | |---|---| | Dense technical docs | 200–400 chars | | General articles | 400–800 chars | | Legal/medical documents | 800–1,500 chars | | Code files | 500–1,000 chars (split by function) |

Overlap: Typically 10–20% of chunk size. Too little → context lost at boundaries. Too much → redundant retrieval.

Practical advice: Start at 500 chars / 50 overlap. Measure RAGAS context recall on your test set. If recall is low, increase chunk size. If precision is low, decrease chunk size.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.