Live Coding Interview Prep · Lesson 12 of 16
Implement a Text Chunker with Overlap
Why Chunking Matters
LLM context windows are finite. To index a large document for RAG, you split it into chunks small enough to embed and retrieve. The chunking strategy directly affects retrieval quality.
A chunk that splits mid-sentence loses context. A chunk that's too large retrieves lots of irrelevant content. A chunk with no overlap may miss relationships that span a boundary.
Fixed-Size Character Chunker
The simplest approach — split every N characters:
def chunk_fixed_size(
text: str,
chunk_size: int = 500,
overlap: int = 50,
) -> list[str]:
"""
Split text into fixed-size chunks with overlap.
Overlap preserves context at chunk boundaries.
"""
if chunk_size <= 0:
raise ValueError("chunk_size must be positive")
if overlap >= chunk_size:
raise ValueError("overlap must be smaller than chunk_size")
chunks = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
chunk = text[start:end]
if chunk.strip(): # Skip empty chunks
chunks.append(chunk)
start += chunk_size - overlap
return chunks
# Test
text = "Metformin is a biguanide antidiabetic drug. " * 10
chunks = chunk_fixed_size(text, chunk_size=100, overlap=20)
print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks[:3]):
print(f"Chunk {i}: '{chunk[:60]}...'")Problem: Fixed character splitting cuts mid-word and mid-sentence.
Word-Boundary Chunker
Respect word boundaries when splitting:
def chunk_by_words(
text: str,
chunk_size: int = 100, # In words
overlap: int = 10, # In words
) -> list[str]:
"""Split text into chunks at word boundaries."""
words = text.split()
chunks = []
start = 0
while start < len(words):
end = min(start + chunk_size, len(words))
chunk = " ".join(words[start:end])
if chunk.strip():
chunks.append(chunk)
start += chunk_size - overlap
return chunks
words_chunks = chunk_by_words(text, chunk_size=20, overlap=5)
print(f"Word-based chunks: {len(words_chunks)}")Sentence-Aware Chunker
Better: split at sentence boundaries to preserve semantic units:
import re
def split_into_sentences(text: str) -> list[str]:
"""Split text into sentences using regex."""
# Split on sentence-ending punctuation followed by whitespace
sentences = re.split(r'(?<=[.!?])\s+', text.strip())
return [s.strip() for s in sentences if s.strip()]
def chunk_by_sentences(
text: str,
max_chunk_chars: int = 500,
overlap_sentences: int = 1,
) -> list[str]:
"""
Group sentences into chunks without exceeding max_chunk_chars.
overlap_sentences: number of sentences to repeat between chunks.
"""
sentences = split_into_sentences(text)
chunks = []
current_chunk = []
current_length = 0
i = 0
while i < len(sentences):
sentence = sentences[i]
sentence_length = len(sentence) + 1 # +1 for space
if current_length + sentence_length > max_chunk_chars and current_chunk:
# Flush current chunk
chunks.append(" ".join(current_chunk))
# Overlap: keep last N sentences in the next chunk
if overlap_sentences > 0:
current_chunk = current_chunk[-overlap_sentences:]
current_length = sum(len(s) + 1 for s in current_chunk)
else:
current_chunk = []
current_length = 0
else:
current_chunk.append(sentence)
current_length += sentence_length
i += 1
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
drug_text = """
Warfarin is an anticoagulant medication. It works by inhibiting vitamin K epoxide reductase,
preventing the recycling of vitamin K. This depletes active vitamin K, which is needed for
the synthesis of clotting factors II, VII, IX, and X. Warfarin has a narrow therapeutic index.
The target INR range is typically 2.0 to 3.0 for most indications. Regular monitoring is required.
Drug interactions are common because warfarin is extensively protein-bound and metabolized by CYP2C9.
NSAIDs, aspirin, and many antibiotics can significantly increase warfarin's effect.
"""
chunks = chunk_by_sentences(drug_text, max_chunk_chars=200, overlap_sentences=1)
for i, chunk in enumerate(chunks):
print(f"\nChunk {i+1} ({len(chunk)} chars):")
print(f" {chunk[:100]}...")Recursive Chunker (LangChain-style)
The most robust approach: try to split at semantic boundaries, falling back to smaller separators:
class RecursiveChunker:
"""
Splits text using a hierarchy of separators.
Tries the first separator; if chunks are still too large,
applies the next separator recursively.
"""
def __init__(
self,
chunk_size: int = 500,
chunk_overlap: int = 50,
separators: list[str] | None = None,
):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
# Try separators in order: paragraph → sentence → word → character
self.separators = separators or ["\n\n", "\n", ". ", " ", ""]
def _split_text(self, text: str, separators: list[str]) -> list[str]:
"""Recursively split text using the separator hierarchy."""
if not separators:
return [text]
separator = separators[0]
remaining_seps = separators[1:]
if separator == "":
# Last resort: character-level split
return [text[i:i+self.chunk_size] for i in range(0, len(text), self.chunk_size - self.chunk_overlap)]
splits = text.split(separator)
chunks = []
current = []
current_length = 0
for split in splits:
split_with_sep = split + separator if separator != "" else split
split_length = len(split_with_sep)
if current_length + split_length > self.chunk_size and current:
chunk_text = separator.join(current)
if len(chunk_text) > self.chunk_size and remaining_seps:
# Chunk still too large — apply next separator
chunks.extend(self._split_text(chunk_text, remaining_seps))
else:
chunks.append(chunk_text)
# Overlap: keep last part
overlap_start = max(0, len(current) - 2)
current = current[overlap_start:]
current_length = sum(len(s) + len(separator) for s in current)
current.append(split)
current_length += split_length
if current:
chunk_text = separator.join(current)
if len(chunk_text) > self.chunk_size and remaining_seps:
chunks.extend(self._split_text(chunk_text, remaining_seps))
else:
chunks.append(chunk_text)
return [c for c in chunks if c.strip()]
def split(self, text: str) -> list[str]:
return self._split_text(text, self.separators)
# Test
long_doc = """
## Drug Information: Warfarin
Warfarin is an oral anticoagulant. It inhibits vitamin K epoxide reductase.
### Mechanism
Warfarin prevents the regeneration of vitamin K1 from vitamin K epoxide. This depletes
active vitamin K, required for gamma-carboxylation of clotting factors II, VII, IX, X.
### Monitoring
The INR (International Normalized Ratio) measures anticoagulation intensity. Target INR
is 2.0-3.0 for most indications. Levels above 4.0 indicate excessive anticoagulation.
### Interactions
Warfarin has many drug interactions. NSAIDs increase bleeding risk. Rifampin induces CYP2C9,
reducing warfarin levels. Fluconazole inhibits CYP2C9, increasing warfarin levels.
"""
chunker = RecursiveChunker(chunk_size=200, chunk_overlap=30)
chunks = chunker.split(long_doc)
print(f"Total chunks: {len(chunks)}")
for i, chunk in enumerate(chunks):
print(f"\nChunk {i+1} ({len(chunk)} chars): {chunk[:80]}...")Chunk Size Guidelines
| Content type | Recommended chunk size | |---|---| | Dense technical docs | 200–400 chars | | General articles | 400–800 chars | | Legal/medical documents | 800–1,500 chars | | Code files | 500–1,000 chars (split by function) |
Overlap: Typically 10–20% of chunk size. Too little → context lost at boundaries. Too much → redundant retrieval.
Practical advice: Start at 500 chars / 50 overlap. Measure RAGAS context recall on your test set. If recall is low, increase chunk size. If precision is low, decrease chunk size.