RAG Systems · Lesson 10 of 24
Fixed-Size Chunking: Simple but Effective
What Fixed-Size Chunking Is
Split a document into segments of N tokens (or characters), with optional overlap:
Document: "Warfarin is an anticoagulant used to prevent blood clots.
It works by blocking vitamin K-dependent clotting factors.
INR monitoring is required. Target INR 2.0–3.0 for AF."
chunk_size=20 tokens, overlap=5:
Chunk 1: "Warfarin is an anticoagulant used to prevent blood clots."
Chunk 2: "prevent blood clots. It works by blocking vitamin K-dependent"
Chunk 3: "vitamin K-dependent clotting factors. INR monitoring is"
Chunk 4: "INR monitoring is required. Target INR 2.0–3.0 for AF."Overlap ensures that sentences crossing chunk boundaries appear in at least one complete chunk.
Implementation
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
def fixed_chunk_by_tokens(
text: str,
chunk_size: int = 256,
overlap: int = 32,
) -> list[str]:
token_ids = tokenizer.encode(text, add_special_tokens=False)
chunks = []
start = 0
while start < len(token_ids):
end = start + chunk_size
chunk_ids = token_ids[start:end]
chunk_text = tokenizer.decode(chunk_ids, skip_special_tokens=True)
chunks.append(chunk_text)
if end >= len(token_ids):
break
start += chunk_size - overlap
return chunks
def fixed_chunk_by_chars(
text: str,
chunk_size: int = 1000,
overlap: int = 100,
) -> list[str]:
chunks = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
chunks.append(text[start:end])
if end == len(text):
break
start += chunk_size - overlap
return chunks
# Usage
chunks = fixed_chunk_by_tokens(
text=clinical_document,
chunk_size=256, # tokens — fits well in all-MiniLM context window
overlap=32, # 12.5% overlap
)Parameter Guidelines
chunk_size (in tokens):
128: very small — precise but loses context
256: good for question-answering over dense text (clinical guidelines)
512: standard — matches many embedding model context windows
1024: large — better for summaries, risks diluting relevance
Match to embedding model's context window:
all-MiniLM-L6-v2: max 256 tokens (sequence limit)
text-embedding-3-small: max 8191 tokens
MedCPT: max 512 tokens
overlap (as % of chunk_size):
0%: chunks are disjoint — boundary sentences may be cut
10%: minimal overlap — fast, low storage overhead
20%: recommended — good boundary coverage
50%: high overlap — 2× storage, rarely necessary
Rule of thumb: overlap = 10–20% of chunk_sizePros and Cons
Advantages:
Simple to implement and debug
Predictable chunk count and storage size
Fast — no semantic processing during chunking
Consistent retrieval granularity
Works well when text is uniformly dense (e.g., clinical guidelines)
Disadvantages:
Ignores document structure (splits mid-sentence, mid-paragraph)
A question answered across two chunks may not be retrieved well
Uniform size is suboptimal for mixed-length documents
Character-based chunking can break mid-token (use token-based instead)When to Use Fixed-Size Chunking
Use fixed-size when:
✓ Getting started — it's the right default
✓ Documents are uniformly structured (guidelines, policies, textbooks)
✓ Embedding model has a hard context limit you must respect
✓ Speed of indexing is a priority
✓ You lack the compute for semantic chunking
Consider alternatives when:
✗ Documents have clear section headers — use recursive/semantic chunking
✗ Answers frequently span paragraphs — use parent document retrieval
✗ Documents are code + prose mixed — use language-aware splitting
✗ Chunks must respect table boundaries — use structural splittingClinical Example
def index_nice_guideline(guideline_text: str, metadata: dict) -> int:
chunks = fixed_chunk_by_tokens(
text=guideline_text,
chunk_size=256, # MedCPT context limit
overlap=40, # ~16% overlap — captures cross-sentence info
)
embeddings = embed_batch(chunks) # batch embedding
collection.add(
ids=[f"{metadata['doc_id']}_chunk_{i}" for i in range(len(chunks))],
documents=chunks,
embeddings=embeddings.tolist(),
metadatas=[{**metadata, "chunk_index": i, "chunk_total": len(chunks)}
for i in range(len(chunks))]
)
return len(chunks)Interview Answer
"Fixed-size chunking splits documents into N-token segments with optional overlap (typically 10–20% of chunk size). It's the right default: simple, fast, and predictable. Key parameters are chunk_size (match to the embedding model's context window — 256 for MiniLM, 512 for MedCPT) and overlap (prevents boundary sentences from being split across chunks). The main weakness is ignoring document structure — it may split mid-sentence or mid-paragraph. For documents with clear section structure, recursive or semantic chunking produces better retrieval, but fixed-size is the starting point for any RAG system."