Learnixo

RAG Systems · Lesson 10 of 24

Fixed-Size Chunking: Simple but Effective

What Fixed-Size Chunking Is

Split a document into segments of N tokens (or characters), with optional overlap:

Document: "Warfarin is an anticoagulant used to prevent blood clots.
           It works by blocking vitamin K-dependent clotting factors.
           INR monitoring is required. Target INR 2.0–3.0 for AF."

chunk_size=20 tokens, overlap=5:
  Chunk 1: "Warfarin is an anticoagulant used to prevent blood clots."
  Chunk 2: "prevent blood clots. It works by blocking vitamin K-dependent"
  Chunk 3: "vitamin K-dependent clotting factors. INR monitoring is"
  Chunk 4: "INR monitoring is required. Target INR 2.0–3.0 for AF."

Overlap ensures that sentences crossing chunk boundaries appear in at least one complete chunk.


Implementation

Python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

def fixed_chunk_by_tokens(
    text: str,
    chunk_size: int = 256,
    overlap: int = 32,
) -> list[str]:
    token_ids = tokenizer.encode(text, add_special_tokens=False)
    chunks = []
    start = 0
    while start < len(token_ids):
        end = start + chunk_size
        chunk_ids = token_ids[start:end]
        chunk_text = tokenizer.decode(chunk_ids, skip_special_tokens=True)
        chunks.append(chunk_text)
        if end >= len(token_ids):
            break
        start += chunk_size - overlap
    return chunks


def fixed_chunk_by_chars(
    text: str,
    chunk_size: int = 1000,
    overlap: int = 100,
) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start:end])
        if end == len(text):
            break
        start += chunk_size - overlap
    return chunks


# Usage
chunks = fixed_chunk_by_tokens(
    text=clinical_document,
    chunk_size=256,    # tokens  fits well in all-MiniLM context window
    overlap=32,        # 12.5% overlap
)

Parameter Guidelines

chunk_size (in tokens):
  128:  very small — precise but loses context
  256:  good for question-answering over dense text (clinical guidelines)
  512:  standard — matches many embedding model context windows
  1024: large — better for summaries, risks diluting relevance

  Match to embedding model's context window:
    all-MiniLM-L6-v2: max 256 tokens (sequence limit)
    text-embedding-3-small: max 8191 tokens
    MedCPT: max 512 tokens

overlap (as % of chunk_size):
  0%:   chunks are disjoint — boundary sentences may be cut
  10%:  minimal overlap — fast, low storage overhead
  20%:  recommended — good boundary coverage
  50%:  high overlap — 2× storage, rarely necessary

Rule of thumb: overlap = 10–20% of chunk_size

Pros and Cons

Advantages:
  Simple to implement and debug
  Predictable chunk count and storage size
  Fast — no semantic processing during chunking
  Consistent retrieval granularity
  Works well when text is uniformly dense (e.g., clinical guidelines)

Disadvantages:
  Ignores document structure (splits mid-sentence, mid-paragraph)
  A question answered across two chunks may not be retrieved well
  Uniform size is suboptimal for mixed-length documents
  Character-based chunking can break mid-token (use token-based instead)

When to Use Fixed-Size Chunking

Use fixed-size when:
  ✓ Getting started — it's the right default
  ✓ Documents are uniformly structured (guidelines, policies, textbooks)
  ✓ Embedding model has a hard context limit you must respect
  ✓ Speed of indexing is a priority
  ✓ You lack the compute for semantic chunking

Consider alternatives when:
  ✗ Documents have clear section headers — use recursive/semantic chunking
  ✗ Answers frequently span paragraphs — use parent document retrieval
  ✗ Documents are code + prose mixed — use language-aware splitting
  ✗ Chunks must respect table boundaries — use structural splitting

Clinical Example

Python
def index_nice_guideline(guideline_text: str, metadata: dict) -> int:
    chunks = fixed_chunk_by_tokens(
        text=guideline_text,
        chunk_size=256,   # MedCPT context limit
        overlap=40,       # ~16% overlap  captures cross-sentence info
    )
    
    embeddings = embed_batch(chunks)   # batch embedding
    
    collection.add(
        ids=[f"{metadata['doc_id']}_chunk_{i}" for i in range(len(chunks))],
        documents=chunks,
        embeddings=embeddings.tolist(),
        metadatas=[{**metadata, "chunk_index": i, "chunk_total": len(chunks)}
                   for i in range(len(chunks))]
    )
    return len(chunks)

Interview Answer

"Fixed-size chunking splits documents into N-token segments with optional overlap (typically 10–20% of chunk size). It's the right default: simple, fast, and predictable. Key parameters are chunk_size (match to the embedding model's context window — 256 for MiniLM, 512 for MedCPT) and overlap (prevents boundary sentences from being split across chunks). The main weakness is ignoring document structure — it may split mid-sentence or mid-paragraph. For documents with clear section structure, recursive or semantic chunking produces better retrieval, but fixed-size is the starting point for any RAG system."