RAG Chunking Strategy — Interview Q&A

Chunking is the step most RAG tutorials gloss over and the step most production RAG failures trace back to. These questions probe whether a candidate understands the trade-offs, not just the syntax.

Q1. How do you choose chunk size?

Chunk size is driven by three constraints: the embedding model's context window, the granularity of expected queries, and storage budget. The embedding model is the hard ceiling — all-MiniLM-L6-v2 has a 256-token limit; text-embedding-3-small handles 8K. For the retrieval task, chunk size should match the typical answer length: if users ask short factual questions, 256 tokens gives precise retrieval; if they ask for explanations of multi-step processes, 512–1024 tokens may be needed. I start at 512 tokens, measure retrieval quality with context recall on a labelled eval set, then tune. The wrong chunk size is detectable: too small gives incomplete answers; too large includes irrelevant content that dilutes the embedding signal.

Q2. Why use overlap at all? What is the downside?

Overlap prevents complete sentences from being cut across two chunks. Without overlap, a query about information spanning a boundary might retrieve two incomplete fragments, neither of which contains the full answer. The downside is storage and compute: 20% overlap creates roughly 25% more chunks, which increases indexing time, vector store size, and query cost. There is also a retrieval quality risk: high overlap (50% or more) creates near-duplicate chunks that may both rank highly, wasting context window space with repeated content. Mitigate by deduplicating retrieved chunks before sending to the LLM — compare chunk text hashes and remove duplicates that are above 90% similar.

Q3. When would you use semantic chunking over recursive chunking?

Semantic chunking embeds every sentence during indexing to detect topic shifts — it is 10 to 50 times slower than recursive chunking. I use it for high-value document sets where retrieval quality justifies the cost: clinical guidelines, regulatory documents, long academic papers with varied section densities. For high-volume pipelines (thousands of documents updated daily) or short documents, recursive chunking with well-tuned separators gives 90% of the quality at 5% of the cost. The decision is: how much does retrieval quality matter, and how many documents are being indexed?

Q4. A user says retrieved chunks are often incomplete — the answer spans two chunks. How do you fix it?

Three options in escalating complexity. First, increase overlap — if the answer consistently spans boundaries, 20% overlap should capture it in at least one chunk. Second, parent document retrieval: embed small child chunks for precise matching, but return the full parent paragraph or section. This gives precise retrieval with complete context. Third, small-to-big: retrieve small chunks, then expand to a plus-or-minus 2 sentence window at retrieval time. Which I choose depends on whether the documents have clear paragraph structure (parent document), whether the answer pattern is short and specific (overlap), or whether queries vary widely in answer length (small-to-big).

Q5. How do you handle tables, lists, and code blocks in documents?

Structured content breaks almost every chunking strategy. My approach by content type:

Tables: extract as-is into a single chunk with a text prefix describing what the table contains. Never split a table across chunks.

Bullet lists: keep the list heading with the bullets. A list split mid-item loses its meaning.

Code blocks: split only at function or class boundaries, never mid-function.

In practice I use a pre-processing pass that detects markdown tables (via pipe patterns) and code fences, marks them as atomic units, then routes non-structured text through recursive chunking. LangChain's MarkdownTextSplitter handles this partially but often needs custom post-processing for complex table formats.

Q6. How would you design the chunking pipeline for a 10,000-page clinical knowledge base that updates monthly?

The pipeline has four stages. First, document classification: route PDFs, Word documents, and HTML guidelines to appropriate parsers (pdfplumber, python-docx, html2text). Second, structural extraction: identify sections, tables, and lists before chunking. Third, chunking by document type: guidelines get recursive chunking on section headers (512 tokens, 64 overlap); drug monographs get fixed chunking on sections with tables kept atomic; patient notes get semantic chunking if used at all, given PHI concerns. Fourth, differential re-indexing on monthly updates: compare document hashes, re-embed only changed sections, and mark superseded chunks with an expired flag rather than deleting them — this preserves the audit trail. For PHI documents, partition the vector store by patient_id and enforce metadata filters on every retrieval query.

Q7. What is the difference between fixed-size, recursive, and document-aware chunking?

Fixed-size chunking splits text at exactly N tokens regardless of content boundaries. Simple to implement, but frequently cuts sentences and logical units in half. Good for homogeneous text like transcripts or log files.

Recursive chunking tries to split on natural separators in priority order: paragraph breaks, then sentence breaks, then word breaks. LangChain's RecursiveCharacterTextSplitter implements this. It respects sentence structure in most cases and is the default choice for general text.

Document-aware chunking understands the document's structure before splitting: it extracts sections, headings, tables, and lists first, then applies chunking within each structural unit. More complex to build but dramatically better for structured documents like product manuals, regulations, or textbooks.

Q8. How do you evaluate whether your chunking strategy is correct?

Build a labelled eval set: 50 to 100 question/ground-truth pairs where you know which source section contains the answer. Then measure context recall — what fraction of questions have the correct source chunk in the top-K retrieved results. If context recall is below 70%, the chunking strategy is the first place to investigate. Common signals:

Context recall is low but semantic similarity scores are high: chunks are too large, diluting the embedding signal.

Context recall is low and similarity scores are low: chunks are too small and lack sufficient context for the embedding to capture meaning.

Context recall is high but faithfulness is low: retrieval is working but the generation model is ignoring the context — a prompt engineering issue, not a chunking issue.

Q9. A colleague suggests just using sentence-level chunks (one sentence per chunk). What would you say?

Sentence-level chunks create two problems. First, most sentences lack sufficient context for the embedding model to understand their topic. "This applies to all employees" embedded alone has an almost meaningless vector — it could relate to anything. The embedding model needs surrounding context to produce a useful vector. Second, the number of chunks explodes: a 100-page document with 50 sentences per page becomes 5,000 chunks. Retrieval becomes noisier as the top-K results scatter across unrelated sentences that happen to share similar phrasing. The right granularity is the minimum text unit that independently answers a question — usually a paragraph or section, not a sentence.

Q10. How do you chunk multi-document corpora where documents have very different structures?

Use a routing pattern. During ingestion, classify each document into a type (FAQ, regulatory guidance, data sheet, article). Each document type gets its own chunking configuration: FAQs get split per Q&A pair; regulatory documents get recursive chunking on numbered sections; data sheets get table-aware parsing. Store the document type in chunk metadata. At retrieval time, you can optionally filter by document type if the query intent is known (a question about a specific product routes to data sheets first). This is more operationally complex but avoids the compromise of a single chunking strategy that works poorly for every document type.

Q11. What metadata should you store with each chunk?

At minimum: document_id, source_title, page_number or section_heading, chunk_index, chunk_total, created_at, content_hash. For production systems also store: document_version (to support differential re-indexing), expiry_date (for time-sensitive content like guidelines with review cycles), topic_tags (for pre-filtering before vector search), and language (for multilingual corpora). Rich metadata enables hybrid search (filter first, then rank by similarity) which is significantly more precise than pure vector search on large corpora.

Q12. What is the "lost in the middle" problem and how does chunking affect it?

LLMs attend most strongly to content at the beginning and end of their context window. Content in the middle gets less attention. In RAG, if you retrieve 8 chunks and inject them in order of relevance, the second-most-relevant chunk in the middle position may be largely ignored by the model. The fix is reordering at retrieval time: put the highest-relevance chunk first and the second-highest last. This is sometimes called the "reverse pyramid" ordering. Smaller chunks help too — they reduce the total context length, shrinking the "middle" region where attention degrades.

Interview Answer Summary

Chunking decisions follow this hierarchy: pick chunk size to match the embedding model's context window and query granularity; pick splitting method (fixed for homogeneous text, recursive for structured documents, semantic for high-value varied documents); pick overlap of 10 to 20% to handle boundaries; add metadata for filtering; deduplicate retrieved chunks at query time. Measure with context recall on a labelled eval set — below 70% means the chunking strategy needs tuning. For clinical and regulatory RAG: treat tables and lists as atomic units, apply document-aware chunking per document type, and use differential re-indexing with soft-delete on updates.

RAG Chunking Strategy — Interview Q&A

RAG Chunking Strategy — Interview Q&A

Q1. How do you choose chunk size?

Q2. Why use overlap at all? What is the downside?

Q3. When would you use semantic chunking over recursive chunking?

Q4. A user says retrieved chunks are often incomplete — the answer spans two chunks. How do you fix it?

Q5. How do you handle tables, lists, and code blocks in documents?

Q6. How would you design the chunking pipeline for a 10,000-page clinical knowledge base that updates monthly?

Q7. What is the difference between fixed-size, recursive, and document-aware chunking?

Q8. How do you evaluate whether your chunking strategy is correct?

Q9. A colleague suggests just using sentence-level chunks (one sentence per chunk). What would you say?

Q10. How do you chunk multi-document corpora where documents have very different structures?

Q11. What metadata should you store with each chunk?

Q12. What is the "lost in the middle" problem and how does chunking affect it?

Interview Answer Summary

Enjoyed this article?

Leave a comment