Parent Document Retrieval

The Chunk Size Dilemma

RAG relies on chunking documents before embedding. Chunk size involves a fundamental trade-off:

Small chunks (100-200 tokens):
  ✓ Precise retrieval — small chunks match queries tightly
  ✓ Less noise per chunk
  ✗ Lost context — a sentence is meaningless without its surrounding paragraph
  ✗ Fragmented answers — the retrieved chunk doesn't contain full reasoning

Large chunks (800-1500 tokens):
  ✓ More context per chunk — the model has fuller information
  ✗ Lower retrieval precision — large chunks may contain the answer but
     also lots of irrelevant text
  ✗ Diluted embedding — hard to match a very specific query to a long passage

Parent document retrieval solves this by searching with small chunks but returning large chunks.

Parent Document Architecture

Indexing time:
  1. Split documents into PARENT chunks (large, ~1000 tokens)
  2. Split each parent into CHILD chunks (small, ~100-200 tokens)
  3. Embed the CHILD chunks — they're indexed for retrieval
  4. Store child→parent mapping

Retrieval time:
  1. Embed the query
  2. Search child chunks (precise match due to small size)
  3. Look up the PARENT chunk for each matched child
  4. Return the parent chunks to the LLM (full context)

Result: fine-grained search, full-context synthesis

Implementation

Python

import uuid
from dataclasses import dataclass, field
from sentence_transformers import SentenceTransformer
import numpy as np

@dataclass
class ParentChunk:
    id: str
    content: str  # large full-context chunk

@dataclass
class ChildChunk:
    id: str
    parent_id: str
    content: str   # small, precisely embeddable chunk
    embedding: np.ndarray | None = None

class ParentDocumentRetriever:
    def __init__(self, embedder: SentenceTransformer, parent_chunk_size: int = 1000,
                 child_chunk_size: int = 150):
        self.embedder = embedder
        self.parent_chunk_size = parent_chunk_size
        self.child_chunk_size = child_chunk_size
        self.parents: dict[str, ParentChunk] = {}
        self.children: list[ChildChunk] = []
        self.child_embeddings: np.ndarray | None = None

    def _split(self, text: str, chunk_size: int) -> list[str]:
        words = text.split()
        chunks = []
        for i in range(0, len(words), chunk_size):
            chunks.append(" ".join(words[i:i+chunk_size]))
        return chunks

    def add_document(self, text: str) -> None:
        parent_texts = self._split(text, self.parent_chunk_size)
        for parent_text in parent_texts:
            parent_id = str(uuid.uuid4())
            self.parents[parent_id] = ParentChunk(id=parent_id, content=parent_text)

            child_texts = self._split(parent_text, self.child_chunk_size)
            for child_text in child_texts:
                child = ChildChunk(id=str(uuid.uuid4()), parent_id=parent_id, content=child_text)
                self.children.append(child)

        # Re-embed all children
        child_texts = [c.content for c in self.children]
        embeddings = self.embedder.encode(child_texts, show_progress_bar=False)
        for child, emb in zip(self.children, embeddings):
            child.embedding = emb
        self.child_embeddings = embeddings

    def retrieve(self, query: str, top_k: int = 5) -> list[ParentChunk]:
        query_emb = self.embedder.encode([query])[0]
        similarities = self.child_embeddings @ query_emb / (
            np.linalg.norm(self.child_embeddings, axis=1) * np.linalg.norm(query_emb) + 1e-9
        )
        top_child_indices = np.argsort(similarities)[::-1][:top_k * 3]

        # Deduplicate by parent
        seen_parent_ids: set[str] = set()
        parents: list[ParentChunk] = []
        for idx in top_child_indices:
            parent_id = self.children[idx].parent_id
            if parent_id not in seen_parent_ids:
                seen_parent_ids.add(parent_id)
                parents.append(self.parents[parent_id])
            if len(parents) >= top_k:
                break

        return parents

LangChain Parent Document Retriever

Python

from langchain.retrievers import ParentDocumentRetriever
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Child splitter (small — for embedding and search)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)

# Parent splitter (large — for context)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

# Vector store for child chunks
vectorstore = Chroma(embedding_function=OpenAIEmbeddings())

# Document store for parent chunks
docstore = InMemoryStore()

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

retriever.add_documents(documents)  # indexes parents and embeds children

# Retrieval returns full parent chunks
results = retriever.get_relevant_documents("Warfarin dose adjustment CYP2C9")

When to Use Parent Document Retrieval

Use when:
  Documents have meaningful structure (section → paragraph → sentence)
  Answers require surrounding context to be accurate
  Clinical notes, guidelines, research papers with multi-paragraph arguments

Don't use when:
  Documents are already short (each chunk IS the parent)
  Each sentence is standalone and meaningful without context
  Storage/memory is very constrained (doubles the data stored)

Medical use case:
  Clinical guideline (20 pages): parent chunks = sections, child chunks = sentences
  Query: "Warfarin in pregnancy"
  Child match: "Warfarin is contraindicated in pregnancy" (sentence)
  Parent returned: full contraindication section with clinical context and alternatives

Interview Answer

"Parent document retrieval decouples search precision from context size. During indexing, each document is split into large parent chunks (~1000 tokens) and small child chunks (~150 tokens). Child chunks are embedded and indexed for retrieval. At search time, query matches against child chunks (precise), but the retriever returns the full parent chunk (context-rich). This gives fine-grained query matching without the lost-context problem of small chunks. Best for structured documents like clinical guidelines where a single sentence makes little sense without its surrounding paragraph. The trade-off is storing both parent and child chunks and the added complexity of the mapping layer."