Learnixo

RAG Systems · Lesson 12 of 24

Semantic Chunking: Split on Meaning

The Idea

Instead of splitting on character counts or separators, detect where the topic changes by measuring embedding similarity between adjacent sentences:

Sentence 1: "Warfarin inhibits vitamin K epoxide reductase."      sim=0.89
Sentence 2: "This blocks synthesis of clotting factors II, VII, IX, X."  sim=0.91
Sentence 3: "INR is monitored to keep anticoagulation within range."   sim=0.85
Sentence 4: "Atrial fibrillation is the most common cardiac arrhythmia."  sim=0.31  ← SPLIT
Sentence 5: "AF affects approximately 2% of the general population."

Split point: between sentence 3 and 4 (similarity drops below threshold)

Topic shifts produce a similarity valley — split there.


Algorithm

Python
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_chunk(
    text: str,
    similarity_threshold: float = 0.5,
    buffer_size: int = 1,          # sentences to compare across boundary
    min_chunk_size: int = 100,     # chars  don't split too small
) -> list[str]:
    import re
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    sentences = [s for s in sentences if len(s) > 10]
    
    if len(sentences) <= 1:
        return [text]
    
    # Embed all sentences at once (batched for efficiency)
    embeddings = model.encode(sentences, batch_size=64, normalize_embeddings=True)
    
    # Compare adjacent sentences (with buffer)
    split_points = [0]
    for i in range(buffer_size, len(sentences) - buffer_size):
        # Left context: average of previous buffer sentences
        left = embeddings[max(0, i - buffer_size):i].mean(axis=0)
        # Right context: average of next buffer sentences
        right = embeddings[i + 1:min(len(sentences), i + 1 + buffer_size)].mean(axis=0)
        
        sim = float(np.dot(left, right))  # cosine sim (normalised)
        if sim < similarity_threshold:
            split_points.append(i + 1)
    
    split_points.append(len(sentences))
    
    # Build chunks from split points
    chunks = []
    for start, end in zip(split_points, split_points[1:]):
        chunk = " ".join(sentences[start:end])
        if len(chunk) >= min_chunk_size:
            chunks.append(chunk)
        elif chunks:
            chunks[-1] += " " + chunk  # merge tiny chunks with previous
    
    return chunks

Tuning the Threshold

Python
def find_split_threshold(
    sample_texts: list[str],
    target_chunk_count: int,
) -> float:
    """Calibrate threshold to produce a target number of chunks."""
    from sentence_transformers import SentenceTransformer
    import re
    
    model = SentenceTransformer("all-MiniLM-L6-v2")
    all_similarities = []
    
    for text in sample_texts:
        sentences = re.split(r'(?<=[.!?])\s+', text)
        if len(sentences) < 2:
            continue
        embeddings = model.encode(sentences, normalize_embeddings=True)
        for i in range(len(embeddings) - 1):
            sim = float(np.dot(embeddings[i], embeddings[i + 1]))
            all_similarities.append(sim)
    
    # Choose the percentile that gives the target split rate
    all_similarities.sort()
    target_splits = 1 / target_chunk_count
    idx = int(len(all_similarities) * target_splits)
    return all_similarities[idx]

# Example calibration
threshold = find_split_threshold(sample_guidelines, target_chunk_count=10)
print(f"Calibrated threshold: {threshold:.3f}")

LangChain SemanticChunker

Python
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

chunker = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",  # "percentile", "standard_deviation", "interquartile"
    breakpoint_threshold_amount=95,          # 95th percentile of similarity drops
)

chunks = chunker.split_text(clinical_document)

Semantic vs Fixed vs Recursive

Method         | Chunk size  | Preserves structure | Speed   | Quality
---------------|-------------|---------------------|---------|--------
Fixed          | Uniform     | No                  | Fast    | Baseline
Recursive      | Variable    | Moderate            | Fast    | Good
Semantic       | Variable    | Yes                 | Slow    | Best

Latency cost of semantic chunking (per 10K-word document):
  Embedding 500 sentences: ~2 seconds on GPU, ~15 seconds on CPU
  Acceptable at index time; never acceptable at query time

When semantic chunking wins:
  Documents with varied topic density (intro + detail + summary sections)
  Long documents where fixed-size would create many mid-topic splits
  High-value documents where retrieval quality justifies extra compute

When fixed/recursive is fine:
  High-volume ingestion pipelines (millions of documents)
  Short documents (< 500 words) — minimal gain over recursive
  Time-constrained indexing pipelines

Interview Answer

"Semantic chunking uses embedding similarity between adjacent sentences to detect topic shifts — splits happen where similarity drops below a threshold rather than at arbitrary character counts. This preserves topical coherence within each chunk, which improves retrieval because the embedding of a coherent chunk represents a single topic better than a random slice. The trade-off is cost: embedding every sentence at index time is 10-100× slower than fixed-size splitting. Use semantic chunking for high-value documents where retrieval quality matters most (clinical guidelines, regulatory documents) and fixed or recursive chunking for high-volume ingestion."