RAG Systems · Lesson 12 of 24
Semantic Chunking: Split on Meaning
The Idea
Instead of splitting on character counts or separators, detect where the topic changes by measuring embedding similarity between adjacent sentences:
Sentence 1: "Warfarin inhibits vitamin K epoxide reductase." sim=0.89
Sentence 2: "This blocks synthesis of clotting factors II, VII, IX, X." sim=0.91
Sentence 3: "INR is monitored to keep anticoagulation within range." sim=0.85
Sentence 4: "Atrial fibrillation is the most common cardiac arrhythmia." sim=0.31 ← SPLIT
Sentence 5: "AF affects approximately 2% of the general population."
Split point: between sentence 3 and 4 (similarity drops below threshold)Topic shifts produce a similarity valley — split there.
Algorithm
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
def semantic_chunk(
text: str,
similarity_threshold: float = 0.5,
buffer_size: int = 1, # sentences to compare across boundary
min_chunk_size: int = 100, # chars — don't split too small
) -> list[str]:
import re
sentences = re.split(r'(?<=[.!?])\s+', text.strip())
sentences = [s for s in sentences if len(s) > 10]
if len(sentences) <= 1:
return [text]
# Embed all sentences at once (batched for efficiency)
embeddings = model.encode(sentences, batch_size=64, normalize_embeddings=True)
# Compare adjacent sentences (with buffer)
split_points = [0]
for i in range(buffer_size, len(sentences) - buffer_size):
# Left context: average of previous buffer sentences
left = embeddings[max(0, i - buffer_size):i].mean(axis=0)
# Right context: average of next buffer sentences
right = embeddings[i + 1:min(len(sentences), i + 1 + buffer_size)].mean(axis=0)
sim = float(np.dot(left, right)) # cosine sim (normalised)
if sim < similarity_threshold:
split_points.append(i + 1)
split_points.append(len(sentences))
# Build chunks from split points
chunks = []
for start, end in zip(split_points, split_points[1:]):
chunk = " ".join(sentences[start:end])
if len(chunk) >= min_chunk_size:
chunks.append(chunk)
elif chunks:
chunks[-1] += " " + chunk # merge tiny chunks with previous
return chunksTuning the Threshold
def find_split_threshold(
sample_texts: list[str],
target_chunk_count: int,
) -> float:
"""Calibrate threshold to produce a target number of chunks."""
from sentence_transformers import SentenceTransformer
import re
model = SentenceTransformer("all-MiniLM-L6-v2")
all_similarities = []
for text in sample_texts:
sentences = re.split(r'(?<=[.!?])\s+', text)
if len(sentences) < 2:
continue
embeddings = model.encode(sentences, normalize_embeddings=True)
for i in range(len(embeddings) - 1):
sim = float(np.dot(embeddings[i], embeddings[i + 1]))
all_similarities.append(sim)
# Choose the percentile that gives the target split rate
all_similarities.sort()
target_splits = 1 / target_chunk_count
idx = int(len(all_similarities) * target_splits)
return all_similarities[idx]
# Example calibration
threshold = find_split_threshold(sample_guidelines, target_chunk_count=10)
print(f"Calibrated threshold: {threshold:.3f}")LangChain SemanticChunker
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
chunker = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile", # "percentile", "standard_deviation", "interquartile"
breakpoint_threshold_amount=95, # 95th percentile of similarity drops
)
chunks = chunker.split_text(clinical_document)Semantic vs Fixed vs Recursive
Method | Chunk size | Preserves structure | Speed | Quality
---------------|-------------|---------------------|---------|--------
Fixed | Uniform | No | Fast | Baseline
Recursive | Variable | Moderate | Fast | Good
Semantic | Variable | Yes | Slow | Best
Latency cost of semantic chunking (per 10K-word document):
Embedding 500 sentences: ~2 seconds on GPU, ~15 seconds on CPU
Acceptable at index time; never acceptable at query time
When semantic chunking wins:
Documents with varied topic density (intro + detail + summary sections)
Long documents where fixed-size would create many mid-topic splits
High-value documents where retrieval quality justifies extra compute
When fixed/recursive is fine:
High-volume ingestion pipelines (millions of documents)
Short documents (< 500 words) — minimal gain over recursive
Time-constrained indexing pipelinesInterview Answer
"Semantic chunking uses embedding similarity between adjacent sentences to detect topic shifts — splits happen where similarity drops below a threshold rather than at arbitrary character counts. This preserves topical coherence within each chunk, which improves retrieval because the embedding of a coherent chunk represents a single topic better than a random slice. The trade-off is cost: embedding every sentence at index time is 10-100× slower than fixed-size splitting. Use semantic chunking for high-value documents where retrieval quality matters most (clinical guidelines, regulatory documents) and fixed or recursive chunking for high-volume ingestion."