Semantic Chunking
How semantic chunking uses embedding similarity to find natural topic boundaries, when it outperforms structural chunking, and its computational cost.
The Idea
Instead of splitting on character counts or separators, detect where the topic changes by measuring embedding similarity between adjacent sentences:
Sentence 1: "Warfarin inhibits vitamin K epoxide reductase." sim=0.89
Sentence 2: "This blocks synthesis of clotting factors II, VII, IX, X." sim=0.91
Sentence 3: "INR is monitored to keep anticoagulation within range." sim=0.85
Sentence 4: "Atrial fibrillation is the most common cardiac arrhythmia." sim=0.31 ā SPLIT
Sentence 5: "AF affects approximately 2% of the general population."
Split point: between sentence 3 and 4 (similarity drops below threshold)Topic shifts produce a similarity valley ā split there.
Algorithm
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
def semantic_chunk(
text: str,
similarity_threshold: float = 0.5,
buffer_size: int = 1, # sentences to compare across boundary
min_chunk_size: int = 100, # chars ā don't split too small
) -> list[str]:
import re
sentences = re.split(r'(?<=[.!?])\s+', text.strip())
sentences = [s for s in sentences if len(s) > 10]
if len(sentences) <= 1:
return [text]
# Embed all sentences at once (batched for efficiency)
embeddings = model.encode(sentences, batch_size=64, normalize_embeddings=True)
# Compare adjacent sentences (with buffer)
split_points = [0]
for i in range(buffer_size, len(sentences) - buffer_size):
# Left context: average of previous buffer sentences
left = embeddings[max(0, i - buffer_size):i].mean(axis=0)
# Right context: average of next buffer sentences
right = embeddings[i + 1:min(len(sentences), i + 1 + buffer_size)].mean(axis=0)
sim = float(np.dot(left, right)) # cosine sim (normalised)
if sim < similarity_threshold:
split_points.append(i + 1)
split_points.append(len(sentences))
# Build chunks from split points
chunks = []
for start, end in zip(split_points, split_points[1:]):
chunk = " ".join(sentences[start:end])
if len(chunk) >= min_chunk_size:
chunks.append(chunk)
elif chunks:
chunks[-1] += " " + chunk # merge tiny chunks with previous
return chunksTuning the Threshold
def find_split_threshold(
sample_texts: list[str],
target_chunk_count: int,
) -> float:
"""Calibrate threshold to produce a target number of chunks."""
from sentence_transformers import SentenceTransformer
import re
model = SentenceTransformer("all-MiniLM-L6-v2")
all_similarities = []
for text in sample_texts:
sentences = re.split(r'(?<=[.!?])\s+', text)
if len(sentences) < 2:
continue
embeddings = model.encode(sentences, normalize_embeddings=True)
for i in range(len(embeddings) - 1):
sim = float(np.dot(embeddings[i], embeddings[i + 1]))
all_similarities.append(sim)
# Choose the percentile that gives the target split rate
all_similarities.sort()
target_splits = 1 / target_chunk_count
idx = int(len(all_similarities) * target_splits)
return all_similarities[idx]
# Example calibration
threshold = find_split_threshold(sample_guidelines, target_chunk_count=10)
print(f"Calibrated threshold: {threshold:.3f}")LangChain SemanticChunker
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
chunker = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile", # "percentile", "standard_deviation", "interquartile"
breakpoint_threshold_amount=95, # 95th percentile of similarity drops
)
chunks = chunker.split_text(clinical_document)Semantic vs Fixed vs Recursive
Method | Chunk size | Preserves structure | Speed | Quality
---------------|-------------|---------------------|---------|--------
Fixed | Uniform | No | Fast | Baseline
Recursive | Variable | Moderate | Fast | Good
Semantic | Variable | Yes | Slow | Best
Latency cost of semantic chunking (per 10K-word document):
Embedding 500 sentences: ~2 seconds on GPU, ~15 seconds on CPU
Acceptable at index time; never acceptable at query time
When semantic chunking wins:
Documents with varied topic density (intro + detail + summary sections)
Long documents where fixed-size would create many mid-topic splits
High-value documents where retrieval quality justifies extra compute
When fixed/recursive is fine:
High-volume ingestion pipelines (millions of documents)
Short documents (< 500 words) ā minimal gain over recursive
Time-constrained indexing pipelinesInterview Answer
"Semantic chunking uses embedding similarity between adjacent sentences to detect topic shifts ā splits happen where similarity drops below a threshold rather than at arbitrary character counts. This preserves topical coherence within each chunk, which improves retrieval because the embedding of a coherent chunk represents a single topic better than a random slice. The trade-off is cost: embedding every sentence at index time is 10-100Ć slower than fixed-size splitting. Use semantic chunking for high-value documents where retrieval quality matters most (clinical guidelines, regulatory documents) and fixed or recursive chunking for high-volume ingestion."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.