Learnixo
Back to blog
AI Systemsintermediate

Semantic Chunking

How semantic chunking uses embedding similarity to find natural topic boundaries, when it outperforms structural chunking, and its computational cost.

Asma Hafeez KhanMay 21, 20264 min read
RAGChunkingSemanticEmbeddingsDocument Processing
Share:š•

The Idea

Instead of splitting on character counts or separators, detect where the topic changes by measuring embedding similarity between adjacent sentences:

Sentence 1: "Warfarin inhibits vitamin K epoxide reductase."      sim=0.89
Sentence 2: "This blocks synthesis of clotting factors II, VII, IX, X."  sim=0.91
Sentence 3: "INR is monitored to keep anticoagulation within range."   sim=0.85
Sentence 4: "Atrial fibrillation is the most common cardiac arrhythmia."  sim=0.31  ← SPLIT
Sentence 5: "AF affects approximately 2% of the general population."

Split point: between sentence 3 and 4 (similarity drops below threshold)

Topic shifts produce a similarity valley — split there.


Algorithm

Python
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_chunk(
    text: str,
    similarity_threshold: float = 0.5,
    buffer_size: int = 1,          # sentences to compare across boundary
    min_chunk_size: int = 100,     # chars — don't split too small
) -> list[str]:
    import re
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    sentences = [s for s in sentences if len(s) > 10]
    
    if len(sentences) <= 1:
        return [text]
    
    # Embed all sentences at once (batched for efficiency)
    embeddings = model.encode(sentences, batch_size=64, normalize_embeddings=True)
    
    # Compare adjacent sentences (with buffer)
    split_points = [0]
    for i in range(buffer_size, len(sentences) - buffer_size):
        # Left context: average of previous buffer sentences
        left = embeddings[max(0, i - buffer_size):i].mean(axis=0)
        # Right context: average of next buffer sentences
        right = embeddings[i + 1:min(len(sentences), i + 1 + buffer_size)].mean(axis=0)
        
        sim = float(np.dot(left, right))  # cosine sim (normalised)
        if sim < similarity_threshold:
            split_points.append(i + 1)
    
    split_points.append(len(sentences))
    
    # Build chunks from split points
    chunks = []
    for start, end in zip(split_points, split_points[1:]):
        chunk = " ".join(sentences[start:end])
        if len(chunk) >= min_chunk_size:
            chunks.append(chunk)
        elif chunks:
            chunks[-1] += " " + chunk  # merge tiny chunks with previous
    
    return chunks

Tuning the Threshold

Python
def find_split_threshold(
    sample_texts: list[str],
    target_chunk_count: int,
) -> float:
    """Calibrate threshold to produce a target number of chunks."""
    from sentence_transformers import SentenceTransformer
    import re
    
    model = SentenceTransformer("all-MiniLM-L6-v2")
    all_similarities = []
    
    for text in sample_texts:
        sentences = re.split(r'(?<=[.!?])\s+', text)
        if len(sentences) < 2:
            continue
        embeddings = model.encode(sentences, normalize_embeddings=True)
        for i in range(len(embeddings) - 1):
            sim = float(np.dot(embeddings[i], embeddings[i + 1]))
            all_similarities.append(sim)
    
    # Choose the percentile that gives the target split rate
    all_similarities.sort()
    target_splits = 1 / target_chunk_count
    idx = int(len(all_similarities) * target_splits)
    return all_similarities[idx]

# Example calibration
threshold = find_split_threshold(sample_guidelines, target_chunk_count=10)
print(f"Calibrated threshold: {threshold:.3f}")

LangChain SemanticChunker

Python
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

chunker = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",  # "percentile", "standard_deviation", "interquartile"
    breakpoint_threshold_amount=95,          # 95th percentile of similarity drops
)

chunks = chunker.split_text(clinical_document)

Semantic vs Fixed vs Recursive

Method         | Chunk size  | Preserves structure | Speed   | Quality
---------------|-------------|---------------------|---------|--------
Fixed          | Uniform     | No                  | Fast    | Baseline
Recursive      | Variable    | Moderate            | Fast    | Good
Semantic       | Variable    | Yes                 | Slow    | Best

Latency cost of semantic chunking (per 10K-word document):
  Embedding 500 sentences: ~2 seconds on GPU, ~15 seconds on CPU
  Acceptable at index time; never acceptable at query time

When semantic chunking wins:
  Documents with varied topic density (intro + detail + summary sections)
  Long documents where fixed-size would create many mid-topic splits
  High-value documents where retrieval quality justifies extra compute

When fixed/recursive is fine:
  High-volume ingestion pipelines (millions of documents)
  Short documents (< 500 words) — minimal gain over recursive
  Time-constrained indexing pipelines

Interview Answer

"Semantic chunking uses embedding similarity between adjacent sentences to detect topic shifts — splits happen where similarity drops below a threshold rather than at arbitrary character counts. This preserves topical coherence within each chunk, which improves retrieval because the embedding of a coherent chunk represents a single topic better than a random slice. The trade-off is cost: embedding every sentence at index time is 10-100Ɨ slower than fixed-size splitting. Use semantic chunking for high-value documents where retrieval quality matters most (clinical guidelines, regulatory documents) and fixed or recursive chunking for high-volume ingestion."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.