RAG Systems · Lesson 15 of 24
Similarity Search: Top-K Retrieval
What Similarity Search Does
Given a query embedding, find the k stored embeddings most similar to it:
Query: "What is the INR target for AF patients?"
Query embedding: [0.12, -0.34, 0.89, ...]
Stored chunks:
Chunk A: [0.11, -0.31, 0.88, ...] similarity=0.97 ← closest
Chunk B: [0.08, -0.29, 0.85, ...] similarity=0.94
Chunk C: [-0.45, 0.82, -0.22, ...] similarity=0.21 ← unrelated
Returns: top-k={A, B, ...} above similarity thresholdExact vs Approximate Search
Exact (brute-force, IndexFlatL2 / IndexFlatIP in FAISS):
Computes similarity against every stored vector
Guaranteed to find the true top-k
Scales as O(n × d) — too slow for n > 100K
Use for: small corpora, offline evaluation
Approximate (HNSW, IVF, ScaNN):
Trades a small accuracy loss for large speed gain
HNSW: graph-based, ~0.95–0.99 recall, sub-millisecond at 1M vectors
IVF: clusters vectors, searches only nearby clusters
Use for: production RAG with > 100K chunksBasic Similarity Search with Chroma
import chromadb
import numpy as np
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
name="clinical_docs",
metadata={"hnsw:space": "cosine"},
)
def similarity_search(
query_embedding: list[float],
top_k: int = 5,
min_similarity: float = 0.5,
metadata_filter: dict | None = None,
) -> list[dict]:
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k,
where=metadata_filter, # e.g. {"topic": "anticoagulation"}
include=["documents", "metadatas", "distances"],
)
retrieved = []
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0],
):
similarity = 1 - dist # cosine distance → cosine similarity
if similarity >= min_similarity:
retrieved.append({
"content": doc,
"metadata": meta,
"similarity": round(similarity, 4),
})
return retrievedSimilarity Search with FAISS
import faiss
import numpy as np
d = 768 # embedding dimension
# Build index (at index time)
index = faiss.IndexHNSWFlat(d, 16) # HNSW with M=16
index.hnsw.efConstruction = 200
# Add normalised vectors (cosine similarity via inner product)
embeddings = np.array(all_embeddings, dtype=np.float32)
faiss.normalize_L2(embeddings)
index.add(embeddings)
# Save the document texts separately (FAISS stores only vectors)
import json
with open("doc_store.json", "w") as f:
json.dump({"docs": all_docs, "metas": all_metas}, f)
# Query time
def faiss_search(
query_embedding: list[float],
top_k: int = 5,
) -> list[dict]:
query_vec = np.array([query_embedding], dtype=np.float32)
faiss.normalize_L2(query_vec)
distances, indices = index.search(query_vec, top_k)
results = []
for dist, idx in zip(distances[0], indices[0]):
if idx == -1:
continue # HNSW returns -1 for unfilled results
results.append({
"content": all_docs[idx],
"metadata": all_metas[idx],
"similarity": float(dist), # inner product on normalised = cosine
})
return resultsMetadata Filtering
Filter by metadata attributes before or after similarity search:
# Chroma: pre-filter (retrieves only from matching subset)
results = collection.query(
query_embeddings=[query_embedding],
n_results=5,
where={
"$and": [
{"topic": {"$eq": "anticoagulation"}},
{"year": {"$gte": 2020}},
]
},
)
# FAISS: no native metadata filtering — post-filter after retrieval
def faiss_search_with_filter(
query_embedding: list[float],
top_k: int = 5,
filter_fn=None, # callable(metadata) -> bool
fetch_multiplier: int = 5,
) -> list[dict]:
# Fetch more, then filter
raw = faiss_search(query_embedding, top_k * fetch_multiplier)
if filter_fn:
raw = [r for r in raw if filter_fn(r["metadata"])]
return raw[:top_k]
# Usage: only return NICE guidelines from 2021+
results = faiss_search_with_filter(
query_embedding=embed(query),
top_k=5,
filter_fn=lambda m: m["source"].startswith("NICE") and m.get("year", 0) >= 2021,
)Similarity Threshold
Don't return low-similarity results — they introduce noise:
0.90+: excellent match — high confidence
0.75–0.90: good match
0.60–0.75: moderate — include but note lower confidence
0.50–0.60: weak — borderline, consider excluding
< 0.50: unrelated — exclude
Clinical RAG threshold: 0.60 minimum
Below this, the retrieved chunk is unlikely to be about the query topic
If nothing exceeds threshold: respond with "not found in knowledge base"
def retrieve_with_threshold(query, min_similarity=0.60):
results = similarity_search(embed(query), top_k=10)
filtered = [r for r in results if r["similarity"] >= min_similarity]
if not filtered:
return None # signal: no relevant content found
return filtered[:5]Interview Answer
"Vector similarity search ranks stored embeddings by cosine or dot product similarity to the query embedding. For production RAG, approximate nearest neighbour indexes like HNSW give sub-millisecond retrieval at millions of vectors with ~0.95–0.99 recall. Two practical considerations: metadata filtering (Chroma supports pre-filtering natively; FAISS requires post-filtering with a fetch multiplier) and similarity thresholds (set a minimum, typically 0.60, below which chunks are too unrelated to be useful — returning a 'not found' is better than returning irrelevant context that confuses the LLM)."