Learnixo

RAG Systems · Lesson 15 of 24

Similarity Search: Top-K Retrieval

What Similarity Search Does

Given a query embedding, find the k stored embeddings most similar to it:

Query: "What is the INR target for AF patients?"
Query embedding: [0.12, -0.34, 0.89, ...]

Stored chunks:
  Chunk A: [0.11, -0.31, 0.88, ...]  similarity=0.97  ← closest
  Chunk B: [0.08, -0.29, 0.85, ...]  similarity=0.94
  Chunk C: [-0.45, 0.82, -0.22, ...] similarity=0.21  ← unrelated

Returns: top-k={A, B, ...} above similarity threshold

Exact vs Approximate Search

Exact (brute-force, IndexFlatL2 / IndexFlatIP in FAISS):
  Computes similarity against every stored vector
  Guaranteed to find the true top-k
  Scales as O(n × d) — too slow for n > 100K
  Use for: small corpora, offline evaluation

Approximate (HNSW, IVF, ScaNN):
  Trades a small accuracy loss for large speed gain
  HNSW: graph-based, ~0.95–0.99 recall, sub-millisecond at 1M vectors
  IVF: clusters vectors, searches only nearby clusters
  Use for: production RAG with > 100K chunks

Basic Similarity Search with Chroma

Python
import chromadb
import numpy as np

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
    name="clinical_docs",
    metadata={"hnsw:space": "cosine"},
)

def similarity_search(
    query_embedding: list[float],
    top_k: int = 5,
    min_similarity: float = 0.5,
    metadata_filter: dict | None = None,
) -> list[dict]:
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        where=metadata_filter,          # e.g. {"topic": "anticoagulation"}
        include=["documents", "metadatas", "distances"],
    )
    
    retrieved = []
    for doc, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0],
    ):
        similarity = 1 - dist           # cosine distance  cosine similarity
        if similarity >= min_similarity:
            retrieved.append({
                "content": doc,
                "metadata": meta,
                "similarity": round(similarity, 4),
            })
    
    return retrieved

Similarity Search with FAISS

Python
import faiss
import numpy as np

d = 768    # embedding dimension

# Build index (at index time)
index = faiss.IndexHNSWFlat(d, 16)     # HNSW with M=16
index.hnsw.efConstruction = 200

# Add normalised vectors (cosine similarity via inner product)
embeddings = np.array(all_embeddings, dtype=np.float32)
faiss.normalize_L2(embeddings)
index.add(embeddings)

# Save the document texts separately (FAISS stores only vectors)
import json
with open("doc_store.json", "w") as f:
    json.dump({"docs": all_docs, "metas": all_metas}, f)

# Query time
def faiss_search(
    query_embedding: list[float],
    top_k: int = 5,
) -> list[dict]:
    query_vec = np.array([query_embedding], dtype=np.float32)
    faiss.normalize_L2(query_vec)
    
    distances, indices = index.search(query_vec, top_k)
    
    results = []
    for dist, idx in zip(distances[0], indices[0]):
        if idx == -1:
            continue  # HNSW returns -1 for unfilled results
        results.append({
            "content": all_docs[idx],
            "metadata": all_metas[idx],
            "similarity": float(dist),  # inner product on normalised = cosine
        })
    
    return results

Metadata Filtering

Filter by metadata attributes before or after similarity search:

Python
# Chroma: pre-filter (retrieves only from matching subset)
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=5,
    where={
        "$and": [
            {"topic": {"$eq": "anticoagulation"}},
            {"year": {"$gte": 2020}},
        ]
    },
)

# FAISS: no native metadata filtering  post-filter after retrieval
def faiss_search_with_filter(
    query_embedding: list[float],
    top_k: int = 5,
    filter_fn=None,   # callable(metadata) -> bool
    fetch_multiplier: int = 5,
) -> list[dict]:
    # Fetch more, then filter
    raw = faiss_search(query_embedding, top_k * fetch_multiplier)
    if filter_fn:
        raw = [r for r in raw if filter_fn(r["metadata"])]
    return raw[:top_k]

# Usage: only return NICE guidelines from 2021+
results = faiss_search_with_filter(
    query_embedding=embed(query),
    top_k=5,
    filter_fn=lambda m: m["source"].startswith("NICE") and m.get("year", 0) >= 2021,
)

Similarity Threshold

Don't return low-similarity results — they introduce noise:

0.90+:  excellent match — high confidence
0.75–0.90: good match
0.60–0.75: moderate — include but note lower confidence
0.50–0.60: weak — borderline, consider excluding
< 0.50: unrelated — exclude

Clinical RAG threshold: 0.60 minimum
  Below this, the retrieved chunk is unlikely to be about the query topic
  If nothing exceeds threshold: respond with "not found in knowledge base"

def retrieve_with_threshold(query, min_similarity=0.60):
    results = similarity_search(embed(query), top_k=10)
    filtered = [r for r in results if r["similarity"] >= min_similarity]
    if not filtered:
        return None  # signal: no relevant content found
    return filtered[:5]

Interview Answer

"Vector similarity search ranks stored embeddings by cosine or dot product similarity to the query embedding. For production RAG, approximate nearest neighbour indexes like HNSW give sub-millisecond retrieval at millions of vectors with ~0.95–0.99 recall. Two practical considerations: metadata filtering (Chroma supports pre-filtering natively; FAISS requires post-filtering with a fetch multiplier) and similarity thresholds (set a minimum, typically 0.60, below which chunks are too unrelated to be useful — returning a 'not found' is better than returning irrelevant context that confuses the LLM)."