RAG Systems · Lesson 8 of 24
HNSW Index: How Approximate Nearest Neighbour Works
Why You Need an Index
Naive similarity search scans every vector and computes cosine similarity. At 1M documents this is too slow:
Brute-force at 1M chunks:
768-dim vectors × 1M = 768M multiplications per query
On CPU: ~100-500ms per query
With GPU: ~10ms — usable but expensive
HNSW at 1M chunks:
~1ms per query on CPU
Trades a small accuracy loss for a ~100-500× speedupHNSW (Hierarchical Navigable Small World) is the dominant vector index algorithm used by Chroma, Qdrant, Weaviate, pgvector, and FAISS.
How HNSW Works
HNSW is a multi-layer graph where:
- Bottom layer contains all vectors, densely connected
- Upper layers are progressively sparser — "express lanes"
- Search enters at the top, greedily descends to the query's neighbourhood
Layer 2 (sparse): A ---- E
Layer 1 (medium): A -- C -- E -- G
Layer 0 (dense): A - B - C - D - E - F - G - H
^
query lands here
Search: enter at top layer → follow edges greedily →
descend to next layer → repeat → at layer 0,
return ef_search nearest neighboursEach node is assigned a random maximum layer via geometric distribution. This prevents pathological worst-cases and gives the "small world" property — any node reachable in O(log n) hops.
Key HNSW Parameters
M (connections per node, default 16):
Controls graph connectivity at each layer
Higher M → better recall, more memory, slower build
M=8: fast build, lower recall (~0.95)
M=16: good balance (recommended starting point)
M=32: high recall (~0.99), 2× memory and build time
ef_construction (build-time search width, default 200):
How many candidates the builder considers when inserting each node
Higher → better graph quality, slower indexing
ef_construction=100: fast, acceptable quality
ef_construction=200: recommended
ef_construction=500: high quality, 2.5× slower build
ef_search (query-time search width, default 50):
How many candidates to track during search
Higher → better recall, higher latency
ef_search=50: fast (~1ms), recall ~0.95
ef_search=100: slower (~2ms), recall ~0.98
ef_search=200: precise (~4ms), recall ~0.995HNSW in Practice with Chroma
import chromadb
from chromadb.config import Settings
# Chroma uses HNSW by default via hnswlib
client = chromadb.PersistentClient(path="./chroma_db")
# Collection with custom HNSW parameters
collection = client.get_or_create_collection(
name="clinical_docs",
metadata={
"hnsw:space": "cosine", # distance metric
"hnsw:M": 16, # connections per node
"hnsw:ef_construction": 200, # build-time ef
"hnsw:ef_search": 100, # query-time ef (can be tuned per-query)
}
)
# At query time, ef_search determines recall vs latency trade-off
results = collection.query(
query_embeddings=[query_embedding],
n_results=10,
# ef_search can be overridden per-query in newer Chroma versions
)HNSW with FAISS
import faiss
import numpy as np
d = 768 # embedding dimension
M = 16 # connections per node
# Build HNSW index
index = faiss.IndexHNSWFlat(d, M)
index.hnsw.efConstruction = 200
# Add vectors (must be float32)
embeddings = np.array(all_embeddings, dtype=np.float32)
index.add(embeddings)
# Search
index.hnsw.efSearch = 100
query_vec = np.array([query_embedding], dtype=np.float32)
distances, indices = index.search(query_vec, k=10)
# Save/load
faiss.write_index(index, "clinical.index")
index = faiss.read_index("clinical.index")Recall vs Latency Trade-off
ef_search | Recall@10 | Query latency (1M vecs)
----------|-----------|------------------------
20 | 0.88 | 0.5ms
50 | 0.95 | 1.0ms
100 | 0.98 | 2.0ms
200 | 0.99 | 4.0ms
500 | 0.999 | 8.0ms
For RAG, recall@10 of 0.95 is usually fine.
Missing 5% of results matters less than latency SLA.
Use ef_search=100 as a starting point, tune with offline eval.When HNSW Falls Short
Memory: HNSW stores the graph structure in RAM
1M vectors × 768 dims × 4 bytes = 3GB vectors
+ HNSW graph overhead (~M × 8 bytes/node) ≈ 128MB
Total ≈ 3.1GB for 1M vectors — manageable
Disk: HNSW requires the index in RAM — can't stream from disk
For very large corpora (100M+ vectors), use IVF (inverted file) index
or disk-ANN (DiskANN, used by Azure AI Search)
Filtering + HNSW: filtering AFTER retrieval hurts recall
If 90% of results are filtered by metadata, need ef_search 10× higher
Better: pre-filter with metadata, then HNSW on the subset
Qdrant and Weaviate support filtered HNSW nativelyInterview Answer
"HNSW is a multi-layer graph index for approximate nearest neighbour search. It enables sub-millisecond vector retrieval at millions of vectors by building a hierarchy of graphs — sparse upper layers act as express lanes, dense lower layers for precision. Key parameters: M (graph connectivity, default 16), ef_construction (build quality, default 200), and ef_search (query recall vs latency trade-off, tuned per use case). For RAG, ef_search=100 giving ~0.98 recall is a good starting point. The main limitation is memory — HNSW lives in RAM, so very large corpora need DiskANN or IVF-based indexes instead."