HNSW: The Vector Index Powering RAG
How Hierarchical Navigable Small World graphs enable fast approximate nearest neighbour search in vector databases, and the parameters that matter for RAG.
Why You Need an Index
Naive similarity search scans every vector and computes cosine similarity. At 1M documents this is too slow:
Brute-force at 1M chunks:
768-dim vectors × 1M = 768M multiplications per query
On CPU: ~100-500ms per query
With GPU: ~10ms — usable but expensive
HNSW at 1M chunks:
~1ms per query on CPU
Trades a small accuracy loss for a ~100-500× speedupHNSW (Hierarchical Navigable Small World) is the dominant vector index algorithm used by Chroma, Qdrant, Weaviate, pgvector, and FAISS.
How HNSW Works
HNSW is a multi-layer graph where:
- Bottom layer contains all vectors, densely connected
- Upper layers are progressively sparser — "express lanes"
- Search enters at the top, greedily descends to the query's neighbourhood
Layer 2 (sparse): A ---- E
Layer 1 (medium): A -- C -- E -- G
Layer 0 (dense): A - B - C - D - E - F - G - H
^
query lands here
Search: enter at top layer → follow edges greedily →
descend to next layer → repeat → at layer 0,
return ef_search nearest neighboursEach node is assigned a random maximum layer via geometric distribution. This prevents pathological worst-cases and gives the "small world" property — any node reachable in O(log n) hops.
Key HNSW Parameters
M (connections per node, default 16):
Controls graph connectivity at each layer
Higher M → better recall, more memory, slower build
M=8: fast build, lower recall (~0.95)
M=16: good balance (recommended starting point)
M=32: high recall (~0.99), 2× memory and build time
ef_construction (build-time search width, default 200):
How many candidates the builder considers when inserting each node
Higher → better graph quality, slower indexing
ef_construction=100: fast, acceptable quality
ef_construction=200: recommended
ef_construction=500: high quality, 2.5× slower build
ef_search (query-time search width, default 50):
How many candidates to track during search
Higher → better recall, higher latency
ef_search=50: fast (~1ms), recall ~0.95
ef_search=100: slower (~2ms), recall ~0.98
ef_search=200: precise (~4ms), recall ~0.995HNSW in Practice with Chroma
import chromadb
from chromadb.config import Settings
# Chroma uses HNSW by default via hnswlib
client = chromadb.PersistentClient(path="./chroma_db")
# Collection with custom HNSW parameters
collection = client.get_or_create_collection(
name="clinical_docs",
metadata={
"hnsw:space": "cosine", # distance metric
"hnsw:M": 16, # connections per node
"hnsw:ef_construction": 200, # build-time ef
"hnsw:ef_search": 100, # query-time ef (can be tuned per-query)
}
)
# At query time, ef_search determines recall vs latency trade-off
results = collection.query(
query_embeddings=[query_embedding],
n_results=10,
# ef_search can be overridden per-query in newer Chroma versions
)HNSW with FAISS
import faiss
import numpy as np
d = 768 # embedding dimension
M = 16 # connections per node
# Build HNSW index
index = faiss.IndexHNSWFlat(d, M)
index.hnsw.efConstruction = 200
# Add vectors (must be float32)
embeddings = np.array(all_embeddings, dtype=np.float32)
index.add(embeddings)
# Search
index.hnsw.efSearch = 100
query_vec = np.array([query_embedding], dtype=np.float32)
distances, indices = index.search(query_vec, k=10)
# Save/load
faiss.write_index(index, "clinical.index")
index = faiss.read_index("clinical.index")Recall vs Latency Trade-off
ef_search | Recall@10 | Query latency (1M vecs)
----------|-----------|------------------------
20 | 0.88 | 0.5ms
50 | 0.95 | 1.0ms
100 | 0.98 | 2.0ms
200 | 0.99 | 4.0ms
500 | 0.999 | 8.0ms
For RAG, recall@10 of 0.95 is usually fine.
Missing 5% of results matters less than latency SLA.
Use ef_search=100 as a starting point, tune with offline eval.When HNSW Falls Short
Memory: HNSW stores the graph structure in RAM
1M vectors × 768 dims × 4 bytes = 3GB vectors
+ HNSW graph overhead (~M × 8 bytes/node) ≈ 128MB
Total ≈ 3.1GB for 1M vectors — manageable
Disk: HNSW requires the index in RAM — can't stream from disk
For very large corpora (100M+ vectors), use IVF (inverted file) index
or disk-ANN (DiskANN, used by Azure AI Search)
Filtering + HNSW: filtering AFTER retrieval hurts recall
If 90% of results are filtered by metadata, need ef_search 10× higher
Better: pre-filter with metadata, then HNSW on the subset
Qdrant and Weaviate support filtered HNSW nativelyInterview Answer
"HNSW is a multi-layer graph index for approximate nearest neighbour search. It enables sub-millisecond vector retrieval at millions of vectors by building a hierarchy of graphs — sparse upper layers act as express lanes, dense lower layers for precision. Key parameters: M (graph connectivity, default 16), ef_construction (build quality, default 200), and ef_search (query recall vs latency trade-off, tuned per use case). For RAG, ef_search=100 giving ~0.98 recall is a good starting point. The main limitation is memory — HNSW lives in RAM, so very large corpora need DiskANN or IVF-based indexes instead."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.