HNSW: The Vector Index Powering RAG

Why You Need an Index

Naive similarity search scans every vector and computes cosine similarity. At 1M documents this is too slow:

Brute-force at 1M chunks:
  768-dim vectors × 1M = 768M multiplications per query
  On CPU: ~100-500ms per query
  With GPU: ~10ms — usable but expensive
  
HNSW at 1M chunks:
  ~1ms per query on CPU
  Trades a small accuracy loss for a ~100-500× speedup

HNSW (Hierarchical Navigable Small World) is the dominant vector index algorithm used by Chroma, Qdrant, Weaviate, pgvector, and FAISS.

How HNSW Works

HNSW is a multi-layer graph where:

Bottom layer contains all vectors, densely connected
Upper layers are progressively sparser — "express lanes"
Search enters at the top, greedily descends to the query's neighbourhood

Layer 2 (sparse):   A ---- E
Layer 1 (medium):   A -- C -- E -- G
Layer 0 (dense):    A - B - C - D - E - F - G - H
                              ^
                         query lands here

Search: enter at top layer → follow edges greedily →
        descend to next layer → repeat → at layer 0,
        return ef_search nearest neighbours

Each node is assigned a random maximum layer via geometric distribution. This prevents pathological worst-cases and gives the "small world" property — any node reachable in O(log n) hops.

Key HNSW Parameters

M (connections per node, default 16):
  Controls graph connectivity at each layer
  Higher M → better recall, more memory, slower build
  M=8:  fast build, lower recall (~0.95)
  M=16: good balance (recommended starting point)
  M=32: high recall (~0.99), 2× memory and build time

ef_construction (build-time search width, default 200):
  How many candidates the builder considers when inserting each node
  Higher → better graph quality, slower indexing
  ef_construction=100: fast, acceptable quality
  ef_construction=200: recommended
  ef_construction=500: high quality, 2.5× slower build

ef_search (query-time search width, default 50):
  How many candidates to track during search
  Higher → better recall, higher latency
  ef_search=50:  fast (~1ms), recall ~0.95
  ef_search=100: slower (~2ms), recall ~0.98
  ef_search=200: precise (~4ms), recall ~0.995

HNSW in Practice with Chroma

Python

import chromadb
from chromadb.config import Settings

# Chroma uses HNSW by default via hnswlib
client = chromadb.PersistentClient(path="./chroma_db")

# Collection with custom HNSW parameters
collection = client.get_or_create_collection(
    name="clinical_docs",
    metadata={
        "hnsw:space": "cosine",      # distance metric
        "hnsw:M": 16,                # connections per node
        "hnsw:ef_construction": 200, # build-time ef
        "hnsw:ef_search": 100,       # query-time ef (can be tuned per-query)
    }
)

# At query time, ef_search determines recall vs latency trade-off
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=10,
    # ef_search can be overridden per-query in newer Chroma versions
)

HNSW with FAISS

Python

import faiss
import numpy as np

d = 768    # embedding dimension
M = 16     # connections per node

# Build HNSW index
index = faiss.IndexHNSWFlat(d, M)
index.hnsw.efConstruction = 200

# Add vectors (must be float32)
embeddings = np.array(all_embeddings, dtype=np.float32)
index.add(embeddings)

# Search
index.hnsw.efSearch = 100
query_vec = np.array([query_embedding], dtype=np.float32)
distances, indices = index.search(query_vec, k=10)

# Save/load
faiss.write_index(index, "clinical.index")
index = faiss.read_index("clinical.index")

Recall vs Latency Trade-off

ef_search | Recall@10 | Query latency (1M vecs)
----------|-----------|------------------------
       20 |     0.88  |  0.5ms
       50 |     0.95  |  1.0ms
      100 |     0.98  |  2.0ms
      200 |     0.99  |  4.0ms
      500 |     0.999 |  8.0ms

For RAG, recall@10 of 0.95 is usually fine.
Missing 5% of results matters less than latency SLA.
Use ef_search=100 as a starting point, tune with offline eval.

When HNSW Falls Short

Memory: HNSW stores the graph structure in RAM
  1M vectors × 768 dims × 4 bytes = 3GB vectors
  + HNSW graph overhead (~M × 8 bytes/node) ≈ 128MB
  Total ≈ 3.1GB for 1M vectors — manageable

Disk: HNSW requires the index in RAM — can't stream from disk
  For very large corpora (100M+ vectors), use IVF (inverted file) index
  or disk-ANN (DiskANN, used by Azure AI Search)

Filtering + HNSW: filtering AFTER retrieval hurts recall
  If 90% of results are filtered by metadata, need ef_search 10× higher
  Better: pre-filter with metadata, then HNSW on the subset
  Qdrant and Weaviate support filtered HNSW natively

Interview Answer

"HNSW is a multi-layer graph index for approximate nearest neighbour search. It enables sub-millisecond vector retrieval at millions of vectors by building a hierarchy of graphs — sparse upper layers act as express lanes, dense lower layers for precision. Key parameters: M (graph connectivity, default 16), ef_construction (build quality, default 200), and ef_search (query recall vs latency trade-off, tuned per use case). For RAG, ef_search=100 giving ~0.98 recall is a good starting point. The main limitation is memory — HNSW lives in RAM, so very large corpora need DiskANN or IVF-based indexes instead."