Learnixo
Back to blog
AI Systemsintermediate

HNSW: The Vector Index Powering RAG

How Hierarchical Navigable Small World graphs enable fast approximate nearest neighbour search in vector databases, and the parameters that matter for RAG.

Asma Hafeez KhanMay 21, 20264 min read
RAGHNSWVector IndexANNInterview
Share:𝕏

Why You Need an Index

Naive similarity search scans every vector and computes cosine similarity. At 1M documents this is too slow:

Brute-force at 1M chunks:
  768-dim vectors × 1M = 768M multiplications per query
  On CPU: ~100-500ms per query
  With GPU: ~10ms — usable but expensive
  
HNSW at 1M chunks:
  ~1ms per query on CPU
  Trades a small accuracy loss for a ~100-500× speedup

HNSW (Hierarchical Navigable Small World) is the dominant vector index algorithm used by Chroma, Qdrant, Weaviate, pgvector, and FAISS.


How HNSW Works

HNSW is a multi-layer graph where:

  • Bottom layer contains all vectors, densely connected
  • Upper layers are progressively sparser — "express lanes"
  • Search enters at the top, greedily descends to the query's neighbourhood
Layer 2 (sparse):   A ---- E
Layer 1 (medium):   A -- C -- E -- G
Layer 0 (dense):    A - B - C - D - E - F - G - H
                              ^
                         query lands here

Search: enter at top layer → follow edges greedily →
        descend to next layer → repeat → at layer 0,
        return ef_search nearest neighbours

Each node is assigned a random maximum layer via geometric distribution. This prevents pathological worst-cases and gives the "small world" property — any node reachable in O(log n) hops.


Key HNSW Parameters

M (connections per node, default 16):
  Controls graph connectivity at each layer
  Higher M → better recall, more memory, slower build
  M=8:  fast build, lower recall (~0.95)
  M=16: good balance (recommended starting point)
  M=32: high recall (~0.99), 2× memory and build time

ef_construction (build-time search width, default 200):
  How many candidates the builder considers when inserting each node
  Higher → better graph quality, slower indexing
  ef_construction=100: fast, acceptable quality
  ef_construction=200: recommended
  ef_construction=500: high quality, 2.5× slower build

ef_search (query-time search width, default 50):
  How many candidates to track during search
  Higher → better recall, higher latency
  ef_search=50:  fast (~1ms), recall ~0.95
  ef_search=100: slower (~2ms), recall ~0.98
  ef_search=200: precise (~4ms), recall ~0.995

HNSW in Practice with Chroma

Python
import chromadb
from chromadb.config import Settings

# Chroma uses HNSW by default via hnswlib
client = chromadb.PersistentClient(path="./chroma_db")

# Collection with custom HNSW parameters
collection = client.get_or_create_collection(
    name="clinical_docs",
    metadata={
        "hnsw:space": "cosine",      # distance metric
        "hnsw:M": 16,                # connections per node
        "hnsw:ef_construction": 200, # build-time ef
        "hnsw:ef_search": 100,       # query-time ef (can be tuned per-query)
    }
)

# At query time, ef_search determines recall vs latency trade-off
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=10,
    # ef_search can be overridden per-query in newer Chroma versions
)

HNSW with FAISS

Python
import faiss
import numpy as np

d = 768    # embedding dimension
M = 16     # connections per node

# Build HNSW index
index = faiss.IndexHNSWFlat(d, M)
index.hnsw.efConstruction = 200

# Add vectors (must be float32)
embeddings = np.array(all_embeddings, dtype=np.float32)
index.add(embeddings)

# Search
index.hnsw.efSearch = 100
query_vec = np.array([query_embedding], dtype=np.float32)
distances, indices = index.search(query_vec, k=10)

# Save/load
faiss.write_index(index, "clinical.index")
index = faiss.read_index("clinical.index")

Recall vs Latency Trade-off

ef_search | Recall@10 | Query latency (1M vecs)
----------|-----------|------------------------
       20 |     0.88  |  0.5ms
       50 |     0.95  |  1.0ms
      100 |     0.98  |  2.0ms
      200 |     0.99  |  4.0ms
      500 |     0.999 |  8.0ms

For RAG, recall@10 of 0.95 is usually fine.
Missing 5% of results matters less than latency SLA.
Use ef_search=100 as a starting point, tune with offline eval.

When HNSW Falls Short

Memory: HNSW stores the graph structure in RAM
  1M vectors × 768 dims × 4 bytes = 3GB vectors
  + HNSW graph overhead (~M × 8 bytes/node) ≈ 128MB
  Total ≈ 3.1GB for 1M vectors — manageable

Disk: HNSW requires the index in RAM — can't stream from disk
  For very large corpora (100M+ vectors), use IVF (inverted file) index
  or disk-ANN (DiskANN, used by Azure AI Search)

Filtering + HNSW: filtering AFTER retrieval hurts recall
  If 90% of results are filtered by metadata, need ef_search 10× higher
  Better: pre-filter with metadata, then HNSW on the subset
  Qdrant and Weaviate support filtered HNSW natively

Interview Answer

"HNSW is a multi-layer graph index for approximate nearest neighbour search. It enables sub-millisecond vector retrieval at millions of vectors by building a hierarchy of graphs — sparse upper layers act as express lanes, dense lower layers for precision. Key parameters: M (graph connectivity, default 16), ef_construction (build quality, default 200), and ef_search (query recall vs latency trade-off, tuned per use case). For RAG, ef_search=100 giving ~0.98 recall is a good starting point. The main limitation is memory — HNSW lives in RAM, so very large corpora need DiskANN or IVF-based indexes instead."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.