Learnixo

RAG Systems · Lesson 9 of 24

Cosine Similarity vs Dot Product vs Euclidean

The Two Common Metrics

Cosine similarity:
  cos(A, B) = (A · B) / (‖A‖ × ‖B‖)
  Range: [-1, 1]  (for unit vectors: [0, 1] for typical text)
  Measures: ANGLE between vectors — ignores magnitude

Dot product:
  A · B = Σ aᵢ × bᵢ
  Range: unbounded
  Measures: magnitude AND angle combined

For unit-normalised vectors, they are identical: If ‖A‖ = ‖B‖ = 1, then cos(A, B) = A · B.


When They Differ

Cosine is magnitude-invariant:
  embed("cat") = [0.5, 0.3, ...]  (small magnitude)
  embed("The cat sat on the mat. It was a large cat.") = [1.0, 0.6, ...]
  
  Cosine similarity between them: high (same topic)
  Dot product: lower (small magnitude doc pulls it down)

  Use cosine when: magnitude should not affect ranking
  Use cosine for: most RAG applications

Dot product rewards high-magnitude vectors:
  If a model is trained with dot product and long documents
  deliberately get higher magnitude, dot product captures
  "importance" (document is highly about this topic)

  Use dot product when: the embedding model was explicitly
  trained with dot product objective (e.g., OpenAI's models)

Which Metric for Which Model

Model                       | Recommended metric | Why
----------------------------|--------------------|---------------------------------
text-embedding-3-small      | cosine or dot      | normalised by default
text-embedding-3-large      | cosine or dot      | normalised by default
all-MiniLM-L6-v2            | cosine             | trained with cosine objective
all-mpnet-base-v2           | cosine             | normalised embeddings
MedCPT-Query-Encoder        | cosine             | FAISS cosine during training
BGE models (BAAI)           | dot product        | trained with inner product
E5 models                   | cosine             | normalised embeddings

Rule: check the model card. Using the wrong metric degrades retrieval quality.

Implementation

Python
import numpy as np
from numpy.linalg import norm

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return np.dot(a, b) / (norm(a) * norm(b))

def dot_product(a: np.ndarray, b: np.ndarray) -> float:
    return np.dot(a, b)

# Batch cosine (normalise then dot product)
def cosine_similarity_batch(
    query: np.ndarray,          # shape: (d,)
    docs: np.ndarray,           # shape: (n, d)
) -> np.ndarray:
    query_norm = query / norm(query)
    docs_norm = docs / norm(docs, axis=1, keepdims=True)
    return docs_norm @ query_norm   # shape: (n,)

# Pre-normalise at index time (recommended)
def normalise(embeddings: np.ndarray) -> np.ndarray:
    norms = norm(embeddings, axis=1, keepdims=True)
    return embeddings / norms

# Then at query time: dot product == cosine (faster)
query_norm = normalise(query_embedding.reshape(1, -1))[0]
scores = doc_embeddings_normalised @ query_norm

Chroma / FAISS Configuration

Python
import chromadb
import faiss

# Chroma: specify space in collection metadata
collection = client.get_or_create_collection(
    name="docs",
    metadata={"hnsw:space": "cosine"}  # or "ip" (inner product) or "l2"
)

# FAISS: choose index type
d = 768

# Cosine: normalise vectors, use inner product index
index_cosine = faiss.IndexFlatIP(d)  # inner product on normalised = cosine
# Normalise before adding:
faiss.normalize_L2(embeddings)       # in-place normalisation
index_cosine.add(embeddings)

# Pure dot product (no normalisation)
index_ip = faiss.IndexFlatIP(d)
index_ip.add(embeddings)             # raw embeddings

# L2 (Euclidean)  rarely used for text
index_l2 = faiss.IndexFlatL2(d)

Distance vs Similarity

Vector databases often return distance, not similarity:

Cosine distance = 1 - cosine_similarity
  cosine_sim = 0.9 → distance = 0.1  (very similar)
  cosine_sim = 0.5 → distance = 0.5  (moderately similar)
  cosine_sim = 0.0 → distance = 1.0  (unrelated)

L2 distance: Euclidean distance — not the same as cosine distance
  For normalised vectors: L2² = 2 × (1 - cosine_sim)
  So L2 ranking == cosine ranking on normalised vectors

Chroma returns "distances" — lower is more similar (cosine distance)
Convert: similarity = 1 - distance
Python
def retrieve_with_similarity(query_embedding, collection, top_k=5):
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )
    
    return [
        {
            "content": doc,
            "metadata": meta,
            "similarity": 1 - dist,   # convert distance to similarity
        }
        for doc, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        )
    ]

Interview Answer

"Cosine similarity measures the angle between vectors (magnitude-invariant), while dot product measures both angle and magnitude. For normalised unit vectors they are identical. In RAG, cosine is the default choice because it's robust to varying text lengths. The important thing is to match the metric to the embedding model's training objective — BGE models use dot product; MiniLM and E5 use cosine. Mismatching degrades retrieval quality. Practically, pre-normalising embeddings at index time and then using dot product is faster than computing cosine at query time."