Live Coding Interview Prep · Lesson 8 of 16
Implement Cosine Similarity in NumPy
Why Cosine Similarity?
In LLM applications, text is represented as dense vectors (embeddings). To find similar texts, you measure the angle between their vectors — not the distance.
Why angle, not distance? A long document and a short document on the same topic should be similar. Euclidean distance would say they're far apart (different magnitudes). Cosine similarity normalizes for magnitude, measuring only direction.
cos(θ) = (A · B) / (||A|| × ||B||)Range: -1 to 1 (1 = identical direction, 0 = orthogonal, -1 = opposite)
Naive Implementation
import math
def dot_product(a: list[float], b: list[float]) -> float:
"""Sum of element-wise products."""
return sum(x * y for x, y in zip(a, b))
def magnitude(v: list[float]) -> float:
"""Euclidean norm (L2 norm) of a vector."""
return math.sqrt(sum(x**2 for x in v))
def cosine_similarity(a: list[float], b: list[float]) -> float:
"""Cosine similarity between two vectors. Returns value in [-1, 1]."""
if len(a) != len(b):
raise ValueError(f"Vector dimensions must match: {len(a)} vs {len(b)}")
mag_a = magnitude(a)
mag_b = magnitude(b)
if mag_a == 0.0 or mag_b == 0.0:
return 0.0 # Zero vector has no direction
return dot_product(a, b) / (mag_a * mag_b)
# Test
a = [1.0, 0.0, 1.0] # Points in x-z direction
b = [1.0, 0.0, 1.0] # Identical
c = [0.0, 1.0, 0.0] # Points in y direction (orthogonal to a)
d = [-1.0, 0.0, -1.0] # Opposite direction
print(cosine_similarity(a, b)) # 1.0 (identical)
print(cosine_similarity(a, c)) # 0.0 (orthogonal)
print(cosine_similarity(a, d)) # -1.0 (opposite)Efficient Implementation with NumPy
The naive version is O(d) per pair — fine for single comparisons but slow for large batches. Use NumPy for vectorized operations:
import numpy as np
def cosine_similarity_np(a: np.ndarray, b: np.ndarray) -> float:
"""Single pair, NumPy version."""
dot = np.dot(a, b)
norm = np.linalg.norm(a) * np.linalg.norm(b)
return float(dot / norm) if norm > 0 else 0.0
def cosine_similarity_matrix(query: np.ndarray, corpus: np.ndarray) -> np.ndarray:
"""
Compute cosine similarity between one query and all corpus vectors.
query: (d,) vector
corpus: (n, d) matrix of n vectors
Returns: (n,) array of similarities
"""
# Normalize query
query_norm = query / (np.linalg.norm(query) + 1e-10)
# Normalize corpus rows
norms = np.linalg.norm(corpus, axis=1, keepdims=True)
corpus_normalized = corpus / (norms + 1e-10)
# Dot product = cosine similarity for normalized vectors
return corpus_normalized @ query_norm
# Example: search embeddings
n_docs = 1000
embedding_dim = 1536 # OpenAI text-embedding-3-small dimension
# Simulate document embeddings
corpus_embeddings = np.random.randn(n_docs, embedding_dim).astype(np.float32)
query_embedding = np.random.randn(embedding_dim).astype(np.float32)
similarities = cosine_similarity_matrix(query_embedding, corpus_embeddings)
# Top-5 most similar documents
top_k = 5
top_indices = np.argsort(similarities)[-top_k:][::-1]
for idx in top_indices:
print(f"Doc {idx}: similarity = {similarities[idx]:.4f}")Batch Query vs Corpus
For multiple queries against a corpus:
def batch_cosine_similarity(
queries: np.ndarray, # (q, d)
corpus: np.ndarray, # (n, d)
) -> np.ndarray: # Returns (q, n) similarity matrix
"""All queries vs all corpus documents."""
# Normalize along embedding dimension (axis=1)
q_norm = queries / (np.linalg.norm(queries, axis=1, keepdims=True) + 1e-10)
c_norm = corpus / (np.linalg.norm(corpus, axis=1, keepdims=True) + 1e-10)
# Matrix multiplication: (q, d) @ (d, n) = (q, n)
return q_norm @ c_norm.T
# Test
q = np.random.randn(5, 128) # 5 queries, 128-dim embeddings
c = np.random.randn(1000, 128) # 1000 documents
sim_matrix = batch_cosine_similarity(q, c)
print(f"Shape: {sim_matrix.shape}") # (5, 1000)
# For each query, find the top-3 results
for i in range(len(q)):
top3 = np.argsort(sim_matrix[i])[-3:][::-1]
print(f"Query {i} top matches: doc {top3[0]} ({sim_matrix[i][top3[0]]:.3f}), "
f"doc {top3[1]} ({sim_matrix[i][top3[1]]:.3f})")Cosine Distance vs Cosine Similarity
Many vector libraries use cosine distance = 1 - cosine_similarity. This converts similarity (higher = more similar) to distance (lower = more similar), making it compatible with k-NN search algorithms.
def cosine_distance(a: np.ndarray, b: np.ndarray) -> float:
return 1.0 - cosine_similarity_np(a, b)
# Distance = 0: identical, Distance = 1: orthogonal, Distance = 2: oppositeWhen to Use Cosine Similarity vs Dot Product
| Situation | Use | |---|---| | Vectors already normalized (unit length) | Dot product (same result, faster) | | Vectors have meaningful magnitudes | Euclidean distance | | Embeddings from OpenAI, Cohere, SBERT | Cosine similarity | | Training a retrieval model | Often dot product (for speed) |
OpenAI and most modern embedding APIs return normalized vectors (unit length). For normalized vectors, cosine similarity equals the dot product — skipping the normalization step is a useful optimization.
Interview Questions
Q: When would cosine similarity give misleading results?
When document length carries meaningful information. If a long detailed document and a short summary are compared, cosine similarity may call them "similar" even though they represent very different levels of detail. Dot product would correctly show the detailed document as more informative.
Q: How does cosine similarity relate to vector databases?
Vector DBs (Pinecone, Weaviate, pgvector) store embeddings and support approximate nearest-neighbor (ANN) search using algorithms like HNSW or IVF. These efficiently find high-cosine-similarity vectors without comparing against all stored vectors (which would be O(n×d) per query). ANN trades exact results for speed — acceptable because embedding search is approximate anyway.