Implement Cosine Similarity
Implement cosine similarity from scratch. Understand why it measures semantic closeness, how it relates to vector search, and how to use it efficiently with NumPy.
Why Cosine Similarity?
In LLM applications, text is represented as dense vectors (embeddings). To find similar texts, you measure the angle between their vectors โ not the distance.
Why angle, not distance? A long document and a short document on the same topic should be similar. Euclidean distance would say they're far apart (different magnitudes). Cosine similarity normalizes for magnitude, measuring only direction.
cos(ฮธ) = (A ยท B) / (||A|| ร ||B||)Range: -1 to 1 (1 = identical direction, 0 = orthogonal, -1 = opposite)
Naive Implementation
import math
def dot_product(a: list[float], b: list[float]) -> float:
"""Sum of element-wise products."""
return sum(x * y for x, y in zip(a, b))
def magnitude(v: list[float]) -> float:
"""Euclidean norm (L2 norm) of a vector."""
return math.sqrt(sum(x**2 for x in v))
def cosine_similarity(a: list[float], b: list[float]) -> float:
"""Cosine similarity between two vectors. Returns value in [-1, 1]."""
if len(a) != len(b):
raise ValueError(f"Vector dimensions must match: {len(a)} vs {len(b)}")
mag_a = magnitude(a)
mag_b = magnitude(b)
if mag_a == 0.0 or mag_b == 0.0:
return 0.0 # Zero vector has no direction
return dot_product(a, b) / (mag_a * mag_b)
# Test
a = [1.0, 0.0, 1.0] # Points in x-z direction
b = [1.0, 0.0, 1.0] # Identical
c = [0.0, 1.0, 0.0] # Points in y direction (orthogonal to a)
d = [-1.0, 0.0, -1.0] # Opposite direction
print(cosine_similarity(a, b)) # 1.0 (identical)
print(cosine_similarity(a, c)) # 0.0 (orthogonal)
print(cosine_similarity(a, d)) # -1.0 (opposite)Efficient Implementation with NumPy
The naive version is O(d) per pair โ fine for single comparisons but slow for large batches. Use NumPy for vectorized operations:
import numpy as np
def cosine_similarity_np(a: np.ndarray, b: np.ndarray) -> float:
"""Single pair, NumPy version."""
dot = np.dot(a, b)
norm = np.linalg.norm(a) * np.linalg.norm(b)
return float(dot / norm) if norm > 0 else 0.0
def cosine_similarity_matrix(query: np.ndarray, corpus: np.ndarray) -> np.ndarray:
"""
Compute cosine similarity between one query and all corpus vectors.
query: (d,) vector
corpus: (n, d) matrix of n vectors
Returns: (n,) array of similarities
"""
# Normalize query
query_norm = query / (np.linalg.norm(query) + 1e-10)
# Normalize corpus rows
norms = np.linalg.norm(corpus, axis=1, keepdims=True)
corpus_normalized = corpus / (norms + 1e-10)
# Dot product = cosine similarity for normalized vectors
return corpus_normalized @ query_norm
# Example: search embeddings
n_docs = 1000
embedding_dim = 1536 # OpenAI text-embedding-3-small dimension
# Simulate document embeddings
corpus_embeddings = np.random.randn(n_docs, embedding_dim).astype(np.float32)
query_embedding = np.random.randn(embedding_dim).astype(np.float32)
similarities = cosine_similarity_matrix(query_embedding, corpus_embeddings)
# Top-5 most similar documents
top_k = 5
top_indices = np.argsort(similarities)[-top_k:][::-1]
for idx in top_indices:
print(f"Doc {idx}: similarity = {similarities[idx]:.4f}")Batch Query vs Corpus
For multiple queries against a corpus:
def batch_cosine_similarity(
queries: np.ndarray, # (q, d)
corpus: np.ndarray, # (n, d)
) -> np.ndarray: # Returns (q, n) similarity matrix
"""All queries vs all corpus documents."""
# Normalize along embedding dimension (axis=1)
q_norm = queries / (np.linalg.norm(queries, axis=1, keepdims=True) + 1e-10)
c_norm = corpus / (np.linalg.norm(corpus, axis=1, keepdims=True) + 1e-10)
# Matrix multiplication: (q, d) @ (d, n) = (q, n)
return q_norm @ c_norm.T
# Test
q = np.random.randn(5, 128) # 5 queries, 128-dim embeddings
c = np.random.randn(1000, 128) # 1000 documents
sim_matrix = batch_cosine_similarity(q, c)
print(f"Shape: {sim_matrix.shape}") # (5, 1000)
# For each query, find the top-3 results
for i in range(len(q)):
top3 = np.argsort(sim_matrix[i])[-3:][::-1]
print(f"Query {i} top matches: doc {top3[0]} ({sim_matrix[i][top3[0]]:.3f}), "
f"doc {top3[1]} ({sim_matrix[i][top3[1]]:.3f})")Cosine Distance vs Cosine Similarity
Many vector libraries use cosine distance = 1 - cosine_similarity. This converts similarity (higher = more similar) to distance (lower = more similar), making it compatible with k-NN search algorithms.
def cosine_distance(a: np.ndarray, b: np.ndarray) -> float:
return 1.0 - cosine_similarity_np(a, b)
# Distance = 0: identical, Distance = 1: orthogonal, Distance = 2: oppositeWhen to Use Cosine Similarity vs Dot Product
| Situation | Use | |---|---| | Vectors already normalized (unit length) | Dot product (same result, faster) | | Vectors have meaningful magnitudes | Euclidean distance | | Embeddings from OpenAI, Cohere, SBERT | Cosine similarity | | Training a retrieval model | Often dot product (for speed) |
OpenAI and most modern embedding APIs return normalized vectors (unit length). For normalized vectors, cosine similarity equals the dot product โ skipping the normalization step is a useful optimization.
Interview Questions
Q: When would cosine similarity give misleading results?
When document length carries meaningful information. If a long detailed document and a short summary are compared, cosine similarity may call them "similar" even though they represent very different levels of detail. Dot product would correctly show the detailed document as more informative.
Q: How does cosine similarity relate to vector databases?
Vector DBs (Pinecone, Weaviate, pgvector) store embeddings and support approximate nearest-neighbor (ANN) search using algorithms like HNSW or IVF. These efficiently find high-cosine-similarity vectors without comparing against all stored vectors (which would be O(nรd) per query). ANN trades exact results for speed โ acceptable because embedding search is approximate anyway.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.