Learnixo

Live Coding Interview Prep · Lesson 8 of 16

Implement Cosine Similarity in NumPy

Why Cosine Similarity?

In LLM applications, text is represented as dense vectors (embeddings). To find similar texts, you measure the angle between their vectors — not the distance.

Why angle, not distance? A long document and a short document on the same topic should be similar. Euclidean distance would say they're far apart (different magnitudes). Cosine similarity normalizes for magnitude, measuring only direction.

cos(θ) = (A · B) / (||A|| × ||B||)

Range: -1 to 1 (1 = identical direction, 0 = orthogonal, -1 = opposite)


Naive Implementation

Python
import math

def dot_product(a: list[float], b: list[float]) -> float:
    """Sum of element-wise products."""
    return sum(x * y for x, y in zip(a, b))

def magnitude(v: list[float]) -> float:
    """Euclidean norm (L2 norm) of a vector."""
    return math.sqrt(sum(x**2 for x in v))

def cosine_similarity(a: list[float], b: list[float]) -> float:
    """Cosine similarity between two vectors. Returns value in [-1, 1]."""
    if len(a) != len(b):
        raise ValueError(f"Vector dimensions must match: {len(a)} vs {len(b)}")

    mag_a = magnitude(a)
    mag_b = magnitude(b)

    if mag_a == 0.0 or mag_b == 0.0:
        return 0.0  # Zero vector has no direction

    return dot_product(a, b) / (mag_a * mag_b)

# Test
a = [1.0, 0.0, 1.0]  # Points in x-z direction
b = [1.0, 0.0, 1.0]  # Identical
c = [0.0, 1.0, 0.0]  # Points in y direction (orthogonal to a)
d = [-1.0, 0.0, -1.0] # Opposite direction

print(cosine_similarity(a, b))  # 1.0 (identical)
print(cosine_similarity(a, c))  # 0.0 (orthogonal)
print(cosine_similarity(a, d))  # -1.0 (opposite)

Efficient Implementation with NumPy

The naive version is O(d) per pair — fine for single comparisons but slow for large batches. Use NumPy for vectorized operations:

Python
import numpy as np

def cosine_similarity_np(a: np.ndarray, b: np.ndarray) -> float:
    """Single pair, NumPy version."""
    dot = np.dot(a, b)
    norm = np.linalg.norm(a) * np.linalg.norm(b)
    return float(dot / norm) if norm > 0 else 0.0

def cosine_similarity_matrix(query: np.ndarray, corpus: np.ndarray) -> np.ndarray:
    """
    Compute cosine similarity between one query and all corpus vectors.
    query: (d,) vector
    corpus: (n, d) matrix of n vectors
    Returns: (n,) array of similarities
    """
    # Normalize query
    query_norm = query / (np.linalg.norm(query) + 1e-10)

    # Normalize corpus rows
    norms = np.linalg.norm(corpus, axis=1, keepdims=True)
    corpus_normalized = corpus / (norms + 1e-10)

    # Dot product = cosine similarity for normalized vectors
    return corpus_normalized @ query_norm

# Example: search embeddings
n_docs = 1000
embedding_dim = 1536  # OpenAI text-embedding-3-small dimension

# Simulate document embeddings
corpus_embeddings = np.random.randn(n_docs, embedding_dim).astype(np.float32)
query_embedding = np.random.randn(embedding_dim).astype(np.float32)

similarities = cosine_similarity_matrix(query_embedding, corpus_embeddings)

# Top-5 most similar documents
top_k = 5
top_indices = np.argsort(similarities)[-top_k:][::-1]
for idx in top_indices:
    print(f"Doc {idx}: similarity = {similarities[idx]:.4f}")

Batch Query vs Corpus

For multiple queries against a corpus:

Python
def batch_cosine_similarity(
    queries: np.ndarray,   # (q, d)
    corpus: np.ndarray,    # (n, d)
) -> np.ndarray:           # Returns (q, n) similarity matrix
    """All queries vs all corpus documents."""
    # Normalize along embedding dimension (axis=1)
    q_norm = queries / (np.linalg.norm(queries, axis=1, keepdims=True) + 1e-10)
    c_norm = corpus / (np.linalg.norm(corpus, axis=1, keepdims=True) + 1e-10)

    # Matrix multiplication: (q, d) @ (d, n) = (q, n)
    return q_norm @ c_norm.T

# Test
q = np.random.randn(5, 128)    # 5 queries, 128-dim embeddings
c = np.random.randn(1000, 128) # 1000 documents

sim_matrix = batch_cosine_similarity(q, c)
print(f"Shape: {sim_matrix.shape}")  # (5, 1000)

# For each query, find the top-3 results
for i in range(len(q)):
    top3 = np.argsort(sim_matrix[i])[-3:][::-1]
    print(f"Query {i} top matches: doc {top3[0]} ({sim_matrix[i][top3[0]]:.3f}), "
          f"doc {top3[1]} ({sim_matrix[i][top3[1]]:.3f})")

Cosine Distance vs Cosine Similarity

Many vector libraries use cosine distance = 1 - cosine_similarity. This converts similarity (higher = more similar) to distance (lower = more similar), making it compatible with k-NN search algorithms.

Python
def cosine_distance(a: np.ndarray, b: np.ndarray) -> float:
    return 1.0 - cosine_similarity_np(a, b)

# Distance = 0: identical, Distance = 1: orthogonal, Distance = 2: opposite

When to Use Cosine Similarity vs Dot Product

| Situation | Use | |---|---| | Vectors already normalized (unit length) | Dot product (same result, faster) | | Vectors have meaningful magnitudes | Euclidean distance | | Embeddings from OpenAI, Cohere, SBERT | Cosine similarity | | Training a retrieval model | Often dot product (for speed) |

OpenAI and most modern embedding APIs return normalized vectors (unit length). For normalized vectors, cosine similarity equals the dot product — skipping the normalization step is a useful optimization.


Interview Questions

Q: When would cosine similarity give misleading results?

When document length carries meaningful information. If a long detailed document and a short summary are compared, cosine similarity may call them "similar" even though they represent very different levels of detail. Dot product would correctly show the detailed document as more informative.

Q: How does cosine similarity relate to vector databases?

Vector DBs (Pinecone, Weaviate, pgvector) store embeddings and support approximate nearest-neighbor (ANN) search using algorithms like HNSW or IVF. These efficiently find high-cosine-similarity vectors without comparing against all stored vectors (which would be O(n×d) per query). ANN trades exact results for speed — acceptable because embedding search is approximate anyway.