Implement Cosine Similarity

Why Cosine Similarity?

In LLM applications, text is represented as dense vectors (embeddings). To find similar texts, you measure the angle between their vectors — not the distance.

Why angle, not distance? A long document and a short document on the same topic should be similar. Euclidean distance would say they're far apart (different magnitudes). Cosine similarity normalizes for magnitude, measuring only direction.

cos(θ) = (A · B) / (||A|| × ||B||)

Range: -1 to 1 (1 = identical direction, 0 = orthogonal, -1 = opposite)

Naive Implementation

Python

import math

def dot_product(a: list[float], b: list[float]) -> float:
    """Sum of element-wise products."""
    return sum(x * y for x, y in zip(a, b))

def magnitude(v: list[float]) -> float:
    """Euclidean norm (L2 norm) of a vector."""
    return math.sqrt(sum(x**2 for x in v))

def cosine_similarity(a: list[float], b: list[float]) -> float:
    """Cosine similarity between two vectors. Returns value in [-1, 1]."""
    if len(a) != len(b):
        raise ValueError(f"Vector dimensions must match: {len(a)} vs {len(b)}")

    mag_a = magnitude(a)
    mag_b = magnitude(b)

    if mag_a == 0.0 or mag_b == 0.0:
        return 0.0  # Zero vector has no direction

    return dot_product(a, b) / (mag_a * mag_b)

# Test
a = [1.0, 0.0, 1.0]  # Points in x-z direction
b = [1.0, 0.0, 1.0]  # Identical
c = [0.0, 1.0, 0.0]  # Points in y direction (orthogonal to a)
d = [-1.0, 0.0, -1.0] # Opposite direction

print(cosine_similarity(a, b))  # 1.0 (identical)
print(cosine_similarity(a, c))  # 0.0 (orthogonal)
print(cosine_similarity(a, d))  # -1.0 (opposite)

Efficient Implementation with NumPy

The naive version is O(d) per pair — fine for single comparisons but slow for large batches. Use NumPy for vectorized operations:

Python

import numpy as np

def cosine_similarity_np(a: np.ndarray, b: np.ndarray) -> float:
    """Single pair, NumPy version."""
    dot = np.dot(a, b)
    norm = np.linalg.norm(a) * np.linalg.norm(b)
    return float(dot / norm) if norm > 0 else 0.0

def cosine_similarity_matrix(query: np.ndarray, corpus: np.ndarray) -> np.ndarray:
    """
    Compute cosine similarity between one query and all corpus vectors.
    query: (d,) vector
    corpus: (n, d) matrix of n vectors
    Returns: (n,) array of similarities
    """
    # Normalize query
    query_norm = query / (np.linalg.norm(query) + 1e-10)

    # Normalize corpus rows
    norms = np.linalg.norm(corpus, axis=1, keepdims=True)
    corpus_normalized = corpus / (norms + 1e-10)

    # Dot product = cosine similarity for normalized vectors
    return corpus_normalized @ query_norm

# Example: search embeddings
n_docs = 1000
embedding_dim = 1536  # OpenAI text-embedding-3-small dimension

# Simulate document embeddings
corpus_embeddings = np.random.randn(n_docs, embedding_dim).astype(np.float32)
query_embedding = np.random.randn(embedding_dim).astype(np.float32)

similarities = cosine_similarity_matrix(query_embedding, corpus_embeddings)

# Top-5 most similar documents
top_k = 5
top_indices = np.argsort(similarities)[-top_k:][::-1]
for idx in top_indices:
    print(f"Doc {idx}: similarity = {similarities[idx]:.4f}")

Batch Query vs Corpus

For multiple queries against a corpus:

Python

def batch_cosine_similarity(
    queries: np.ndarray,   # (q, d)
    corpus: np.ndarray,    # (n, d)
) -> np.ndarray:           # Returns (q, n) similarity matrix
    """All queries vs all corpus documents."""
    # Normalize along embedding dimension (axis=1)
    q_norm = queries / (np.linalg.norm(queries, axis=1, keepdims=True) + 1e-10)
    c_norm = corpus / (np.linalg.norm(corpus, axis=1, keepdims=True) + 1e-10)

    # Matrix multiplication: (q, d) @ (d, n) = (q, n)
    return q_norm @ c_norm.T

# Test
q = np.random.randn(5, 128)    # 5 queries, 128-dim embeddings
c = np.random.randn(1000, 128) # 1000 documents

sim_matrix = batch_cosine_similarity(q, c)
print(f"Shape: {sim_matrix.shape}")  # (5, 1000)

# For each query, find the top-3 results
for i in range(len(q)):
    top3 = np.argsort(sim_matrix[i])[-3:][::-1]
    print(f"Query {i} top matches: doc {top3[0]} ({sim_matrix[i][top3[0]]:.3f}), "
          f"doc {top3[1]} ({sim_matrix[i][top3[1]]:.3f})")

Cosine Distance vs Cosine Similarity

Many vector libraries use cosine distance = 1 - cosine_similarity. This converts similarity (higher = more similar) to distance (lower = more similar), making it compatible with k-NN search algorithms.

Python

def cosine_distance(a: np.ndarray, b: np.ndarray) -> float:
    return 1.0 - cosine_similarity_np(a, b)

# Distance = 0: identical, Distance = 1: orthogonal, Distance = 2: opposite

When to Use Cosine Similarity vs Dot Product

| Situation | Use | |---|---| | Vectors already normalized (unit length) | Dot product (same result, faster) | | Vectors have meaningful magnitudes | Euclidean distance | | Embeddings from OpenAI, Cohere, SBERT | Cosine similarity | | Training a retrieval model | Often dot product (for speed) |

OpenAI and most modern embedding APIs return normalized vectors (unit length). For normalized vectors, cosine similarity equals the dot product — skipping the normalization step is a useful optimization.

Interview Questions

Q: When would cosine similarity give misleading results?

When document length carries meaningful information. If a long detailed document and a short summary are compared, cosine similarity may call them "similar" even though they represent very different levels of detail. Dot product would correctly show the detailed document as more informative.

Q: How does cosine similarity relate to vector databases?

Vector DBs (Pinecone, Weaviate, pgvector) store embeddings and support approximate nearest-neighbor (ANN) search using algorithms like HNSW or IVF. These efficiently find high-cosine-similarity vectors without comparing against all stored vectors (which would be O(n×d) per query). ANN trades exact results for speed — acceptable because embedding search is approximate anyway.

Implement Cosine Similarity

Why Cosine Similarity?

Naive Implementation

Efficient Implementation with NumPy

Batch Query vs Corpus

Cosine Distance vs Cosine Similarity

When to Use Cosine Similarity vs Dot Product

Interview Questions

Enjoyed this article?

Leave a comment