Learnixo
Back to blog
AI Systemsintermediate

Implement Cosine Similarity

Implement cosine similarity from scratch. Understand why it measures semantic closeness, how it relates to vector search, and how to use it efficiently with NumPy.

Asma Hafeez KhanMay 16, 20265 min read
Live CodingCosine SimilarityVector SearchPython
Share:๐•

Why Cosine Similarity?

In LLM applications, text is represented as dense vectors (embeddings). To find similar texts, you measure the angle between their vectors โ€” not the distance.

Why angle, not distance? A long document and a short document on the same topic should be similar. Euclidean distance would say they're far apart (different magnitudes). Cosine similarity normalizes for magnitude, measuring only direction.

cos(ฮธ) = (A ยท B) / (||A|| ร— ||B||)

Range: -1 to 1 (1 = identical direction, 0 = orthogonal, -1 = opposite)


Naive Implementation

Python
import math

def dot_product(a: list[float], b: list[float]) -> float:
    """Sum of element-wise products."""
    return sum(x * y for x, y in zip(a, b))

def magnitude(v: list[float]) -> float:
    """Euclidean norm (L2 norm) of a vector."""
    return math.sqrt(sum(x**2 for x in v))

def cosine_similarity(a: list[float], b: list[float]) -> float:
    """Cosine similarity between two vectors. Returns value in [-1, 1]."""
    if len(a) != len(b):
        raise ValueError(f"Vector dimensions must match: {len(a)} vs {len(b)}")

    mag_a = magnitude(a)
    mag_b = magnitude(b)

    if mag_a == 0.0 or mag_b == 0.0:
        return 0.0  # Zero vector has no direction

    return dot_product(a, b) / (mag_a * mag_b)

# Test
a = [1.0, 0.0, 1.0]  # Points in x-z direction
b = [1.0, 0.0, 1.0]  # Identical
c = [0.0, 1.0, 0.0]  # Points in y direction (orthogonal to a)
d = [-1.0, 0.0, -1.0] # Opposite direction

print(cosine_similarity(a, b))  # 1.0 (identical)
print(cosine_similarity(a, c))  # 0.0 (orthogonal)
print(cosine_similarity(a, d))  # -1.0 (opposite)

Efficient Implementation with NumPy

The naive version is O(d) per pair โ€” fine for single comparisons but slow for large batches. Use NumPy for vectorized operations:

Python
import numpy as np

def cosine_similarity_np(a: np.ndarray, b: np.ndarray) -> float:
    """Single pair, NumPy version."""
    dot = np.dot(a, b)
    norm = np.linalg.norm(a) * np.linalg.norm(b)
    return float(dot / norm) if norm > 0 else 0.0

def cosine_similarity_matrix(query: np.ndarray, corpus: np.ndarray) -> np.ndarray:
    """
    Compute cosine similarity between one query and all corpus vectors.
    query: (d,) vector
    corpus: (n, d) matrix of n vectors
    Returns: (n,) array of similarities
    """
    # Normalize query
    query_norm = query / (np.linalg.norm(query) + 1e-10)

    # Normalize corpus rows
    norms = np.linalg.norm(corpus, axis=1, keepdims=True)
    corpus_normalized = corpus / (norms + 1e-10)

    # Dot product = cosine similarity for normalized vectors
    return corpus_normalized @ query_norm

# Example: search embeddings
n_docs = 1000
embedding_dim = 1536  # OpenAI text-embedding-3-small dimension

# Simulate document embeddings
corpus_embeddings = np.random.randn(n_docs, embedding_dim).astype(np.float32)
query_embedding = np.random.randn(embedding_dim).astype(np.float32)

similarities = cosine_similarity_matrix(query_embedding, corpus_embeddings)

# Top-5 most similar documents
top_k = 5
top_indices = np.argsort(similarities)[-top_k:][::-1]
for idx in top_indices:
    print(f"Doc {idx}: similarity = {similarities[idx]:.4f}")

Batch Query vs Corpus

For multiple queries against a corpus:

Python
def batch_cosine_similarity(
    queries: np.ndarray,   # (q, d)
    corpus: np.ndarray,    # (n, d)
) -> np.ndarray:           # Returns (q, n) similarity matrix
    """All queries vs all corpus documents."""
    # Normalize along embedding dimension (axis=1)
    q_norm = queries / (np.linalg.norm(queries, axis=1, keepdims=True) + 1e-10)
    c_norm = corpus / (np.linalg.norm(corpus, axis=1, keepdims=True) + 1e-10)

    # Matrix multiplication: (q, d) @ (d, n) = (q, n)
    return q_norm @ c_norm.T

# Test
q = np.random.randn(5, 128)    # 5 queries, 128-dim embeddings
c = np.random.randn(1000, 128) # 1000 documents

sim_matrix = batch_cosine_similarity(q, c)
print(f"Shape: {sim_matrix.shape}")  # (5, 1000)

# For each query, find the top-3 results
for i in range(len(q)):
    top3 = np.argsort(sim_matrix[i])[-3:][::-1]
    print(f"Query {i} top matches: doc {top3[0]} ({sim_matrix[i][top3[0]]:.3f}), "
          f"doc {top3[1]} ({sim_matrix[i][top3[1]]:.3f})")

Cosine Distance vs Cosine Similarity

Many vector libraries use cosine distance = 1 - cosine_similarity. This converts similarity (higher = more similar) to distance (lower = more similar), making it compatible with k-NN search algorithms.

Python
def cosine_distance(a: np.ndarray, b: np.ndarray) -> float:
    return 1.0 - cosine_similarity_np(a, b)

# Distance = 0: identical, Distance = 1: orthogonal, Distance = 2: opposite

When to Use Cosine Similarity vs Dot Product

| Situation | Use | |---|---| | Vectors already normalized (unit length) | Dot product (same result, faster) | | Vectors have meaningful magnitudes | Euclidean distance | | Embeddings from OpenAI, Cohere, SBERT | Cosine similarity | | Training a retrieval model | Often dot product (for speed) |

OpenAI and most modern embedding APIs return normalized vectors (unit length). For normalized vectors, cosine similarity equals the dot product โ€” skipping the normalization step is a useful optimization.


Interview Questions

Q: When would cosine similarity give misleading results?

When document length carries meaningful information. If a long detailed document and a short summary are compared, cosine similarity may call them "similar" even though they represent very different levels of detail. Dot product would correctly show the detailed document as more informative.

Q: How does cosine similarity relate to vector databases?

Vector DBs (Pinecone, Weaviate, pgvector) store embeddings and support approximate nearest-neighbor (ANN) search using algorithms like HNSW or IVF. These efficiently find high-cosine-similarity vectors without comparing against all stored vectors (which would be O(nร—d) per query). ANN trades exact results for speed โ€” acceptable because embedding search is approximate anyway.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:๐•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.