Time and Space Complexity for AI Engineers
Big O notation review with AI-specific examples: O(n) embedding lookup, O(n²) attention, chunking pipelines, and when complexity actually matters in production RAG systems.
Why Complexity Matters More in AI Systems
When you're building a REST API, a quadratic algorithm on a 100-row dataset is irrelevant. In AI systems, the numbers are different. A RAG pipeline might embed 500,000 document chunks. A similarity search runs against millions of vectors. An attention mechanism processes sequences of thousands of tokens. Getting complexity wrong here means your system either can't scale or costs 100x what it should.
This lesson reviews Big O through the lens of AI engineering.
Big O — The Mental Model
Big O describes how runtime or memory grows as input size n grows. The key insight: we drop constants and lower-order terms.
# O(1) — constant time, doesn't grow with n
def get_embedding_by_id(store: dict, doc_id: str) -> list[float]:
return store[doc_id] # hash map lookup
# O(n) — linear, scales with number of documents
def find_keyword_in_chunks(chunks: list[str], keyword: str) -> list[str]:
return [c for c in chunks if keyword in c]
# O(n log n) — typical sorting
def sort_chunks_by_relevance(chunks: list[tuple[str, float]]) -> list[tuple[str, float]]:
return sorted(chunks, key=lambda x: x[1], reverse=True)
# O(n^2) — quadratic, dangerous at scale
def pairwise_similarity_naive(embeddings: list[list[float]]) -> list[list[float]]:
n = len(embeddings)
matrix = [[0.0] * n for _ in range(n)]
for i in range(n):
for j in range(n):
matrix[i][j] = dot_product(embeddings[i], embeddings[j])
return matrixAI-Specific Complexity Examples
O(n) — Embedding Lookup
A dictionary-based embedding store gives you O(1) per lookup, O(n) to build.
from typing import Optional
class EmbeddingStore:
def __init__(self):
self._store: dict[str, list[float]] = {}
def add(self, doc_id: str, embedding: list[float]) -> None:
# O(1) amortized — hash map insertion
self._store[doc_id] = embedding
def get(self, doc_id: str) -> Optional[list[float]]:
# O(1) — hash map lookup
return self._store.get(doc_id)
def build_from_corpus(self, corpus: dict[str, list[float]]) -> None:
# O(n) — iterate all n documents
for doc_id, emb in corpus.items():
self.add(doc_id, emb)Interview point: When an interviewer asks "how does your embedding store scale?", say: "Lookup is O(1), but linear scan for similarity is O(n × d) where d is embedding dimension. For large n, we'd swap to an approximate index like FAISS."
O(n²) — Naive Attention
The attention mechanism in transformers is the most famous quadratic operation in AI.
For a sequence of n tokens, attention computes:
- Q × K^T: shape (n, n) — every token attends to every other token
- This is O(n²) time and O(n²) space
import numpy as np
def scaled_dot_product_attention(
Q: np.ndarray, # (n, d_k)
K: np.ndarray, # (n, d_k)
V: np.ndarray, # (n, d_v)
) -> np.ndarray:
"""
Standard attention: O(n^2 * d) time, O(n^2) space for the attention matrix.
n = sequence length, d = head dimension
"""
d_k = Q.shape[-1]
# (n, n) matrix — this is the O(n^2) step
scores = Q @ K.T / np.sqrt(d_k)
# softmax over last dim
scores = scores - scores.max(axis=-1, keepdims=True) # numerical stability
weights = np.exp(scores)
weights = weights / weights.sum(axis=-1, keepdims=True)
# (n, d_v)
return weights @ V
# Demonstration: attention matrix size grows quadratically
for seq_len in [128, 512, 2048, 8192]:
attn_matrix_elements = seq_len * seq_len
attn_matrix_mb = attn_matrix_elements * 4 / (1024 * 1024) # float32
print(f"seq_len={seq_len:5d} → attention matrix = {attn_matrix_mb:8.2f} MB")
# seq_len= 128 → attention matrix = 0.06 MB
# seq_len= 512 → attention matrix = 1.00 MB
# seq_len= 2048 → attention matrix = 16.00 MB
# seq_len= 8192 → attention matrix = 256.00 MBThis is why context window length is expensive. Doubling the context length quadruples the memory for the attention layer.
O(n log n) — Sorting Retrieved Chunks
After retrieving candidate chunks, you sort by similarity score before returning top-k.
def rank_retrieved_chunks(
query_embedding: list[float],
candidate_chunks: list[tuple[str, list[float]]], # (text, embedding)
top_k: int = 5,
) -> list[tuple[str, float]]:
"""
Compute similarity for all candidates: O(n * d)
Sort by score: O(n log n)
Return top k: O(k)
Overall: O(n * d + n log n) = O(n * d) for typical d >> log n
"""
scored = []
for text, emb in candidate_chunks:
score = cosine_similarity(query_embedding, emb)
scored.append((text, score))
# O(n log n)
scored.sort(key=lambda x: x[1], reverse=True)
return scored[:top_k]
def cosine_similarity(a: list[float], b: list[float]) -> float:
dot = sum(x * y for x, y in zip(a, b))
mag_a = sum(x ** 2 for x in a) ** 0.5
mag_b = sum(x ** 2 for x in b) ** 0.5
if mag_a == 0 or mag_b == 0:
return 0.0
return dot / (mag_a * mag_b)Interview tip: Sorting all n chunks is usually fine. If n is large (millions), you should use a min-heap to get top-k in O(n log k) — see the heaps lesson.
Space Complexity: Storing Embeddings
An embedding vector for a document is typically 1536 floats (OpenAI ada-002) or 768 floats (many open models). Each float is 4 bytes.
def estimate_embedding_storage(
n_documents: int,
embedding_dim: int = 1536,
bytes_per_float: int = 4,
) -> dict:
"""
Space complexity of storing all embeddings: O(n * d)
where n = number of documents, d = embedding dimension
"""
total_bytes = n_documents * embedding_dim * bytes_per_float
return {
"n_documents": n_documents,
"embedding_dim": embedding_dim,
"total_mb": total_bytes / (1024 ** 2),
"total_gb": total_bytes / (1024 ** 3),
}
# Real-world examples
examples = [
(10_000, 1536), # small knowledge base
(100_000, 1536), # medium corpus
(1_000_000, 1536), # large document store
(1_000_000, 768), # open model, smaller dim
]
for n, d in examples:
result = estimate_embedding_storage(n, d)
print(f"n={n:>10,} d={d:>5} → {result['total_gb']:.2f} GB")
# n= 10,000 d= 1536 → 0.06 GB
# n= 100,000 d= 1536 → 0.57 GB
# n= 1,000,000 d= 1536 → 5.72 GB
# n= 1,000,000 d= 768 → 2.86 GBAt 1 million documents with 1536-dim embeddings you need roughly 6 GB just for the raw vectors — before indexing overhead. This is why vector databases like Qdrant and Weaviate use quantization (int8, binary) to shrink storage.
RAG Pipeline Bottleneck Analysis
Let's analyze the complexity of a full RAG pipeline step by step.
from dataclasses import dataclass
from typing import Callable
import time
@dataclass
class PipelineStep:
name: str
time_complexity: str
space_complexity: str
typical_bottleneck: bool
def analyze_rag_pipeline() -> list[PipelineStep]:
return [
PipelineStep(
name="Tokenize query",
time_complexity="O(q) where q = query length",
space_complexity="O(q)",
typical_bottleneck=False,
),
PipelineStep(
name="Embed query (model forward pass)",
time_complexity="O(q^2 * d_model) — attention over query tokens",
space_complexity="O(q^2) — attention matrix",
typical_bottleneck=False, # q is small (under 512 tokens typically)
),
PipelineStep(
name="Nearest neighbor search over n docs",
time_complexity="O(n * d) brute force, O(log n) approximate",
space_complexity="O(n * d) to store all embeddings",
typical_bottleneck=True, # main bottleneck at scale
),
PipelineStep(
name="Fetch top-k chunk texts",
time_complexity="O(k)",
space_complexity="O(k * chunk_size)",
typical_bottleneck=False,
),
PipelineStep(
name="Build prompt with context",
time_complexity="O(k * chunk_size)",
space_complexity="O(k * chunk_size)",
typical_bottleneck=False,
),
PipelineStep(
name="LLM generation (n_ctx context tokens)",
time_complexity="O(n_ctx^2) — attention is quadratic in context",
space_complexity="O(n_ctx^2)",
typical_bottleneck=True, # large context = slow and expensive
),
]
pipeline = analyze_rag_pipeline()
print(f"{'Step':<40} {'Time':<45} Bottleneck?")
print("-" * 100)
for step in pipeline:
flag = "YES" if step.typical_bottleneck else "no"
print(f"{step.name:<40} {step.time_complexity:<45} {flag}")Practice Problem: Analyze a Tokenizer Loop
Problem: Given the following tokenizer implementation, determine the time and space complexity. Then identify and fix the bottleneck.
# Version 1 — naive implementation
def tokenize_corpus_v1(documents: list[str]) -> list[list[str]]:
"""Tokenize a list of documents into word lists."""
all_tokens = []
for doc in documents:
tokens = []
for word in doc.split():
# Normalize: lowercase and strip punctuation
normalized = ""
for char in word:
if char.isalnum():
normalized += char # BUG: string concatenation in loop!
if normalized:
tokens.append(normalized.lower())
all_tokens.append(tokens)
return all_tokens
# Analysis of v1:
# - Outer loop: O(n) documents
# - Split: O(len(doc))
# - Inner char loop: O(len(word))
# - String concatenation `normalized += char`: O(len(normalized)) each time!
# → O(len(word)^2) per word due to string immutability
# Total: O(n * max_doc_len * max_word_len^2) — terrible for long words
# Version 2 — fixed with list join
def tokenize_corpus_v2(documents: list[str]) -> list[list[str]]:
"""Fixed version: O(n * total_chars)"""
all_tokens = []
for doc in documents:
tokens = []
for word in doc.split():
# Use list comprehension + join: O(len(word)) not O(len(word)^2)
normalized = "".join(c for c in word if c.isalnum())
if normalized:
tokens.append(normalized.lower())
all_tokens.append(tokens)
return all_tokens
# Version 3 — fastest with regex
import re
_WORD_PATTERN = re.compile(r"[a-zA-Z0-9]+")
def tokenize_corpus_v3(documents: list[str]) -> list[list[str]]:
"""
Regex-based: O(n * total_chars) with very small constant.
Pattern compiled once outside the loop — O(1) compile cost amortized.
"""
return [_WORD_PATTERN.findall(doc.lower()) for doc in documents]
# Benchmark
import time
docs = ["Hello, world! This is a test." * 10] * 10_000
for version, func in [("v1", tokenize_corpus_v1), ("v2", tokenize_corpus_v2), ("v3", tokenize_corpus_v3)]:
start = time.perf_counter()
result = func(docs)
elapsed = time.perf_counter() - start
print(f"{version}: {elapsed:.3f}s, first doc tokens: {result[0][:5]}")Expected interview answer: v1 has hidden O(k²) due to string concatenation; v2 fixes it; v3 is the idiomatic Python approach. All three are O(n × total_chars) in terms of documents, but constants differ significantly.
Summary Table
| Operation | Time Complexity | Space Complexity | Notes | |-----------|----------------|------------------|-------| | Dict embedding lookup | O(1) | O(n × d) | Hash map | | Brute-force kNN | O(n × d) | O(1) extra | Per query | | Sorting n chunks | O(n log n) | O(n) | For reranking | | Top-k with heap | O(n log k) | O(k) | Preferred at scale | | Self-attention | O(n²) | O(n²) | n = sequence length | | BM25 indexing | O(n × avg_len) | O(vocab × n) | Inverted index build | | Regex tokenize | O(total_chars) | O(total_tokens) | One pass |
Key Interview Takeaways
- Always state your complexity before coding, not after. Interviewers want to see that you reason before you type.
- In AI systems, d (embedding dimension) often matters as much as n (number of documents). An O(n × d) algorithm on 1M documents with d=1536 is very different from d=64.
- Attention is O(n²) in context length — this is why there's so much research into linear attention and sparse attention variants.
- String concatenation in a loop is a classic O(n²) trap in Python. Always use
"".join(list). - Approximate algorithms are the correct answer at scale. Exact kNN at 10M vectors is impractical; HNSW with 95% recall at 10ms latency is production-ready.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.