Time and Space Complexity for AI Engineers

Why Complexity Matters More in AI Systems

When you're building a REST API, a quadratic algorithm on a 100-row dataset is irrelevant. In AI systems, the numbers are different. A RAG pipeline might embed 500,000 document chunks. A similarity search runs against millions of vectors. An attention mechanism processes sequences of thousands of tokens. Getting complexity wrong here means your system either can't scale or costs 100x what it should.

This lesson reviews Big O through the lens of AI engineering.

Big O — The Mental Model

Big O describes how runtime or memory grows as input size n grows. The key insight: we drop constants and lower-order terms.

Python

# O(1) — constant time, doesn't grow with n
def get_embedding_by_id(store: dict, doc_id: str) -> list[float]:
    return store[doc_id]  # hash map lookup

# O(n) — linear, scales with number of documents
def find_keyword_in_chunks(chunks: list[str], keyword: str) -> list[str]:
    return [c for c in chunks if keyword in c]

# O(n log n) — typical sorting
def sort_chunks_by_relevance(chunks: list[tuple[str, float]]) -> list[tuple[str, float]]:
    return sorted(chunks, key=lambda x: x[1], reverse=True)

# O(n^2) — quadratic, dangerous at scale
def pairwise_similarity_naive(embeddings: list[list[float]]) -> list[list[float]]:
    n = len(embeddings)
    matrix = [[0.0] * n for _ in range(n)]
    for i in range(n):
        for j in range(n):
            matrix[i][j] = dot_product(embeddings[i], embeddings[j])
    return matrix

AI-Specific Complexity Examples

O(n) — Embedding Lookup

A dictionary-based embedding store gives you O(1) per lookup, O(n) to build.

Python

from typing import Optional

class EmbeddingStore:
    def __init__(self):
        self._store: dict[str, list[float]] = {}

    def add(self, doc_id: str, embedding: list[float]) -> None:
        # O(1) amortized — hash map insertion
        self._store[doc_id] = embedding

    def get(self, doc_id: str) -> Optional[list[float]]:
        # O(1) — hash map lookup
        return self._store.get(doc_id)

    def build_from_corpus(self, corpus: dict[str, list[float]]) -> None:
        # O(n) — iterate all n documents
        for doc_id, emb in corpus.items():
            self.add(doc_id, emb)

Interview point: When an interviewer asks "how does your embedding store scale?", say: "Lookup is O(1), but linear scan for similarity is O(n × d) where d is embedding dimension. For large n, we'd swap to an approximate index like FAISS."

O(n²) — Naive Attention

The attention mechanism in transformers is the most famous quadratic operation in AI.

For a sequence of n tokens, attention computes:

Q × K^T: shape (n, n) — every token attends to every other token
This is O(n²) time and O(n²) space

Python

import numpy as np

def scaled_dot_product_attention(
    Q: np.ndarray,  # (n, d_k)
    K: np.ndarray,  # (n, d_k)
    V: np.ndarray,  # (n, d_v)
) -> np.ndarray:
    """
    Standard attention: O(n^2 * d) time, O(n^2) space for the attention matrix.
    n = sequence length, d = head dimension
    """
    d_k = Q.shape[-1]
    
    # (n, n) matrix — this is the O(n^2) step
    scores = Q @ K.T / np.sqrt(d_k)
    
    # softmax over last dim
    scores = scores - scores.max(axis=-1, keepdims=True)  # numerical stability
    weights = np.exp(scores)
    weights = weights / weights.sum(axis=-1, keepdims=True)
    
    # (n, d_v)
    return weights @ V

# Demonstration: attention matrix size grows quadratically
for seq_len in [128, 512, 2048, 8192]:
    attn_matrix_elements = seq_len * seq_len
    attn_matrix_mb = attn_matrix_elements * 4 / (1024 * 1024)  # float32
    print(f"seq_len={seq_len:5d} → attention matrix = {attn_matrix_mb:8.2f} MB")

# seq_len=  128 → attention matrix =     0.06 MB
# seq_len=  512 → attention matrix =     1.00 MB
# seq_len= 2048 → attention matrix =    16.00 MB
# seq_len= 8192 → attention matrix =   256.00 MB

This is why context window length is expensive. Doubling the context length quadruples the memory for the attention layer.

O(n log n) — Sorting Retrieved Chunks

After retrieving candidate chunks, you sort by similarity score before returning top-k.

Python

def rank_retrieved_chunks(
    query_embedding: list[float],
    candidate_chunks: list[tuple[str, list[float]]],  # (text, embedding)
    top_k: int = 5,
) -> list[tuple[str, float]]:
    """
    Compute similarity for all candidates: O(n * d)
    Sort by score: O(n log n)
    Return top k: O(k)
    
    Overall: O(n * d + n log n) = O(n * d) for typical d >> log n
    """
    scored = []
    for text, emb in candidate_chunks:
        score = cosine_similarity(query_embedding, emb)
        scored.append((text, score))
    
    # O(n log n)
    scored.sort(key=lambda x: x[1], reverse=True)
    
    return scored[:top_k]


def cosine_similarity(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    mag_a = sum(x ** 2 for x in a) ** 0.5
    mag_b = sum(x ** 2 for x in b) ** 0.5
    if mag_a == 0 or mag_b == 0:
        return 0.0
    return dot / (mag_a * mag_b)

Interview tip: Sorting all n chunks is usually fine. If n is large (millions), you should use a min-heap to get top-k in O(n log k) — see the heaps lesson.

Space Complexity: Storing Embeddings

An embedding vector for a document is typically 1536 floats (OpenAI ada-002) or 768 floats (many open models). Each float is 4 bytes.

Python

def estimate_embedding_storage(
    n_documents: int,
    embedding_dim: int = 1536,
    bytes_per_float: int = 4,
) -> dict:
    """
    Space complexity of storing all embeddings: O(n * d)
    where n = number of documents, d = embedding dimension
    """
    total_bytes = n_documents * embedding_dim * bytes_per_float
    return {
        "n_documents": n_documents,
        "embedding_dim": embedding_dim,
        "total_mb": total_bytes / (1024 ** 2),
        "total_gb": total_bytes / (1024 ** 3),
    }

# Real-world examples
examples = [
    (10_000, 1536),    # small knowledge base
    (100_000, 1536),   # medium corpus
    (1_000_000, 1536), # large document store
    (1_000_000, 768),  # open model, smaller dim
]

for n, d in examples:
    result = estimate_embedding_storage(n, d)
    print(f"n={n:>10,} d={d:>5} → {result['total_gb']:.2f} GB")

# n=    10,000 d= 1536 →  0.06 GB
# n=   100,000 d= 1536 →  0.57 GB
# n= 1,000,000 d= 1536 →  5.72 GB
# n= 1,000,000 d=  768 →  2.86 GB

At 1 million documents with 1536-dim embeddings you need roughly 6 GB just for the raw vectors — before indexing overhead. This is why vector databases like Qdrant and Weaviate use quantization (int8, binary) to shrink storage.

RAG Pipeline Bottleneck Analysis

Let's analyze the complexity of a full RAG pipeline step by step.

Python

from dataclasses import dataclass
from typing import Callable
import time

@dataclass
class PipelineStep:
    name: str
    time_complexity: str
    space_complexity: str
    typical_bottleneck: bool

def analyze_rag_pipeline() -> list[PipelineStep]:
    return [
        PipelineStep(
            name="Tokenize query",
            time_complexity="O(q) where q = query length",
            space_complexity="O(q)",
            typical_bottleneck=False,
        ),
        PipelineStep(
            name="Embed query (model forward pass)",
            time_complexity="O(q^2 * d_model) — attention over query tokens",
            space_complexity="O(q^2) — attention matrix",
            typical_bottleneck=False,  # q is small (under 512 tokens typically)
        ),
        PipelineStep(
            name="Nearest neighbor search over n docs",
            time_complexity="O(n * d) brute force, O(log n) approximate",
            space_complexity="O(n * d) to store all embeddings",
            typical_bottleneck=True,  # main bottleneck at scale
        ),
        PipelineStep(
            name="Fetch top-k chunk texts",
            time_complexity="O(k)",
            space_complexity="O(k * chunk_size)",
            typical_bottleneck=False,
        ),
        PipelineStep(
            name="Build prompt with context",
            time_complexity="O(k * chunk_size)",
            space_complexity="O(k * chunk_size)",
            typical_bottleneck=False,
        ),
        PipelineStep(
            name="LLM generation (n_ctx context tokens)",
            time_complexity="O(n_ctx^2) — attention is quadratic in context",
            space_complexity="O(n_ctx^2)",
            typical_bottleneck=True,  # large context = slow and expensive
        ),
    ]

pipeline = analyze_rag_pipeline()
print(f"{'Step':<40} {'Time':<45} Bottleneck?")
print("-" * 100)
for step in pipeline:
    flag = "YES" if step.typical_bottleneck else "no"
    print(f"{step.name:<40} {step.time_complexity:<45} {flag}")

Practice Problem: Analyze a Tokenizer Loop

Problem: Given the following tokenizer implementation, determine the time and space complexity. Then identify and fix the bottleneck.

Python

# Version 1 — naive implementation
def tokenize_corpus_v1(documents: list[str]) -> list[list[str]]:
    """Tokenize a list of documents into word lists."""
    all_tokens = []
    for doc in documents:
        tokens = []
        for word in doc.split():
            # Normalize: lowercase and strip punctuation
            normalized = ""
            for char in word:
                if char.isalnum():
                    normalized += char  # BUG: string concatenation in loop!
            if normalized:
                tokens.append(normalized.lower())
        all_tokens.append(tokens)
    return all_tokens

# Analysis of v1:
# - Outer loop: O(n) documents
# - Split: O(len(doc))
# - Inner char loop: O(len(word))
# - String concatenation `normalized += char`: O(len(normalized)) each time!
#   → O(len(word)^2) per word due to string immutability
# Total: O(n * max_doc_len * max_word_len^2) — terrible for long words


# Version 2 — fixed with list join
def tokenize_corpus_v2(documents: list[str]) -> list[list[str]]:
    """Fixed version: O(n * total_chars)"""
    all_tokens = []
    for doc in documents:
        tokens = []
        for word in doc.split():
            # Use list comprehension + join: O(len(word)) not O(len(word)^2)
            normalized = "".join(c for c in word if c.isalnum())
            if normalized:
                tokens.append(normalized.lower())
        all_tokens.append(tokens)
    return all_tokens


# Version 3 — fastest with regex
import re

_WORD_PATTERN = re.compile(r"[a-zA-Z0-9]+")

def tokenize_corpus_v3(documents: list[str]) -> list[list[str]]:
    """
    Regex-based: O(n * total_chars) with very small constant.
    Pattern compiled once outside the loop — O(1) compile cost amortized.
    """
    return [_WORD_PATTERN.findall(doc.lower()) for doc in documents]


# Benchmark
import time

docs = ["Hello, world! This is a test." * 10] * 10_000

for version, func in [("v1", tokenize_corpus_v1), ("v2", tokenize_corpus_v2), ("v3", tokenize_corpus_v3)]:
    start = time.perf_counter()
    result = func(docs)
    elapsed = time.perf_counter() - start
    print(f"{version}: {elapsed:.3f}s, first doc tokens: {result[0][:5]}")

Expected interview answer: v1 has hidden O(k²) due to string concatenation; v2 fixes it; v3 is the idiomatic Python approach. All three are O(n × total_chars) in terms of documents, but constants differ significantly.

Summary Table

| Operation | Time Complexity | Space Complexity | Notes | |-----------|----------------|------------------|-------| | Dict embedding lookup | O(1) | O(n × d) | Hash map | | Brute-force kNN | O(n × d) | O(1) extra | Per query | | Sorting n chunks | O(n log n) | O(n) | For reranking | | Top-k with heap | O(n log k) | O(k) | Preferred at scale | | Self-attention | O(n²) | O(n²) | n = sequence length | | BM25 indexing | O(n × avg_len) | O(vocab × n) | Inverted index build | | Regex tokenize | O(total_chars) | O(total_tokens) | One pass |

Key Interview Takeaways

Always state your complexity before coding, not after. Interviewers want to see that you reason before you type.
In AI systems, d (embedding dimension) often matters as much as n (number of documents). An O(n × d) algorithm on 1M documents with d=1536 is very different from d=64.
Attention is O(n²) in context length — this is why there's so much research into linear attention and sparse attention variants.
String concatenation in a loop is a classic O(n²) trap in Python. Always use "".join(list).
Approximate algorithms are the correct answer at scale. Exact kNN at 10M vectors is impractical; HNSW with 95% recall at 10ms latency is production-ready.

Time and Space Complexity for AI Engineers

Why Complexity Matters More in AI Systems

Big O — The Mental Model

AI-Specific Complexity Examples

O(n) — Embedding Lookup

O(n²) — Naive Attention

O(n log n) — Sorting Retrieved Chunks

Space Complexity: Storing Embeddings

RAG Pipeline Bottleneck Analysis

Practice Problem: Analyze a Tokenizer Loop

Summary Table

Key Interview Takeaways

Enjoyed this article?

Leave a comment