Embeddings for RAG

What Embeddings Are

A text embedding is a dense vector of floating-point numbers that encodes the semantic meaning of a text:

"Warfarin is an anticoagulant" → [0.12, -0.34, 0.89, ...]  (768 or 1536 numbers)
"Blood thinners reduce clotting" → [0.11, -0.31, 0.88, ...]  (semantically similar → similar vector)
"The stock market rose today" → [-0.45, 0.82, -0.22, ...]   (different meaning → different vector)

Texts with similar meaning have vectors that are close together in the embedding space, measured by cosine similarity.

Why Embeddings Enable Semantic Search

Traditional keyword search matches exact text. Embeddings match meaning:

Query: "heart attack"

Keyword search (BM25):
  Returns: documents containing "heart attack"
  Misses: "myocardial infarction", "acute MI", "NSTEMI"

Embedding search:
  Returns: documents semantically close to "heart attack"
  Includes: "myocardial infarction", "cardiac event", "acute coronary syndrome"

The embedding model has learned that these terms mean the same thing.

Embedding Models for RAG

General-purpose (text-embedding-3-small, OpenAI):
  Dimensions: 1536
  Cost: $0.002/1M tokens
  Quality: excellent for general text
  Use for: general knowledge bases

General-purpose (all-MiniLM-L6-v2, local):
  Dimensions: 384
  Cost: free (runs locally)
  Speed: very fast (~100K texts/second on GPU)
  Use for: high-volume, cost-sensitive applications

Medical domain (MedCPT-Query-Encoder):
  Trained on PubMed queries and articles
  Better for biomedical retrieval than general models
  Use for: clinical guideline search, medical QA

Large (text-embedding-3-large, OpenAI):
  Dimensions: 3072
  Higher quality than small, 5× more expensive
  Use for: where retrieval quality is paramount

Embedding Dimensions Trade-off

More dimensions:
  ✓ More expressive — can encode finer-grained distinctions
  ✗ Larger storage (1536 floats × 4 bytes = 6KB per chunk)
  ✗ Slower similarity search (more dimensions to compute)
  ✗ More memory in the vector index

text-embedding-3-small with matryoshka:
  Can truncate to 512 or 256 dimensions
  256d: 6× less storage, ~2× faster search, modest quality drop
  Use for very large corpora where storage/speed matters

Rule of thumb:
  Small corpus (< 100K chunks): use the best model regardless of dimension
  Large corpus (> 10M chunks): consider dimension reduction

Generating Embeddings

Python

from openai import OpenAI
from sentence_transformers import SentenceTransformer
import numpy as np

# OpenAI embeddings (API-based, higher quality)
openai_client = OpenAI()

def embed_openai(texts: list[str], model: str = "text-embedding-3-small") -> np.ndarray:
    response = openai_client.embeddings.create(
        input=texts,
        model=model
    )
    return np.array([item.embedding for item in response.data])

# Local embedding model (free, fast)
local_model = SentenceTransformer("all-MiniLM-L6-v2")

def embed_local(texts: list[str]) -> np.ndarray:
    return local_model.encode(texts, show_progress_bar=True)

# Batching for large corpora:
def embed_batch(texts: list[str], batch_size: int = 256) -> np.ndarray:
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        embs = embed_local(batch)
        all_embeddings.append(embs)
    return np.vstack(all_embeddings)

Embedding Asymmetry

Query and document embeddings can be different:

Standard: embed(query) and embed(document) use the SAME model
  Works well when query and document vocabulary are similar

Asymmetric models (bi-encoder trained with hard negatives):
  query encoder and document encoder may be different
  query: "What causes heart attacks?"
  document: "Myocardial infarction occurs when..."
  
  The models are trained to map queries and their matching documents
  close together — even when they use different vocabulary

MedCPT: separate query and article encoders
  MedCPT-Query-Encoder for queries
  MedCPT-Article-Encoder for documents
  Don't use the article encoder for queries — quality degrades

Interview Answer

"Text embeddings encode semantic meaning as dense vectors — texts with similar meaning have similar vectors, measured by cosine similarity. This enables semantic search that matches meaning rather than exact keywords, capturing 'heart attack' ↔ 'myocardial infarction' equivalences that BM25 misses. For RAG embedding model choice: general-purpose models (text-embedding-3-small for quality, all-MiniLM-L6-v2 for free local inference) work well for most applications. For clinical RAG, domain-specific models (MedCPT) improve retrieval on biomedical text. Some bi-encoder models use separate query and document encoders — always use the query encoder for queries, not the document encoder."