RAG Systems · Lesson 5 of 24
What Are Embeddings?
What Embeddings Are
A text embedding is a dense vector of floating-point numbers that encodes the semantic meaning of a text:
"Warfarin is an anticoagulant" → [0.12, -0.34, 0.89, ...] (768 or 1536 numbers)
"Blood thinners reduce clotting" → [0.11, -0.31, 0.88, ...] (semantically similar → similar vector)
"The stock market rose today" → [-0.45, 0.82, -0.22, ...] (different meaning → different vector)Texts with similar meaning have vectors that are close together in the embedding space, measured by cosine similarity.
Why Embeddings Enable Semantic Search
Traditional keyword search matches exact text. Embeddings match meaning:
Query: "heart attack"
Keyword search (BM25):
Returns: documents containing "heart attack"
Misses: "myocardial infarction", "acute MI", "NSTEMI"
Embedding search:
Returns: documents semantically close to "heart attack"
Includes: "myocardial infarction", "cardiac event", "acute coronary syndrome"
The embedding model has learned that these terms mean the same thing.Embedding Models for RAG
General-purpose (text-embedding-3-small, OpenAI):
Dimensions: 1536
Cost: $0.002/1M tokens
Quality: excellent for general text
Use for: general knowledge bases
General-purpose (all-MiniLM-L6-v2, local):
Dimensions: 384
Cost: free (runs locally)
Speed: very fast (~100K texts/second on GPU)
Use for: high-volume, cost-sensitive applications
Medical domain (MedCPT-Query-Encoder):
Trained on PubMed queries and articles
Better for biomedical retrieval than general models
Use for: clinical guideline search, medical QA
Large (text-embedding-3-large, OpenAI):
Dimensions: 3072
Higher quality than small, 5× more expensive
Use for: where retrieval quality is paramountEmbedding Dimensions Trade-off
More dimensions:
✓ More expressive — can encode finer-grained distinctions
✗ Larger storage (1536 floats × 4 bytes = 6KB per chunk)
✗ Slower similarity search (more dimensions to compute)
✗ More memory in the vector index
text-embedding-3-small with matryoshka:
Can truncate to 512 or 256 dimensions
256d: 6× less storage, ~2× faster search, modest quality drop
Use for very large corpora where storage/speed matters
Rule of thumb:
Small corpus (< 100K chunks): use the best model regardless of dimension
Large corpus (> 10M chunks): consider dimension reductionGenerating Embeddings
from openai import OpenAI
from sentence_transformers import SentenceTransformer
import numpy as np
# OpenAI embeddings (API-based, higher quality)
openai_client = OpenAI()
def embed_openai(texts: list[str], model: str = "text-embedding-3-small") -> np.ndarray:
response = openai_client.embeddings.create(
input=texts,
model=model
)
return np.array([item.embedding for item in response.data])
# Local embedding model (free, fast)
local_model = SentenceTransformer("all-MiniLM-L6-v2")
def embed_local(texts: list[str]) -> np.ndarray:
return local_model.encode(texts, show_progress_bar=True)
# Batching for large corpora:
def embed_batch(texts: list[str], batch_size: int = 256) -> np.ndarray:
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
embs = embed_local(batch)
all_embeddings.append(embs)
return np.vstack(all_embeddings)Embedding Asymmetry
Query and document embeddings can be different:
Standard: embed(query) and embed(document) use the SAME model
Works well when query and document vocabulary are similar
Asymmetric models (bi-encoder trained with hard negatives):
query encoder and document encoder may be different
query: "What causes heart attacks?"
document: "Myocardial infarction occurs when..."
The models are trained to map queries and their matching documents
close together — even when they use different vocabulary
MedCPT: separate query and article encoders
MedCPT-Query-Encoder for queries
MedCPT-Article-Encoder for documents
Don't use the article encoder for queries — quality degradesInterview Answer
"Text embeddings encode semantic meaning as dense vectors — texts with similar meaning have similar vectors, measured by cosine similarity. This enables semantic search that matches meaning rather than exact keywords, capturing 'heart attack' ↔ 'myocardial infarction' equivalences that BM25 misses. For RAG embedding model choice: general-purpose models (text-embedding-3-small for quality, all-MiniLM-L6-v2 for free local inference) work well for most applications. For clinical RAG, domain-specific models (MedCPT) improve retrieval on biomedical text. Some bi-encoder models use separate query and document encoders — always use the query encoder for queries, not the document encoder."