Advanced RAG · Lesson 2 of 14
Cross-Encoder Re-Ranking: Precision at the Top
The Problem Reranking Solves
First-stage retrieval (vector search + BM25) is optimised for recall — retrieve a large set of potentially relevant documents. But top-k precision may be low:
Query: "Warfarin dose adjustment for CYP2C9 poor metabolisers"
Vector search top-5:
1. "Warfarin pharmacology overview" — relevant but general
2. "CYP2C9 enzyme in drug metabolism" — relevant but not dose-specific
3. "Warfarin 5mg daily dose protocol" — less relevant to the question
4. "CYP2C9*2 dosing algorithm" — HIGHLY relevant — but ranked 4th!
5. "Anticoagulation reversal agents" — not relevant
The most relevant document is buried. Reranking reorders them.Bi-encoder vs Cross-encoder
Bi-encoder (used in first-stage retrieval):
Encodes query and document INDEPENDENTLY
Similarity = cosine(embed(query), embed(doc))
Fast: doc embeddings precomputed, only query is encoded at search time
Limitation: query and document never "see" each other during encoding
Cross-encoder (reranker):
Encodes query AND document TOGETHER
Input: [CLS] query [SEP] document [SEP]
Output: relevance score from the final [CLS] representation
Slow: must encode query+document pair at search time (can't precompute)
Strength: full attention between every query token and document token
→ much higher quality relevance scoresReranking Architecture
Stage 1: Candidate Retrieval (fast, high recall)
Query → vector search + BM25 → top-50 candidates
Takes: 10-50ms
Stage 2: Reranking (slow, high precision)
Query + top-50 candidates → cross-encoder → relevance scores
Re-order candidates by relevance score
Return top-5
Takes: 100-500ms (50 cross-encoder forward passes)
Result: high recall from stage 1, high precision from stage 2Implementation with Cohere Rerank
import cohere
from typing import NamedTuple
co = cohere.Client("your-api-key")
class RankedDocument(NamedTuple):
content: str
relevance_score: float
original_rank: int
def rerank_documents(
query: str,
documents: list[str],
top_n: int = 5,
model: str = "rerank-english-v3.0"
) -> list[RankedDocument]:
response = co.rerank(
query=query,
documents=documents,
top_n=top_n,
model=model,
return_documents=True
)
return [
RankedDocument(
content=result.document["text"],
relevance_score=result.relevance_score,
original_rank=result.index
)
for result in response.results
]
# Usage
candidates = retriever.search(query, top_k=50)
doc_texts = [c["content"] for c in candidates]
reranked = rerank_documents(query, doc_texts, top_n=5)Implementation with sentence-transformers (local)
from sentence_transformers import CrossEncoder
import numpy as np
class LocalReranker:
def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
self.model = CrossEncoder(model_name)
def rerank(
self,
query: str,
documents: list[dict],
top_n: int = 5
) -> list[dict]:
# Create query-document pairs
pairs = [(query, doc["content"]) for doc in documents]
# Score all pairs (batch for efficiency)
scores = self.model.predict(pairs, batch_size=32)
# Sort by score descending
ranked = sorted(
zip(scores, documents),
key=lambda x: x[0],
reverse=True
)
return [
{**doc, "rerank_score": float(score)}
for score, doc in ranked[:top_n]
]Medical-Specific Rerankers
For clinical applications, domain-specific rerankers outperform general ones:
MedCPT-Cross-Encoder (NLM):
Pretrained on PubMed query-article pairs
Better at biomedical relevance than MS-MARCO models
BioLinkBERT / PubMedBERT as cross-encoder:
Fine-tuned on clinical NLI (natural language inference)
Better calibration for clinical text relevance
Cohere Rerank with medical prompting:
The general model is still competitive with domain models
Especially for structured clinical notes (EHR-style)Cost and Latency Trade-off
Cohere Rerank:
Cost: ~$0.001 per 1000 token-pairs (estimate)
Latency: 100-300ms for 50 documents
Quality: excellent, no infrastructure overhead
Local cross-encoder (CPU):
Cost: infrastructure only
Latency: 500ms-2s for 50 documents on CPU
Quality: slightly below Cohere for general, better for fine-tuned domain
Local cross-encoder (GPU T4):
Latency: 50-150ms for 50 documents
Practical recommendation:
Use Cohere Rerank for prototyping and production with moderate volume
Move to local GPU inference for high-volume production (>1M requests/day)Interview Answer
"Reranking adds a second stage to retrieval: first-stage retrieval (vector search + BM25) produces 20-100 candidate documents for recall; a cross-encoder reranker then scores each (query, document) pair jointly — with full attention between query and document tokens — and reorders by relevance. Cross-encoders are much more accurate than bi-encoders because they can compare query and document tokens directly, but they can't precompute document representations, so they can only be used on a small candidate set. In production, I use Cohere Rerank as a managed API or sentence-transformers' CrossEncoder locally. For clinical AI, domain-specific rerankers (MedCPT, PubMedBERT-based) improve precision on biomedical queries."