Learnixo

Advanced RAG · Lesson 2 of 14

Cross-Encoder Re-Ranking: Precision at the Top

The Problem Reranking Solves

First-stage retrieval (vector search + BM25) is optimised for recall — retrieve a large set of potentially relevant documents. But top-k precision may be low:

Query: "Warfarin dose adjustment for CYP2C9 poor metabolisers"

Vector search top-5:
  1. "Warfarin pharmacology overview"     — relevant but general
  2. "CYP2C9 enzyme in drug metabolism"  — relevant but not dose-specific
  3. "Warfarin 5mg daily dose protocol"  — less relevant to the question
  4. "CYP2C9*2 dosing algorithm"         — HIGHLY relevant — but ranked 4th!
  5. "Anticoagulation reversal agents"   — not relevant

The most relevant document is buried. Reranking reorders them.

Bi-encoder vs Cross-encoder

Bi-encoder (used in first-stage retrieval):
  Encodes query and document INDEPENDENTLY
  Similarity = cosine(embed(query), embed(doc))
  Fast: doc embeddings precomputed, only query is encoded at search time
  Limitation: query and document never "see" each other during encoding

Cross-encoder (reranker):
  Encodes query AND document TOGETHER
  Input: [CLS] query [SEP] document [SEP]
  Output: relevance score from the final [CLS] representation
  Slow: must encode query+document pair at search time (can't precompute)
  Strength: full attention between every query token and document token
            → much higher quality relevance scores

Reranking Architecture

Stage 1: Candidate Retrieval (fast, high recall)
  Query → vector search + BM25 → top-50 candidates
  Takes: 10-50ms

Stage 2: Reranking (slow, high precision)
  Query + top-50 candidates → cross-encoder → relevance scores
  Re-order candidates by relevance score
  Return top-5
  Takes: 100-500ms (50 cross-encoder forward passes)

Result: high recall from stage 1, high precision from stage 2

Implementation with Cohere Rerank

Python
import cohere
from typing import NamedTuple

co = cohere.Client("your-api-key")

class RankedDocument(NamedTuple):
    content: str
    relevance_score: float
    original_rank: int

def rerank_documents(
    query: str,
    documents: list[str],
    top_n: int = 5,
    model: str = "rerank-english-v3.0"
) -> list[RankedDocument]:
    response = co.rerank(
        query=query,
        documents=documents,
        top_n=top_n,
        model=model,
        return_documents=True
    )

    return [
        RankedDocument(
            content=result.document["text"],
            relevance_score=result.relevance_score,
            original_rank=result.index
        )
        for result in response.results
    ]

# Usage
candidates = retriever.search(query, top_k=50)
doc_texts = [c["content"] for c in candidates]
reranked = rerank_documents(query, doc_texts, top_n=5)

Implementation with sentence-transformers (local)

Python
from sentence_transformers import CrossEncoder
import numpy as np

class LocalReranker:
    def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
        self.model = CrossEncoder(model_name)

    def rerank(
        self,
        query: str,
        documents: list[dict],
        top_n: int = 5
    ) -> list[dict]:
        # Create query-document pairs
        pairs = [(query, doc["content"]) for doc in documents]

        # Score all pairs (batch for efficiency)
        scores = self.model.predict(pairs, batch_size=32)

        # Sort by score descending
        ranked = sorted(
            zip(scores, documents),
            key=lambda x: x[0],
            reverse=True
        )

        return [
            {**doc, "rerank_score": float(score)}
            for score, doc in ranked[:top_n]
        ]

Medical-Specific Rerankers

For clinical applications, domain-specific rerankers outperform general ones:

MedCPT-Cross-Encoder (NLM):
  Pretrained on PubMed query-article pairs
  Better at biomedical relevance than MS-MARCO models

BioLinkBERT / PubMedBERT as cross-encoder:
  Fine-tuned on clinical NLI (natural language inference)
  Better calibration for clinical text relevance

Cohere Rerank with medical prompting:
  The general model is still competitive with domain models
  Especially for structured clinical notes (EHR-style)

Cost and Latency Trade-off

Cohere Rerank:
  Cost: ~$0.001 per 1000 token-pairs (estimate)
  Latency: 100-300ms for 50 documents
  Quality: excellent, no infrastructure overhead

Local cross-encoder (CPU):
  Cost: infrastructure only
  Latency: 500ms-2s for 50 documents on CPU
  Quality: slightly below Cohere for general, better for fine-tuned domain

Local cross-encoder (GPU T4):
  Latency: 50-150ms for 50 documents
  
Practical recommendation:
  Use Cohere Rerank for prototyping and production with moderate volume
  Move to local GPU inference for high-volume production (>1M requests/day)

Interview Answer

"Reranking adds a second stage to retrieval: first-stage retrieval (vector search + BM25) produces 20-100 candidate documents for recall; a cross-encoder reranker then scores each (query, document) pair jointly — with full attention between query and document tokens — and reorders by relevance. Cross-encoders are much more accurate than bi-encoders because they can compare query and document tokens directly, but they can't precompute document representations, so they can only be used on a small candidate set. In production, I use Cohere Rerank as a managed API or sentence-transformers' CrossEncoder locally. For clinical AI, domain-specific rerankers (MedCPT, PubMedBERT-based) improve precision on biomedical queries."