BERTScore: Semantic Similarity for Text Evaluation

What is BERTScore?

BERTScore computes the semantic similarity between a generated text and a reference text using contextual token embeddings from a pre-trained language model (BERT, RoBERTa, DeBERTa).

Unlike BLEU or ROUGE, which count exact token overlaps, BERTScore matches tokens based on their meaning in context. "Ibuprofen inhibits COX enzymes" and "NSAIDs block cyclooxygenase" would score low on BLEU (different words) but high on BERTScore (same meaning).

How BERTScore Works

Given candidate text C and reference text R:

Encode both with a contextual model (e.g., DeBERTa-xlarge-mnli)
For each token in C, find the most similar token in R (cosine similarity)
For each token in R, find the most similar token in C
Compute:
- Precision: average max similarity for candidate tokens
- Recall: average max similarity for reference tokens
- F1: harmonic mean of precision and recall

Precision_BERT = (1/|C|) Σ_c max_r cos_sim(embed(c), embed(r))
Recall_BERT    = (1/|R|) Σ_r max_c cos_sim(embed(r), embed(c))
F1_BERT        = 2 × (Precision × Recall) / (Precision + Recall)

Using BERTScore in Python

Python

from bert_score import score

# Example: drug information generation evaluation
candidates = [
    "Metformin works by inhibiting hepatic glucose production through AMPK activation, reducing fasting blood glucose levels.",
    "Warfarin prevents blood clots by reducing vitamin K-dependent clotting factors.",
]

references = [
    "Metformin's primary mechanism is inhibition of hepatic gluconeogenesis via AMP-activated protein kinase, lowering fasting glucose.",
    "Warfarin is an anticoagulant that inhibits vitamin K epoxide reductase, depleting active vitamin K needed for clotting factors II, VII, IX, and X.",
]

# Compute BERTScore
precision, recall, f1 = score(
    cands=candidates,
    refs=references,
    model_type="microsoft/deberta-xlarge-mnli",  # High-quality model for scoring
    lang="en",
    verbose=True,
)

for i, (p, r, f) in enumerate(zip(precision, recall, f1)):
    print(f"Example {i+1}:")
    print(f"  Precision: {p:.4f}")
    print(f"  Recall:    {r:.4f}")
    print(f"  F1:        {f:.4f}")

Batch Evaluation on a Test Set

Python

import json
from bert_score import BERTScorer

# Initialize scorer once (avoid reloading model for each batch)
scorer = BERTScorer(
    model_type="microsoft/deberta-xlarge-mnli",
    lang="en",
    rescale_with_baseline=True,  # Rescales to [0, 1] for easier interpretation
    device="cuda",
)

def evaluate_with_bertscore(
    generated_outputs: list[str],
    reference_outputs: list[str],
    batch_size: int = 32,
) -> dict:
    """Evaluate a set of generated texts against references."""
    all_p, all_r, all_f1 = [], [], []

    for i in range(0, len(generated_outputs), batch_size):
        batch_cands = generated_outputs[i:i+batch_size]
        batch_refs = reference_outputs[i:i+batch_size]

        p, r, f1 = scorer.score(batch_cands, batch_refs)
        all_p.extend(p.tolist())
        all_r.extend(r.tolist())
        all_f1.extend(f1.tolist())

    return {
        "mean_precision": sum(all_p) / len(all_p),
        "mean_recall": sum(all_r) / len(all_r),
        "mean_f1": sum(all_f1) / len(all_f1),
        "min_f1": min(all_f1),
        "max_f1": max(all_f1),
        "n_examples": len(all_f1),
    }

# Example usage
generated = ["Metformin activates AMPK, reducing hepatic glucose output."]
references = ["Metformin inhibits gluconeogenesis by activating AMP-activated protein kinase."]

results = evaluate_with_bertscore(generated, references)
print(json.dumps(results, indent=2))

BERTScore vs BLEU vs ROUGE

| Metric | Matching | Semantic | Handles Paraphrase | Correlates with Human | |---|---|---|---|---| | BLEU | n-gram overlap | No | No | Low | | ROUGE | n-gram recall | No | No | Low-Medium | | BERTScore | Contextual embedding | Yes | Yes | High | | LLM-as-judge | Holistic | Yes | Yes | Highest |

BERTScore correlates much better with human judgments than BLEU or ROUGE. For text generation evaluation, it's the strongest automatic metric that doesn't require an LLM call.

Model Selection for BERTScore

The scoring model matters. Better models → more accurate semantic matching:

| Model | Speed | Quality | Recommended for | |---|---|---|---| | bert-base-uncased | Fast | Low | Development testing | | roberta-large | Medium | Good | General use | | microsoft/deberta-xlarge-mnli | Slow | Excellent | Production evaluation |

For pharmaceutical/medical content, consider using a domain-specific model if one is available for your language.

Interpreting BERTScore Values

With rescale_with_baseline=True, scores are rescaled to be more interpretable:

F1 around 0.90+: Very strong semantic match
F1 around 0.80–0.90: Good match with some differences
F1 around 0.70–0.80: Related content, notable differences
F1 below 0.70: Significant divergence in content

Without rescaling, raw BERTScore F1 values tend to cluster around 0.85–0.95 even for poor matches because the embedding space is dense. Always use rescale_with_baseline=True for interpretable values.

When BERTScore Falls Short

BERTScore is strong but has limitations:

Doesn't evaluate factual accuracy: "Ibuprofen is an anticoagulant" would score high against "Warfarin is an anticoagulant" — same type of claim, wrong drug
Doesn't catch hallucinations: If the reference doesn't mention a fact, a hallucinated fact in the candidate can't be detected
Reference quality dependent: If your reference answers are poor, BERTScore rewards candidates that match poor references

For applications where factual accuracy is critical (medical, legal), combine BERTScore with LLM-as-judge evaluation that explicitly checks for factual errors.