BERTScore: Semantic Similarity for Text Evaluation
Use BERTScore to measure semantic similarity between generated and reference text. Understand how contextual embeddings improve on surface-level metrics like BLEU.
What is BERTScore?
BERTScore computes the semantic similarity between a generated text and a reference text using contextual token embeddings from a pre-trained language model (BERT, RoBERTa, DeBERTa).
Unlike BLEU or ROUGE, which count exact token overlaps, BERTScore matches tokens based on their meaning in context. "Ibuprofen inhibits COX enzymes" and "NSAIDs block cyclooxygenase" would score low on BLEU (different words) but high on BERTScore (same meaning).
How BERTScore Works
Given candidate text C and reference text R:
- Encode both with a contextual model (e.g., DeBERTa-xlarge-mnli)
- For each token in C, find the most similar token in R (cosine similarity)
- For each token in R, find the most similar token in C
- Compute:
- Precision: average max similarity for candidate tokens
- Recall: average max similarity for reference tokens
- F1: harmonic mean of precision and recall
Precision_BERT = (1/|C|) Σ_c max_r cos_sim(embed(c), embed(r))
Recall_BERT = (1/|R|) Σ_r max_c cos_sim(embed(r), embed(c))
F1_BERT = 2 × (Precision × Recall) / (Precision + Recall)Using BERTScore in Python
from bert_score import score
# Example: drug information generation evaluation
candidates = [
"Metformin works by inhibiting hepatic glucose production through AMPK activation, reducing fasting blood glucose levels.",
"Warfarin prevents blood clots by reducing vitamin K-dependent clotting factors.",
]
references = [
"Metformin's primary mechanism is inhibition of hepatic gluconeogenesis via AMP-activated protein kinase, lowering fasting glucose.",
"Warfarin is an anticoagulant that inhibits vitamin K epoxide reductase, depleting active vitamin K needed for clotting factors II, VII, IX, and X.",
]
# Compute BERTScore
precision, recall, f1 = score(
cands=candidates,
refs=references,
model_type="microsoft/deberta-xlarge-mnli", # High-quality model for scoring
lang="en",
verbose=True,
)
for i, (p, r, f) in enumerate(zip(precision, recall, f1)):
print(f"Example {i+1}:")
print(f" Precision: {p:.4f}")
print(f" Recall: {r:.4f}")
print(f" F1: {f:.4f}")Batch Evaluation on a Test Set
import json
from bert_score import BERTScorer
# Initialize scorer once (avoid reloading model for each batch)
scorer = BERTScorer(
model_type="microsoft/deberta-xlarge-mnli",
lang="en",
rescale_with_baseline=True, # Rescales to [0, 1] for easier interpretation
device="cuda",
)
def evaluate_with_bertscore(
generated_outputs: list[str],
reference_outputs: list[str],
batch_size: int = 32,
) -> dict:
"""Evaluate a set of generated texts against references."""
all_p, all_r, all_f1 = [], [], []
for i in range(0, len(generated_outputs), batch_size):
batch_cands = generated_outputs[i:i+batch_size]
batch_refs = reference_outputs[i:i+batch_size]
p, r, f1 = scorer.score(batch_cands, batch_refs)
all_p.extend(p.tolist())
all_r.extend(r.tolist())
all_f1.extend(f1.tolist())
return {
"mean_precision": sum(all_p) / len(all_p),
"mean_recall": sum(all_r) / len(all_r),
"mean_f1": sum(all_f1) / len(all_f1),
"min_f1": min(all_f1),
"max_f1": max(all_f1),
"n_examples": len(all_f1),
}
# Example usage
generated = ["Metformin activates AMPK, reducing hepatic glucose output."]
references = ["Metformin inhibits gluconeogenesis by activating AMP-activated protein kinase."]
results = evaluate_with_bertscore(generated, references)
print(json.dumps(results, indent=2))BERTScore vs BLEU vs ROUGE
| Metric | Matching | Semantic | Handles Paraphrase | Correlates with Human | |---|---|---|---|---| | BLEU | n-gram overlap | No | No | Low | | ROUGE | n-gram recall | No | No | Low-Medium | | BERTScore | Contextual embedding | Yes | Yes | High | | LLM-as-judge | Holistic | Yes | Yes | Highest |
BERTScore correlates much better with human judgments than BLEU or ROUGE. For text generation evaluation, it's the strongest automatic metric that doesn't require an LLM call.
Model Selection for BERTScore
The scoring model matters. Better models → more accurate semantic matching:
| Model | Speed | Quality | Recommended for |
|---|---|---|---|
| bert-base-uncased | Fast | Low | Development testing |
| roberta-large | Medium | Good | General use |
| microsoft/deberta-xlarge-mnli | Slow | Excellent | Production evaluation |
For pharmaceutical/medical content, consider using a domain-specific model if one is available for your language.
Interpreting BERTScore Values
With rescale_with_baseline=True, scores are rescaled to be more interpretable:
- F1 around 0.90+: Very strong semantic match
- F1 around 0.80–0.90: Good match with some differences
- F1 around 0.70–0.80: Related content, notable differences
- F1 below 0.70: Significant divergence in content
Without rescaling, raw BERTScore F1 values tend to cluster around 0.85–0.95 even for poor matches because the embedding space is dense. Always use rescale_with_baseline=True for interpretable values.
When BERTScore Falls Short
BERTScore is strong but has limitations:
- Doesn't evaluate factual accuracy: "Ibuprofen is an anticoagulant" would score high against "Warfarin is an anticoagulant" — same type of claim, wrong drug
- Doesn't catch hallucinations: If the reference doesn't mention a fact, a hallucinated fact in the candidate can't be detected
- Reference quality dependent: If your reference answers are poor, BERTScore rewards candidates that match poor references
For applications where factual accuracy is critical (medical, legal), combine BERTScore with LLM-as-judge evaluation that explicitly checks for factual errors.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.