BLEU and ROUGE
How BLEU and ROUGE scores work, what they measure, their formulas, implementation, and why they fall short for evaluating modern LLM outputs.
BLEU: Bilingual Evaluation Understudy
BLEU measures how much of the model's output matches the reference text using n-gram precision:
BLEU = BP Ć exp(Ī£ wā log pā)
where:
pā = n-gram precision (fraction of n-grams in hypothesis found in reference)
wā = weight for n-gram order (typically 1/N for N-gram BLEU)
BP = brevity penalty (penalises outputs shorter than reference)
BLEU-4 (most common):
Uses 1-gram, 2-gram, 3-gram, 4-gram precision
Weights: wā=wā=wā=wā=0.25BLEU Computation
from collections import Counter
import math
def count_ngrams(tokens, n):
return Counter(tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1))
def bleu_score(hypothesis: list, reference: list, max_n: int = 4) -> float:
bp = min(1.0, math.exp(1 - len(reference)/max(len(hypothesis), 1)))
log_bleu = 0.0
for n in range(1, max_n + 1):
hyp_ngrams = count_ngrams(hypothesis, n)
ref_ngrams = count_ngrams(reference, n)
# Clipped precision: don't give credit for repeating rare n-grams
clipped = sum(min(hyp_ngrams[g], ref_ngrams[g]) for g in hyp_ngrams)
total = max(sum(hyp_ngrams.values()), 1)
precision = clipped / total
if precision > 0:
log_bleu += (1/max_n) * math.log(precision)
else:
return 0.0
return bp * math.exp(log_bleu)
# Example:
hyp = ["the", "cat", "sat", "on", "the", "mat"]
ref = ["the", "cat", "is", "on", "the", "mat"]
print(f"BLEU-4: {bleu_score(hyp, ref):.3f}") # ~0.576ROUGE: Recall-Oriented Understudy for Gisting Evaluation
ROUGE measures recall (how much of the reference appears in the output), primarily for summarisation:
ROUGE-N = (matched n-grams) / (n-grams in reference)
ROUGE-1: unigram recall
ROUGE-2: bigram recall
ROUGE-L: longest common subsequence F1
Common reporting format:
ROUGE-1, ROUGE-2, ROUGE-L as F1 scores (combining precision and recall)ROUGE Implementation
def rouge_n(hypothesis: list, reference: list, n: int) -> dict:
hyp_ngrams = count_ngrams(hypothesis, n)
ref_ngrams = count_ngrams(reference, n)
matched = sum(min(hyp_ngrams[g], ref_ngrams[g]) for g in hyp_ngrams)
precision = matched / max(sum(hyp_ngrams.values()), 1)
recall = matched / max(sum(ref_ngrams.values()), 1)
f1 = (2 * precision * recall) / max(precision + recall, 1e-9)
return {"precision": precision, "recall": recall, "f1": f1}
# HuggingFace implementation:
from evaluate import load
rouge = load("rouge")
result = rouge.compute(
predictions=["The patient takes Warfarin."],
references=["The patient is prescribed Warfarin."]
)
print(result)
# {'rouge1': 0.778, 'rouge2': 0.5, 'rougeL': 0.778}Typical Score Ranges
Machine translation (BLEU):
< 10: almost unusable
10-19: poor
20-29: understandable
30-40: good
40-50: high quality
> 50: approaching human
Summarisation (ROUGE-1 F1):
Typical neural models: 40-55
Human summaries vs each other: ~60-70
These ranges are domain and task specific ā do not compare BLEU
across different datasets or language pairs.Why BLEU and ROUGE Fall Short
1. Reference dependence:
"The medication reduces clotting." vs "Warfarin prevents blood clots."
BLEU = 0 (no shared n-grams) despite semantically equivalent
2. No semantic understanding:
Rewording with synonyms scores 0 even if meaning is preserved.
"The cat sat on the mat" vs "The feline rested on the rug" ā low BLEU
3. Don't measure factual accuracy:
A fluent hallucination scores high if it matches reference wording.
4. Require a reference:
For open-ended generation, there may be no reference.
Multiple valid answers are penalised equally.
5. Length artefacts:
BLEU penalises short outputs; ROUGE penalises long ones.
Modern evaluation uses BERTScore, LLM-as-judge, or task-specific metrics.BERTScore: Semantic Alternative
from evaluate import load
bertscore = load("bertscore")
result = bertscore.compute(
predictions=["The feline rested on the rug"],
references=["The cat sat on the mat"],
lang="en"
)
print(result)
# {'precision': [0.93], 'recall': [0.92], 'f1': [0.92]}
# Much higher than BLEU-4 ā 0.0 for this paraphraseBERTScore computes cosine similarity between contextual embeddings of each token in prediction and reference, then takes the maximum match (greedy matching) and aggregates.
Interview Answer
"BLEU measures n-gram precision of the model's output against a reference, with a brevity penalty for short outputs. ROUGE measures recall ā how much of the reference appears in the output ā and is standard for summarisation. Both require a reference, don't capture semantic similarity (a perfect paraphrase scores near 0), and don't measure factual accuracy. Modern LLM evaluation has moved toward BERTScore (embedding-based matching), ROUGE-L for extractive tasks, and LLM-as-judge for open-ended generation where no single reference exists."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.