BLEU Score for Text Generation
Learn how BLEU score works, what it measures, when to use it, and why it fails for many modern NLP tasks.
BLEU Score for Text Generation
BLEU (Bilingual Evaluation Understudy) was introduced in 2002 for machine translation evaluation. It remains one of the most cited metrics in NLP research — and one of the most misapplied. This lesson explains how it works and when it actually makes sense to use it.
What BLEU Measures
BLEU measures the n-gram overlap between a model-generated text (hypothesis) and one or more reference texts. The core intuition: if a good translation contains many of the same words and phrases as the reference translation, then the hypothesis is probably good too.
This works reasonably well for translation. It works poorly for tasks where the space of valid responses is large.
Building BLEU from Scratch
Understanding the formula is essential for knowing when BLEU will mislead you.
Step 1: N-gram Precision
For a given n, compute precision = (number of hypothesis n-grams that appear in reference) / (total hypothesis n-grams).
from collections import Counter
from typing import Optional
def ngrams(tokens: list[str], n: int) -> list[tuple]:
"""Extract all n-grams from a token list."""
return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]
def clipped_precision(
hypothesis: list[str],
references: list[list[str]],
n: int,
) -> tuple[int, int]:
"""
Compute clipped n-gram precision counts.
Returns (numerator, denominator) to allow accumulation across sentences.
The "clipped" part: each hypothesis n-gram can only match a reference n-gram
as many times as that n-gram appears in the reference. This prevents
"the the the the" from scoring perfectly against "the cat sat".
"""
hyp_ngrams = Counter(ngrams(hypothesis, n))
# For each reference, count available n-gram matches
max_ref_counts = Counter()
for ref in references:
ref_ngrams = Counter(ngrams(ref, n))
for gram, count in ref_ngrams.items():
max_ref_counts[gram] = max(max_ref_counts[gram], count)
# Clip: each hypothesis n-gram can match at most max_ref_counts[gram] times
numerator = 0
for gram, hyp_count in hyp_ngrams.items():
numerator += min(hyp_count, max_ref_counts.get(gram, 0))
denominator = sum(hyp_ngrams.values())
return numerator, denominator
# Example
hypothesis = "the cat sat on the mat".split()
reference = "the cat is sitting on the mat".split()
for n in [1, 2, 3, 4]:
num, denom = clipped_precision([hypothesis], [[reference]], n)
precision = num / denom if denom > 0 else 0.0
print(f"BLEU-{n} precision: {num}/{denom} = {precision:.3f}")Step 2: Brevity Penalty
Without a penalty, a model could maximize precision by generating just one word — the most common word in all references.
import math
def brevity_penalty(
hypothesis_length: int,
reference_length: int,
) -> float:
"""
Penalizes outputs that are shorter than the reference.
If hypothesis is at least as long as reference: penalty = 1 (no penalty)
If hypothesis is shorter: penalty = exp(1 - reference_len/hypothesis_len)
"""
if hypothesis_length >= reference_length:
return 1.0
else:
return math.exp(1 - reference_length / hypothesis_length)
# Demonstration
for hyp_len in [2, 5, 7, 10, 15]:
bp = brevity_penalty(hyp_len, reference_length=10)
print(f" hyp_len={hyp_len:3d}, ref_len=10: BP={bp:.3f}")
# At hyp_len=10: BP=1.000 (no penalty)
# At hyp_len=5: BP=0.607
# At hyp_len=2: BP=0.082Step 3: Final BLEU Score
BLEU combines n-gram precisions (1 through 4) using a geometric mean, then applies the brevity penalty.
def bleu_score(
hypothesis: str,
references: list[str],
max_n: int = 4,
weights: Optional[list[float]] = None,
) -> float:
"""
Compute corpus-level BLEU score.
Args:
hypothesis: Generated text
references: List of reference texts
max_n: Maximum n-gram order (default 4 for BLEU-4)
weights: Per-order weights (default: uniform)
Returns:
BLEU score in [0, 1]
"""
if weights is None:
weights = [1.0 / max_n] * max_n
hyp_tokens = hypothesis.lower().split()
ref_tokens_list = [r.lower().split() for r in references]
# Compute clipped precision for each n
log_avg_precision = 0.0
for n in range(1, max_n + 1):
num, denom = clipped_precision([hyp_tokens], [ref_tokens_list], n)
if denom == 0 or num == 0:
return 0.0
precision = num / denom
log_avg_precision += weights[n-1] * math.log(precision)
# Brevity penalty: use shortest reference length
best_ref_len = min(len(r) for r in ref_tokens_list)
bp = brevity_penalty(len(hyp_tokens), best_ref_len)
return bp * math.exp(log_avg_precision)
# Test
hyp = "the cat sat on the mat"
refs = ["the cat is sitting on the mat", "a cat sat on the mat"]
score = bleu_score(hyp, refs)
print(f"BLEU-4: {score:.4f}")Using sacrebleu (the Standard Library)
For reproducible, publication-quality BLEU, use sacrebleu. It handles tokenization consistently.
# pip install sacrebleu
import sacrebleu
def compute_sacrebleu(
hypotheses: list[str],
references: list[list[str]],
) -> dict:
"""
Compute BLEU with sacrebleu (standard tokenization).
Args:
hypotheses: List of generated texts
references: List of reference lists. Each inner list is all references
for the corresponding hypothesis.
Returns:
BLEU score dict
"""
# sacrebleu expects references transposed: list of lists where
# outer list is over reference sets, inner list is over examples
refs_transposed = list(zip(*references))
bleu = sacrebleu.corpus_bleu(
hypotheses,
[list(r) for r in refs_transposed],
)
return {
"bleu": round(bleu.score, 2), # 0-100 scale
"bleu_normalized": round(bleu.score / 100, 4), # 0-1 scale
"bp": round(bleu.bp, 4),
"precisions": [round(p, 2) for p in bleu.precisions], # BLEU-1 through BLEU-4
"n_examples": len(hypotheses),
}
# Example: machine translation evaluation
hypotheses = [
"The cat sat on the mat.",
"She went to the market to buy vegetables.",
"The hospital is located on Main Street.",
]
references_per_example = [
["The cat is sitting on the mat.", "A cat sat upon the mat."],
["She went to the market to purchase vegetables.", "She visited the market for vegetables."],
["The hospital is on Main Street.", "Main Street is where the hospital is."],
]
result = compute_sacrebleu(hypotheses, references_per_example)
print(result)BLEU Variants
| Variant | N-gram Order | Weights | Use | |---------|-------------|---------|-----| | BLEU-1 | Unigram only | [1.0] | Vocabulary coverage | | BLEU-2 | 1+2-gram | [0.5, 0.5] | Short phrase matching | | BLEU-4 | 1-4-gram | [0.25 each] | Standard for translation | | chrF | Character n-gram | — | Better for morphologically rich languages | | TER | Edit distance | — | Translation edit rate |
# chrF: character-level F-score, often better than word-level BLEU
chrf = sacrebleu.corpus_chrf(
hypotheses,
[list(r) for r in zip(*references_per_example)],
)
print(f"chrF score: {chrf.score:.2f}")
# TER (Translation Edit Rate): lower is better
ter = sacrebleu.corpus_ter(
hypotheses,
[list(r) for r in zip(*references_per_example)],
)
print(f"TER score: {ter.score:.2f}")Weaknesses of BLEU
1. Penalizes Valid Paraphrases
reference = "The patient should take the medication with food."
# Valid paraphrase — different wording, same meaning
paraphrase = "The medication should be taken at mealtime."
# Near-copy — slightly reworded
near_copy = "The patient should take medication with food."
score_paraphrase = bleu_score(paraphrase, [reference])
score_near_copy = bleu_score(near_copy, [reference])
print(f"Valid paraphrase BLEU: {score_paraphrase:.4f}") # Low! Penalized.
print(f"Near-copy BLEU: {score_near_copy:.4f}") # High.
# BLEU rewards near-copies over valid paraphrases2. Ignores Semantic Correctness
# BLEU does not detect factual errors
correct = "Ibuprofen reduces inflammation by inhibiting COX enzymes."
wrong_but_fluent = "Ibuprofen reduces inflammation by inhibiting ATP enzymes."
# High word overlap, but medically wrong
score = bleu_score(wrong_but_fluent, [correct])
print(f"BLEU for factually wrong output: {score:.4f}")
# This may still score reasonably well — BLEU cannot detect the error3. No Document-Level Coherence
BLEU computes n-gram statistics locally. It cannot evaluate whether a long document is logically coherent.
When BLEU Is Appropriate
Good fit:
- Machine translation (its original purpose)
- Constrained paraphrase tasks where outputs should stay close to the source
- Academic comparison of translation models (community standard)
Poor fit:
- Open-ended question answering
- Chatbot evaluation
- Summarization where diverse phrasing is acceptable
- Any task where there is no single reference answer
# Task suitability classifier
BLEU_SUITABLE_TASKS = {
"machine_translation": True,
"constrained_paraphrase": True,
"fill_in_the_blank": True,
"open_qa": False,
"conversation": False,
"summarization": False, # Use ROUGE instead
"code_generation": False, # Use pass@k instead
"medical_qa": False, # Use human eval or LLM judge
}
def is_bleu_appropriate(task: str) -> bool:
suitable = BLEU_SUITABLE_TASKS.get(task, False)
if not suitable:
print(f"Warning: BLEU is not appropriate for '{task}' tasks.")
print("Consider ROUGE-L, BERTScore, or LLM-as-judge instead.")
return suitableFull Evaluation Pipeline Example
from dataclasses import dataclass
@dataclass
class TranslationEvalResult:
model_name: str
n_examples: int
bleu4: float
chrf: float
mean_length_ratio: float
def eval_translation_model(
model_name: str,
source_texts: list[str],
reference_translations: list[list[str]],
translate_fn, # (str) -> str
) -> TranslationEvalResult:
hypotheses = [translate_fn(src) for src in source_texts]
refs_transposed = [list(r) for r in zip(*reference_translations)]
bleu = sacrebleu.corpus_bleu(hypotheses, refs_transposed)
chrf = sacrebleu.corpus_chrf(hypotheses, refs_transposed)
hyp_lengths = [len(h.split()) for h in hypotheses]
ref_lengths = [len(reference_translations[i][0].split()) for i in range(len(hypotheses))]
mean_length_ratio = sum(h/r for h, r in zip(hyp_lengths, ref_lengths)) / len(hyp_lengths)
return TranslationEvalResult(
model_name=model_name,
n_examples=len(hypotheses),
bleu4=round(bleu.score, 2),
chrf=round(chrf.score, 2),
mean_length_ratio=round(mean_length_ratio, 3),
)Key Takeaways
- BLEU measures n-gram overlap between hypothesis and one or more reference translations.
- BLEU-4 is standard: combines unigram through 4-gram precision with a brevity penalty.
- Use sacrebleu for reproducible, publication-comparable scores.
- BLEU penalizes valid paraphrases and cannot detect factual errors.
- Appropriate for machine translation and constrained paraphrase tasks.
- Not appropriate for chatbots, open QA, summarization, or medical/legal output.
What's Next
In eval-rouge.mdx, you will learn about ROUGE — the recall-oriented complement to BLEU, designed for summarization evaluation.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.