BLEU Score for Text Generation

BLEU (Bilingual Evaluation Understudy) was introduced in 2002 for machine translation evaluation. It remains one of the most cited metrics in NLP research — and one of the most misapplied. This lesson explains how it works and when it actually makes sense to use it.

What BLEU Measures

BLEU measures the n-gram overlap between a model-generated text (hypothesis) and one or more reference texts. The core intuition: if a good translation contains many of the same words and phrases as the reference translation, then the hypothesis is probably good too.

This works reasonably well for translation. It works poorly for tasks where the space of valid responses is large.

Building BLEU from Scratch

Understanding the formula is essential for knowing when BLEU will mislead you.

Step 1: N-gram Precision

For a given n, compute precision = (number of hypothesis n-grams that appear in reference) / (total hypothesis n-grams).

Python

from collections import Counter
from typing import Optional

def ngrams(tokens: list[str], n: int) -> list[tuple]:
    """Extract all n-grams from a token list."""
    return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]


def clipped_precision(
    hypothesis: list[str],
    references: list[list[str]],
    n: int,
) -> tuple[int, int]:
    """
    Compute clipped n-gram precision counts.
    
    Returns (numerator, denominator) to allow accumulation across sentences.
    
    The "clipped" part: each hypothesis n-gram can only match a reference n-gram
    as many times as that n-gram appears in the reference. This prevents
    "the the the the" from scoring perfectly against "the cat sat".
    """
    hyp_ngrams = Counter(ngrams(hypothesis, n))
    
    # For each reference, count available n-gram matches
    max_ref_counts = Counter()
    for ref in references:
        ref_ngrams = Counter(ngrams(ref, n))
        for gram, count in ref_ngrams.items():
            max_ref_counts[gram] = max(max_ref_counts[gram], count)
    
    # Clip: each hypothesis n-gram can match at most max_ref_counts[gram] times
    numerator = 0
    for gram, hyp_count in hyp_ngrams.items():
        numerator += min(hyp_count, max_ref_counts.get(gram, 0))
    
    denominator = sum(hyp_ngrams.values())
    return numerator, denominator


# Example
hypothesis = "the cat sat on the mat".split()
reference = "the cat is sitting on the mat".split()

for n in [1, 2, 3, 4]:
    num, denom = clipped_precision([hypothesis], [[reference]], n)
    precision = num / denom if denom > 0 else 0.0
    print(f"BLEU-{n} precision: {num}/{denom} = {precision:.3f}")

Step 2: Brevity Penalty

Without a penalty, a model could maximize precision by generating just one word — the most common word in all references.

Python

import math

def brevity_penalty(
    hypothesis_length: int,
    reference_length: int,
) -> float:
    """
    Penalizes outputs that are shorter than the reference.
    
    If hypothesis is at least as long as reference: penalty = 1 (no penalty)
    If hypothesis is shorter: penalty = exp(1 - reference_len/hypothesis_len)
    """
    if hypothesis_length >= reference_length:
        return 1.0
    else:
        return math.exp(1 - reference_length / hypothesis_length)


# Demonstration
for hyp_len in [2, 5, 7, 10, 15]:
    bp = brevity_penalty(hyp_len, reference_length=10)
    print(f"  hyp_len={hyp_len:3d}, ref_len=10: BP={bp:.3f}")
# At hyp_len=10: BP=1.000 (no penalty)
# At hyp_len=5:  BP=0.607
# At hyp_len=2:  BP=0.082

Step 3: Final BLEU Score

BLEU combines n-gram precisions (1 through 4) using a geometric mean, then applies the brevity penalty.

Python

def bleu_score(
    hypothesis: str,
    references: list[str],
    max_n: int = 4,
    weights: Optional[list[float]] = None,
) -> float:
    """
    Compute corpus-level BLEU score.
    
    Args:
        hypothesis: Generated text
        references: List of reference texts
        max_n: Maximum n-gram order (default 4 for BLEU-4)
        weights: Per-order weights (default: uniform)
    
    Returns:
        BLEU score in [0, 1]
    """
    if weights is None:
        weights = [1.0 / max_n] * max_n
    
    hyp_tokens = hypothesis.lower().split()
    ref_tokens_list = [r.lower().split() for r in references]
    
    # Compute clipped precision for each n
    log_avg_precision = 0.0
    for n in range(1, max_n + 1):
        num, denom = clipped_precision([hyp_tokens], [ref_tokens_list], n)
        if denom == 0 or num == 0:
            return 0.0
        precision = num / denom
        log_avg_precision += weights[n-1] * math.log(precision)
    
    # Brevity penalty: use shortest reference length
    best_ref_len = min(len(r) for r in ref_tokens_list)
    bp = brevity_penalty(len(hyp_tokens), best_ref_len)
    
    return bp * math.exp(log_avg_precision)


# Test
hyp = "the cat sat on the mat"
refs = ["the cat is sitting on the mat", "a cat sat on the mat"]

score = bleu_score(hyp, refs)
print(f"BLEU-4: {score:.4f}")

Using sacrebleu (the Standard Library)

For reproducible, publication-quality BLEU, use sacrebleu. It handles tokenization consistently.

Python

# pip install sacrebleu
import sacrebleu

def compute_sacrebleu(
    hypotheses: list[str],
    references: list[list[str]],
) -> dict:
    """
    Compute BLEU with sacrebleu (standard tokenization).
    
    Args:
        hypotheses: List of generated texts
        references: List of reference lists. Each inner list is all references
                   for the corresponding hypothesis.
    
    Returns:
        BLEU score dict
    """
    # sacrebleu expects references transposed: list of lists where
    # outer list is over reference sets, inner list is over examples
    refs_transposed = list(zip(*references))
    
    bleu = sacrebleu.corpus_bleu(
        hypotheses,
        [list(r) for r in refs_transposed],
    )
    
    return {
        "bleu": round(bleu.score, 2),  # 0-100 scale
        "bleu_normalized": round(bleu.score / 100, 4),  # 0-1 scale
        "bp": round(bleu.bp, 4),
        "precisions": [round(p, 2) for p in bleu.precisions],  # BLEU-1 through BLEU-4
        "n_examples": len(hypotheses),
    }


# Example: machine translation evaluation
hypotheses = [
    "The cat sat on the mat.",
    "She went to the market to buy vegetables.",
    "The hospital is located on Main Street.",
]

references_per_example = [
    ["The cat is sitting on the mat.", "A cat sat upon the mat."],
    ["She went to the market to purchase vegetables.", "She visited the market for vegetables."],
    ["The hospital is on Main Street.", "Main Street is where the hospital is."],
]

result = compute_sacrebleu(hypotheses, references_per_example)
print(result)

BLEU Variants

| Variant | N-gram Order | Weights | Use | |---------|-------------|---------|-----| | BLEU-1 | Unigram only | [1.0] | Vocabulary coverage | | BLEU-2 | 1+2-gram | [0.5, 0.5] | Short phrase matching | | BLEU-4 | 1-4-gram | [0.25 each] | Standard for translation | | chrF | Character n-gram | — | Better for morphologically rich languages | | TER | Edit distance | — | Translation edit rate |

Python

# chrF: character-level F-score, often better than word-level BLEU
chrf = sacrebleu.corpus_chrf(
    hypotheses,
    [list(r) for r in zip(*references_per_example)],
)
print(f"chrF score: {chrf.score:.2f}")

# TER (Translation Edit Rate): lower is better
ter = sacrebleu.corpus_ter(
    hypotheses,
    [list(r) for r in zip(*references_per_example)],
)
print(f"TER score: {ter.score:.2f}")

Weaknesses of BLEU

1. Penalizes Valid Paraphrases

Python

reference = "The patient should take the medication with food."

# Valid paraphrase — different wording, same meaning
paraphrase = "The medication should be taken at mealtime."

# Near-copy — slightly reworded
near_copy = "The patient should take medication with food."

score_paraphrase = bleu_score(paraphrase, [reference])
score_near_copy = bleu_score(near_copy, [reference])

print(f"Valid paraphrase BLEU: {score_paraphrase:.4f}")  # Low! Penalized.
print(f"Near-copy BLEU: {score_near_copy:.4f}")  # High.
# BLEU rewards near-copies over valid paraphrases

2. Ignores Semantic Correctness

Python

# BLEU does not detect factual errors
correct = "Ibuprofen reduces inflammation by inhibiting COX enzymes."
wrong_but_fluent = "Ibuprofen reduces inflammation by inhibiting ATP enzymes."

# High word overlap, but medically wrong
score = bleu_score(wrong_but_fluent, [correct])
print(f"BLEU for factually wrong output: {score:.4f}")
# This may still score reasonably well — BLEU cannot detect the error

3. No Document-Level Coherence

BLEU computes n-gram statistics locally. It cannot evaluate whether a long document is logically coherent.

When BLEU Is Appropriate

Good fit:

Machine translation (its original purpose)
Constrained paraphrase tasks where outputs should stay close to the source
Academic comparison of translation models (community standard)

Poor fit:

Open-ended question answering
Chatbot evaluation
Summarization where diverse phrasing is acceptable
Any task where there is no single reference answer

Python

# Task suitability classifier
BLEU_SUITABLE_TASKS = {
    "machine_translation": True,
    "constrained_paraphrase": True,
    "fill_in_the_blank": True,
    "open_qa": False,
    "conversation": False,
    "summarization": False,  # Use ROUGE instead
    "code_generation": False,  # Use pass@k instead
    "medical_qa": False,  # Use human eval or LLM judge
}

def is_bleu_appropriate(task: str) -> bool:
    suitable = BLEU_SUITABLE_TASKS.get(task, False)
    if not suitable:
        print(f"Warning: BLEU is not appropriate for '{task}' tasks.")
        print("Consider ROUGE-L, BERTScore, or LLM-as-judge instead.")
    return suitable

Full Evaluation Pipeline Example

Python

from dataclasses import dataclass

@dataclass
class TranslationEvalResult:
    model_name: str
    n_examples: int
    bleu4: float
    chrf: float
    mean_length_ratio: float


def eval_translation_model(
    model_name: str,
    source_texts: list[str],
    reference_translations: list[list[str]],
    translate_fn,  # (str) -> str
) -> TranslationEvalResult:
    hypotheses = [translate_fn(src) for src in source_texts]
    
    refs_transposed = [list(r) for r in zip(*reference_translations)]
    
    bleu = sacrebleu.corpus_bleu(hypotheses, refs_transposed)
    chrf = sacrebleu.corpus_chrf(hypotheses, refs_transposed)
    
    hyp_lengths = [len(h.split()) for h in hypotheses]
    ref_lengths = [len(reference_translations[i][0].split()) for i in range(len(hypotheses))]
    mean_length_ratio = sum(h/r for h, r in zip(hyp_lengths, ref_lengths)) / len(hyp_lengths)
    
    return TranslationEvalResult(
        model_name=model_name,
        n_examples=len(hypotheses),
        bleu4=round(bleu.score, 2),
        chrf=round(chrf.score, 2),
        mean_length_ratio=round(mean_length_ratio, 3),
    )

Key Takeaways

BLEU measures n-gram overlap between hypothesis and one or more reference translations.
BLEU-4 is standard: combines unigram through 4-gram precision with a brevity penalty.
Use sacrebleu for reproducible, publication-comparable scores.
BLEU penalizes valid paraphrases and cannot detect factual errors.
Appropriate for machine translation and constrained paraphrase tasks.
Not appropriate for chatbots, open QA, summarization, or medical/legal output.

What's Next

In eval-rouge.mdx, you will learn about ROUGE — the recall-oriented complement to BLEU, designed for summarization evaluation.

BLEU Score for Text Generation

BLEU Score for Text Generation

What BLEU Measures

Building BLEU from Scratch

Step 1: N-gram Precision

Step 2: Brevity Penalty

Step 3: Final BLEU Score

Using sacrebleu (the Standard Library)

BLEU Variants

Weaknesses of BLEU

1. Penalizes Valid Paraphrases

2. Ignores Semantic Correctness

3. No Document-Level Coherence

When BLEU Is Appropriate

Full Evaluation Pipeline Example

Key Takeaways

What's Next

Enjoyed this article?

Leave a comment