BLEU and ROUGE for Generation Quality — LLMs Deep Dive | Learnixo

BLEU: Bilingual Evaluation Understudy

BLEU measures how much of the model's output matches the reference text using n-gram precision:

BLEU = BP × exp(Σ wₙ log pₙ)

where:
  pₙ = n-gram precision (fraction of n-grams in hypothesis found in reference)
  wₙ = weight for n-gram order (typically 1/N for N-gram BLEU)
  BP = brevity penalty (penalises outputs shorter than reference)

BLEU-4 (most common):
  Uses 1-gram, 2-gram, 3-gram, 4-gram precision
  Weights: w₁=w₂=w₃=w₄=0.25

BLEU Computation

Python

from collections import Counter
import math

def count_ngrams(tokens, n):
    return Counter(tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1))

def bleu_score(hypothesis: list, reference: list, max_n: int = 4) -> float:
    bp = min(1.0, math.exp(1 - len(reference)/max(len(hypothesis), 1)))

    log_bleu = 0.0
    for n in range(1, max_n + 1):
        hyp_ngrams = count_ngrams(hypothesis, n)
        ref_ngrams = count_ngrams(reference, n)

        # Clipped precision: don't give credit for repeating rare n-grams
        clipped = sum(min(hyp_ngrams[g], ref_ngrams[g]) for g in hyp_ngrams)
        total = max(sum(hyp_ngrams.values()), 1)
        precision = clipped / total

        if precision > 0:
            log_bleu += (1/max_n) * math.log(precision)
        else:
            return 0.0

    return bp * math.exp(log_bleu)

# Example:
hyp = ["the", "cat", "sat", "on", "the", "mat"]
ref = ["the", "cat", "is", "on", "the", "mat"]
print(f"BLEU-4: {bleu_score(hyp, ref):.3f}")  # ~0.576

ROUGE: Recall-Oriented Understudy for Gisting Evaluation

ROUGE measures recall (how much of the reference appears in the output), primarily for summarisation:

ROUGE-N = (matched n-grams) / (n-grams in reference)

ROUGE-1: unigram recall
ROUGE-2: bigram recall
ROUGE-L: longest common subsequence F1

Common reporting format:
  ROUGE-1, ROUGE-2, ROUGE-L as F1 scores (combining precision and recall)

ROUGE Implementation

Python

def rouge_n(hypothesis: list, reference: list, n: int) -> dict:
    hyp_ngrams = count_ngrams(hypothesis, n)
    ref_ngrams = count_ngrams(reference, n)

    matched = sum(min(hyp_ngrams[g], ref_ngrams[g]) for g in hyp_ngrams)
    precision = matched / max(sum(hyp_ngrams.values()), 1)
    recall    = matched / max(sum(ref_ngrams.values()), 1)
    f1 = (2 * precision * recall) / max(precision + recall, 1e-9)

    return {"precision": precision, "recall": recall, "f1": f1}

# HuggingFace implementation:
from evaluate import load
rouge = load("rouge")
result = rouge.compute(
    predictions=["The patient takes Warfarin."],
    references=["The patient is prescribed Warfarin."]
)
print(result)
# {'rouge1': 0.778, 'rouge2': 0.5, 'rougeL': 0.778}

Typical Score Ranges

Machine translation (BLEU):
  < 10:    almost unusable
  10-19:   poor
  20-29:   understandable
  30-40:   good
  40-50:   high quality
  > 50:    approaching human

Summarisation (ROUGE-1 F1):
  Typical neural models: 40-55
  Human summaries vs each other: ~60-70

These ranges are domain and task specific — do not compare BLEU
across different datasets or language pairs.

Why BLEU and ROUGE Fall Short

1. Reference dependence:
   "The medication reduces clotting." vs "Warfarin prevents blood clots."
   BLEU = 0 (no shared n-grams) despite semantically equivalent

2. No semantic understanding:
   Rewording with synonyms scores 0 even if meaning is preserved.
   "The cat sat on the mat" vs "The feline rested on the rug" → low BLEU

3. Don't measure factual accuracy:
   A fluent hallucination scores high if it matches reference wording.

4. Require a reference:
   For open-ended generation, there may be no reference.
   Multiple valid answers are penalised equally.

5. Length artefacts:
   BLEU penalises short outputs; ROUGE penalises long ones.

Modern evaluation uses BERTScore, LLM-as-judge, or task-specific metrics.

BERTScore: Semantic Alternative

Python

from evaluate import load
bertscore = load("bertscore")

result = bertscore.compute(
    predictions=["The feline rested on the rug"],
    references=["The cat sat on the mat"],
    lang="en"
)
print(result)
# {'precision': [0.93], 'recall': [0.92], 'f1': [0.92]}
# Much higher than BLEU-4 ≈ 0.0 for this paraphrase

BERTScore computes cosine similarity between contextual embeddings of each token in prediction and reference, then takes the maximum match (greedy matching) and aggregates.

Interview Answer

"BLEU measures n-gram precision of the model's output against a reference, with a brevity penalty for short outputs. ROUGE measures recall — how much of the reference appears in the output — and is standard for summarisation. Both require a reference, don't capture semantic similarity (a perfect paraphrase scores near 0), and don't measure factual accuracy. Modern LLM evaluation has moved toward BERTScore (embedding-based matching), ROUGE-L for extractive tasks, and LLM-as-judge for open-ended generation where no single reference exists."