Learnixo
Back to blog
AI Systemsintermediate

BLEU and ROUGE

How BLEU and ROUGE scores work, what they measure, their formulas, implementation, and why they fall short for evaluating modern LLM outputs.

Asma Hafeez KhanMay 16, 20264 min read
LLMsEvaluationBLEUROUGEInterview
Share:š•

BLEU: Bilingual Evaluation Understudy

BLEU measures how much of the model's output matches the reference text using n-gram precision:

BLEU = BP Ɨ exp(Ī£ wā‚™ log pā‚™)

where:
  pā‚™ = n-gram precision (fraction of n-grams in hypothesis found in reference)
  wā‚™ = weight for n-gram order (typically 1/N for N-gram BLEU)
  BP = brevity penalty (penalises outputs shorter than reference)

BLEU-4 (most common):
  Uses 1-gram, 2-gram, 3-gram, 4-gram precision
  Weights: w₁=wā‚‚=wā‚ƒ=wā‚„=0.25

BLEU Computation

Python
from collections import Counter
import math

def count_ngrams(tokens, n):
    return Counter(tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1))

def bleu_score(hypothesis: list, reference: list, max_n: int = 4) -> float:
    bp = min(1.0, math.exp(1 - len(reference)/max(len(hypothesis), 1)))

    log_bleu = 0.0
    for n in range(1, max_n + 1):
        hyp_ngrams = count_ngrams(hypothesis, n)
        ref_ngrams = count_ngrams(reference, n)

        # Clipped precision: don't give credit for repeating rare n-grams
        clipped = sum(min(hyp_ngrams[g], ref_ngrams[g]) for g in hyp_ngrams)
        total = max(sum(hyp_ngrams.values()), 1)
        precision = clipped / total

        if precision > 0:
            log_bleu += (1/max_n) * math.log(precision)
        else:
            return 0.0

    return bp * math.exp(log_bleu)

# Example:
hyp = ["the", "cat", "sat", "on", "the", "mat"]
ref = ["the", "cat", "is", "on", "the", "mat"]
print(f"BLEU-4: {bleu_score(hyp, ref):.3f}")  # ~0.576

ROUGE: Recall-Oriented Understudy for Gisting Evaluation

ROUGE measures recall (how much of the reference appears in the output), primarily for summarisation:

ROUGE-N = (matched n-grams) / (n-grams in reference)

ROUGE-1: unigram recall
ROUGE-2: bigram recall
ROUGE-L: longest common subsequence F1

Common reporting format:
  ROUGE-1, ROUGE-2, ROUGE-L as F1 scores (combining precision and recall)

ROUGE Implementation

Python
def rouge_n(hypothesis: list, reference: list, n: int) -> dict:
    hyp_ngrams = count_ngrams(hypothesis, n)
    ref_ngrams = count_ngrams(reference, n)

    matched = sum(min(hyp_ngrams[g], ref_ngrams[g]) for g in hyp_ngrams)
    precision = matched / max(sum(hyp_ngrams.values()), 1)
    recall    = matched / max(sum(ref_ngrams.values()), 1)
    f1 = (2 * precision * recall) / max(precision + recall, 1e-9)

    return {"precision": precision, "recall": recall, "f1": f1}

# HuggingFace implementation:
from evaluate import load
rouge = load("rouge")
result = rouge.compute(
    predictions=["The patient takes Warfarin."],
    references=["The patient is prescribed Warfarin."]
)
print(result)
# {'rouge1': 0.778, 'rouge2': 0.5, 'rougeL': 0.778}

Typical Score Ranges

Machine translation (BLEU):
  < 10:    almost unusable
  10-19:   poor
  20-29:   understandable
  30-40:   good
  40-50:   high quality
  > 50:    approaching human

Summarisation (ROUGE-1 F1):
  Typical neural models: 40-55
  Human summaries vs each other: ~60-70

These ranges are domain and task specific — do not compare BLEU
across different datasets or language pairs.

Why BLEU and ROUGE Fall Short

1. Reference dependence:
   "The medication reduces clotting." vs "Warfarin prevents blood clots."
   BLEU = 0 (no shared n-grams) despite semantically equivalent

2. No semantic understanding:
   Rewording with synonyms scores 0 even if meaning is preserved.
   "The cat sat on the mat" vs "The feline rested on the rug" → low BLEU

3. Don't measure factual accuracy:
   A fluent hallucination scores high if it matches reference wording.

4. Require a reference:
   For open-ended generation, there may be no reference.
   Multiple valid answers are penalised equally.

5. Length artefacts:
   BLEU penalises short outputs; ROUGE penalises long ones.

Modern evaluation uses BERTScore, LLM-as-judge, or task-specific metrics.

BERTScore: Semantic Alternative

Python
from evaluate import load
bertscore = load("bertscore")

result = bertscore.compute(
    predictions=["The feline rested on the rug"],
    references=["The cat sat on the mat"],
    lang="en"
)
print(result)
# {'precision': [0.93], 'recall': [0.92], 'f1': [0.92]}
# Much higher than BLEU-4 ā‰ˆ 0.0 for this paraphrase

BERTScore computes cosine similarity between contextual embeddings of each token in prediction and reference, then takes the maximum match (greedy matching) and aggregates.


Interview Answer

"BLEU measures n-gram precision of the model's output against a reference, with a brevity penalty for short outputs. ROUGE measures recall — how much of the reference appears in the output — and is standard for summarisation. Both require a reference, don't capture semantic similarity (a perfect paraphrase scores near 0), and don't measure factual accuracy. Modern LLM evaluation has moved toward BERTScore (embedding-based matching), ROUGE-L for extractive tasks, and LLM-as-judge for open-ended generation where no single reference exists."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.