ROUGE Score for Summarization

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) was introduced in 2004 specifically for evaluating automatic summarization. Where BLEU focuses on precision (how much of the hypothesis appears in the reference), ROUGE focuses on recall (how much of the reference appears in the hypothesis).

For summarization, recall matters: a good summary should cover the key content of the source document.

The ROUGE Family

There are four main variants:

| Variant | What It Measures | |---------|-----------------| | ROUGE-1 | Unigram (word) recall overlap | | ROUGE-2 | Bigram recall overlap | | ROUGE-L | Longest common subsequence (LCS) | | ROUGE-S | Skip-bigram co-occurrence |

In practice, ROUGE-1, ROUGE-2, and ROUGE-L are reported most often. ROUGE-S is less common.

ROUGE-N: N-gram Recall

ROUGE-N computes the overlap of n-grams between the hypothesis (generated summary) and the reference (gold summary), reported as recall, precision, and F1.

Python

from collections import Counter

def ngrams(tokens: list[str], n: int) -> list[tuple]:
    return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]


def rouge_n(
    hypothesis: str,
    reference: str,
    n: int,
) -> dict[str, float]:
    """
    Compute ROUGE-N between a hypothesis and reference.
    
    Returns recall, precision, and F1.
    """
    hyp_tokens = hypothesis.lower().split()
    ref_tokens = reference.lower().split()
    
    hyp_ngrams = Counter(ngrams(hyp_tokens, n))
    ref_ngrams = Counter(ngrams(ref_tokens, n))
    
    # Overlap: for each n-gram, min of hyp count and ref count
    overlap = 0
    for gram, count in ref_ngrams.items():
        overlap += min(count, hyp_ngrams.get(gram, 0))
    
    total_ref = sum(ref_ngrams.values())
    total_hyp = sum(hyp_ngrams.values())
    
    recall = overlap / total_ref if total_ref > 0 else 0.0
    precision = overlap / total_hyp if total_hyp > 0 else 0.0
    f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0
    
    return {
        "recall": round(recall, 4),
        "precision": round(precision, 4),
        "f1": round(f1, 4),
    }


# Example: summarization evaluation
reference = (
    "The drug ibuprofen reduces fever and inflammation by blocking "
    "COX-1 and COX-2 enzymes. It is widely used for pain relief "
    "and comes in 200 mg and 400 mg tablet forms."
)

hypothesis_good = (
    "Ibuprofen blocks COX enzymes to reduce fever and inflammation. "
    "It is available in 200 mg and 400 mg tablets."
)

hypothesis_bad = (
    "Aspirin is a common pain reliever used by many people worldwide."
)

for n in [1, 2]:
    print(f"\nROUGE-{n}:")
    print(f"  Good summary: {rouge_n(hypothesis_good, reference, n)}")
    print(f"  Bad summary:  {rouge_n(hypothesis_bad, reference, n)}")

ROUGE-L: Longest Common Subsequence

ROUGE-L uses the longest common subsequence (LCS) between hypothesis and reference. Unlike ROUGE-N, it does not require consecutive matching — it captures flexible word order.

Python

def lcs_length(seq_a: list, seq_b: list) -> int:
    """Compute the length of the longest common subsequence."""
    m, n = len(seq_a), len(seq_b)
    
    # dp[i][j] = LCS length for seq_a[:i] and seq_b[:j]
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if seq_a[i-1] == seq_b[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    
    return dp[m][n]


def rouge_l(hypothesis: str, reference: str) -> dict[str, float]:
    """Compute ROUGE-L between hypothesis and reference."""
    hyp_tokens = hypothesis.lower().split()
    ref_tokens = reference.lower().split()
    
    lcs = lcs_length(hyp_tokens, ref_tokens)
    
    recall = lcs / len(ref_tokens) if ref_tokens else 0.0
    precision = lcs / len(hyp_tokens) if hyp_tokens else 0.0
    f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0
    
    return {
        "recall": round(recall, 4),
        "precision": round(precision, 4),
        "f1": round(f1, 4),
        "lcs_length": lcs,
    }


# ROUGE-L captures paraphrases better than ROUGE-2
reference = "The patient should take medication after meals."
paraphrase = "Medication should be taken following a meal by the patient."

rouge2_result = rouge_n(paraphrase, reference, n=2)
rougeL_result = rouge_l(paraphrase, reference)

print(f"ROUGE-2 F1: {rouge2_result['f1']}")  # Low: bigrams differ
print(f"ROUGE-L F1: {rougeL_result['f1']}")  # Higher: words appear in order

ROUGE-S: Skip-Bigram Co-occurrence

ROUGE-S allows arbitrary gaps between bigrams. "The cat sat" and "The dog sat" share the skip-bigram ("The", "sat") even though "cat" and "dog" differ.

Python

from itertools import combinations

def skip_bigrams(tokens: list[str]) -> Counter:
    """Extract all skip-bigrams (pairs of words, order preserved, gaps allowed)."""
    pairs = Counter()
    for i in range(len(tokens)):
        for j in range(i+1, len(tokens)):
            pairs[(tokens[i], tokens[j])] += 1
    return pairs


def rouge_s(hypothesis: str, reference: str) -> dict[str, float]:
    """Compute ROUGE-S (skip-bigram F1)."""
    hyp_tokens = hypothesis.lower().split()
    ref_tokens = reference.lower().split()
    
    hyp_sbg = skip_bigrams(hyp_tokens)
    ref_sbg = skip_bigrams(ref_tokens)
    
    overlap = 0
    for pair, count in ref_sbg.items():
        overlap += min(count, hyp_sbg.get(pair, 0))
    
    total_ref = sum(ref_sbg.values())
    total_hyp = sum(hyp_sbg.values())
    
    recall = overlap / total_ref if total_ref > 0 else 0.0
    precision = overlap / total_hyp if total_hyp > 0 else 0.0
    f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0
    
    return {"recall": round(recall, 4), "precision": round(precision, 4), "f1": round(f1, 4)}

Using rouge-score Library

The rouge-score library (Google) provides an efficient, production-quality implementation:

Python

# pip install rouge-score
from rouge_score import rouge_scorer

def compute_all_rouge(
    hypothesis: str,
    reference: str,
    use_stemmer: bool = True,
) -> dict:
    """
    Compute ROUGE-1, ROUGE-2, and ROUGE-L with stemming.
    
    Stemming normalizes words: "running" and "run" become the same token.
    This increases recall for morphologically varied text.
    """
    scorer = rouge_scorer.RougeScorer(
        ["rouge1", "rouge2", "rougeL"],
        use_stemmer=use_stemmer,
    )
    scores = scorer.score(reference, hypothesis)
    
    return {
        "rouge1_recall": round(scores["rouge1"].recall, 4),
        "rouge1_precision": round(scores["rouge1"].precision, 4),
        "rouge1_f1": round(scores["rouge1"].fmeasure, 4),
        "rouge2_f1": round(scores["rouge2"].fmeasure, 4),
        "rougeL_f1": round(scores["rougeL"].fmeasure, 4),
    }


# Batch evaluation across a golden dataset
def batch_rouge_eval(
    hypotheses: list[str],
    references: list[str],
) -> dict:
    import numpy as np
    
    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
    
    rouge1_f, rouge2_f, rougeL_f = [], [], []
    
    for hyp, ref in zip(hypotheses, references):
        scores = scorer.score(ref, hyp)
        rouge1_f.append(scores["rouge1"].fmeasure)
        rouge2_f.append(scores["rouge2"].fmeasure)
        rougeL_f.append(scores["rougeL"].fmeasure)
    
    return {
        "rouge1_f1": round(float(np.mean(rouge1_f)), 4),
        "rouge2_f1": round(float(np.mean(rouge2_f)), 4),
        "rougeL_f1": round(float(np.mean(rougeL_f)), 4),
        "n_examples": len(hypotheses),
    }

Handling Multiple References

If each example has multiple valid summaries, use the best-matching reference:

Python

def rouge_best_of_references(
    hypothesis: str,
    references: list[str],
) -> dict:
    """Score against multiple references, take the best F1."""
    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
    
    best_rougeL = 0.0
    best_scores = None
    
    for ref in references:
        scores = scorer.score(ref, hypothesis)
        rougeL_f1 = scores["rougeL"].fmeasure
        if rougeL_f1 > best_rougeL:
            best_rougeL = rougeL_f1
            best_scores = scores
    
    if best_scores is None:
        return {}
    
    return {
        "rouge1_f1": round(best_scores["rouge1"].fmeasure, 4),
        "rouge2_f1": round(best_scores["rouge2"].fmeasure, 4),
        "rougeL_f1": round(best_scores["rougeL"].fmeasure, 4),
    }

Full Summarization Eval Pipeline

Python

from dataclasses import dataclass
from pathlib import Path
import json

@dataclass
class SummarizationEvalConfig:
    model_name: str
    dataset_path: str
    use_stemmer: bool = True
    report_per_example: bool = False


@dataclass
class SummarizationEvalResult:
    model_name: str
    n_examples: int
    rouge1_f1: float
    rouge2_f1: float
    rougeL_f1: float
    mean_hypothesis_length: float
    mean_reference_length: float


def run_summarization_eval(
    config: SummarizationEvalConfig,
    summarize_fn,  # (str) -> str
) -> SummarizationEvalResult:
    examples = []
    with open(config.dataset_path) as f:
        for line in f:
            obj = json.loads(line)
            if not obj.get("_metadata"):
                examples.append(obj)
    
    scorer = rouge_scorer.RougeScorer(
        ["rouge1", "rouge2", "rougeL"],
        use_stemmer=config.use_stemmer,
    )
    
    rouge1_f, rouge2_f, rougeL_f = [], [], []
    hyp_lengths, ref_lengths = [], []
    
    for ex in examples:
        hypothesis = summarize_fn(ex["document"])
        reference = ex["summary"]
        
        scores = scorer.score(reference, hypothesis)
        rouge1_f.append(scores["rouge1"].fmeasure)
        rouge2_f.append(scores["rouge2"].fmeasure)
        rougeL_f.append(scores["rougeL"].fmeasure)
        hyp_lengths.append(len(hypothesis.split()))
        ref_lengths.append(len(reference.split()))
    
    import numpy as np
    return SummarizationEvalResult(
        model_name=config.model_name,
        n_examples=len(examples),
        rouge1_f1=round(float(np.mean(rouge1_f)), 4),
        rouge2_f1=round(float(np.mean(rouge2_f)), 4),
        rougeL_f1=round(float(np.mean(rougeL_f)), 4),
        mean_hypothesis_length=round(float(np.mean(hyp_lengths)), 1),
        mean_reference_length=round(float(np.mean(ref_lengths)), 1),
    )

Limitations of ROUGE

1. Lexical overlap only

ROUGE misses semantically equivalent paraphrases. "The medicine lowers fever" and "The drug reduces temperature" have zero word overlap but mean the same thing.

2. Cannot detect hallucinations

A summary that adds fabricated information can still score well on ROUGE if it also includes real information from the reference.

3. Length bias

Longer summaries tend to have higher recall simply by including more words. Always report both length and ROUGE scores together.

Python

# Length bias demonstration
short_summary = "Ibuprofen blocks COX enzymes."
long_summary = "Ibuprofen, a nonsteroidal anti-inflammatory drug, blocks COX-1 and COX-2 enzymes, reducing fever and inflammation."

reference = "Ibuprofen reduces fever and inflammation by blocking COX enzymes."

short_score = compute_all_rouge(short_summary, reference)
long_score = compute_all_rouge(long_summary, reference)

print(f"Short summary ROUGE-1 F1: {short_score['rouge1_f1']}")
print(f"Long summary ROUGE-1 F1:  {long_score['rouge1_f1']}")
# Long summary scores higher — even if it adds no new useful content

When to Use ROUGE

| Task | ROUGE Appropriate? | Better Alternative | |------|---------------------|-------------------| | Extractive summarization | Yes (ROUGE-1, ROUGE-2) | — | | Abstractive summarization | Partial (ROUGE-L) | + BERTScore | | Document QA | Partial | BERTScore + human | | Translation | No | BLEU, chrF | | Chatbot evaluation | No | LLM-as-judge |

Key Takeaways

ROUGE measures recall of n-gram overlap between hypothesis and reference.
ROUGE-1 and ROUGE-2 measure unigram and bigram overlap; ROUGE-L uses longest common subsequence.
Use stemming (use_stemmer=True) to handle morphological variation.
When multiple references exist, score against each and take the best.
ROUGE cannot detect hallucinations or semantic paraphrases.
Combine ROUGE with BERTScore for stronger summarization evaluation.

What's Next

In eval-bertscore.mdx, you will learn how BERTScore uses contextual embeddings to capture semantic similarity that BLEU and ROUGE miss.

ROUGE Score for Summarization

ROUGE Score for Summarization

The ROUGE Family

ROUGE-N: N-gram Recall

ROUGE-L: Longest Common Subsequence

ROUGE-S: Skip-Bigram Co-occurrence

Using rouge-score Library

Handling Multiple References

Full Summarization Eval Pipeline

Limitations of ROUGE

When to Use ROUGE

Key Takeaways

What's Next

Enjoyed this article?

Leave a comment