ROUGE Score for Summarization
Learn how ROUGE-N, ROUGE-L, and ROUGE-S work, when to use them, and how to implement summarization evaluation with the rouge-score library.
ROUGE Score for Summarization
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) was introduced in 2004 specifically for evaluating automatic summarization. Where BLEU focuses on precision (how much of the hypothesis appears in the reference), ROUGE focuses on recall (how much of the reference appears in the hypothesis).
For summarization, recall matters: a good summary should cover the key content of the source document.
The ROUGE Family
There are four main variants:
| Variant | What It Measures | |---------|-----------------| | ROUGE-1 | Unigram (word) recall overlap | | ROUGE-2 | Bigram recall overlap | | ROUGE-L | Longest common subsequence (LCS) | | ROUGE-S | Skip-bigram co-occurrence |
In practice, ROUGE-1, ROUGE-2, and ROUGE-L are reported most often. ROUGE-S is less common.
ROUGE-N: N-gram Recall
ROUGE-N computes the overlap of n-grams between the hypothesis (generated summary) and the reference (gold summary), reported as recall, precision, and F1.
from collections import Counter
def ngrams(tokens: list[str], n: int) -> list[tuple]:
return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]
def rouge_n(
hypothesis: str,
reference: str,
n: int,
) -> dict[str, float]:
"""
Compute ROUGE-N between a hypothesis and reference.
Returns recall, precision, and F1.
"""
hyp_tokens = hypothesis.lower().split()
ref_tokens = reference.lower().split()
hyp_ngrams = Counter(ngrams(hyp_tokens, n))
ref_ngrams = Counter(ngrams(ref_tokens, n))
# Overlap: for each n-gram, min of hyp count and ref count
overlap = 0
for gram, count in ref_ngrams.items():
overlap += min(count, hyp_ngrams.get(gram, 0))
total_ref = sum(ref_ngrams.values())
total_hyp = sum(hyp_ngrams.values())
recall = overlap / total_ref if total_ref > 0 else 0.0
precision = overlap / total_hyp if total_hyp > 0 else 0.0
f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0
return {
"recall": round(recall, 4),
"precision": round(precision, 4),
"f1": round(f1, 4),
}
# Example: summarization evaluation
reference = (
"The drug ibuprofen reduces fever and inflammation by blocking "
"COX-1 and COX-2 enzymes. It is widely used for pain relief "
"and comes in 200 mg and 400 mg tablet forms."
)
hypothesis_good = (
"Ibuprofen blocks COX enzymes to reduce fever and inflammation. "
"It is available in 200 mg and 400 mg tablets."
)
hypothesis_bad = (
"Aspirin is a common pain reliever used by many people worldwide."
)
for n in [1, 2]:
print(f"\nROUGE-{n}:")
print(f" Good summary: {rouge_n(hypothesis_good, reference, n)}")
print(f" Bad summary: {rouge_n(hypothesis_bad, reference, n)}")ROUGE-L: Longest Common Subsequence
ROUGE-L uses the longest common subsequence (LCS) between hypothesis and reference. Unlike ROUGE-N, it does not require consecutive matching ā it captures flexible word order.
def lcs_length(seq_a: list, seq_b: list) -> int:
"""Compute the length of the longest common subsequence."""
m, n = len(seq_a), len(seq_b)
# dp[i][j] = LCS length for seq_a[:i] and seq_b[:j]
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if seq_a[i-1] == seq_b[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
else:
dp[i][j] = max(dp[i-1][j], dp[i][j-1])
return dp[m][n]
def rouge_l(hypothesis: str, reference: str) -> dict[str, float]:
"""Compute ROUGE-L between hypothesis and reference."""
hyp_tokens = hypothesis.lower().split()
ref_tokens = reference.lower().split()
lcs = lcs_length(hyp_tokens, ref_tokens)
recall = lcs / len(ref_tokens) if ref_tokens else 0.0
precision = lcs / len(hyp_tokens) if hyp_tokens else 0.0
f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0
return {
"recall": round(recall, 4),
"precision": round(precision, 4),
"f1": round(f1, 4),
"lcs_length": lcs,
}
# ROUGE-L captures paraphrases better than ROUGE-2
reference = "The patient should take medication after meals."
paraphrase = "Medication should be taken following a meal by the patient."
rouge2_result = rouge_n(paraphrase, reference, n=2)
rougeL_result = rouge_l(paraphrase, reference)
print(f"ROUGE-2 F1: {rouge2_result['f1']}") # Low: bigrams differ
print(f"ROUGE-L F1: {rougeL_result['f1']}") # Higher: words appear in orderROUGE-S: Skip-Bigram Co-occurrence
ROUGE-S allows arbitrary gaps between bigrams. "The cat sat" and "The dog sat" share the skip-bigram ("The", "sat") even though "cat" and "dog" differ.
from itertools import combinations
def skip_bigrams(tokens: list[str]) -> Counter:
"""Extract all skip-bigrams (pairs of words, order preserved, gaps allowed)."""
pairs = Counter()
for i in range(len(tokens)):
for j in range(i+1, len(tokens)):
pairs[(tokens[i], tokens[j])] += 1
return pairs
def rouge_s(hypothesis: str, reference: str) -> dict[str, float]:
"""Compute ROUGE-S (skip-bigram F1)."""
hyp_tokens = hypothesis.lower().split()
ref_tokens = reference.lower().split()
hyp_sbg = skip_bigrams(hyp_tokens)
ref_sbg = skip_bigrams(ref_tokens)
overlap = 0
for pair, count in ref_sbg.items():
overlap += min(count, hyp_sbg.get(pair, 0))
total_ref = sum(ref_sbg.values())
total_hyp = sum(hyp_sbg.values())
recall = overlap / total_ref if total_ref > 0 else 0.0
precision = overlap / total_hyp if total_hyp > 0 else 0.0
f1 = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0
return {"recall": round(recall, 4), "precision": round(precision, 4), "f1": round(f1, 4)}Using rouge-score Library
The rouge-score library (Google) provides an efficient, production-quality implementation:
# pip install rouge-score
from rouge_score import rouge_scorer
def compute_all_rouge(
hypothesis: str,
reference: str,
use_stemmer: bool = True,
) -> dict:
"""
Compute ROUGE-1, ROUGE-2, and ROUGE-L with stemming.
Stemming normalizes words: "running" and "run" become the same token.
This increases recall for morphologically varied text.
"""
scorer = rouge_scorer.RougeScorer(
["rouge1", "rouge2", "rougeL"],
use_stemmer=use_stemmer,
)
scores = scorer.score(reference, hypothesis)
return {
"rouge1_recall": round(scores["rouge1"].recall, 4),
"rouge1_precision": round(scores["rouge1"].precision, 4),
"rouge1_f1": round(scores["rouge1"].fmeasure, 4),
"rouge2_f1": round(scores["rouge2"].fmeasure, 4),
"rougeL_f1": round(scores["rougeL"].fmeasure, 4),
}
# Batch evaluation across a golden dataset
def batch_rouge_eval(
hypotheses: list[str],
references: list[str],
) -> dict:
import numpy as np
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
rouge1_f, rouge2_f, rougeL_f = [], [], []
for hyp, ref in zip(hypotheses, references):
scores = scorer.score(ref, hyp)
rouge1_f.append(scores["rouge1"].fmeasure)
rouge2_f.append(scores["rouge2"].fmeasure)
rougeL_f.append(scores["rougeL"].fmeasure)
return {
"rouge1_f1": round(float(np.mean(rouge1_f)), 4),
"rouge2_f1": round(float(np.mean(rouge2_f)), 4),
"rougeL_f1": round(float(np.mean(rougeL_f)), 4),
"n_examples": len(hypotheses),
}Handling Multiple References
If each example has multiple valid summaries, use the best-matching reference:
def rouge_best_of_references(
hypothesis: str,
references: list[str],
) -> dict:
"""Score against multiple references, take the best F1."""
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
best_rougeL = 0.0
best_scores = None
for ref in references:
scores = scorer.score(ref, hypothesis)
rougeL_f1 = scores["rougeL"].fmeasure
if rougeL_f1 > best_rougeL:
best_rougeL = rougeL_f1
best_scores = scores
if best_scores is None:
return {}
return {
"rouge1_f1": round(best_scores["rouge1"].fmeasure, 4),
"rouge2_f1": round(best_scores["rouge2"].fmeasure, 4),
"rougeL_f1": round(best_scores["rougeL"].fmeasure, 4),
}Full Summarization Eval Pipeline
from dataclasses import dataclass
from pathlib import Path
import json
@dataclass
class SummarizationEvalConfig:
model_name: str
dataset_path: str
use_stemmer: bool = True
report_per_example: bool = False
@dataclass
class SummarizationEvalResult:
model_name: str
n_examples: int
rouge1_f1: float
rouge2_f1: float
rougeL_f1: float
mean_hypothesis_length: float
mean_reference_length: float
def run_summarization_eval(
config: SummarizationEvalConfig,
summarize_fn, # (str) -> str
) -> SummarizationEvalResult:
examples = []
with open(config.dataset_path) as f:
for line in f:
obj = json.loads(line)
if not obj.get("_metadata"):
examples.append(obj)
scorer = rouge_scorer.RougeScorer(
["rouge1", "rouge2", "rougeL"],
use_stemmer=config.use_stemmer,
)
rouge1_f, rouge2_f, rougeL_f = [], [], []
hyp_lengths, ref_lengths = [], []
for ex in examples:
hypothesis = summarize_fn(ex["document"])
reference = ex["summary"]
scores = scorer.score(reference, hypothesis)
rouge1_f.append(scores["rouge1"].fmeasure)
rouge2_f.append(scores["rouge2"].fmeasure)
rougeL_f.append(scores["rougeL"].fmeasure)
hyp_lengths.append(len(hypothesis.split()))
ref_lengths.append(len(reference.split()))
import numpy as np
return SummarizationEvalResult(
model_name=config.model_name,
n_examples=len(examples),
rouge1_f1=round(float(np.mean(rouge1_f)), 4),
rouge2_f1=round(float(np.mean(rouge2_f)), 4),
rougeL_f1=round(float(np.mean(rougeL_f)), 4),
mean_hypothesis_length=round(float(np.mean(hyp_lengths)), 1),
mean_reference_length=round(float(np.mean(ref_lengths)), 1),
)Limitations of ROUGE
1. Lexical overlap only
ROUGE misses semantically equivalent paraphrases. "The medicine lowers fever" and "The drug reduces temperature" have zero word overlap but mean the same thing.
2. Cannot detect hallucinations
A summary that adds fabricated information can still score well on ROUGE if it also includes real information from the reference.
3. Length bias
Longer summaries tend to have higher recall simply by including more words. Always report both length and ROUGE scores together.
# Length bias demonstration
short_summary = "Ibuprofen blocks COX enzymes."
long_summary = "Ibuprofen, a nonsteroidal anti-inflammatory drug, blocks COX-1 and COX-2 enzymes, reducing fever and inflammation."
reference = "Ibuprofen reduces fever and inflammation by blocking COX enzymes."
short_score = compute_all_rouge(short_summary, reference)
long_score = compute_all_rouge(long_summary, reference)
print(f"Short summary ROUGE-1 F1: {short_score['rouge1_f1']}")
print(f"Long summary ROUGE-1 F1: {long_score['rouge1_f1']}")
# Long summary scores higher ā even if it adds no new useful contentWhen to Use ROUGE
| Task | ROUGE Appropriate? | Better Alternative | |------|---------------------|-------------------| | Extractive summarization | Yes (ROUGE-1, ROUGE-2) | ā | | Abstractive summarization | Partial (ROUGE-L) | + BERTScore | | Document QA | Partial | BERTScore + human | | Translation | No | BLEU, chrF | | Chatbot evaluation | No | LLM-as-judge |
Key Takeaways
- ROUGE measures recall of n-gram overlap between hypothesis and reference.
- ROUGE-1 and ROUGE-2 measure unigram and bigram overlap; ROUGE-L uses longest common subsequence.
- Use stemming (use_stemmer=True) to handle morphological variation.
- When multiple references exist, score against each and take the best.
- ROUGE cannot detect hallucinations or semantic paraphrases.
- Combine ROUGE with BERTScore for stronger summarization evaluation.
What's Next
In eval-bertscore.mdx, you will learn how BERTScore uses contextual embeddings to capture semantic similarity that BLEU and ROUGE miss.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.