Evaluation by Task Type

Using the wrong metric for a task produces misleading results. BLEU score is meaningful for machine translation. It is nearly meaningless for evaluating a medical chatbot. This lesson maps each major LLM task type to the evaluation approaches that actually capture quality.

Task Type 1: Classification

When your LLM produces a discrete label — sentiment, intent, toxicity class, medical code — classification metrics apply directly.

Primary metrics: Accuracy, Precision, Recall, F1 per class, Macro F1, Confusion Matrix

Python

from sklearn.metrics import (
    accuracy_score,
    f1_score,
    classification_report,
    confusion_matrix,
)
import numpy as np

# Example: LLM classifying medical query intent
true_labels = [
    "dosing", "side_effects", "interaction", "dosing",
    "general_info", "side_effects", "dosing", "interaction",
    "general_info", "dosing",
]

predicted_labels = [
    "dosing", "side_effects", "dosing", "dosing",
    "general_info", "interaction", "dosing", "interaction",
    "side_effects", "dosing",
]

accuracy = accuracy_score(true_labels, predicted_labels)
macro_f1 = f1_score(true_labels, predicted_labels, average="macro")

print(f"Accuracy: {accuracy:.3f}")
print(f"Macro F1: {macro_f1:.3f}")
print()
print(classification_report(true_labels, predicted_labels))

Why Macro F1 over accuracy? If your dataset is imbalanced (80% "dosing" queries, 20% everything else), a model that always predicts "dosing" gets 80% accuracy but is useless. Macro F1 averages F1 across classes, penalizing models that ignore minority classes.

Python

def eval_classification_task(
    true_labels: list[str],
    predicted_labels: list[str],
    class_names: list[str] | None = None,
) -> dict:
    accuracy = accuracy_score(true_labels, predicted_labels)
    macro_f1 = f1_score(true_labels, predicted_labels, average="macro")
    weighted_f1 = f1_score(true_labels, predicted_labels, average="weighted")
    
    cm = confusion_matrix(true_labels, predicted_labels, labels=class_names)
    
    return {
        "accuracy": round(accuracy, 4),
        "macro_f1": round(macro_f1, 4),
        "weighted_f1": round(weighted_f1, 4),
        "confusion_matrix": cm.tolist(),
    }

Task Type 2: Text Generation (Summarization, Q&A)

For open-ended generation, there is no single correct output. Metrics measure overlap or semantic similarity between the model output and one or more reference answers.

Primary metrics: ROUGE-L, BERTScore F1, LLM-as-judge

Python

# ROUGE-L for summarization
from rouge_score import rouge_scorer

def eval_generation_rouge(
    generated: str,
    reference: str,
) -> dict:
    scorer = rouge_scorer.RougeScorer(
        ["rouge1", "rouge2", "rougeL"],
        use_stemmer=True,
    )
    scores = scorer.score(reference, generated)
    
    return {
        "rouge1_f": round(scores["rouge1"].fmeasure, 4),
        "rouge2_f": round(scores["rouge2"].fmeasure, 4),
        "rougeL_f": round(scores["rougeL"].fmeasure, 4),
    }

# BERTScore for semantic similarity (covers paraphrases)
# pip install bert-score
from bert_score import score as bert_score_fn

def eval_generation_bertscore(
    generated_list: list[str],
    reference_list: list[str],
    model_type: str = "microsoft/deberta-xlarge-mnli",
) -> dict:
    P, R, F1 = bert_score_fn(
        cands=generated_list,
        refs=reference_list,
        model_type=model_type,
        lang="en",
        verbose=False,
    )
    return {
        "bertscore_precision": round(P.mean().item(), 4),
        "bertscore_recall": round(R.mean().item(), 4),
        "bertscore_f1": round(F1.mean().item(), 4),
    }

When to use each:

| Scenario | Recommended Metric | |----------|-------------------| | Summarization with reference | ROUGE-L + BERTScore | | Open Q&A, no reference | LLM-as-judge | | Translation | BLEU + BERTScore | | Medical Q&A with references | BERTScore + human spot-check |

Task Type 3: RAG (Retrieval-Augmented Generation)

RAG adds a retrieval step before generation. This introduces additional failure modes: wrong documents retrieved, answer contradicts retrieved context (hallucination), or relevant documents were missed.

Primary metrics: Faithfulness, Answer Relevancy, Context Recall, Context Precision

Python

import json
import anthropic

client = anthropic.Anthropic()

def eval_faithfulness(
    question: str,
    context: str,
    answer: str,
) -> float:
    """Score: is the answer grounded in the retrieved context?"""
    prompt = f"""You are evaluating whether an AI answer is faithful to the provided context.

Context:
{context}

Question: {question}

Answer: {answer}

Is the answer fully supported by the context above? Score from 0 to 1:
- 1.0: Every claim in the answer is directly supported by the context
- 0.5: Some claims are supported, others are not
- 0.0: The answer contradicts or ignores the context

Return ONLY a JSON object: {{"score": <float 0-1>, "reason": "<one sentence>"}}"""

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=150,
        messages=[{"role": "user", "content": prompt}]
    )
    
    data = json.loads(response.content[0].text)
    return data["score"]


def eval_answer_relevancy(
    question: str,
    answer: str,
) -> float:
    """Score: does the answer address the question?"""
    prompt = f"""Rate how relevant this answer is to the question.

Question: {question}

Answer: {answer}

Score from 0 to 1:
- 1.0: Answer directly and completely addresses the question
- 0.5: Answer partially addresses the question
- 0.0: Answer is off-topic or doesn't address the question

Return ONLY a JSON object: {{"score": <float 0-1>}}"""

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    data = json.loads(response.content[0].text)
    return data["score"]


def eval_rag_pipeline(
    examples: list[dict],  # each has: question, context, answer
) -> dict:
    faithfulness_scores = []
    relevancy_scores = []
    
    for ex in examples:
        f_score = eval_faithfulness(ex["question"], ex["context"], ex["answer"])
        r_score = eval_answer_relevancy(ex["question"], ex["answer"])
        faithfulness_scores.append(f_score)
        relevancy_scores.append(r_score)
    
    return {
        "mean_faithfulness": round(sum(faithfulness_scores) / len(faithfulness_scores), 3),
        "mean_answer_relevancy": round(sum(relevancy_scores) / len(relevancy_scores), 3),
        "n_examples": len(examples),
    }

Task Type 4: Code Generation

Code generation has the cleanest eval story: run the code and see if it works. No reference answer needed.

Primary metric: Pass@k — the probability that at least one of k generated solutions passes all test cases.

Python

import subprocess
import tempfile
import os
from pathlib import Path

def run_python_code_with_tests(code: str, test_code: str) -> dict:
    """Execute generated code + test suite, return pass/fail result."""
    with tempfile.TemporaryDirectory() as tmpdir:
        solution_path = Path(tmpdir) / "solution.py"
        test_path = Path(tmpdir) / "test_solution.py"
        
        solution_path.write_text(code)
        test_path.write_text(f"from solution import *\n\n{test_code}")
        
        result = subprocess.run(
            ["python", "-m", "pytest", str(test_path), "-v", "--tb=short"],
            capture_output=True,
            text=True,
            timeout=30,
            cwd=tmpdir,
        )
        
        passed = result.returncode == 0
        return {
            "passed": passed,
            "stdout": result.stdout[-500:],  # last 500 chars
            "stderr": result.stderr[-300:],
        }


def pass_at_k(
    problem: str,
    test_code: str,
    generate_fn,  # (problem) -> str
    k: int = 5,
    n_samples: int = 10,
) -> float:
    """Estimate pass@k: probability that at least one of k samples passes."""
    results = []
    for _ in range(n_samples):
        generated_code = generate_fn(problem)
        result = run_python_code_with_tests(generated_code, test_code)
        results.append(result["passed"])
    
    total = len(results)
    passed = sum(results)
    failed = total - passed
    
    # Exact pass@k formula (avoids bias in small samples)
    # pass@k = 1 - C(failed, k) / C(total, k)
    from math import comb
    if failed < k:
        return 1.0
    return 1.0 - comb(failed, k) / comb(total, k)


# Example usage
problem_description = """
Write a Python function called `count_vowels(text: str) -> int` 
that counts the number of vowels (a, e, i, o, u, case-insensitive) in the input string.
"""

test_suite = """
def test_basic():
    assert count_vowels("hello") == 2

def test_uppercase():
    assert count_vowels("AEIOU") == 5

def test_no_vowels():
    assert count_vowels("gym") == 0

def test_empty():
    assert count_vowels("") == 0
"""

# With a real generate_fn, you'd call your LLM here
# pass_rate = pass_at_k(problem_description, test_suite, my_llm_generate, k=1)

Additional code quality metrics:

Python

import ast

def check_syntax_valid(code: str) -> bool:
    """Check if generated Python code is syntactically valid."""
    try:
        ast.parse(code)
        return True
    except SyntaxError:
        return False


def check_no_dangerous_imports(code: str) -> bool:
    """Reject code that imports dangerous modules."""
    dangerous = {"os", "subprocess", "sys", "shutil", "socket"}
    try:
        tree = ast.parse(code)
        for node in ast.walk(tree):
            if isinstance(node, ast.Import):
                for alias in node.names:
                    if alias.name.split(".")[0] in dangerous:
                        return False
            elif isinstance(node, ast.ImportFrom):
                if node.module and node.module.split(".")[0] in dangerous:
                    return False
        return True
    except SyntaxError:
        return False


def eval_code_generation(generated_code: str, test_code: str) -> dict:
    syntax_ok = check_syntax_valid(generated_code)
    safe = check_no_dangerous_imports(generated_code)
    
    if not syntax_ok:
        return {"syntax_valid": False, "safe": safe, "passed_tests": False}
    
    test_result = run_python_code_with_tests(generated_code, test_code)
    
    return {
        "syntax_valid": syntax_ok,
        "safe": safe,
        "passed_tests": test_result["passed"],
    }

Task Type 5: Conversation

Multi-turn conversation quality is the hardest to measure automatically. The key dimensions are coherence, helpfulness, and safety.

Python

# Conversation turn structure
from dataclasses import dataclass

@dataclass
class ConversationTurn:
    role: str  # "user" or "assistant"
    content: str

def eval_conversation(
    turns: list[ConversationTurn],
    judge_client: anthropic.Anthropic,
) -> dict:
    """Use LLM-as-judge to evaluate a conversation."""
    formatted = "\n".join(
        f"{t.role.upper()}: {t.content}"
        for t in turns
    )
    
    prompt = f"""Evaluate the following conversation between a user and an AI assistant.

{formatted}

Rate the AI's responses on three dimensions (each 1-5):
1. Coherence: Does each response logically follow from the conversation?
2. Helpfulness: Does the AI actually help the user accomplish their goal?
3. Safety: Does the AI avoid harmful, misleading, or inappropriate content?

Return JSON: {{"coherence": <1-5>, "helpfulness": <1-5>, "safety": <1-5>, "overall_notes": "<one sentence>"}}"""

    response = judge_client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=200,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return json.loads(response.content[0].text)


# Example conversation
turns = [
    ConversationTurn("user", "What pain reliever is safe with blood thinners?"),
    ConversationTurn("assistant", "Acetaminophen (Tylenol) is generally considered safer than NSAIDs like ibuprofen for people on blood thinners. However, you should consult your doctor or pharmacist before taking any medication, as individual circumstances vary."),
    ConversationTurn("user", "What's the normal dose?"),
    ConversationTurn("assistant", "For adults, the standard acetaminophen dose is 325-650 mg every 4-6 hours as needed, not exceeding 3000-4000 mg per day. However, your doctor may recommend a lower limit based on your specific situation. Always check with your healthcare provider given your blood thinner medication."),
]

Key conversation metrics:

| Metric | How to Measure | |--------|---------------| | Coherence | LLM judge scores context-relevance | | Session completion | Did user achieve their goal? (user feedback) | | Turn count to resolution | Fewer turns = more efficient | | Safety flag rate | Automated classifier for harmful content | | Abandonment rate | User left before task complete (production signal) |

Putting It Together: Universal Eval Dispatcher

Python

from enum import Enum

class TaskType(str, Enum):
    CLASSIFICATION = "classification"
    GENERATION = "generation"
    RAG = "rag"
    CODE = "code"
    CONVERSATION = "conversation"


def dispatch_eval(
    task_type: TaskType,
    example: dict,
    model_output: str,
    judge_client=None,
) -> dict:
    """Route to the right evaluation function based on task type."""
    
    if task_type == TaskType.CLASSIFICATION:
        return {
            "correct": model_output.strip().lower() == example["label"].lower(),
            "predicted": model_output.strip(),
            "expected": example["label"],
        }
    
    elif task_type == TaskType.GENERATION:
        rouge_result = eval_generation_rouge(model_output, example["ideal_response"])
        return rouge_result
    
    elif task_type == TaskType.RAG:
        return {
            "faithfulness": eval_faithfulness(
                example["question"], example["context"], model_output
            ),
            "relevancy": eval_answer_relevancy(example["question"], model_output),
        }
    
    elif task_type == TaskType.CODE:
        return eval_code_generation(model_output, example["test_code"])
    
    elif task_type == TaskType.CONVERSATION:
        assert judge_client is not None, "Conversation eval requires a judge client"
        return eval_conversation(example["turns"], judge_client)
    
    else:
        raise ValueError(f"Unknown task type: {task_type}")

Key Takeaways

Classification tasks: use Macro F1, not just accuracy, to handle class imbalance.
Generation tasks: use ROUGE-L for n-gram recall, BERTScore for semantic similarity.
RAG tasks: evaluate faithfulness (grounding) and answer relevancy separately.
Code generation: pass@k against real test cases is the most reliable signal.
Conversation: use LLM-as-judge on coherence, helpfulness, and safety.
Build a dispatcher that routes each example to the right metric based on task type.

What's Next

In eval-perplexity.mdx, you will learn what perplexity measures, when it is useful, and how to compute it using the Hugging Face transformers library.

Evaluation by Task Type

Evaluation by Task Type

Task Type 1: Classification

Task Type 2: Text Generation (Summarization, Q&A)

Task Type 3: RAG (Retrieval-Augmented Generation)

Task Type 4: Code Generation

Task Type 5: Conversation

Putting It Together: Universal Eval Dispatcher

Key Takeaways

What's Next

Enjoyed this article?

Leave a comment