The F1 Score

What F1 Measures

F1 combines precision and recall into a single number. It's the go-to metric for imbalanced binary classification when you care about both error types.

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Properties:
  - F1 = 1.0: perfect precision AND perfect recall
  - F1 = 0.0: either precision or recall is 0
  - F1 < min(precision, recall): harmonic mean is always lower than arithmetic mean
  - F1 ≠ accuracy: focuses on the positive class, not overall correctness

Why Harmonic Mean?

The harmonic mean penalizes extreme imbalance between precision and recall more than the arithmetic mean would.

Python

import numpy as np

def harmonic_mean(a, b):
    return 2 * a * b / (a + b)

def arithmetic_mean(a, b):
    return (a + b) / 2

# Case 1: balanced
p, r = 0.8, 0.8
print(f"Balanced   — Arithmetic: {arithmetic_mean(p,r):.3f}, Harmonic: {harmonic_mean(p,r):.3f}")

# Case 2: one very low
p, r = 0.9, 0.1
print(f"Imbalanced — Arithmetic: {arithmetic_mean(p,r):.3f}, Harmonic: {harmonic_mean(p,r):.3f}")
# Arithmetic: 0.5 (sounds acceptable)
# Harmonic:   0.18 (correctly signals the model is almost useless for one error type)

# This is why F1 is better than averaging precision and recall directly

Computing F1

Python

from sklearn.metrics import f1_score, precision_score, recall_score, classification_report
import numpy as np

# Readmission model
y_true = np.array([0]*900 + [1]*100)
y_pred = np.array([0]*870 + [1]*30 + [0]*25 + [1]*75)
#                  TN=870   FP=30   FN=25    TP=75

precision = precision_score(y_true, y_pred)   # 75/(75+30) = 0.714
recall    = recall_score(y_true, y_pred)       # 75/(75+25) = 0.750
f1        = f1_score(y_true, y_pred)           # 0.732

print(f"Precision: {precision:.3f}")
print(f"Recall:    {recall:.3f}")
print(f"F1:        {f1:.3f}")

# Per-class report
print(classification_report(y_true, y_pred, target_names=["no_readmit", "readmit"]))

F-Beta: Weighting Precision vs Recall

When false negatives are more costly than false positives (or vice versa), use F-beta.

Python

from sklearn.metrics import fbeta_score

# F-beta = (1 + β²) × (precision × recall) / (β² × precision + recall)
# β < 1: weights precision more  (false positives are more costly)
# β = 1: equal weight (standard F1)
# β > 1: weights recall more   (false negatives are more costly)

y_true = np.array([0]*900 + [1]*100)
y_pred = np.array([0]*870 + [1]*30 + [0]*25 + [1]*75)

# Sepsis detection: missing sepsis (FN) is worse than a false alarm (FP)
f2 = fbeta_score(y_true, y_pred, beta=2)    # Recall counts twice as much as precision
print(f"F2 (recall-weighted): {f2:.3f}")

# Drug alert: alert fatigue matters, false alarms are expensive
f05 = fbeta_score(y_true, y_pred, beta=0.5)  # Precision counts twice as much as recall
print(f"F0.5 (precision-weighted): {f05:.3f}")

Multi-Class F1: Averaging Strategies

Python

from sklearn.metrics import f1_score, classification_report

# 4-class drug category prediction
y_true = [0, 1, 2, 3, 0, 1, 2, 3, 0, 0, 1, 2]
y_pred = [0, 1, 2, 2, 0, 1, 3, 3, 1, 0, 1, 2]

# Per-class F1 — most informative
print("Per-class F1:")
f1_per_class = f1_score(y_true, y_pred, average=None)
for class_id, score in enumerate(f1_per_class):
    print(f"  Class {class_id}: {score:.3f}")

# Macro: unweighted mean across classes
# → treats all classes equally regardless of frequency
# → low macro F1 means at least one class performs poorly
f1_macro = f1_score(y_true, y_pred, average="macro")
print(f"\nMacro F1:    {f1_macro:.3f}")

# Weighted: mean weighted by class support
# → better when class imbalance exists and frequent classes matter more
f1_weighted = f1_score(y_true, y_pred, average="weighted")
print(f"Weighted F1: {f1_weighted:.3f}")

# Micro: compute globally from TP/FP/FN totals
# → equivalent to accuracy when classes are balanced
f1_micro = f1_score(y_true, y_pred, average="micro")
print(f"Micro F1:    {f1_micro:.3f}")

F1 vs Other Metrics — When to Use What

| Metric | Best When | |---|---| | Accuracy | Classes are balanced; all errors equally costly | | Precision | False positives are expensive (alert fatigue, unnecessary treatment) | | Recall | False negatives are expensive (missed diagnosis, undetected safety issue) | | F1 | Both error types matter; imbalanced classes | | F2 | Recall matters more (clinical screening, safety detection) | | F0.5 | Precision matters more (recommendation systems, alert systems) | | AUC-ROC | Threshold-independent; classes roughly balanced | | AUC-PR | Threshold-independent; severe class imbalance |

Common F1 Interpretation Mistakes

Python

# Mistake 1: Comparing F1 across datasets with different class balances
# F1 of 0.80 on a 50/50 dataset ≠ F1 of 0.80 on a 90/10 dataset
# Always report class distribution alongside F1

# Mistake 2: Only reporting macro F1 without per-class F1
# A model can have decent macro F1 while completely failing on a rare class
print(classification_report(y_true, y_pred))   # Always include per-class

# Mistake 3: Optimizing F1 at the default threshold (0.5)
# The threshold that maximizes F1 may not be 0.5
from sklearn.metrics import precision_recall_curve
import numpy as np

y_proba = model.predict_proba(X_test)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)
f1_scores = 2 * precisions * recalls / (precisions + recalls + 1e-9)
best_idx = f1_scores.argmax()
print(f"Best threshold: {thresholds[best_idx]:.3f} → F1: {f1_scores[best_idx]:.3f}")

F1 in LLM Evaluation

Python

# F1 is used for token-level evaluation in extractive QA (SQuAD-style)
# Measures token overlap between predicted and reference answers

def compute_token_f1(prediction: str, reference: str) -> float:
    pred_tokens = set(prediction.lower().split())
    ref_tokens  = set(reference.lower().split())

    if not pred_tokens or not ref_tokens:
        return 0.0

    common = pred_tokens & ref_tokens
    if not common:
        return 0.0

    precision = len(common) / len(pred_tokens)
    recall    = len(common) / len(ref_tokens)
    return 2 * precision * recall / (precision + recall)

# Clinical note extraction
prediction = "The patient was given metformin 1000mg for type 2 diabetes"
reference  = "Patient received metformin 1000 mg for T2DM treatment"

print(f"Token F1: {compute_token_f1(prediction, reference):.3f}")

Interview Answer Template

Q: What is F1 score and when do you use it?

F1 is the harmonic mean of precision and recall: 2 × (P × R) / (P + R). It's the right metric when classes are imbalanced and both false positives and false negatives matter. The harmonic mean is key — if either precision or recall is very low, F1 is low, even if the other is high. This correctly penalizes a model that sacrifices one for the other. For multi-class problems, I report per-class F1 plus macro F1 (unweighted average across classes — treats rare and common classes equally). When the two error types aren't equally costly, I use F-beta: F2 weights recall double (miss fewer positives), F0.5 weights precision double (fewer false alarms). For imbalanced binary classification, AUC-PR (average precision) is complementary — it summarizes performance across all thresholds without requiring you to pick one upfront.

What F1 Measures

Why Harmonic Mean?

Computing F1

F-Beta: Weighting Precision vs Recall

Multi-Class F1: Averaging Strategies

F1 vs Other Metrics — When to Use What

Common F1 Interpretation Mistakes

F1 in LLM Evaluation

Interview Answer Template

Enjoyed this article?

Leave a comment