The F1 Score
F1 score explained: formula, why harmonic mean penalizes imbalance, F-beta for asymmetric costs, macro vs micro vs weighted averaging, and when to use F1 vs other metrics.
What F1 Measures
F1 combines precision and recall into a single number. It's the go-to metric for imbalanced binary classification when you care about both error types.
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Properties:
- F1 = 1.0: perfect precision AND perfect recall
- F1 = 0.0: either precision or recall is 0
- F1 < min(precision, recall): harmonic mean is always lower than arithmetic mean
- F1 ≠ accuracy: focuses on the positive class, not overall correctnessWhy Harmonic Mean?
The harmonic mean penalizes extreme imbalance between precision and recall more than the arithmetic mean would.
import numpy as np
def harmonic_mean(a, b):
return 2 * a * b / (a + b)
def arithmetic_mean(a, b):
return (a + b) / 2
# Case 1: balanced
p, r = 0.8, 0.8
print(f"Balanced — Arithmetic: {arithmetic_mean(p,r):.3f}, Harmonic: {harmonic_mean(p,r):.3f}")
# Case 2: one very low
p, r = 0.9, 0.1
print(f"Imbalanced — Arithmetic: {arithmetic_mean(p,r):.3f}, Harmonic: {harmonic_mean(p,r):.3f}")
# Arithmetic: 0.5 (sounds acceptable)
# Harmonic: 0.18 (correctly signals the model is almost useless for one error type)
# This is why F1 is better than averaging precision and recall directlyComputing F1
from sklearn.metrics import f1_score, precision_score, recall_score, classification_report
import numpy as np
# Readmission model
y_true = np.array([0]*900 + [1]*100)
y_pred = np.array([0]*870 + [1]*30 + [0]*25 + [1]*75)
# TN=870 FP=30 FN=25 TP=75
precision = precision_score(y_true, y_pred) # 75/(75+30) = 0.714
recall = recall_score(y_true, y_pred) # 75/(75+25) = 0.750
f1 = f1_score(y_true, y_pred) # 0.732
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1: {f1:.3f}")
# Per-class report
print(classification_report(y_true, y_pred, target_names=["no_readmit", "readmit"]))F-Beta: Weighting Precision vs Recall
When false negatives are more costly than false positives (or vice versa), use F-beta.
from sklearn.metrics import fbeta_score
# F-beta = (1 + β²) × (precision × recall) / (β² × precision + recall)
# β < 1: weights precision more (false positives are more costly)
# β = 1: equal weight (standard F1)
# β > 1: weights recall more (false negatives are more costly)
y_true = np.array([0]*900 + [1]*100)
y_pred = np.array([0]*870 + [1]*30 + [0]*25 + [1]*75)
# Sepsis detection: missing sepsis (FN) is worse than a false alarm (FP)
f2 = fbeta_score(y_true, y_pred, beta=2) # Recall counts twice as much as precision
print(f"F2 (recall-weighted): {f2:.3f}")
# Drug alert: alert fatigue matters, false alarms are expensive
f05 = fbeta_score(y_true, y_pred, beta=0.5) # Precision counts twice as much as recall
print(f"F0.5 (precision-weighted): {f05:.3f}")Multi-Class F1: Averaging Strategies
from sklearn.metrics import f1_score, classification_report
# 4-class drug category prediction
y_true = [0, 1, 2, 3, 0, 1, 2, 3, 0, 0, 1, 2]
y_pred = [0, 1, 2, 2, 0, 1, 3, 3, 1, 0, 1, 2]
# Per-class F1 — most informative
print("Per-class F1:")
f1_per_class = f1_score(y_true, y_pred, average=None)
for class_id, score in enumerate(f1_per_class):
print(f" Class {class_id}: {score:.3f}")
# Macro: unweighted mean across classes
# → treats all classes equally regardless of frequency
# → low macro F1 means at least one class performs poorly
f1_macro = f1_score(y_true, y_pred, average="macro")
print(f"\nMacro F1: {f1_macro:.3f}")
# Weighted: mean weighted by class support
# → better when class imbalance exists and frequent classes matter more
f1_weighted = f1_score(y_true, y_pred, average="weighted")
print(f"Weighted F1: {f1_weighted:.3f}")
# Micro: compute globally from TP/FP/FN totals
# → equivalent to accuracy when classes are balanced
f1_micro = f1_score(y_true, y_pred, average="micro")
print(f"Micro F1: {f1_micro:.3f}")F1 vs Other Metrics — When to Use What
| Metric | Best When | |---|---| | Accuracy | Classes are balanced; all errors equally costly | | Precision | False positives are expensive (alert fatigue, unnecessary treatment) | | Recall | False negatives are expensive (missed diagnosis, undetected safety issue) | | F1 | Both error types matter; imbalanced classes | | F2 | Recall matters more (clinical screening, safety detection) | | F0.5 | Precision matters more (recommendation systems, alert systems) | | AUC-ROC | Threshold-independent; classes roughly balanced | | AUC-PR | Threshold-independent; severe class imbalance |
Common F1 Interpretation Mistakes
# Mistake 1: Comparing F1 across datasets with different class balances
# F1 of 0.80 on a 50/50 dataset ≠ F1 of 0.80 on a 90/10 dataset
# Always report class distribution alongside F1
# Mistake 2: Only reporting macro F1 without per-class F1
# A model can have decent macro F1 while completely failing on a rare class
print(classification_report(y_true, y_pred)) # Always include per-class
# Mistake 3: Optimizing F1 at the default threshold (0.5)
# The threshold that maximizes F1 may not be 0.5
from sklearn.metrics import precision_recall_curve
import numpy as np
y_proba = model.predict_proba(X_test)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)
f1_scores = 2 * precisions * recalls / (precisions + recalls + 1e-9)
best_idx = f1_scores.argmax()
print(f"Best threshold: {thresholds[best_idx]:.3f} → F1: {f1_scores[best_idx]:.3f}")F1 in LLM Evaluation
# F1 is used for token-level evaluation in extractive QA (SQuAD-style)
# Measures token overlap between predicted and reference answers
def compute_token_f1(prediction: str, reference: str) -> float:
pred_tokens = set(prediction.lower().split())
ref_tokens = set(reference.lower().split())
if not pred_tokens or not ref_tokens:
return 0.0
common = pred_tokens & ref_tokens
if not common:
return 0.0
precision = len(common) / len(pred_tokens)
recall = len(common) / len(ref_tokens)
return 2 * precision * recall / (precision + recall)
# Clinical note extraction
prediction = "The patient was given metformin 1000mg for type 2 diabetes"
reference = "Patient received metformin 1000 mg for T2DM treatment"
print(f"Token F1: {compute_token_f1(prediction, reference):.3f}")Interview Answer Template
Q: What is F1 score and when do you use it?
F1 is the harmonic mean of precision and recall: 2 × (P × R) / (P + R). It's the right metric when classes are imbalanced and both false positives and false negatives matter. The harmonic mean is key — if either precision or recall is very low, F1 is low, even if the other is high. This correctly penalizes a model that sacrifices one for the other. For multi-class problems, I report per-class F1 plus macro F1 (unweighted average across classes — treats rare and common classes equally). When the two error types aren't equally costly, I use F-beta: F2 weights recall double (miss fewer positives), F0.5 weights precision double (fewer false alarms). For imbalanced binary classification, AUC-PR (average precision) is complementary — it summarizes performance across all thresholds without requiring you to pick one upfront.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.