Classification Threshold Tuning
Classification threshold explained: why 0.5 is rarely optimal, how to move the threshold to trade off precision and recall, and how to pick the right threshold for clinical and safety-critical ML.
The Default Threshold Is Rarely Optimal
Most classifiers output a probability score between 0 and 1. The default threshold of 0.5 ā "predict positive if probability is above 50%" ā is a convention, not a calibrated choice.
from sklearn.linear_model import LogisticRegression
import numpy as np
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]
# Default behavior: threshold = 0.5
y_pred_default = model.predict(X_test) # equivalent to (y_proba >= 0.5).astype(int)
# Custom threshold
threshold = 0.3
y_pred_low = (y_proba >= threshold).astype(int)
threshold = 0.7
y_pred_high = (y_proba >= threshold).astype(int)
from sklearn.metrics import precision_score, recall_score
for name, preds in [("default (0.5)", y_pred_default), ("low (0.3)", y_pred_low), ("high (0.7)", y_pred_high)]:
p = precision_score(y_test, preds, zero_division=0)
r = recall_score(y_test, preds, zero_division=0)
print(f"Threshold {name:>15}: precision={p:.3f}, recall={r:.3f}")What Moving the Threshold Does
Threshold ā (lower):
More samples classified as positive
ā TP increases (catch more real positives)
ā FP increases (more false alarms)
ā Recall ā, Precision ā
Threshold ā (higher):
Fewer samples classified as positive
ā FP decreases (fewer false alarms)
ā FN increases (miss more real positives)
ā Precision ā, Recall āimport numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
y_proba = model.predict_proba(X_test)[:, 1]
print(f"{'Threshold':>10} {'Predicted Pos':>14} {'Precision':>10} {'Recall':>8} {'F1':>8}")
print("-" * 58)
for threshold in np.arange(0.1, 0.95, 0.1):
y_pred_t = (y_proba >= threshold).astype(int)
n_pos = y_pred_t.sum()
p = precision_score(y_test, y_pred_t, zero_division=0)
r = recall_score(y_test, y_pred_t, zero_division=0)
f1 = f1_score(y_test, y_pred_t, zero_division=0)
print(f"{threshold:>10.1f} {n_pos:>14} {p:>10.3f} {r:>8.3f} {f1:>8.3f}")Finding the Threshold That Maximizes F1
from sklearn.metrics import precision_recall_curve, f1_score
import numpy as np
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)
# F1 at each threshold
# precisions and recalls have one more element than thresholds
f1_scores = 2 * precisions[:-1] * recalls[:-1] / (precisions[:-1] + recalls[:-1] + 1e-9)
best_idx = np.argmax(f1_scores)
print(f"Best threshold for F1: {thresholds[best_idx]:.3f}")
print(f"Precision at best: {precisions[best_idx]:.3f}")
print(f"Recall at best: {recalls[best_idx]:.3f}")
print(f"F1 at best: {f1_scores[best_idx]:.3f}")Setting a Recall Target (Clinical Safety)
# In clinical settings, the stakeholder often specifies a minimum recall
# "We need to catch at least 90% of high-risk patients"
target_recall = 0.90
for threshold, precision, recall in zip(thresholds, precisions, recalls):
if recall >= target_recall:
print(f"Threshold achieving recall={target_recall}:")
print(f" Threshold: {threshold:.3f}")
print(f" Precision: {precision:.3f}")
print(f" Recall: {recall:.3f}")
print(f" For every {1/precision:.0f} alerts, 1 is a real positive")
breakSetting a Precision Target (Alert Fatigue)
# In alert-heavy systems, precision is constrained
# "Physicians won't tolerate more than 1 false alarm per real case (precision >= 0.50)"
target_precision = 0.50
# Work backwards through thresholds (higher threshold ā higher precision)
for threshold, precision, recall in zip(thresholds[::-1], precisions[-2::-1], recalls[-2::-1]):
if precision >= target_precision:
print(f"Lowest threshold achieving precision={target_precision}:")
print(f" Threshold: {threshold:.3f}")
print(f" Precision: {precision:.3f}")
print(f" Recall: {recall:.3f}")
breakThreshold Tuning Should Use Validation Data
from sklearn.model_selection import train_test_split
X_train, X_val_test, y_train, y_val_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, test_size=0.5, random_state=42)
# Step 1: Train model
model.fit(X_train, y_train)
# Step 2: Find threshold on validation set
y_val_proba = model.predict_proba(X_val)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_val, y_val_proba)
best_idx = np.argmax(2 * precisions[:-1] * recalls[:-1] / (precisions[:-1] + recalls[:-1] + 1e-9))
best_threshold = thresholds[best_idx]
print(f"Tuned threshold (from val): {best_threshold:.3f}")
# Step 3: Apply threshold to test set
y_test_proba = model.predict_proba(X_test)[:, 1]
y_test_pred = (y_test_proba >= best_threshold).astype(int)
from sklearn.metrics import classification_report
print("\nTest set performance at tuned threshold:")
print(classification_report(y_test, y_test_pred))Threshold vs Probability Calibration
# Threshold tuning changes the decision boundary
# It does NOT fix a poorly calibrated model
# If model.predict_proba(X) returns 0.3 for 60% of true positives:
# ā model is overconfident in negatives (miscalibrated)
# ā lowering threshold to 0.2 helps but doesn't fix the root cause
# Calibration check
from sklearn.calibration import calibration_curve
fraction_of_positives, mean_predicted_value = calibration_curve(
y_test, y_proba, n_bins=10
)
print("Calibration curve (diagonal = perfect):")
for pred, actual in zip(mean_predicted_value, fraction_of_positives):
print(f" Predicted prob: {pred:.2f} ā Actual positive rate: {actual:.2f}")
# If actual >> predicted: model is underconfident (threshold should be lower)
# If actual << predicted: model is overconfident (threshold should be higher)Interview Answer Template
Q: How do you choose the classification threshold?
The default threshold of 0.5 is almost never the right choice for imbalanced clinical datasets ā it's a convention, not a calibrated decision. The right threshold depends on the relative cost of false positives and false negatives. For a clinical screening model where missing a case is dangerous, I find the lowest threshold that achieves a target recall (e.g., 0.90) using the validation precision-recall curve, then evaluate the resulting precision to assess whether the false alarm rate is operationally sustainable. For an alert system where alert fatigue is a concern, I find the threshold that keeps precision above a minimum acceptable level, then report what recall that yields. I always tune the threshold on the validation set ā never on the training set ā and evaluate the final model on a held-out test set using the tuned threshold. The key insight: the precision-recall curve shows you all possible operating points at once; choosing the threshold is choosing the operating point.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.