Classification Threshold Tuning

The Default Threshold Is Rarely Optimal

Most classifiers output a probability score between 0 and 1. The default threshold of 0.5 — "predict positive if probability is above 50%" — is a convention, not a calibrated choice.

Python

from sklearn.linear_model import LogisticRegression
import numpy as np

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_proba = model.predict_proba(X_test)[:, 1]

# Default behavior: threshold = 0.5
y_pred_default = model.predict(X_test)   # equivalent to (y_proba >= 0.5).astype(int)

# Custom threshold
threshold = 0.3
y_pred_low = (y_proba >= threshold).astype(int)

threshold = 0.7
y_pred_high = (y_proba >= threshold).astype(int)

from sklearn.metrics import precision_score, recall_score

for name, preds in [("default (0.5)", y_pred_default), ("low (0.3)", y_pred_low), ("high (0.7)", y_pred_high)]:
    p = precision_score(y_test, preds, zero_division=0)
    r = recall_score(y_test, preds, zero_division=0)
    print(f"Threshold {name:>15}: precision={p:.3f}, recall={r:.3f}")

What Moving the Threshold Does

Threshold ↓ (lower):
  More samples classified as positive
  → TP increases (catch more real positives)
  → FP increases (more false alarms)
  → Recall ↑, Precision ↓

Threshold ↑ (higher):
  Fewer samples classified as positive
  → FP decreases (fewer false alarms)
  → FN increases (miss more real positives)
  → Precision ↑, Recall ↓

Python

import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix

y_proba = model.predict_proba(X_test)[:, 1]

print(f"{'Threshold':>10}  {'Predicted Pos':>14}  {'Precision':>10}  {'Recall':>8}  {'F1':>8}")
print("-" * 58)

for threshold in np.arange(0.1, 0.95, 0.1):
    y_pred_t = (y_proba >= threshold).astype(int)
    n_pos = y_pred_t.sum()
    p = precision_score(y_test, y_pred_t, zero_division=0)
    r = recall_score(y_test, y_pred_t, zero_division=0)
    f1 = f1_score(y_test, y_pred_t, zero_division=0)
    print(f"{threshold:>10.1f}  {n_pos:>14}  {p:>10.3f}  {r:>8.3f}  {f1:>8.3f}")

Finding the Threshold That Maximizes F1

Python

from sklearn.metrics import precision_recall_curve, f1_score
import numpy as np

precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)

# F1 at each threshold
# precisions and recalls have one more element than thresholds
f1_scores = 2 * precisions[:-1] * recalls[:-1] / (precisions[:-1] + recalls[:-1] + 1e-9)
best_idx = np.argmax(f1_scores)

print(f"Best threshold for F1:  {thresholds[best_idx]:.3f}")
print(f"Precision at best:      {precisions[best_idx]:.3f}")
print(f"Recall at best:         {recalls[best_idx]:.3f}")
print(f"F1 at best:             {f1_scores[best_idx]:.3f}")

Setting a Recall Target (Clinical Safety)

Python

# In clinical settings, the stakeholder often specifies a minimum recall
# "We need to catch at least 90% of high-risk patients"

target_recall = 0.90

for threshold, precision, recall in zip(thresholds, precisions, recalls):
    if recall >= target_recall:
        print(f"Threshold achieving recall={target_recall}:")
        print(f"  Threshold:  {threshold:.3f}")
        print(f"  Precision:  {precision:.3f}")
        print(f"  Recall:     {recall:.3f}")
        print(f"  For every {1/precision:.0f} alerts, 1 is a real positive")
        break

Setting a Precision Target (Alert Fatigue)

Python

# In alert-heavy systems, precision is constrained
# "Physicians won't tolerate more than 1 false alarm per real case (precision >= 0.50)"

target_precision = 0.50

# Work backwards through thresholds (higher threshold → higher precision)
for threshold, precision, recall in zip(thresholds[::-1], precisions[-2::-1], recalls[-2::-1]):
    if precision >= target_precision:
        print(f"Lowest threshold achieving precision={target_precision}:")
        print(f"  Threshold:  {threshold:.3f}")
        print(f"  Precision:  {precision:.3f}")
        print(f"  Recall:     {recall:.3f}")
        break

Threshold Tuning Should Use Validation Data

Python

from sklearn.model_selection import train_test_split

X_train, X_val_test, y_train, y_val_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, test_size=0.5, random_state=42)

# Step 1: Train model
model.fit(X_train, y_train)

# Step 2: Find threshold on validation set
y_val_proba = model.predict_proba(X_val)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_val, y_val_proba)
best_idx = np.argmax(2 * precisions[:-1] * recalls[:-1] / (precisions[:-1] + recalls[:-1] + 1e-9))
best_threshold = thresholds[best_idx]
print(f"Tuned threshold (from val): {best_threshold:.3f}")

# Step 3: Apply threshold to test set
y_test_proba = model.predict_proba(X_test)[:, 1]
y_test_pred  = (y_test_proba >= best_threshold).astype(int)

from sklearn.metrics import classification_report
print("\nTest set performance at tuned threshold:")
print(classification_report(y_test, y_test_pred))

Threshold vs Probability Calibration

Python

# Threshold tuning changes the decision boundary
# It does NOT fix a poorly calibrated model

# If model.predict_proba(X) returns 0.3 for 60% of true positives:
# → model is overconfident in negatives (miscalibrated)
# → lowering threshold to 0.2 helps but doesn't fix the root cause

# Calibration check
from sklearn.calibration import calibration_curve

fraction_of_positives, mean_predicted_value = calibration_curve(
    y_test, y_proba, n_bins=10
)

print("Calibration curve (diagonal = perfect):")
for pred, actual in zip(mean_predicted_value, fraction_of_positives):
    print(f"  Predicted prob: {pred:.2f} → Actual positive rate: {actual:.2f}")
    # If actual >> predicted: model is underconfident (threshold should be lower)
    # If actual << predicted: model is overconfident (threshold should be higher)

Interview Answer Template

Q: How do you choose the classification threshold?

The default threshold of 0.5 is almost never the right choice for imbalanced clinical datasets — it's a convention, not a calibrated decision. The right threshold depends on the relative cost of false positives and false negatives. For a clinical screening model where missing a case is dangerous, I find the lowest threshold that achieves a target recall (e.g., 0.90) using the validation precision-recall curve, then evaluate the resulting precision to assess whether the false alarm rate is operationally sustainable. For an alert system where alert fatigue is a concern, I find the threshold that keeps precision above a minimum acceptable level, then report what recall that yields. I always tune the threshold on the validation set — never on the training set — and evaluate the final model on a held-out test set using the tuned threshold. The key insight: the precision-recall curve shows you all possible operating points at once; choosing the threshold is choosing the operating point.

Classification Threshold Tuning

The Default Threshold Is Rarely Optimal

What Moving the Threshold Does

Finding the Threshold That Maximizes F1

Setting a Recall Target (Clinical Safety)

Setting a Precision Target (Alert Fatigue)

Threshold Tuning Should Use Validation Data

Threshold vs Probability Calibration

Interview Answer Template

Enjoyed this article?

Leave a comment