What is a Decision Boundary?

A decision boundary is the region where the model transitions from predicting one class to another. It's the separator between predicted classes in feature space.

For a 2-class problem:
  - All points on one side → class 0
  - All points on the other side → class 1
  - On the boundary → exactly 50% probability (for logistic regression)

Linear Decision Boundary

Logistic regression and linear SVM draw a straight line (in 2D) or flat hyperplane (in higher dimensions).

Python

import numpy as np
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

# 2D example: predict drug interaction (1) vs no interaction (0)
np.random.seed(42)
X_class0 = np.random.randn(50, 2) + [0, 0]   # Centered at origin
X_class1 = np.random.randn(50, 2) + [3, 3]   # Shifted up-right
X = np.vstack([X_class0, X_class1])
y = np.array([0]*50 + [1]*50)

model = LogisticRegression()
model.fit(X, y)

# The decision boundary is where probability = 0.5
# For logistic regression: w·x + b = 0
w = model.coef_[0]
b = model.intercept_[0]

# x2 = -(w[0]*x1 + b) / w[1]
x1_range = np.linspace(-3, 6, 100)
x2_boundary = -(w[0] * x1_range + b) / w[1]
# Plotting: plt.plot(x1_range, x2_boundary, 'r-', label='Decision boundary')

Limitation: cannot separate classes that aren't linearly separable (e.g., XOR problem, concentric circles).

Non-Linear Decision Boundaries

Decision Trees — Axis-Aligned Rectangles

Python

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(max_depth=3)
tree.fit(X, y)

# Decision tree splits on one feature at a time:
# if feature_1 > 1.5: → class 1
#   if feature_2 > 2.0: → class 1
#   else: → class 0
# else: → class 0
# Result: rectangular regions in feature space

Random Forest — Complex Piecewise Boundaries

Averages many decision trees, each with a different rectangular split. The combined boundary is a complex piecewise function that can approximate non-linear shapes.

Python

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, max_depth=5)
rf.fit(X, y)
# Very flexible boundary: can capture most shapes

RBF Kernel SVM — Smooth Non-Linear Boundary

Python

from sklearn.svm import SVC

svm = SVC(kernel="rbf", C=1.0, gamma="scale")
svm.fit(X, y)
# "Radial Basis Function": boundary can be circular, elliptical, or complex
# C controls margin width; gamma controls how tightly the boundary fits training points

Neural Network — Arbitrary Boundary

Python

from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(hidden_layer_sizes=(16, 8), max_iter=1000)
mlp.fit(X, y)
# With enough neurons and depth, can approximate ANY continuous decision boundary
# This is the Universal Approximation Theorem

The XOR Problem — Why Linear Boundaries Fail

Python

# XOR: linearly inseparable — no straight line can separate these classes
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([0, 1, 1, 0])   # XOR: different → 1, same → 0

lr = LogisticRegression()
lr.fit(X_xor, y_xor)
print(lr.score(X_xor, y_xor))   # 0.5 — chance level, cannot learn XOR

mlp = MLPClassifier(hidden_layer_sizes=(4,), max_iter=2000)
mlp.fit(X_xor, y_xor)
print(mlp.score(X_xor, y_xor))  # 1.0 — learned non-linear boundary

Overfitting and Decision Boundaries

A model with too much capacity memorizes training data — its decision boundary becomes overly complex, fitting every training point exactly but generalizing poorly.

High bias (underfitting): boundary too simple for the data
  → both training and test accuracy are poor

High variance (overfitting): boundary too complex, follows every noise point
  → training accuracy is great, test accuracy is poor
  
Just right (good generalization): boundary captures the true pattern
  → both training and test accuracy are good

Soft vs Hard Decision Boundaries

Python

# Hard: predict a single class label
y_pred_hard = model.predict(X_test)   # [0, 1, 1, 0, 1]

# Soft: predict class probability — more information
y_pred_soft = model.predict_proba(X_test)[:, 1]   # [0.12, 0.89, 0.76, 0.23, 0.94]

# Near the boundary (probability ~0.5): model is uncertain
# Far from the boundary (probability near 0 or 1): model is confident

# In clinical AI: flag uncertain predictions for human review
for i, prob in enumerate(y_pred_soft):
    if 0.3 <= prob <= 0.7:
        print(f"Sample {i}: UNCERTAIN (prob={prob:.2f}) → escalate to clinician")

Decision Boundary Summary

| Algorithm | Boundary Type | Flexibility | |---|---|---| | Logistic Regression | Linear hyperplane | Low | | Linear SVM | Linear hyperplane | Low | | Decision Tree | Axis-aligned rectangles | Medium | | Random Forest | Complex piecewise | High | | RBF Kernel SVM | Smooth non-linear | High | | Neural Network | Arbitrary smooth | Very high |

Interview Answer Template

Q: What is a decision boundary?

A decision boundary is the surface in feature space where the model transitions from predicting one class to another. For logistic regression, it's the hyperplane where probability equals exactly 0.5 — a straight line in 2D. Linear models can only draw straight boundaries, which fails when classes aren't linearly separable (like the XOR problem). Non-linear models like decision trees create axis-aligned rectangular boundaries, SVMs with RBF kernels create smooth curved boundaries, and neural networks can approximate arbitrarily complex boundaries — which is both their strength (flexibility) and weakness (overfitting risk). In practice, I'd start with a linear model as a baseline, then add complexity only if the linear model clearly underperforms.

What is a Decision Boundary?