What is Underfitting?
Understand underfitting in machine learning: high bias, why models fail to learn, how to detect it, and the fixes — more complexity, better features, less regularization, more training.
What is Underfitting?
Underfitting occurs when a model is too simple to capture the patterns in the data. It performs poorly on both training data and new data — it hasn't learned enough.
Underfitting model:
Training loss: high (can't even fit training data)
Validation loss: high (no patterns learned)
Gap: small (both are bad)
Compare to overfitting:
Training loss: very low (memorized training)
Validation loss: high (can't generalize)
Gap: largeA Concrete Example
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import numpy as np
np.random.seed(42)
# Create a clearly non-linear pattern
X = np.random.randn(500, 2)
y = ((X[:, 0]**2 + X[:, 1]**2) < 1).astype(int) # Circular boundary
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.25)
# Underfitting: linear model for a circular boundary
linear = LogisticRegression()
linear.fit(X_train, y_train)
print(f"Logistic (linear) — Train: {accuracy_score(y_train, linear.predict(X_train)):.2%}")
print(f"Logistic (linear) — Val: {accuracy_score(y_val, linear.predict(X_val)):.2%}")
# Train: ~70%, Val: ~69% — both bad, small gap (underfitting)
# Better fit: flexible model
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
print(f"RandomForest — Train: {accuracy_score(y_train, rf.predict(X_train)):.2%}")
print(f"RandomForest — Val: {accuracy_score(y_val, rf.predict(X_val)):.2%}")
# Train: ~98%, Val: ~95% — much betterCauses of Underfitting
| Cause | Explanation | |---|---| | Model too simple | Linear model for non-linear data, too few neurons | | Too little training | Not enough epochs for the model to converge | | Too much regularization | Regularization penalizes the model so strongly it can't learn | | Missing features | The model doesn't have access to the information it needs | | Bad features | Features don't encode the relevant signal for the task |
Diagnosing Underfitting
def diagnose_fit(train_score: float, val_score: float, baseline: float) -> str:
"""
baseline: chance level (e.g., 0.5 for balanced binary)
or majority class accuracy for imbalanced data
"""
gap = train_score - val_score
if train_score <= baseline + 0.05:
return "UNDERFITTING — model is no better than a naive baseline"
elif train_score < 0.75 and val_score < 0.70:
return "UNDERFITTING — both train and val scores are too low"
elif gap < 0.05 and train_score > 0.85:
return "GOOD FIT"
elif gap > 0.10:
return "OVERFITTING — large train/val gap"
else:
return "MILD OVERFITTING — acceptable if val score is good enough"
print(diagnose_fit(0.71, 0.69, baseline=0.50)) # UNDERFITTING
print(diagnose_fit(1.00, 0.62, baseline=0.50)) # OVERFITTING
print(diagnose_fit(0.91, 0.89, baseline=0.50)) # GOOD FITFixes for Underfitting
1. Use a More Complex Model
# Linear → tree → ensemble
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
# Try progressively more complex models
models = [
("Logistic Regression", LogisticRegression()),
("Decision Tree (d=3)", DecisionTreeClassifier(max_depth=3)),
("Decision Tree (d=10)", DecisionTreeClassifier(max_depth=10)),
("Gradient Boosting", GradientBoostingClassifier()),
]
for name, model in models:
model.fit(X_train, y_train)
print(f"{name}: val={accuracy_score(y_val, model.predict(X_val)):.2%}")2. Train Longer
# Neural networks: increase epochs
model.fit(X_train, y_train, epochs=200) # Was 20
# In scikit-learn: increase n_estimators for ensembles
RandomForestClassifier(n_estimators=500) # Was 50
# SGD-based: increase max_iter
LogisticRegression(max_iter=1000) # Was 100 (default)3. Reduce Regularization
from sklearn.linear_model import Ridge
# C in LogisticRegression is INVERSE of regularization strength
# Smaller C = more regularization = more underfitting risk
LogisticRegression(C=100) # Weak regularization (was C=0.01)
# alpha in Ridge/Lasso: larger alpha = more regularization
Ridge(alpha=0.0001) # Weak regularization (was alpha=10.0)4. Add Better Features
from sklearn.preprocessing import PolynomialFeatures
# Transform features to capture non-linear relationships
poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train) # Adds x₁², x₂², x₁x₂ features
X_val_poly = poly.transform(X_val)
linear = LogisticRegression()
linear.fit(X_train_poly, y_train)
print(f"Poly features — Val: {accuracy_score(y_val, linear.predict(X_val_poly)):.2%}")
# Often dramatically improves performance on non-linear problemsUnderfitting vs Overfitting at a Glance
| | Underfitting | Overfitting | |---|---|---| | Also called | High bias | High variance | | Training performance | Poor | Excellent | | Validation performance | Poor | Poor | | Train/val gap | Small | Large | | Cause | Model too simple | Model too complex | | Fix | Add complexity, better features, less regularization | Simplify, regularize, more data |
Interview Answer Template
Q: What is underfitting and how do you distinguish it from overfitting?
Underfitting (also called high bias) occurs when a model is too simple to capture the patterns in the data — it performs poorly on both training and validation. The key diagnostic difference from overfitting is the train/val gap: underfitting shows both scores are low with a small gap (the model fails everywhere), while overfitting shows a high training score with a low validation score (large gap). Fixes for underfitting include using a more complex model, training longer, reducing regularization that's too strong, or engineering better features. The bias-variance tradeoff means these two problems sit at opposite ends: reducing bias (adding complexity) increases variance risk, and vice versa.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.