What is Underfitting?

Underfitting occurs when a model is too simple to capture the patterns in the data. It performs poorly on both training data and new data — it hasn't learned enough.

Underfitting model:
  Training loss:   high  (can't even fit training data)
  Validation loss: high  (no patterns learned)
  Gap: small (both are bad)

Compare to overfitting:
  Training loss:   very low  (memorized training)
  Validation loss: high      (can't generalize)
  Gap: large

A Concrete Example

Python

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import numpy as np

np.random.seed(42)

# Create a clearly non-linear pattern
X = np.random.randn(500, 2)
y = ((X[:, 0]**2 + X[:, 1]**2) < 1).astype(int)   # Circular boundary

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.25)

# Underfitting: linear model for a circular boundary
linear = LogisticRegression()
linear.fit(X_train, y_train)
print(f"Logistic (linear) — Train: {accuracy_score(y_train, linear.predict(X_train)):.2%}")
print(f"Logistic (linear) — Val:   {accuracy_score(y_val,   linear.predict(X_val)):.2%}")
# Train: ~70%, Val: ~69% — both bad, small gap (underfitting)

# Better fit: flexible model
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
print(f"RandomForest       — Train: {accuracy_score(y_train, rf.predict(X_train)):.2%}")
print(f"RandomForest       — Val:   {accuracy_score(y_val,   rf.predict(X_val)):.2%}")
# Train: ~98%, Val: ~95% — much better

Causes of Underfitting

| Cause | Explanation | |---|---| | Model too simple | Linear model for non-linear data, too few neurons | | Too little training | Not enough epochs for the model to converge | | Too much regularization | Regularization penalizes the model so strongly it can't learn | | Missing features | The model doesn't have access to the information it needs | | Bad features | Features don't encode the relevant signal for the task |

Diagnosing Underfitting

Python

def diagnose_fit(train_score: float, val_score: float, baseline: float) -> str:
    """
    baseline: chance level (e.g., 0.5 for balanced binary) 
             or majority class accuracy for imbalanced data
    """
    gap = train_score - val_score

    if train_score <= baseline + 0.05:
        return "UNDERFITTING — model is no better than a naive baseline"
    elif train_score < 0.75 and val_score < 0.70:
        return "UNDERFITTING — both train and val scores are too low"
    elif gap < 0.05 and train_score > 0.85:
        return "GOOD FIT"
    elif gap > 0.10:
        return "OVERFITTING — large train/val gap"
    else:
        return "MILD OVERFITTING — acceptable if val score is good enough"

print(diagnose_fit(0.71, 0.69, baseline=0.50))   # UNDERFITTING
print(diagnose_fit(1.00, 0.62, baseline=0.50))   # OVERFITTING
print(diagnose_fit(0.91, 0.89, baseline=0.50))   # GOOD FIT

Fixes for Underfitting

1. Use a More Complex Model

Python

# Linear → tree → ensemble
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier

# Try progressively more complex models
models = [
    ("Logistic Regression", LogisticRegression()),
    ("Decision Tree (d=3)",  DecisionTreeClassifier(max_depth=3)),
    ("Decision Tree (d=10)", DecisionTreeClassifier(max_depth=10)),
    ("Gradient Boosting",    GradientBoostingClassifier()),
]

for name, model in models:
    model.fit(X_train, y_train)
    print(f"{name}: val={accuracy_score(y_val, model.predict(X_val)):.2%}")

2. Train Longer

Python

# Neural networks: increase epochs
model.fit(X_train, y_train, epochs=200)   # Was 20

# In scikit-learn: increase n_estimators for ensembles
RandomForestClassifier(n_estimators=500)  # Was 50

# SGD-based: increase max_iter
LogisticRegression(max_iter=1000)         # Was 100 (default)

3. Reduce Regularization

Python

from sklearn.linear_model import Ridge

# C in LogisticRegression is INVERSE of regularization strength
# Smaller C = more regularization = more underfitting risk
LogisticRegression(C=100)   # Weak regularization (was C=0.01)

# alpha in Ridge/Lasso: larger alpha = more regularization
Ridge(alpha=0.0001)          # Weak regularization (was alpha=10.0)

4. Add Better Features

Python

from sklearn.preprocessing import PolynomialFeatures

# Transform features to capture non-linear relationships
poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train)   # Adds x₁², x₂², x₁x₂ features
X_val_poly   = poly.transform(X_val)

linear = LogisticRegression()
linear.fit(X_train_poly, y_train)
print(f"Poly features — Val: {accuracy_score(y_val, linear.predict(X_val_poly)):.2%}")
# Often dramatically improves performance on non-linear problems

Underfitting vs Overfitting at a Glance

| | Underfitting | Overfitting | |---|---|---| | Also called | High bias | High variance | | Training performance | Poor | Excellent | | Validation performance | Poor | Poor | | Train/val gap | Small | Large | | Cause | Model too simple | Model too complex | | Fix | Add complexity, better features, less regularization | Simplify, regularize, more data |

Interview Answer Template

Q: What is underfitting and how do you distinguish it from overfitting?

Underfitting (also called high bias) occurs when a model is too simple to capture the patterns in the data — it performs poorly on both training and validation. The key diagnostic difference from overfitting is the train/val gap: underfitting shows both scores are low with a small gap (the model fails everywhere), while overfitting shows a high training score with a low validation score (large gap). Fixes for underfitting include using a more complex model, training longer, reducing regularization that's too strong, or engineering better features. The bias-variance tradeoff means these two problems sit at opposite ends: reducing bias (adding complexity) increases variance risk, and vice versa.

What is Underfitting?