Learnixo
Back to blog
AI Systemsintermediate

Universal Approximation Theorem

What the universal approximation theorem says, what it doesn't say, and why it matters (and doesn't matter) for practical deep learning.

Asma Hafeez KhanMay 22, 20265 min read
Deep LearningUniversal ApproximationTheoryNeural NetworksInterview
Share:𝕏

The Theorem

Universal Approximation Theorem (Cybenko 1989, Hornik 1991):

A feedforward network with:
  - one hidden layer
  - a sufficient (possibly very large) number of neurons
  - a non-polynomial activation function (sigmoid, ReLU, tanh, ...)
  
can approximate any continuous function on a compact subset of Rⁿ
to arbitrary precision.

In plain English:
  A single hidden layer MLP can approximate any reasonable function —
  given enough neurons and a non-linear activation.

What It Proves (and Doesn't)

✓ What UAT proves:
  1. Neural networks are, in principle, universal function approximators
  2. Non-linearity is necessary — without activation functions, you get a linear model
  3. One hidden layer is theoretically sufficient for any function

✗ What UAT doesn't prove:
  1. How many neurons are needed (may be exponentially many)
  2. That gradient descent will find the approximation
  3. That the approximation generalises to unseen data
  4. That depth is unnecessary (it isn't — depth is far more efficient)
  5. That finite training data is sufficient

The "constructive gap":
  UAT is an existence proof, not a construction recipe.
  It says "a solution exists", not "here's how to find it".

Demonstrating Non-Linearity is Essential

Python
import torch
import torch.nn as nn
import numpy as np

# ── Linear vs Non-Linear on XOR ──
# XOR is not linearly separable  a linear model cannot learn it

X_xor = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=torch.float)
y_xor = torch.tensor([0, 1, 1, 0], dtype=torch.float)

# Linear model: will fail (XOR is not linearly separable)
linear_model = nn.Linear(2, 1)
criterion = nn.BCEWithLogitsLoss()
opt = torch.optim.Adam(linear_model.parameters(), lr=0.1)

for _ in range(500):
    opt.zero_grad()
    loss = criterion(linear_model(X_xor).squeeze(), y_xor)
    loss.backward()
    opt.step()

with torch.no_grad():
    preds = torch.sigmoid(linear_model(X_xor).squeeze())
    print("Linear model predictions:", preds.numpy().round(2))
    print("Linear accuracy:", ((preds > 0.5).float() == y_xor).float().mean().item())

# Non-linear model: easily learns XOR
nonlinear_model = nn.Sequential(
    nn.Linear(2, 4),
    nn.ReLU(),
    nn.Linear(4, 1),
)
opt2 = torch.optim.Adam(nonlinear_model.parameters(), lr=0.1)

for _ in range(1000):
    opt2.zero_grad()
    loss = criterion(nonlinear_model(X_xor).squeeze(), y_xor)
    loss.backward()
    opt2.step()

with torch.no_grad():
    preds2 = torch.sigmoid(nonlinear_model(X_xor).squeeze())
    print("\nNonlinear model predictions:", preds2.numpy().round(2))
    print("Nonlinear accuracy:", ((preds2 > 0.5).float() == y_xor).float().mean().item())

Approximating a 1D Function

Python
import torch
import torch.nn as nn
import numpy as np

def approximate_function(
    fn,
    n_hidden: int = 50,
    n_samples: int = 1000,
    n_epochs: int = 5000,
) -> nn.Module:
    """Train a 1-hidden-layer network to approximate fn on [-π, π]."""
    # Training data
    x = torch.linspace(-np.pi, np.pi, n_samples).unsqueeze(1)  # (1000, 1)
    y = fn(x)
    
    model = nn.Sequential(
        nn.Linear(1, n_hidden),
        nn.Tanh(),              # UAT requires non-polynomial activation
        nn.Linear(n_hidden, 1),
    )
    
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    criterion = nn.MSELoss()
    
    for epoch in range(n_epochs):
        optimizer.zero_grad()
        loss = criterion(model(x), y)
        loss.backward()
        optimizer.step()
        
        if epoch % 1000 == 0:
            print(f"Epoch {epoch}: loss = {loss.item():.6f}")
    
    return model

import math

# Approximate sin(x)  non-trivial function
model_sin = approximate_function(fn=torch.sin, n_hidden=20)

# Test
x_test = torch.tensor([[0.0], [np.pi/2], [np.pi], [-np.pi/2]])
with torch.no_grad():
    preds = model_sin(x_test)
    true  = torch.sin(x_test)
    for xi, pi, ti in zip(x_test, preds, true):
        print(f"x={xi.item():.3f}: pred={pi.item():.3f}, true={ti.item():.3f}")

Width vs Depth for Approximation Efficiency

Python
import torch
import torch.nn as nn

def count_params(model: nn.Module) -> int:
    return sum(p.numel() for p in model.parameters())

# Function to approximate: sin(x₁) + cos(x₂) + x₁x₂ (2D)

# Wide, shallow (UAT-style)
wide_shallow = nn.Sequential(
    nn.Linear(2, 1000),     # very wide single hidden layer
    nn.Tanh(),
    nn.Linear(1000, 1),
)

# Deep, narrow (practical DL)
deep_narrow = nn.Sequential(
    nn.Linear(2, 32), nn.Tanh(),
    nn.Linear(32, 32), nn.Tanh(),
    nn.Linear(32, 32), nn.Tanh(),
    nn.Linear(32, 32), nn.Tanh(),
    nn.Linear(32, 1),
)

print(f"Wide-shallow: {count_params(wide_shallow):,} params")
print(f"Deep-narrow:  {count_params(deep_narrow):,} params")

# Deep networks can represent exponentially more functions than shallow ones
# for the same parameter count  this is why depth works in practice
# (Requires 2^n width to match n-layer deep network for some function classes)

Practical Implications

What UAT means for practitioners:

1. Don't be too shallow: one layer can do it in theory, but in practice
   it requires enormous width and doesn't generalise.
   → Use 2–4 layers for most problems.

2. Activation functions are mandatory: a linear stack of layers is just one linear layer.
   → Always use non-linear activations in hidden layers.

3. Capacity is bounded by architecture:
   If your model is consistently underfitting, increase capacity
   (more neurons or more layers) before doing other things.

4. UAT doesn't solve optimisation:
   A big enough network can approximate anything, but gradient descent
   may not find that approximation. Architecture, initialisation,
   and training dynamics all matter.

5. UAT doesn't solve generalisation:
   A network that perfectly approximates the training data
   may not generalise. Regularisation (Dropout, BatchNorm, weight decay)
   is necessary for practical deployment.

Interview Answer

"The Universal Approximation Theorem states that a single-hidden-layer feedforward network with enough neurons and a non-polynomial activation function can approximate any continuous function to arbitrary precision. It's an existence theorem, not a constructive one — it proves a solution exists but doesn't say how many neurons are needed (potentially exponentially many), whether gradient descent will find it, or whether it generalises to unseen data. Its main practical implication: non-linearity is essential (a linear MLP is just one linear transformation regardless of depth); a network with zero hidden layers cannot learn non-linear patterns like XOR. Depth is more efficient than width for complex functions — deeper networks represent exponentially more function classes per parameter. In practice, universal approximation is too loose to guide architecture design; empirical results, inductive biases (CNNs for images, transformers for sequences), and ablation studies matter far more."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.