Back to blog
AI Systemsintermediate

Multi-Stage Docker Builds for AI Apps

Understand multi-stage Docker builds and how they dramatically reduce AI API image sizes — from 2.1 GB down to 480 MB — while keeping your runtime image clean, secure, and free of compilers.

Asma Hafeez KhanMay 15, 202610 min read
LLMOpsDockerMulti-Stage BuildsAIDevOpsPyTorch
Share:𝕏

The Problem: AI Images Are Huge

Run a naive Dockerfile for an AI service that uses torch, transformers, and sentence-transformers, and your image will be somewhere between 2 and 5 GB. That's not an exaggeration — torch alone is over 700 MB for the CUDA variant.

Why does this matter in production?

  • Cold start latency: A container orchestrator pulling a 2.1 GB image takes 60–90 seconds on a warm node and several minutes on a cold node.
  • Attack surface: Every library, compiler, and header file in the image is a potential vulnerability. The builder stage needs gcc and make. The runtime does not.
  • Registry costs: Storing and transferring 2.1 GB images across regions adds up.
  • Deploy frequency: Slower builds and pulls discourage frequent deployments. Smaller images make continuous deployment practical.

Multi-stage builds solve all of this by separating the build environment from the runtime environment in a single Dockerfile.


How Multi-Stage Builds Work

A multi-stage Dockerfile has multiple FROM instructions. Each FROM starts a new stage. Earlier stages can be referenced by later stages using COPY --from=<stage>.

DOCKERFILE
# Stage 1: builder  has compilers, build tools
FROM python:3.11 AS builder
# ... install build deps, compile wheels ...

# Stage 2: runtime  tiny, no compilers
FROM python:3.11-slim AS runtime
# COPY only the built artifacts from builder
COPY --from=builder /app/wheels /wheels
# ... run pip install from local wheels, start app ...

Docker builds both stages but the final image is only the last stage. All the compiler tools, source files, and build artefacts from the builder stage are discarded. They never appear in the final image — not even in the layer history.


Image Size Comparison

Here's what you're working with for a pharmabot-style service that uses sentence-transformers for local embeddings and calls Azure OpenAI for generation:

| Build approach | Image size | Notes | |---|---|---| | Naive python:3.11 | 2.1 GB | Full Debian + compilers stay in the image | | Single-stage python:3.11-slim | 820 MB | No compilers, but torch wheels need build deps | | Multi-stage (CPU-only torch) | 480 MB | Runtime has no compilers; CPU torch variant | | Multi-stage + no local inference | 180 MB | If you only call Azure OpenAI (no local models) |

The jump from 2.1 GB to 480 MB is the result of:

  1. Switching to the CPU-only PyTorch variant (saves ~1 GB)
  2. Discarding build tools from the runtime stage (saves ~400 MB)
  3. Using python:3.11-slim as the runtime base (saves ~300 MB)

The Full Multi-Stage Dockerfile

DOCKERFILE
# ════════════════════════════════════════════════════════════════════════════
# Stage 1: builder
# Purpose: Install dependencies, compile any C extensions, produce wheels.
# This stage is DISCARDED from the final image.
# ════════════════════════════════════════════════════════════════════════════
FROM python:3.11 AS builder

# Prevent pyc files and unbuffered output
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

WORKDIR /build

# Install build dependencies for C extensions
# (gcc needed for psycopg2 from source, cryptography, etc.)
RUN apt-get update && apt-get install -y --no-install-recommends \
        gcc \
        g++ \
        libpq-dev \
        libffi-dev \
        libssl-dev \
    && rm -rf /var/lib/apt/lists/*

# Upgrade pip and install wheel builder
RUN pip install --upgrade pip wheel

# Copy requirements and build wheels into /wheels
# Building wheels means the runtime stage can install without any compilers.
COPY requirements.txt .
RUN pip wheel --no-cache-dir --wheel-dir /wheels -r requirements.txt


# ════════════════════════════════════════════════════════════════════════════
# Stage 2: runtime
# Purpose: The actual production image. Tiny, no compilers, no source files.
# ════════════════════════════════════════════════════════════════════════════
FROM python:3.11-slim AS runtime

ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

WORKDIR /app

# Create non-root user
RUN groupadd --gid 1001 appgroup && \
    useradd --uid 1001 --gid appgroup --no-create-home appuser

# Install only the runtime OS libraries needed (not build-time headers)
# libpq5  = PostgreSQL client runtime (psycopg2 needs this)
# libffi8 = Foreign function interface runtime (cryptography, cffi)
RUN apt-get update && apt-get install -y --no-install-recommends \
        libpq5 \
        libffi8 \
        curl \
    && rm -rf /var/lib/apt/lists/*

# ── The key step: copy pre-built wheels from the builder stage ────────────
COPY --from=builder /wheels /wheels

# Install from local wheels — no compiler needed, no internet needed
RUN pip install --upgrade pip && \
    pip install --no-cache-dir --no-index --find-links=/wheels /wheels/*.whl && \
    rm -rf /wheels

# Copy application code
COPY app/ ./app/

RUN chown -R appuser:appgroup /app

USER appuser

EXPOSE 8000

ENV WORKERS="4"
ENV TIMEOUT="120"

CMD ["sh", "-c", \
     "gunicorn app.main:app \
      --workers $WORKERS \
      --worker-class uvicorn.workers.UvicornWorker \
      --bind 0.0.0.0:8000 \
      --timeout $TIMEOUT \
      --access-logfile -"]

Handling PyTorch: CPU-Only vs GPU Variants

PyTorch is the largest single dependency in most AI images. The default torch package on PyPI includes full CUDA support and weighs over 700 MB compressed. For a service that only calls Azure OpenAI (no local inference), you don't need PyTorch at all.

For services that do need PyTorch (embedding generation, re-ranking, intent classification), you have three variants:

Option A: CPU-only PyTorch (recommended for API services)

The CPU-only wheel is dramatically smaller. Use a custom requirements file:

# requirements-torch-cpu.txt
--extra-index-url https://download.pytorch.org/whl/cpu
torch==2.3.0+cpu
torchvision==0.18.0+cpu

In the Dockerfile builder stage:

DOCKERFILE
COPY requirements-base.txt requirements-torch-cpu.txt ./

RUN pip wheel --no-cache-dir --wheel-dir /wheels \
    -r requirements-base.txt \
    -r requirements-torch-cpu.txt

This reduces torch from 700 MB to approximately 170 MB.

Option B: CUDA PyTorch (for GPU inference containers)

Use NVIDIA's CUDA base image for the runtime stage only:

DOCKERFILE
# Builder still uses standard Python (has all build tools)
FROM python:3.11 AS builder
# ... same wheel-building logic as before ...

# Runtime uses NVIDIA CUDA image
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04 AS runtime

# Install Python on the CUDA runtime base
RUN apt-get update && apt-get install -y python3.11 python3-pip && \
    rm -rf /var/lib/apt/lists/*

# Copy wheels from builder
COPY --from=builder /wheels /wheels
RUN pip3 install --no-cache-dir --no-index --find-links=/wheels /wheels/*.whl

Option C: No PyTorch (pure API proxy)

If your service purely calls Azure OpenAI and does no local ML, exclude torch entirely:

# requirements.txt — no torch, no transformers
fastapi==0.111.0
uvicorn[standard]==0.29.0
gunicorn==22.0.0
openai==1.30.1
pydantic-settings==2.2.1
redis==5.0.4
httpx==0.27.0
psycopg2-binary==2.9.9
structlog==24.1.0

Image size drops to approximately 180 MB. This is the right choice for most Azure OpenAI proxy services.


The COPY --from=builder Pattern in Depth

COPY --from=<stage_name> is the core mechanism that makes multi-stage builds work. You can use it in several ways:

DOCKERFILE
# Copy a specific directory
COPY --from=builder /wheels /wheels

# Copy a specific file
COPY --from=builder /build/compiled_extension.so /app/

# Copy an installed Python environment (alternative to wheels)
COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages

# You can also reference external images (not just stages in the same file)
COPY --from=nginx:latest /etc/nginx/nginx.conf /app/nginx.conf

The "copy site-packages" approach (copying the entire installed Python library directory) is simpler than the wheel approach but copies more files. The wheel approach is cleaner because the runtime stage runs its own pip install, which registers packages properly in pip's package database.


Build Arguments for Variant Selection

Use ARG to select between CPU and GPU variants at build time without maintaining separate Dockerfiles:

DOCKERFILE
FROM python:3.11 AS builder

# Default to CPU; override with: docker build --build-arg TORCH_VARIANT=cu121
ARG TORCH_VARIANT=cpu

WORKDIR /build
COPY requirements-base.txt .

# Conditionally copy the right torch requirements
COPY requirements-torch-${TORCH_VARIANT}.txt ./requirements-torch.txt

RUN pip wheel --no-cache-dir --wheel-dir /wheels \
    -r requirements-base.txt \
    -r requirements-torch.txt

Build for CPU:

Bash
docker build -t pharmabot:cpu .

Build for CUDA 12.1:

Bash
docker build --build-arg TORCH_VARIANT=cu121 -t pharmabot:gpu .

In your CI pipeline, you can build both variants from the same Dockerfile and push them as separate tags to your registry.


Verifying the Size Reduction

After building both variants, compare:

Bash
# Build naive single-stage (for comparison)
docker build -f Dockerfile.naive -t pharmabot:naive .

# Build multi-stage
docker build -t pharmabot:multistage .

# Compare sizes
docker images | grep pharmabot

Expected output:

REPOSITORY   TAG          IMAGE ID       SIZE
pharmabot    naive        a1b2c3d4e5f6   2.13GB
pharmabot    multistage   b2c3d4e5f6a1   482MB

Inspect what's in each image:

Bash
# Check if gcc is in the final image (it should NOT be)
docker run --rm pharmabot:multistage which gcc
# (no output  gcc is not present)

# Confirm Python packages are installed correctly
docker run --rm pharmabot:multistage pip list | grep fastapi
# fastapi    0.111.0

Layer Caching Strategy for Multi-Stage Builds

Multi-stage builds have independent cache layers per stage. Structure your Dockerfile to maximise cache hits:

DOCKERFILE
FROM python:3.11 AS builder

# Layer 1: OS packages (rarely changes)
RUN apt-get update && apt-get install -y gcc libpq-dev

# Layer 2: requirements (changes when you add/upgrade packages)
COPY requirements.txt .
RUN pip wheel --no-cache-dir --wheel-dir /wheels -r requirements.txt

# Layer 3: application code (changes most often)
# (not needed in builder stage  we only build wheels here)
DOCKERFILE
FROM python:3.11-slim AS runtime

# Layer 1: OS packages (rarely changes, cached independently from builder)
RUN apt-get update && apt-get install -y libpq5

# Layer 2: wheels from builder (cache key includes builder layer 2)
COPY --from=builder /wheels /wheels
RUN pip install --no-index --find-links=/wheels /wheels/*.whl

# Layer 3: app code (changes most often — always at the end)
COPY app/ ./app/

When only app/ changes, layers 1 and 2 of both stages are served from cache. The build completes in seconds instead of minutes.


Multi-Stage Build for Development vs Production

You can also use multi-stage builds to create development and production variants in one file:

DOCKERFILE
FROM python:3.11-slim AS base
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

# Development stage: adds dev tools, mounts for hot reload
FROM base AS development
COPY requirements-dev.txt .
RUN pip install -r requirements-dev.txt
CMD ["uvicorn", "app.main:app", "--reload", "--host", "0.0.0.0", "--port", "8000"]

# Production stage: non-root user, gunicorn
FROM base AS production
RUN groupadd --gid 1001 appgroup && \
    useradd --uid 1001 --gid appgroup --no-create-home appuser
COPY app/ ./app/
RUN chown -R appuser:appgroup /app
USER appuser
CMD ["gunicorn", "app.main:app", "--workers", "4", \
     "--worker-class", "uvicorn.workers.UvicornWorker", \
     "--bind", "0.0.0.0:8000"]

Build the stage you need:

Bash
# Development (with hot reload)
docker build --target development -t pharmabot:dev .

# Production (hardened, non-root)
docker build --target production -t pharmabot:prod .

Your docker-compose.yml references the development target; your CI pipeline builds the production target.


Common Mistakes with Multi-Stage Builds

Mistake 1: Copying unnecessary files into the builder stage

DOCKERFILE
# Wrong  copies everything including tests and docs into builder
COPY . .

# Correct  only copy what pip needs to build wheels
COPY requirements.txt .

Mistake 2: Forgetting runtime OS libraries

The builder has libpq-dev (development headers). The runtime needs libpq5 (the runtime shared library). Missing runtime libs cause import errors that only appear after you've deployed.

Bash
# Error you'll see if libpq5 is missing in runtime:
# ImportError: libpq.so.5: cannot open shared object file

Mistake 3: Installing from PyPI in the runtime stage

DOCKERFILE
# Wrong  requires internet access at deploy time, slow, non-deterministic
FROM python:3.11-slim AS runtime
COPY requirements.txt .
RUN pip install -r requirements.txt   # hits PyPI

# Correct  install from pre-built local wheels
COPY --from=builder /wheels /wheels
RUN pip install --no-index --find-links=/wheels /wheels/*.whl

Summary

Multi-stage Docker builds are not optional for production AI services — they're a fundamental practice. The core pattern:

  1. Builder stage: full python:3.11 with compilers, builds all wheels into /wheels
  2. Runtime stage: python:3.11-slim, copies only the wheels, installs without compilers
  3. Result: 60–75% smaller image, zero build tools in production, faster cold starts

For PyTorch specifically:

  • Pure Azure OpenAI proxy services: don't install PyTorch at all
  • Embedding/re-ranking services: use the CPU-only wheel variant
  • Local GPU inference: use NVIDIA CUDA runtime as the base for the runtime stage

The size savings translate directly into faster deployments, lower registry storage costs, and smaller attack surfaces — all of which matter when you're running a production LLM service.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.