Back to blog
AI Systemsintermediate

Dockerising an AI API: Best Practices

Learn why AI APIs have unique Docker considerations — model weights, GPU drivers, large images — and build a production-grade Dockerfile for a FastAPI + Azure OpenAI service from scratch.

Asma Hafeez KhanMay 15, 202612 min read
LLMOpsDockerFastAPIAIDevOpsAzure OpenAI
Share:𝕏

Why AI APIs Are Different to Dockerise

Containerising a standard REST API is routine. Containerising an AI API introduces several complications that bite you in production if you don't plan for them upfront.

Model weights are enormous. A quantised 7B model can be 4–8 GB. A full GPT-4-class local model can be 70+ GB. Even if you're calling a hosted API (Azure OpenAI, OpenAI), the Python dependency tree — torch, transformers, sentence-transformers — adds hundreds of megabytes to your image.

GPU driver dependencies. If your service runs inference locally, the CUDA version in the image must match the driver version on the host. Mismatches cause silent runtime failures that are difficult to debug.

Large base images cascade costs. A 4 GB image means 4 GB pulled on every cold start in your container orchestrator. At scale, this kills startup times and egress costs.

Secret sprawl. API keys, Azure endpoint URLs, and database connection strings must never be baked into the image — yet it's easy to accidentally commit them when you're moving fast.

This lesson gives you a production Dockerfile for a FastAPI + Azure OpenAI service with every one of these concerns handled correctly.


The Application We're Containerising

We have a simple pharmabot service:

pharmabot/
├── app/
│   ├── __init__.py
│   ├── main.py
│   ├── routers/
│   │   └── chat.py
│   ├── services/
│   │   └── llm_service.py
│   └── config.py
├── requirements.txt
├── Dockerfile
├── .dockerignore
└── .env.local        ← never committed, never copied into image

main.py — the FastAPI entry point:

Python
from fastapi import FastAPI
from app.routers import chat
from app.config import settings

app = FastAPI(
    title="PharmaBot API",
    version="1.0.0",
    docs_url="/docs" if settings.debug else None,
)

app.include_router(chat.router, prefix="/chat", tags=["chat"])

@app.get("/health")
async def health():
    return {"status": "ok", "service": "pharmabot"}

config.py — reads from environment variables:

Python
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    azure_openai_endpoint: str
    azure_openai_api_key: str
    azure_openai_deployment: str = "gpt-4o"
    azure_openai_api_version: str = "2024-02-01"
    redis_url: str = "redis://localhost:6379"
    debug: bool = False

    class Config:
        env_file = ".env.local"

settings = Settings()

Notice that Settings reads from env vars first, then falls back to .env.local. In production, you inject env vars at runtime — the .env.local file is only for local development and is never inside the container.


Base Image Choice: python:3.11-slim Not python:3.11

This is the single highest-impact decision in your Dockerfile.

| Base image | Compressed size | Includes | |---|---|---| | python:3.11 | ~330 MB | Full Debian, gcc, make, all dev tools | | python:3.11-slim | ~45 MB | Minimal Debian, no compilers | | python:3.11-alpine | ~18 MB | musl libc, not glibc — breaks some C extensions |

Use python:3.11-slim. Alpine breaks packages like uvloop, cryptography, and some PyTorch wheels that expect glibc. The full python:3.11 image is 7x larger with no runtime benefit — the compilers it includes are only needed at build time.

If you need compilers to build C extensions (e.g., psycopg2 from source), install them, build the wheels, then discard them. That's what multi-stage builds are for (covered in the next lesson).


The Production Dockerfile

DOCKERFILE
# ── Stage: single-stage production build ─────────────────────────────────────
FROM python:3.11-slim

# 1. Metadata
LABEL maintainer="asmasikkerhetservice@gmail.com"
LABEL service="pharmabot"
LABEL version="1.0.0"

# 2. Prevent Python from writing .pyc files and buffering stdout/stderr
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

# 3. Set a working directory inside the container
WORKDIR /app

# 4. Create a non-root user BEFORE copying files
#    Running as root inside a container is a security risk.
#    If your app is compromised, root in the container = root escape potential.
RUN groupadd --gid 1001 appgroup && \
    useradd --uid 1001 --gid appgroup --no-create-home appuser

# 5. Install OS-level dependencies needed at runtime (not build time)
#    libpq5 is the PostgreSQL client library used by psycopg2-binary at runtime.
#    curl is used by our health check scripts.
RUN apt-get update && apt-get install -y --no-install-recommends \
        libpq5 \
        curl \
    && rm -rf /var/lib/apt/lists/*

# 6. COPY requirements FIRST — before app code.
#    Docker caches each layer. If requirements.txt hasn't changed,
#    Docker reuses the cached pip install layer even when app code changes.
#    This turns a 3-minute build into a 20-second build during development.
COPY requirements.txt .

RUN pip install --upgrade pip && \
    pip install --no-cache-dir -r requirements.txt

# 7. NOW copy the application code
COPY app/ ./app/

# 8. Transfer ownership of the app directory to the non-root user
RUN chown -R appuser:appgroup /app

# 9. Switch to the non-root user for all subsequent commands
USER appuser

# 10. Expose the port (documentation only — does not publish it)
EXPOSE 8000

# 11. Runtime environment variables — non-secret configuration
#     Secrets (API keys, connection strings) are injected at runtime
#     by the container orchestrator or docker run -e flags.
ENV AZURE_OPENAI_DEPLOYMENT="gpt-4o"
ENV AZURE_OPENAI_API_VERSION="2024-02-01"
ENV WORKERS="2"
ENV WORKER_CLASS="uvicorn.workers.UvicornWorker"
ENV TIMEOUT="120"

# 12. Production-grade startup: Gunicorn manages worker processes,
#     each worker runs UvicornWorker (async I/O, handles SSE/streaming).
#     - workers: 2x(CPU cores) + 1 is a common starting point.
#     - timeout: 120s — LLM responses can be slow.
#     - access-logfile -: sends access logs to stdout (captured by Docker).
CMD ["sh", "-c", \
     "gunicorn app.main:app \
      --workers $WORKERS \
      --worker-class $WORKER_CLASS \
      --bind 0.0.0.0:8000 \
      --timeout $TIMEOUT \
      --access-logfile - \
      --error-logfile -"]

The .dockerignore File

Your .dockerignore controls what gets sent to the Docker build context. A bloated build context slows every build. More critically, secrets in your project root will be sent to the daemon even if you never COPY them — unless they're ignored.

# Python cache
__pycache__/
*.pyc
*.pyo
*.pyd
.Python
*.egg-info/
dist/
build/
.eggs/

# Virtual environments
.venv/
venv/
env/
ENV/

# Secrets and local config — NEVER in the image
.env
.env.local
.env.*.local
*.key
*.pem
secrets/

# Development tools
.git/
.gitignore
.github/
.pre-commit-config.yaml

# Test artefacts
.pytest_cache/
htmlcov/
.coverage
coverage.xml
tests/

# Documentation
*.md
docs/

# IDE files
.vscode/
.idea/
*.swp

# Docker files (no need to copy these into the image)
Dockerfile
.dockerignore
docker-compose*.yml

Key points:

  • .env and .env.local are explicitly listed. This is your last line of defence against accidentally shipping API keys.
  • tests/ is excluded. Your test suite belongs in CI, not in production images.
  • .git/ is excluded. The git history can be many megabytes and reveals sensitive commit history.
  • __pycache__/ is excluded. The image rebuild will regenerate bytecode, and you don't want host-compiled bytecode polluting the container.

Why You Must Never Bake Secrets Into the Image

An image baked with secrets is permanently compromised. Here's why:

Images are stored in registries — ACR, ECR, Docker Hub. Anyone with pull access gets the secret. Even if you delete the tag, the layer is still accessible by digest for a time.

Docker layer history is public. Run docker history --no-trunc pharmabot:latest and you'll see every command used to build the image. If your secret was set via ENV AZURE_OPENAI_API_KEY=sk-..., it appears in plain text in the history.

Container runtimes log env vars. Kubernetes event logs, container inspect output, and cloud provider diagnostics can all surface environment variables — but only if they were baked into the image. Variables injected at runtime through a Secrets Manager or Key Vault never appear in image layers.

The correct pattern:

Bash
# Wrong  secret is in the image layer history
ENV AZURE_OPENAI_API_KEY="sk-abc123..."

# Correct  secret is injected at runtime by the orchestrator
# In docker run:
docker run -e AZURE_OPENAI_API_KEY=$MY_SECRET_FROM_VAULT pharmabot:latest

# In Azure Container Apps:
az containerapp update \
  --name pharmabot \
  --resource-group rg-pharmabot \
  --set-env-vars AZURE_OPENAI_API_KEY=secretref:openai-key

requirements.txt for This Service

# Web framework
fastapi==0.111.0
uvicorn[standard]==0.29.0
gunicorn==22.0.0

# Azure OpenAI
openai==1.30.1

# Configuration
pydantic-settings==2.2.1

# Redis for session caching
redis==5.0.4

# HTTP client
httpx==0.27.0

# PostgreSQL (binary wheels — no build deps needed)
psycopg2-binary==2.9.9

# Observability
structlog==24.1.0

Pinned exact versions in requirements.txt. Floating versions (fastapi>=0.100) cause non-reproducible builds — a dependency update on a Friday evening can break a Monday deployment.


Environment Variable Strategy

Here is the complete env var taxonomy for this service:

| Variable | Source | Secret? | Example | |---|---|---|---| | AZURE_OPENAI_ENDPOINT | Key Vault / runtime | Yes | https://pharmabot.openai.azure.com/ | | AZURE_OPENAI_API_KEY | Key Vault / runtime | Yes | abc123... | | AZURE_OPENAI_DEPLOYMENT | Dockerfile ENV | No | gpt-4o | | REDIS_URL | Runtime env | Partial | redis://cache:6379 | | DATABASE_URL | Key Vault / runtime | Yes | postgresql://... | | DEBUG | Runtime env | No | false | | WORKERS | Dockerfile ENV | No | 2 |

Variables marked "Dockerfile ENV" are safe defaults that can be overridden at runtime. Variables marked "Key Vault / runtime" must never appear in the Dockerfile or in version control.


Gunicorn + Uvicorn Workers Explained

FastAPI is an async framework built on ASGI. Uvicorn is the ASGI server that runs it. But Uvicorn alone is single-process — if it crashes, your service is down.

Gunicorn is a battle-tested WSGI/ASGI process manager. It:

  • Starts and monitors N worker processes
  • Restarts crashed workers automatically
  • Handles graceful shutdown (SIGTERM → drain in-flight requests → exit)
  • Reports health via signals

The UvicornWorker class is Uvicorn running inside a Gunicorn worker — you get both the async I/O of Uvicorn and the process management of Gunicorn.

Worker count rule of thumb: (2 x CPU_CORES) + 1. For a 2-core container, use 5 workers. However, for LLM services that spend most time waiting on network I/O to Azure OpenAI, you can go higher — async workers don't consume CPU while waiting, so 10-20 workers on 2 cores is reasonable.

DOCKERFILE
# In the CMD, make workers configurable via env var
CMD ["sh", "-c", "gunicorn app.main:app \
  --workers ${WORKERS:-4} \
  --worker-class uvicorn.workers.UvicornWorker \
  --bind 0.0.0.0:8000 \
  --timeout ${TIMEOUT:-120} \
  --graceful-timeout 30 \
  --keep-alive 5 \
  --access-logfile - \
  --error-logfile -"]

--graceful-timeout 30: Gunicorn gives workers 30 seconds to finish in-flight requests before forcefully killing them. Essential for streaming LLM responses.

--keep-alive 5: Keep HTTP connections alive for 5 seconds. Reduces connection overhead for clients making multiple requests.


Build and Run Checkpoint

Build the image:

Bash
docker build -t pharmabot:latest .

Inspect the build output. You'll see layer caching in action — the pip install layer is only rebuilt when requirements.txt changes.

Check the image size:

Bash
docker images pharmabot
# REPOSITORY   TAG       IMAGE ID       SIZE
# pharmabot    latest    a3b4c5d6e7f8   620MB

620 MB is reasonable for a FastAPI + OpenAI service. The bulk is the OpenAI SDK and its transitive dependencies. The next lesson (multi-stage builds) shows how to reduce this further.

Run locally with secrets injected at runtime:

Bash
docker run \
  -p 8000:8000 \
  -e AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/" \
  -e AZURE_OPENAI_API_KEY="your-key-here" \
  -e AZURE_OPENAI_DEPLOYMENT="gpt-4o" \
  -e DEBUG="true" \
  pharmabot:latest

Verify the health endpoint:

Bash
curl http://localhost:8000/health
# {"status":"ok","service":"pharmabot"}

Check the running process is non-root:

Bash
docker exec $(docker ps -q -f ancestor=pharmabot:latest) whoami
# appuser

Inspect what's NOT in the image (confirm .env is excluded):

Bash
docker run --rm pharmabot:latest ls -la /app
# Should show: app/ requirements.txt
# Should NOT show: .env, .env.local, .git, tests/

What to Do About GPU Dependencies

If your service runs local inference (not a hosted API), you need CUDA. The correct approach is to use NVIDIA's official CUDA base images:

DOCKERFILE
# For GPU inference (PyTorch)
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

# Install Python on top
RUN apt-get update && apt-get install -y python3.11 python3-pip

Key rules:

  • Use -runtime not -devel. The devel image includes compilers; runtime has only what you need to execute CUDA code.
  • Pin the CUDA version to match your host driver. nvidia-smi on the host tells you the maximum CUDA version your driver supports.
  • For CPU-only inference (quantised GGUF models via llama-cpp-python), stay on python:3.11-slim — no CUDA needed.

Common Mistakes Checklist

| Mistake | Consequence | Fix | |---|---|---| | Using python:3.11 (full) as base | 330 MB extra, no benefit | Use python:3.11-slim | | COPY . . before pip install | No layer caching — every code change re-installs all deps | COPY requirements.txt first | | Running as root | Security vulnerability | Add non-root user before CMD | | Baking API keys into ENV | Secret leaked in image history | Inject at runtime | | No .dockerignore | .env, .git, tests/ all in build context | Always write .dockerignore | | CMD with uvicorn only (no gunicorn) | Single process, no restart on crash | Use gunicorn + UvicornWorker | | Floating dep versions | Non-reproducible builds | Pin exact versions |


Summary

A production Dockerfile for an AI API is not just a recipe to run Python — it's a security boundary, a caching strategy, and a process management configuration. The key decisions:

  1. python:3.11-slim as the base image
  2. Non-root user created and switched to before CMD
  3. requirements.txt copied before application code (layer caching)
  4. .dockerignore that explicitly excludes .env files and secrets
  5. All secrets injected at runtime, never baked into ENV
  6. Gunicorn + UvicornWorker for multi-process async serving

In the next lesson, we'll use multi-stage builds to cut this image down from 620 MB to under 480 MB while also separating build-time and runtime dependencies cleanly.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.