Docker for Data Engineers: Fundamentals
Master Docker from first principles — containers vs VMs, the Docker daemon, images, volumes, multi-stage builds, and a complete production-ready Dockerfile for a Python data pipeline.
Why Every Data Engineer Needs Docker
Your Spark job runs perfectly locally. It crashes in production. The culprit? Python 3.9 vs 3.11, a missing system library, or pandas 1.5 vs 2.0. Docker eliminates this entire class of problem by packaging your code, runtime, and dependencies into a single portable unit that runs identically everywhere.
This lesson builds a production mental model — not just "how to run containers" but how Docker actually works under the hood, and how to build images that are fast to build, small to ship, and safe to run.
Containers vs Virtual Machines
This distinction matters for performance and architecture decisions.
Virtual Machine Container
┌─────────────────────────┐ ┌─────────────────────────┐
│ App A │ App B │ │ App A │ App B │
├─────────┴──────────┐ │ ├─────────┴──────────┐ │
│ Guest OS (full) │ │ │ Container Runtime │ │
├────────────────────┤ │ ├────────────────────┤ │
│ Hypervisor │ │ │ Host OS Kernel │ │
├────────────────────┴────┤ ├─────────────────────┘ │
│ Host OS │ │ Host OS │
└─────────────────────────┘ └──────────────────────────┘| Property | VM | Container | |---|---|---| | Startup time | 30–60 seconds | < 1 second | | Size | GBs | MBs | | OS isolation | Full (own kernel) | Shared kernel, isolated namespaces | | Resource overhead | High (duplicate OS) | Near zero | | Use case | Full OS isolation, Windows/Linux mix | Microservices, pipelines, CI jobs |
The key insight: containers are not tiny VMs. They are isolated processes on the host. Linux kernel namespaces isolate the process tree, filesystem, and network. cgroups limit CPU and memory. Docker is a layer of tooling on top of these kernel features.
Docker Architecture
Understanding the architecture prevents debugging confusion.
┌──────────────────────────────────────────────────────┐
│ Docker Client (CLI) │
│ docker run / build / pull / push / ps / logs ... │
└──────────────────────┬───────────────────────────────┘
│ REST API (Unix socket / TCP)
┌──────────────────────▼───────────────────────────────┐
│ Docker Daemon (dockerd) │
│ - Manages images, containers, networks, volumes │
│ - Calls containerd for actual container execution │
└──────────┬────────────────────────────────┬──────────┘
│ │
┌──────────▼──────────┐ ┌─────────────▼──────────┐
│ Image Registry │ │ containerd + runc │
│ Docker Hub / ECR │ │ (OCI-compatible) │
│ ghcr.io / ACR │ │ runs containers │
└─────────────────────┘ └────────────────────────┘Three moving parts:
- Docker client: the CLI you type into. It speaks REST to the daemon.
- Docker daemon (
dockerd): the server. Manages all objects. Runs as a system service. - Registry: remote storage for images. Docker Hub is the default. You also use AWS ECR, Azure ACR, or GitHub Container Registry in production.
Images vs Containers
This mental model will save you hours of confusion:
| Concept | Analogy | Mutable? | |---|---|---| | Image | Class definition / cookie cutter | No — read-only layers | | Container | Running instance / cookie | Yes — writable layer on top |
An image is a stack of read-only layers. When you start a container, Docker adds a thin writable layer on top. Changes inside the container (writing files, installing packages) only exist in that writable layer. Stop and remove the container — the image is untouched.
# An image is not a container
docker pull python:3.11-slim # fetch image from registry
docker images # list local images
docker image inspect python:3.11-slim # see layers, config, entrypoint
# A container is a running (or stopped) instance
docker run python:3.11-slim python --version # run and exit
docker ps # list running containers
docker ps -a # list all (including stopped)Essential Docker Commands
Pulling and Running Images
# Pull an image (downloads to local cache)
docker pull postgres:16
# Run a container interactively
docker run -it python:3.11-slim bash
# Run detached (background)
docker run -d --name my-postgres postgres:16
# Run with a specific command
docker run --rm python:3.11-slim python -c "import sys; print(sys.version)"
# --rm removes the container automatically after it exitsInspecting Running Containers
# List running containers
docker ps
# List all containers (including stopped)
docker ps -a
# See container logs (stdout/stderr)
docker logs my-postgres
# Follow logs in real time
docker logs -f my-postgres
# See last 50 lines
docker logs --tail 50 my-postgres
# Execute a command inside a running container
docker exec -it my-postgres psql -U postgres
# Open a shell in a running container
docker exec -it my-postgres bashStopping and Removing
# Graceful stop (sends SIGTERM, waits, then SIGKILL)
docker stop my-postgres
# Immediate kill
docker kill my-postgres
# Remove a stopped container
docker rm my-postgres
# Remove a running container (force)
docker rm -f my-postgres
# Remove all stopped containers
docker container prune
# Remove an image
docker rmi postgres:16
# Nuclear option: remove everything unused
docker system prune -aPort Mapping
Containers have their own network namespace. You must explicitly publish ports to reach them from the host.
# Syntax: -p <host_port>:<container_port>
docker run -d \
--name airflow-web \
-p 8080:8080 \
apache/airflow:2.8.0
# Now http://localhost:8080 routes to port 8080 inside the container
# Bind to a specific interface (security best practice)
docker run -d \
-p 127.0.0.1:5432:5432 \
--name local-pg \
postgres:16
# Only accessible from localhost, not from other machines on the network
# Multiple port mappings
docker run -d \
-p 8080:8080 \
-p 8443:8443 \
--name my-app \
my-app:latestEnvironment Variables
The primary mechanism for runtime configuration. Never bake secrets into images.
# Pass a single variable
docker run -e POSTGRES_PASSWORD=secret postgres:16
# Pass multiple variables
docker run \
-e POSTGRES_DB=pipelines \
-e POSTGRES_USER=pipeline_user \
-e POSTGRES_PASSWORD=secret \
postgres:16
# Pass from host environment (no value = pass through from shell)
export POSTGRES_PASSWORD=mysecret
docker run -e POSTGRES_PASSWORD postgres:16
# Use an env file (preferred for multiple variables)
docker run --env-file .env postgres:16Example .env file — keep this out of git:
# .env
POSTGRES_DB=pipelines
POSTGRES_USER=pipeline_user
POSTGRES_PASSWORD=changeme_in_production
AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://pipeline_user:changeme_in_production@postgres/pipelines
AIRFLOW__CORE__FERNET_KEY=your_fernet_key_hereAdd to .gitignore:
.env
*.env
.env.*
!.env.exampleVolume Mounts
Containers are ephemeral. When you remove a container, its writable layer is gone. Volumes persist data.
Bind Mounts
Maps a host directory directly into the container. Ideal for development.
# Syntax: -v <host_path>:<container_path>
docker run -d \
--name postgres-dev \
-v /home/user/postgres-data:/var/lib/postgresql/data \
-e POSTGRES_PASSWORD=secret \
postgres:16
# Mount current directory (common for dev)
docker run --rm \
-v $(pwd):/workspace \
-w /workspace \
python:3.11-slim \
python pipeline.pyNamed Volumes
Docker-managed volumes. Preferred for production — no host path dependency.
# Create a named volume
docker volume create postgres-data
# Use it
docker run -d \
--name postgres-prod \
-v postgres-data:/var/lib/postgresql/data \
-e POSTGRES_PASSWORD=secret \
postgres:16
# Inspect volume (see where Docker stores it on the host)
docker volume inspect postgres-data
# List all volumes
docker volume ls
# Remove unused volumes
docker volume pruneNamed Volumes vs Bind Mounts
| | Named Volume | Bind Mount | |---|---|---| | Path | Docker manages it | You specify host path | | Portability | Works on any host | Depends on host directory | | Performance | Optimised by Docker | Direct host I/O | | Use case | Databases in prod | Dev: live-reload code |
Writing Dockerfiles
Basic Structure
# Base image
FROM python:3.11-slim
# Set working directory inside the container
WORKDIR /app
# Copy dependency files first (cache optimization)
COPY requirements.txt .
# Install dependencies (this layer gets cached if requirements.txt unchanged)
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Expose port (documentation — does not actually publish)
EXPOSE 8000
# Default command
CMD ["python", "main.py"]# Build: -t tags the image, . is the build context
docker build -t my-pipeline:latest .
# Build with a specific Dockerfile
docker build -f Dockerfile.prod -t my-pipeline:prod .
# Build with build arguments
docker build --build-arg PYTHON_VERSION=3.11 -t my-pipeline .Layer Caching
Docker caches each instruction. When a layer changes, all subsequent layers rebuild. Order your instructions from least to most frequently changing:
# GOOD: requirements.txt rarely changes, code changes often
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
# BAD: code changes invalidate the pip install cache
COPY . .
RUN pip install -r requirements.txt.dockerignore
Like .gitignore but for the Docker build context. Without it, Docker sends your entire directory (including .git, __pycache__, venv, etc.) to the daemon.
# .dockerignore
.git
.gitignore
__pycache__
*.pyc
*.pyo
.pytest_cache
.coverage
htmlcov/
.venv
venv/
env/
*.egg-info/
dist/
build/
.env
.env.*
!.env.example
*.log
*.tmp
.DS_Store
Thumbs.db
tests/
docs/
*.mdCOPY vs ADD
Both copy files into the image. Use COPY by default.
# COPY: explicit, predictable
COPY requirements.txt /app/
COPY src/ /app/src/
# ADD: has extra magic — can extract tar files and fetch URLs
# Avoid ADD unless you specifically need these features
ADD archive.tar.gz /app/ # extracts the archive
ADD https://example.com/file /app/file # fetches remote URLRule: always use COPY. Use ADD only when you need tar extraction and document why.
CMD vs ENTRYPOINT
Both define what runs when the container starts. They compose differently.
# CMD: default command, fully replaceable at runtime
CMD ["python", "pipeline.py"]
# docker run my-image python other_script.py <- replaces CMD
# ENTRYPOINT: the executable, arguments appended
ENTRYPOINT ["python"]
CMD ["pipeline.py"]
# docker run my-image other_script.py <- runs: python other_script.pyProduction pattern: use ENTRYPOINT for the executable, CMD for default arguments:
ENTRYPOINT ["python", "-m", "uvicorn"]
CMD ["app.main:app", "--host", "0.0.0.0", "--port", "8000"]
# Override at runtime:
# docker run my-api app.main:app --host 0.0.0.0 --port 9000 --reloadFor shell scripts used as entrypoints:
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"] # use exec form (JSON array) — not shell formMulti-Stage Builds
Single-stage images include build tools (gcc, pip, headers) in the final image — adding hundreds of MBs of attack surface. Multi-stage builds separate build from runtime.
The Pattern
# Stage 1: Build — install everything needed to compile/install
FROM python:3.11 AS builder
WORKDIR /build
# Install build dependencies
RUN pip install --upgrade pip
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
# Stage 2: Runtime — copy only the installed packages
FROM python:3.11-slim AS runtime
# Copy installed packages from build stage
COPY --from=builder /install /usr/local
WORKDIR /app
COPY src/ /app/src/
COPY pipeline.py /app/
CMD ["python", "pipeline.py"]The FROM python:3.11 (full) stage has compilers, headers, pip. The python:3.11-slim runtime stage does not. The final image is just slim + your installed packages.
Complete Production Dockerfile: Python Data Pipeline
This is a real-world multi-stage Dockerfile for a Python pipeline that reads from S3, transforms data with pandas, and writes to PostgreSQL.
# ─── Stage 1: Dependency builder ─────────────────────────────────────────────
FROM python:3.11 AS builder
# Avoid prompts during package install
ENV DEBIAN_FRONTEND=noninteractive
WORKDIR /build
# Upgrade pip first
RUN pip install --upgrade pip==24.0
# Copy only dependency manifests (cache-friendly)
COPY requirements.txt requirements-pipeline.txt ./
# Install into a prefix directory so we can copy cleanly
RUN pip install \
--no-cache-dir \
--prefix=/install \
-r requirements.txt \
-r requirements-pipeline.txt
# ─── Stage 2: Runtime image ───────────────────────────────────────────────────
FROM python:3.11-slim AS runtime
# Security: run as non-root user
RUN groupadd --gid 1001 pipeline && \
useradd --uid 1001 --gid pipeline --shell /bin/bash --create-home pipeline
# Install only OS-level runtime deps (not build tools)
RUN apt-get update && apt-get install -y --no-install-recommends \
libpq5 \
curl \
&& rm -rf /var/lib/apt/lists/*
# Copy installed Python packages from builder
COPY --from=builder /install /usr/local
WORKDIR /app
# Copy application source
COPY --chown=pipeline:pipeline src/ /app/src/
COPY --chown=pipeline:pipeline pipeline.py /app/
COPY --chown=pipeline:pipeline entrypoint.sh /app/
RUN chmod +x /app/entrypoint.sh
# Drop to non-root
USER pipeline
# Healthcheck: verify Python can import our module
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD python -c "import src.pipeline; print('ok')" || exit 1
ENTRYPOINT ["/app/entrypoint.sh"]
CMD ["--mode", "incremental"]entrypoint.sh:
#!/bin/bash
set -euo pipefail
echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] Starting pipeline"
echo " Mode: ${1:-incremental}"
echo " DB: ${DB_HOST:-not set}"
echo " Bucket: ${S3_BUCKET:-not set}"
# Validate required env vars
: "${DB_HOST:?DB_HOST is required}"
: "${DB_PASSWORD:?DB_PASSWORD is required}"
: "${S3_BUCKET:?S3_BUCKET is required}"
exec python pipeline.py "$@"requirements.txt:
pandas==2.2.2
psycopg2-binary==2.9.9
boto3==1.34.0
sqlalchemy==2.0.29
pyarrow==16.0.0
pydantic==2.7.0Image Tagging and Pushing
Tagging Conventions
# Semantic versioning
docker tag my-pipeline:latest my-pipeline:1.4.2
docker tag my-pipeline:latest my-pipeline:1.4
docker tag my-pipeline:latest my-pipeline:1
# Git SHA tags (common in CI/CD)
GIT_SHA=$(git rev-parse --short HEAD)
docker tag my-pipeline:latest my-pipeline:${GIT_SHA}
# Environment tags
docker tag my-pipeline:latest my-pipeline:prod
docker tag my-pipeline:latest my-pipeline:stagingPushing to Docker Hub
# Login (use token, not password)
docker login -u yourusername
# Tag with registry prefix: <registry>/<namespace>/<image>:<tag>
docker tag my-pipeline:latest yourusername/my-pipeline:1.4.2
docker tag my-pipeline:latest yourusername/my-pipeline:latest
# Push
docker push yourusername/my-pipeline:1.4.2
docker push yourusername/my-pipeline:latestPushing to AWS ECR
# Authenticate (requires AWS CLI configured)
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
AWS_REGION=us-east-1
aws ecr get-login-password --region ${AWS_REGION} | \
docker login --username AWS --password-stdin \
${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com
# Create repository (one-time)
aws ecr create-repository \
--repository-name my-pipeline \
--region ${AWS_REGION} \
--image-scanning-configuration scanOnPush=true
# Tag and push
ECR_REPO=${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/my-pipeline
docker tag my-pipeline:latest ${ECR_REPO}:latest
docker tag my-pipeline:latest ${ECR_REPO}:1.4.2
docker push ${ECR_REPO}:latest
docker push ${ECR_REPO}:1.4.2Build Optimisation Checklist
Before shipping any Docker image, verify:
# Check image size
docker images my-pipeline
# Inspect layers (see what each instruction added)
docker history my-pipeline:latest
# Dive tool for interactive layer explorer
docker run --rm -it \
-v /var/run/docker.sock:/var/run/docker.sock \
wagoodman/dive:latest my-pipeline:latest| Practice | Why |
|---|---|
| Use -slim or -alpine base | Smaller attack surface, faster pulls |
| Multi-stage builds | Remove build tools from runtime image |
| .dockerignore | Faster builds, no secrets leaked into context |
| Copy requirements.txt before COPY . | Cache pip install between code changes |
| --no-cache-dir in pip | Saves 20–40 MB |
| rm -rf /var/lib/apt/lists/* after apt | Removes package index from layer |
| Non-root user | Security — most CVEs run as root |
| Pin base image versions | Reproducibility — never use latest in production |
What's Next
You now understand Docker at the level needed to build and ship images confidently. The next lesson covers Docker Compose — running Airflow, PostgreSQL, Redis, Kafka, and Schema Registry together in a local data engineering environment with a single docker compose up.
Enjoyed this article?
Explore the Cloud & DevOps learning path for more.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.