Docker for Data Engineers: Fundamentals

Why Every Data Engineer Needs Docker

Your Spark job runs perfectly locally. It crashes in production. The culprit? Python 3.9 vs 3.11, a missing system library, or pandas 1.5 vs 2.0. Docker eliminates this entire class of problem by packaging your code, runtime, and dependencies into a single portable unit that runs identically everywhere.

This lesson builds a production mental model — not just "how to run containers" but how Docker actually works under the hood, and how to build images that are fast to build, small to ship, and safe to run.

Containers vs Virtual Machines

This distinction matters for performance and architecture decisions.

Virtual Machine                     Container
┌─────────────────────────┐         ┌─────────────────────────┐
│  App A  │  App B        │         │  App A  │  App B        │
├─────────┴──────────┐    │         ├─────────┴──────────┐    │
│   Guest OS (full)  │    │         │  Container Runtime  │    │
├────────────────────┤    │         ├────────────────────┤    │
│    Hypervisor      │    │         │   Host OS Kernel    │    │
├────────────────────┴────┤         ├─────────────────────┘    │
│      Host OS            │         │      Host OS             │
└─────────────────────────┘         └──────────────────────────┘

| Property | VM | Container | |---|---|---| | Startup time | 30–60 seconds | < 1 second | | Size | GBs | MBs | | OS isolation | Full (own kernel) | Shared kernel, isolated namespaces | | Resource overhead | High (duplicate OS) | Near zero | | Use case | Full OS isolation, Windows/Linux mix | Microservices, pipelines, CI jobs |

The key insight: containers are not tiny VMs. They are isolated processes on the host. Linux kernel namespaces isolate the process tree, filesystem, and network. cgroups limit CPU and memory. Docker is a layer of tooling on top of these kernel features.

Docker Architecture

Understanding the architecture prevents debugging confusion.

┌──────────────────────────────────────────────────────┐
│                  Docker Client (CLI)                  │
│  docker run / build / pull / push / ps / logs ...    │
└──────────────────────┬───────────────────────────────┘
                       │  REST API (Unix socket / TCP)
┌──────────────────────▼───────────────────────────────┐
│              Docker Daemon (dockerd)                  │
│  - Manages images, containers, networks, volumes      │
│  - Calls containerd for actual container execution    │
└──────────┬────────────────────────────────┬──────────┘
           │                                │
┌──────────▼──────────┐       ┌─────────────▼──────────┐
│   Image Registry    │       │   containerd + runc     │
│  Docker Hub / ECR   │       │  (OCI-compatible)       │
│  ghcr.io / ACR      │       │  runs containers        │
└─────────────────────┘       └────────────────────────┘

Three moving parts:

Docker client: the CLI you type into. It speaks REST to the daemon.
Docker daemon (dockerd): the server. Manages all objects. Runs as a system service.
Registry: remote storage for images. Docker Hub is the default. You also use AWS ECR, Azure ACR, or GitHub Container Registry in production.

Images vs Containers

This mental model will save you hours of confusion:

| Concept | Analogy | Mutable? | |---|---|---| | Image | Class definition / cookie cutter | No — read-only layers | | Container | Running instance / cookie | Yes — writable layer on top |

An image is a stack of read-only layers. When you start a container, Docker adds a thin writable layer on top. Changes inside the container (writing files, installing packages) only exist in that writable layer. Stop and remove the container — the image is untouched.

Bash

# An image is not a container
docker pull python:3.11-slim          # fetch image from registry
docker images                          # list local images
docker image inspect python:3.11-slim  # see layers, config, entrypoint

# A container is a running (or stopped) instance
docker run python:3.11-slim python --version   # run and exit
docker ps                                       # list running containers
docker ps -a                                    # list all (including stopped)

Essential Docker Commands

Pulling and Running Images

Bash

# Pull an image (downloads to local cache)
docker pull postgres:16

# Run a container interactively
docker run -it python:3.11-slim bash

# Run detached (background)
docker run -d --name my-postgres postgres:16

# Run with a specific command
docker run --rm python:3.11-slim python -c "import sys; print(sys.version)"
# --rm removes the container automatically after it exits

Inspecting Running Containers

Bash

# List running containers
docker ps

# List all containers (including stopped)
docker ps -a

# See container logs (stdout/stderr)
docker logs my-postgres

# Follow logs in real time
docker logs -f my-postgres

# See last 50 lines
docker logs --tail 50 my-postgres

# Execute a command inside a running container
docker exec -it my-postgres psql -U postgres

# Open a shell in a running container
docker exec -it my-postgres bash

Stopping and Removing

Bash

# Graceful stop (sends SIGTERM, waits, then SIGKILL)
docker stop my-postgres

# Immediate kill
docker kill my-postgres

# Remove a stopped container
docker rm my-postgres

# Remove a running container (force)
docker rm -f my-postgres

# Remove all stopped containers
docker container prune

# Remove an image
docker rmi postgres:16

# Nuclear option: remove everything unused
docker system prune -a

Port Mapping

Containers have their own network namespace. You must explicitly publish ports to reach them from the host.

Bash

# Syntax: -p <host_port>:<container_port>
docker run -d \
  --name airflow-web \
  -p 8080:8080 \
  apache/airflow:2.8.0

# Now http://localhost:8080 routes to port 8080 inside the container

# Bind to a specific interface (security best practice)
docker run -d \
  -p 127.0.0.1:5432:5432 \
  --name local-pg \
  postgres:16
# Only accessible from localhost, not from other machines on the network

# Multiple port mappings
docker run -d \
  -p 8080:8080 \
  -p 8443:8443 \
  --name my-app \
  my-app:latest

Environment Variables

The primary mechanism for runtime configuration. Never bake secrets into images.

Bash

# Pass a single variable
docker run -e POSTGRES_PASSWORD=secret postgres:16

# Pass multiple variables
docker run \
  -e POSTGRES_DB=pipelines \
  -e POSTGRES_USER=pipeline_user \
  -e POSTGRES_PASSWORD=secret \
  postgres:16

# Pass from host environment (no value = pass through from shell)
export POSTGRES_PASSWORD=mysecret
docker run -e POSTGRES_PASSWORD postgres:16

# Use an env file (preferred for multiple variables)
docker run --env-file .env postgres:16

Example .env file — keep this out of git:

Bash

# .env
POSTGRES_DB=pipelines
POSTGRES_USER=pipeline_user
POSTGRES_PASSWORD=changeme_in_production
AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://pipeline_user:changeme_in_production@postgres/pipelines
AIRFLOW__CORE__FERNET_KEY=your_fernet_key_here

Add to .gitignore:

.env
*.env
.env.*
!.env.example

Volume Mounts

Containers are ephemeral. When you remove a container, its writable layer is gone. Volumes persist data.

Bind Mounts

Maps a host directory directly into the container. Ideal for development.

Bash

# Syntax: -v <host_path>:<container_path>
docker run -d \
  --name postgres-dev \
  -v /home/user/postgres-data:/var/lib/postgresql/data \
  -e POSTGRES_PASSWORD=secret \
  postgres:16

# Mount current directory (common for dev)
docker run --rm \
  -v $(pwd):/workspace \
  -w /workspace \
  python:3.11-slim \
  python pipeline.py

Named Volumes

Docker-managed volumes. Preferred for production — no host path dependency.

Bash

# Create a named volume
docker volume create postgres-data

# Use it
docker run -d \
  --name postgres-prod \
  -v postgres-data:/var/lib/postgresql/data \
  -e POSTGRES_PASSWORD=secret \
  postgres:16

# Inspect volume (see where Docker stores it on the host)
docker volume inspect postgres-data

# List all volumes
docker volume ls

# Remove unused volumes
docker volume prune

Named Volumes vs Bind Mounts

| | Named Volume | Bind Mount | |---|---|---| | Path | Docker manages it | You specify host path | | Portability | Works on any host | Depends on host directory | | Performance | Optimised by Docker | Direct host I/O | | Use case | Databases in prod | Dev: live-reload code |

Writing Dockerfiles

Basic Structure

DOCKERFILE

# Base image
FROM python:3.11-slim

# Set working directory inside the container
WORKDIR /app

# Copy dependency files first (cache optimization)
COPY requirements.txt .

# Install dependencies (this layer gets cached if requirements.txt unchanged)
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Expose port (documentation — does not actually publish)
EXPOSE 8000

# Default command
CMD ["python", "main.py"]

Bash

# Build: -t tags the image, . is the build context
docker build -t my-pipeline:latest .

# Build with a specific Dockerfile
docker build -f Dockerfile.prod -t my-pipeline:prod .

# Build with build arguments
docker build --build-arg PYTHON_VERSION=3.11 -t my-pipeline .

Layer Caching

Docker caches each instruction. When a layer changes, all subsequent layers rebuild. Order your instructions from least to most frequently changing:

DOCKERFILE

# GOOD: requirements.txt rarely changes, code changes often
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .

# BAD: code changes invalidate the pip install cache
COPY . .
RUN pip install -r requirements.txt

.dockerignore

Like .gitignore but for the Docker build context. Without it, Docker sends your entire directory (including .git, __pycache__, venv, etc.) to the daemon.

# .dockerignore
.git
.gitignore
__pycache__
*.pyc
*.pyo
.pytest_cache
.coverage
htmlcov/
.venv
venv/
env/
*.egg-info/
dist/
build/
.env
.env.*
!.env.example
*.log
*.tmp
.DS_Store
Thumbs.db
tests/
docs/
*.md

COPY vs ADD

Both copy files into the image. Use COPY by default.

DOCKERFILE

# COPY: explicit, predictable
COPY requirements.txt /app/
COPY src/ /app/src/

# ADD: has extra magic — can extract tar files and fetch URLs
# Avoid ADD unless you specifically need these features
ADD archive.tar.gz /app/     # extracts the archive
ADD https://example.com/file /app/file  # fetches remote URL

Rule: always use COPY. Use ADD only when you need tar extraction and document why.

CMD vs ENTRYPOINT

Both define what runs when the container starts. They compose differently.

DOCKERFILE

# CMD: default command, fully replaceable at runtime
CMD ["python", "pipeline.py"]
# docker run my-image python other_script.py  <- replaces CMD

# ENTRYPOINT: the executable, arguments appended
ENTRYPOINT ["python"]
CMD ["pipeline.py"]
# docker run my-image other_script.py  <- runs: python other_script.py

Production pattern: use ENTRYPOINT for the executable, CMD for default arguments:

DOCKERFILE

ENTRYPOINT ["python", "-m", "uvicorn"]
CMD ["app.main:app", "--host", "0.0.0.0", "--port", "8000"]
# Override at runtime:
# docker run my-api app.main:app --host 0.0.0.0 --port 9000 --reload

For shell scripts used as entrypoints:

DOCKERFILE

COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]  # use exec form (JSON array) — not shell form

Multi-Stage Builds

Single-stage images include build tools (gcc, pip, headers) in the final image — adding hundreds of MBs of attack surface. Multi-stage builds separate build from runtime.

The Pattern

DOCKERFILE

# Stage 1: Build — install everything needed to compile/install
FROM python:3.11 AS builder

WORKDIR /build

# Install build dependencies
RUN pip install --upgrade pip
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# Stage 2: Runtime — copy only the installed packages
FROM python:3.11-slim AS runtime

# Copy installed packages from build stage
COPY --from=builder /install /usr/local

WORKDIR /app
COPY src/ /app/src/
COPY pipeline.py /app/

CMD ["python", "pipeline.py"]

The FROM python:3.11 (full) stage has compilers, headers, pip. The python:3.11-slim runtime stage does not. The final image is just slim + your installed packages.

Complete Production Dockerfile: Python Data Pipeline

This is a real-world multi-stage Dockerfile for a Python pipeline that reads from S3, transforms data with pandas, and writes to PostgreSQL.

DOCKERFILE

# ─── Stage 1: Dependency builder ─────────────────────────────────────────────
FROM python:3.11 AS builder

# Avoid prompts during package install
ENV DEBIAN_FRONTEND=noninteractive

WORKDIR /build

# Upgrade pip first
RUN pip install --upgrade pip==24.0

# Copy only dependency manifests (cache-friendly)
COPY requirements.txt requirements-pipeline.txt ./

# Install into a prefix directory so we can copy cleanly
RUN pip install \
      --no-cache-dir \
      --prefix=/install \
      -r requirements.txt \
      -r requirements-pipeline.txt

# ─── Stage 2: Runtime image ───────────────────────────────────────────────────
FROM python:3.11-slim AS runtime

# Security: run as non-root user
RUN groupadd --gid 1001 pipeline && \
    useradd --uid 1001 --gid pipeline --shell /bin/bash --create-home pipeline

# Install only OS-level runtime deps (not build tools)
RUN apt-get update && apt-get install -y --no-install-recommends \
      libpq5 \          
      curl \
    && rm -rf /var/lib/apt/lists/*

# Copy installed Python packages from builder
COPY --from=builder /install /usr/local

WORKDIR /app

# Copy application source
COPY --chown=pipeline:pipeline src/ /app/src/
COPY --chown=pipeline:pipeline pipeline.py /app/
COPY --chown=pipeline:pipeline entrypoint.sh /app/

RUN chmod +x /app/entrypoint.sh

# Drop to non-root
USER pipeline

# Healthcheck: verify Python can import our module
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD python -c "import src.pipeline; print('ok')" || exit 1

ENTRYPOINT ["/app/entrypoint.sh"]
CMD ["--mode", "incremental"]

entrypoint.sh:

Bash

#!/bin/bash
set -euo pipefail

echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] Starting pipeline"
echo "  Mode:   ${1:-incremental}"
echo "  DB:     ${DB_HOST:-not set}"
echo "  Bucket: ${S3_BUCKET:-not set}"

# Validate required env vars
: "${DB_HOST:?DB_HOST is required}"
: "${DB_PASSWORD:?DB_PASSWORD is required}"
: "${S3_BUCKET:?S3_BUCKET is required}"

exec python pipeline.py "$@"

requirements.txt:

pandas==2.2.2
psycopg2-binary==2.9.9
boto3==1.34.0
sqlalchemy==2.0.29
pyarrow==16.0.0
pydantic==2.7.0

Image Tagging and Pushing

Tagging Conventions

Bash

# Semantic versioning
docker tag my-pipeline:latest my-pipeline:1.4.2
docker tag my-pipeline:latest my-pipeline:1.4
docker tag my-pipeline:latest my-pipeline:1

# Git SHA tags (common in CI/CD)
GIT_SHA=$(git rev-parse --short HEAD)
docker tag my-pipeline:latest my-pipeline:${GIT_SHA}

# Environment tags
docker tag my-pipeline:latest my-pipeline:prod
docker tag my-pipeline:latest my-pipeline:staging

Pushing to Docker Hub

Bash

# Login (use token, not password)
docker login -u yourusername

# Tag with registry prefix: <registry>/<namespace>/<image>:<tag>
docker tag my-pipeline:latest yourusername/my-pipeline:1.4.2
docker tag my-pipeline:latest yourusername/my-pipeline:latest

# Push
docker push yourusername/my-pipeline:1.4.2
docker push yourusername/my-pipeline:latest

Pushing to AWS ECR

Bash

# Authenticate (requires AWS CLI configured)
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
AWS_REGION=us-east-1

aws ecr get-login-password --region ${AWS_REGION} | \
  docker login --username AWS --password-stdin \
  ${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com

# Create repository (one-time)
aws ecr create-repository \
  --repository-name my-pipeline \
  --region ${AWS_REGION} \
  --image-scanning-configuration scanOnPush=true

# Tag and push
ECR_REPO=${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/my-pipeline

docker tag my-pipeline:latest ${ECR_REPO}:latest
docker tag my-pipeline:latest ${ECR_REPO}:1.4.2

docker push ${ECR_REPO}:latest
docker push ${ECR_REPO}:1.4.2

Build Optimisation Checklist

Before shipping any Docker image, verify:

Bash

# Check image size
docker images my-pipeline

# Inspect layers (see what each instruction added)
docker history my-pipeline:latest

# Dive tool for interactive layer explorer
docker run --rm -it \
  -v /var/run/docker.sock:/var/run/docker.sock \
  wagoodman/dive:latest my-pipeline:latest

| Practice | Why | |---|---| | Use -slim or -alpine base | Smaller attack surface, faster pulls | | Multi-stage builds | Remove build tools from runtime image | | .dockerignore | Faster builds, no secrets leaked into context | | Copy requirements.txt before COPY . | Cache pip install between code changes | | --no-cache-dir in pip | Saves 20–40 MB | | rm -rf /var/lib/apt/lists/* after apt | Removes package index from layer | | Non-root user | Security — most CVEs run as root | | Pin base image versions | Reproducibility — never use latest in production |

What's Next

You now understand Docker at the level needed to build and ship images confidently. The next lesson covers Docker Compose — running Airflow, PostgreSQL, Redis, Kafka, and Schema Registry together in a local data engineering environment with a single docker compose up.

Docker for Data Engineers: Fundamentals

Why Every Data Engineer Needs Docker

Containers vs Virtual Machines

Docker Architecture

Images vs Containers

Essential Docker Commands

Pulling and Running Images

Inspecting Running Containers

Stopping and Removing

Port Mapping

Environment Variables

Volume Mounts

Bind Mounts

Named Volumes

Named Volumes vs Bind Mounts

Writing Dockerfiles

Basic Structure

Layer Caching

.dockerignore

COPY vs ADD

CMD vs ENTRYPOINT

Multi-Stage Builds

The Pattern

Complete Production Dockerfile: Python Data Pipeline

Image Tagging and Pushing

Tagging Conventions

Pushing to Docker Hub

Pushing to AWS ECR

Build Optimisation Checklist

What's Next

Enjoyed this article?

Leave a comment