AWS vs Azure: Real-World Examples for Every Service — When, Why, and What Breaks

How to Read This Article

Every service comparison includes:

What problem it solves — in one sentence
Real company using it — who actually uses this and why
When YOU use it — concrete decision criteria
When you pick the other — the trade-off
What breaks if you choose wrong — the failure mode

This is not a features list. It is a decision guide.

DATABASE DECISIONS — The Most Important Choices

This section comes first because database choice is irreversible. Getting it wrong means a painful migration 18 months later.

Cosmos DB vs PostgreSQL — The Core Question

Every project starts here. Most get it wrong.

Use Cosmos DB (AWS DynamoDB equivalent) when:

Scenario 1: Uber's Trip Tracking

Uber has 5 million concurrent trips. Each trip updates location every 4 seconds. That is 20 million writes per second at peak.

Each trip is independent — no JOIN needed between trips
Access pattern is always the same: get trip by trip_id
Write volume is enormous and unpredictable (New Year's Eve vs Tuesday morning)
Latency must be under 10ms — a driver's app must update instantly

Cosmos DB / DynamoDB wins because:
  - Scales to 20M writes/second without configuration changes
  - Single-item lookups by trip_id: O(1), always 1ms
  - No schema migrations when you add a field to the trip record
  - Partition key = trip_id → perfect distribution, no hot partitions

PostgreSQL would fail because:
  - Cannot handle 20M writes/second on a single instance
  - Sharding PostgreSQL at this scale requires massive engineering effort
  - Schema migrations on a table with billions of rows = hours of downtime

Scenario 2: Netflix's User Watch History

500 million users. Each user has a list of everything they have watched. Netflix needs: "what has user X watched?" — one query, one user, instant response.

Access pattern: always by user_id
No complex queries — never "show me all users who watched both Breaking Bad and Ozark"
Data volume: 500M users × avg 500 items each = 250 billion records
Write pattern: append-only (user watches something → add to list)

DynamoDB wins because:
  - PK = user_id, SK = watched_timestamp — perfect for "get all items for user X"
  - 250 billion rows in DynamoDB: same 1ms latency as 1,000 rows
  - Automatic scaling — no capacity planning needed
  - Pay per request: cheaper than provisioning for peak

PostgreSQL would fail because:
  - 250 billion rows even in PostgreSQL is manageable with partitioning
    but operational complexity is enormous
  - Any table-wide query (aggregate analytics) would be catastrophically slow
  - Vertical scaling limits eventually hit

Scenario 3: IoT Device Telemetry — Laerdal Medical Fleet

10,000 medical devices, each sending status every 30 seconds. That is 333 writes per second — not huge. But access patterns are very specific:

"What is the current status of device X?" — always by device_id
"Show me all devices assigned to Hospital Y today" — by hospital + date
"Which devices are overdue for calibration?" — by calibration_date
Data retention: 2 years of telemetry per device = billions of rows

DynamoDB wins because:
  - All three access patterns are known upfront — design GSIs for each
  - Time-series data (telemetry) is append-only — no updates, just inserts
  - 2 years × 10,000 devices × 2,880 readings/day = 210 billion records
    DynamoDB handles this without operational overhead
  - DynamoDB TTL automatically expires old telemetry records

PostgreSQL would fail because:
  - TimescaleDB (time-series extension) is needed for this scale
  - Regular PostgreSQL table scans on 210 billion rows = minutes per query
  - Storage cost for 210 billion rows in PostgreSQL is enormous

Use PostgreSQL (Azure SQL / RDS) when:

Scenario 4: Shopify's Order Management

An order has: customer, shipping address, billing address, line items, discount codes, tax calculations, payment method, fulfilment status, and refund history. These entities are deeply related.

SQL

-- This query is impossible in DynamoDB, trivial in PostgreSQL
SELECT
  o.order_number,
  c.email,
  SUM(li.quantity * li.unit_price) as subtotal,
  d.discount_percentage,
  p.payment_status,
  COUNT(r.id) as refund_count
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN line_items li ON li.order_id = o.id
LEFT JOIN discount_codes d ON d.id = o.discount_code_id
JOIN payments p ON p.order_id = o.id
LEFT JOIN refunds r ON r.order_id = o.id
WHERE o.created_at > NOW() - INTERVAL '30 days'
  AND o.status = 'unfulfilled'
GROUP BY o.id, c.email, d.discount_percentage, p.payment_status
HAVING SUM(li.quantity * li.unit_price) > 500
ORDER BY subtotal DESC;

PostgreSQL wins because:
  - Complex JOINs across 6 related tables — not possible in DynamoDB
  - Business rules enforced by the database: foreign keys prevent orphan records
  - Transactions: an order refund must atomically update orders, payments,
    inventory, and customer credit — all or nothing
  - Ad-hoc queries: business analysts write SQL directly, no pre-defined access patterns
  - ACID compliance: financial data must never have partial writes

DynamoDB would fail because:
  - No JOINs — you would denormalise everything into one massive item
  - The "ad-hoc reporting" requirement is the killer — 
    DynamoDB cannot answer questions you did not design for at the start
  - Transaction support exists but is limited to 25 items — 
    complex financial operations with many related records hit this limit

Scenario 5: Healthcare Analytics — MyBCAT Practice Reporting

A practice manager asks: "Show me the call answer rate by agent, broken down by hour, for the last 30 days, compared to the practice average, with the 5 agents who improved most this month."

SQL

-- This is a reporting query. Impossible to design this in DynamoDB upfront.
WITH agent_hourly AS (
  SELECT
    agent_id,
    DATE_TRUNC('hour', call_start) as hour,
    COUNT(*) FILTER (WHERE status = 'answered') as answered,
    COUNT(*) as total,
    ROUND(100.0 * COUNT(*) FILTER (WHERE status = 'answered') / COUNT(*), 2) as answer_rate
  FROM call_logs
  WHERE call_start > NOW() - INTERVAL '30 days'
    AND practice_id = 'p001'
  GROUP BY agent_id, hour
),
practice_avg AS (
  SELECT ROUND(AVG(answer_rate), 2) as avg_rate FROM agent_hourly
),
improvement AS (
  SELECT agent_id,
    AVG(answer_rate) FILTER (WHERE hour > NOW() - INTERVAL '7 days') as recent_rate,
    AVG(answer_rate) FILTER (WHERE hour < NOW() - INTERVAL '7 days') as older_rate
  FROM agent_hourly
  GROUP BY agent_id
)
SELECT i.agent_id, i.recent_rate, i.older_rate,
       i.recent_rate - i.older_rate as improvement,
       p.avg_rate
FROM improvement i
CROSS JOIN practice_avg p
ORDER BY improvement DESC
LIMIT 5;

PostgreSQL wins because:
  - Window functions, CTEs, aggregations, FILTER clauses — all native
  - Business analysts and practice managers write these queries ad-hoc
  - The question was not known when the system was designed
  - Data fits in PostgreSQL at MyBCAT's scale (30 practices, millions of rows)

DynamoDB would fail because:
  - You would need to pre-build a separate aggregation pipeline for every
    possible report — weeks of work per report type
  - DynamoDB cannot GROUP BY across a scan of millions of items efficiently
  - Every new question from the practice manager requires an engineering sprint

Scenario 6: Banking — Account Balances and Transfers

A bank transfer: debit Account A by £500, credit Account B by £500. Both must succeed or neither succeeds.

SQL

BEGIN;
  UPDATE accounts SET balance = balance - 500 WHERE account_id = 'A'
    AND balance >= 500;  -- prevent overdraft
  
  IF NOT FOUND THEN ROLLBACK; END IF;
  
  UPDATE accounts SET balance = balance + 500 WHERE account_id = 'B';
  
  INSERT INTO transaction_log (from_account, to_account, amount, timestamp)
  VALUES ('A', 'B', 500, NOW());
COMMIT;

PostgreSQL wins because:
  - ACID transactions: debit and credit are atomic — no partial transfer possible
  - Row-level locking: two transfers from Account A simultaneously 
    cannot both see "balance = £1000" and both proceed
  - Check constraints: balance >= 0 enforced at database level
  - Full audit trail with transaction_log JOIN — complex financial queries

Cosmos DB / DynamoDB:
  - DynamoDB transactions exist but are limited and less expressive
  - For true financial systems: always use ACID-compliant relational database
  - Cosmos DB has full ACID support in its newer versions — valid for financial
    workloads if you use the right consistency level (Strong)

The Decision Framework — One Question to Ask

"Do I know every query this data will ever need to answer?"

YES, and the data volume is massive, and latency must be milliseconds
  → Cosmos DB / DynamoDB

NO, or users will write ad-hoc queries, or data is financial
  → PostgreSQL / Azure SQL

MESSAGING SERVICES — Real Scenarios

Azure Service Bus Queues (= AWS SQS) — When to Use

Scenario: E-commerce Order Processing

User clicks "Buy Now" on Amazon. The order must:

Reserve inventory
Charge the card
Notify the warehouse
Send confirmation email
Update analytics

If the analytics service is slow, should the customer wait? No. If the warehouse notification fails, should the card charge reverse? No — retry the warehouse notification.

"Buy Now" clicked
      ↓
Order record saved (synchronous — must happen before response)
      ↓
Response: "Order confirmed! #12345"
      ↓
Message: "Process order #12345" → Service Bus Queue
      ↓
Each step picks up the message independently:
  Payment processor: deducts card
  Warehouse system: allocates stock
  Email service: sends confirmation
  Analytics: records the sale

If email service is down:
  Message stays in queue, retries after 30 seconds
  Customer already has their confirmation — they do not care
  Email arrives 2 minutes late — acceptable
  No cascade failure

Real company: Deliveroo

When a restaurant confirms an order, Deliveroo puts a "find a rider" message in a queue. The rider dispatch system processes it. If the dispatch system has a 10-second backlog, orders still process in order — the queue absorbs the burst. Without the queue, a busy Friday night would crash the dispatch service.

In Azure:

// Sender — Deliveroo order service
await serviceBusClient.CreateSender("order-dispatch")
    .SendMessageAsync(new ServiceBusMessage(JsonSerializer.Serialize(new {
        OrderId = "ord_789",
        RestaurantLocation = "51.5074,-0.1278",
        CustomerLocation = "51.5155,-0.0922"
    })));

// Receiver — rider dispatch service
await using var processor = serviceBusClient.CreateProcessor("order-dispatch");
processor.ProcessMessageAsync += async args => {
    var order = JsonSerializer.Deserialize<Order>(args.Message.Body);
    await DispatchNearestRider(order);
    await args.CompleteMessageAsync(args.Message); // remove from queue on success
};

Azure Service Bus Topics + Subscriptions (= AWS SNS) — When to Use

Scenario: Hospital Patient Admission

A patient is admitted to a hospital. Seven different systems need to know:

Ward management system — allocate a bed
Pharmacy system — prepare medications
Catering system — add to meal schedule
Billing system — open an account
Lab system — set up for test orders
Notification system — SMS to family member
Audit system — log the admission for compliance

"Patient admitted" event published to Service Bus Topic
          ↓
┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
↓         ↓         ↓         ↓         ↓         ↓         ↓
Ward    Pharmacy  Catering  Billing    Lab    Notification  Audit
(each has its own subscription — processes independently)

If the catering system is down for maintenance, only the catering subscription's messages back up. The other six systems process normally. The catering system catches up when it recovers.

Real company: NHS England

The NHS Spine (national healthcare data network) uses publish-subscribe for patient events. When a GP surgery updates a patient record, dozens of downstream systems — hospitals, pharmacies, out-of-hours services — receive the update through a publish-subscribe bus. No system polls another; they all react to the event.

Azure Event Grid (= AWS EventBridge) — When to Use

Scenario: GitHub Webhook → Deployment Pipeline

When you push code to a branch, many things need to happen based on which branch and what changed:

Push to feature/* → run unit tests only
Push to main → run full test suite → deploy to staging
Push to main AND file changed in /infrastructure/ → also run Terraform plan
Any push → notify Slack
Push to main with tag v* → deploy to production

Code pushed to GitHub
      ↓
GitHub webhook → Azure Event Grid custom topic
      ↓
Event Grid routing rules:
  Rule 1: branch matches "main" → Event Grid Subscription → Azure DevOps pipeline (full tests + staging deploy)
  Rule 2: branch matches "feature/*" → Event Grid Subscription → Azure DevOps pipeline (unit tests only)
  Rule 3: changed files contains "infrastructure/" → Event Grid Subscription → Terraform Lambda
  Rule 4: ALL events → Event Grid Subscription → Slack notification function
  Rule 5: tag matches "v*" → Event Grid Subscription → Production deployment pipeline

Event Grid's filter rules do the routing. You do not need a single complex Lambda that has if/elif/else for every case — each rule is independent, testable, and maintainable separately.

Real company: Microsoft's own Azure platform

When you create a new Virtual Machine in Azure, Event Grid publishes Microsoft.Resources.ResourceWriteSuccess. Dozens of Azure services react: Security Center scans the VM, Azure Monitor starts collecting metrics, Azure Policy checks compliance, your custom Terraform audit function logs the creation. You did not have to wire up VM creation to each of these — they subscribed to the event.

Azure Event Hubs (= AWS Kinesis) — When to Use

Scenario: Tesla Fleet Telemetry

Tesla has 4 million connected vehicles. Each sends 1,000 data points per second: speed, battery level, GPS, motor temperature, door state, brake pressure. That is 4 billion data points per second globally.

Multiple teams need this data simultaneously:

Real-time dashboard team: display your car's data in the Tesla app
Safety team: detect abnormal brake pressure across the fleet instantly
ML team: train the Autopilot model on driving patterns
Analytics team: compute fleet-wide energy efficiency statistics

4 million cars → Event Hubs (partitioned stream, 30-day retention)
                      ↓
        ┌─────────────┼────────────────┐
        ↓             ↓                ↓
  App Dashboard   Safety Monitor    ML Pipeline
  Consumer Group  Consumer Group    Consumer Group
  
  (each reads independently, at their own speed, from their own offset)
  
Safety Monitor reads in real time — triggers alert if brake anomaly
ML Pipeline reads 6 hours behind — batch processing for training
Analytics reads once per day — aggregates the full day's data

This is not a queue. A queue deletes messages after consumption. Event Hubs is a log — the data stays for 30 days. Every consumer reads it independently. Adding a new consumer does not affect existing consumers.

Real company: Siemens Wind Turbines

Siemens Gamesa has 100,000 wind turbines globally. Each sends 1,000+ sensor readings per second. Event Hubs ingests 100 billion readings per day. Three consumers: real-time anomaly detection (prevents turbine damage), preventive maintenance scheduling (analyses wear patterns), and energy grid balancing (feeds into national grid management systems). All three read independently. The anomaly detector reads in real time. The maintenance system reads in 1-hour batches.

COMPUTE — Real Scenarios

Azure Functions (= AWS Lambda) — When to Use

Scenario 1: Stripe Webhook Processing

Every time a payment succeeds or fails, Stripe sends a webhook to your endpoint. You need to:

Update the subscription status in your database
Send a receipt email
Update HubSpot CRM

This happens maybe 1,000 times per day normally, 50,000 times on Black Friday.

Stripe webhook → Azure Function (HTTP trigger)
  → Update database
  → Queue email job
  → Queue CRM update
  
Function runs for ~200ms per webhook
Scales from 1 to 5,000 concurrent invocations automatically
Cost: 1 million invocations × 200ms × 256MB = ~$0.40/month

Why not a always-on server?

An always-on server for 1,000 webhooks/day costs $30-50/month to keep running. Azure Function: $0.40/month. 99% cheaper. And it handles the Black Friday spike automatically — no capacity planning.

Scenario 2: Scheduled Reports — MyBCAT Practice Reports

Every Monday at 7am ET, generate a weekly summary for each of 30 practices: calls answered, no-show rate, top agents, insurance verification backlog.

Azure Logic App timer (or EventBridge Scheduler) → triggers Lambda/Function
  → query PostgreSQL for last 7 days of data per practice
  → generate PDF report (pdfkit)
  → upload to S3/Blob Storage
  → send email via SendGrid
  
Runs: once per week, 30 practices, ~5 minutes total
Cost: essentially free (well within free tier for both Lambda and Functions)

Why not a cron job on a server?

A server running 24/7 to execute a 5-minute weekly job is 99.9% idle. Lambda/Functions eliminate that waste.

When NOT to use Lambda/Functions:

Scenario: Real-time video transcoding

A user uploads a 2-hour 4K video. Transcoding takes 45 minutes.

AWS Lambda: maximum timeout = 15 minutes → cannot finish
Azure Functions Consumption: maximum timeout = 10 minutes → cannot finish

Solution:
  Lambda/Function receives the upload notification → puts job in queue
  EC2 Spot Instance / Azure VM (Spot) picks up the job → runs for 45 minutes
  Sends completion notification → Lambda/Function updates the database

Use Lambda for the coordination (cheap, event-driven) and a VM for the heavy lifting (billed per minute, terminated when done).

STORAGE — Real Scenarios

Azure Blob Storage (= AWS S3) — When to Use

Scenario 1: Spotify Audio Files

Spotify has 100 million songs. Each song is a 5-10MB audio file. Total: ~700TB of audio.

Access pattern: get a specific song by song_id — always
Write pattern: upload once, read billions of times
Latency: acceptable to have 200-500ms for initial buffering
Scale: unlimited, cannot predict how many new songs will be added

S3 / Azure Blob Storage wins:
  - Unlimited storage — add 1 million songs tomorrow, no provisioning
  - 99.999999999% (11 nines) durability — a song is never lost
  - CloudFront/Azure CDN in front: song file served from edge node near the listener
  - Presigned URL / SAS token: Spotify's backend generates a 15-minute access URL
    Your browser downloads directly from S3/Blob — Spotify's servers not involved
  
PostgreSQL BLOB would fail:
  - A 700TB table in PostgreSQL is technically possible but operationally a nightmare
  - Binary data in databases bypasses database indexing entirely
  - No CDN integration — every download goes through your server

Scenario 2: Call Recordings — MyBCAT

10,000 calls per day, each a 3-minute audio file (~3MB). 30 days retention for HIPAA compliance, 7 years for legal retention.

S3 / Azure Blob workflow:
  1. Call ends on Amazon Connect
  2. Connect uploads recording to S3 automatically
  3. S3 key: recordings/{practiceId}/{date}/{callId}.mp3
  4. Metadata: { practiceId, agentId, patientPhone, duration }
  5. KMS encryption key: per-practice CMK
  6. Object Lock: GOVERNANCE mode, 7-year retention (HIPAA)
  7. Lifecycle rule: 
     - Standard tier (first 30 days — frequently accessed for review)
     - S3 Infrequent Access (30 days to 1 year)
     - S3 Glacier (1-7 years — legal archive, rarely accessed)
  
Cost optimisation:
  10,000 calls × 3MB = 30GB/day
  Standard: $0.023/GB → $0.69/day
  Glacier: $0.004/GB → $0.12/day
  Lifecycle to Glacier after 30 days saves 83% on storage cost

IDENTITY — Real Scenarios

Azure Entra External ID / Cognito — When to Use

Scenario: SaaS Platform with Multiple Business Customers

MyBCAT has 30 optometry practices. Each practice has:

Practice managers (can see all data, configure settings)
Scheduling agents (can book appointments, view call logs)
Read-only users (analysts who view reports)

Each user belongs to exactly one practice and can only see that practice's data.

Cognito / Entra External ID setup:
  
  User signs in → enters email + password + MFA code
  Cognito validates credentials
  Issues JWT with custom claims:
  {
    "sub": "user-uuid-123",
    "email": "sarah@besteyecare.com",
    "custom:practice_id": "practice_p001",
    "custom:role": "scheduling_agent",
    "exp": 1745000000
  }
  
  Every Lambda/Function:
    1. Validate JWT signature (Cognito public key)
    2. Read practice_id from claims
    3. Use practice_id as DynamoDB partition key prefix
    4. Agent from Practice A CANNOT access Practice B's data
       — physically cannot form a valid query

Why not build your own auth?

Auth is hard. Getting password hashing wrong means user passwords leak. Implementing MFA correctly requires deep cryptography knowledge. TOTP (Google Authenticator) has timing attack vulnerabilities if implemented naively. Secure session management with refresh token rotation is a week's work minimum.

Cognito and Entra External ID solve all of this — correctly, out of the box, HIPAA/SOC2 compliant — for ~$0.0055 per monthly active user.

SECRETS MANAGEMENT — Real Scenarios

Azure Key Vault (= AWS Secrets Manager + KMS) — When to Use

Scenario: Database Credentials That Must Rotate

A production PostgreSQL database has a password. This password:

Must not be in source code (obvious)
Must not be in environment variables (visible in console, can appear in logs)
Must rotate every 90 days (security best practice, some compliance requirements)
Must be accessible by multiple services (API Lambda, reporting Lambda, ETL jobs)

Wrong approach (what most teams do initially):
  DB_PASSWORD = "super_secret_123" in .env file
  → Committed to git accidentally
  → Visible to everyone with AWS console access
  → Never rotated because manual rotation requires updating everywhere

Right approach (Key Vault / Secrets Manager):
  
  Terraform creates the secret:
    aws_secretsmanager_secret "db_password"
    
  Each Lambda's IAM role allows:
    secretsmanager:GetSecretValue on this specific secret ARN only
    
  Lambda code:
    @lru_cache  # fetch once per cold start, not per invocation
    def get_db_password():
        return secrets_manager.get_secret_value(SecretId='mybcat/prod/db')['SecretString']
    
  Secrets Manager auto-rotation:
    Every 90 days → generates new password → updates RDS → updates secret value
    Lambda fetches new password on next cold start
    Zero manual work, zero downtime
    
  Audit trail:
    CloudTrail logs every GetSecretValue call:
    "Lambda function mybcat-reports-prod fetched db_password at 14:23:01"
    If someone unexpected fetches the secret → alert fires

Real company impact: Capital One's 2019 breach exposed 100 million customer records. The attacker gained access to an EC2 instance and found credentials in environment variables. AWS Secrets Manager with IAM-scoped access would have limited the blast radius — the attacker would have needed the specific IAM role to access each secret.

MONITORING — Real Scenarios

Azure Monitor + Application Insights (= AWS CloudWatch + X-Ray) — When to Use

Scenario: Debugging a Slow Patient Booking

A practice manager reports: "Booking appointments feels slow lately — maybe 8 seconds sometimes."

Without distributed tracing: You know something is slow. You do not know what.

With Application Insights / X-Ray:

Request trace for POST /api/appointments (total: 8,247ms)
│
├── Cognito JWT validation: 12ms
├── Lambda cold start: 2,100ms ← problem #1
│
└── Lambda execution: 6,135ms
    ├── DynamoDB GetItem (slot check): 8ms
    ├── Insurance verification API call: 5,912ms ← problem #2
    └── DynamoDB TransactWrite (booking): 215ms

Root causes identified:
  1. Lambda had no provisioned concurrency — cold start on first request of the day
  2. Insurance verification API (Availity) was degraded — their status page confirms

Fix:
  1. Add provisioned concurrency = 5 for booking Lambda
  2. Add 3-second timeout on insurance API call + circuit breaker
     If Availity times out: use last-known cached eligibility, flag for re-verification
  3. Add CloudWatch alarm: p99 booking latency > 2,000ms → PagerDuty alert

Without tracing you would spend hours guessing. With tracing, root cause identification takes 5 minutes.

Scenario: Detecting Suspicious Access Patterns

A disgruntled ex-employee still has their account active (someone forgot to offboard them). They are accessing patient records at 3am.

Without monitoring: you never find out until a HIPAA audit

With Azure Monitor + CloudTrail:
  Alert rule:
    DynamoDB:GetItem called by user "john.doe@mybcat.com"
    at time between 00:00 and 06:00
    → Trigger PagerDuty → Security team investigates
  
  Also:
    CloudTrail shows every API call with source IP
    "john.doe accessed 847 patient records from IP 203.45.67.89
     between 02:14 and 03:58 on 2026-04-15"
  
  HIPAA response:
    - Immediately revoke Cognito user
    - Document the access in breach assessment
    - Determine if 847 records constitute a reportable breach
    - Notify patients if threshold met

Real company: Anthem Health Insurance (2015)

79 million records exposed. The attackers were inside the network for weeks before detection. Azure Sentinel (SIEM) or GuardDuty with proper alerting on unusual access patterns would have flagged the anomalous queries within hours, not weeks.

INFRASTRUCTURE AS CODE — Real Scenarios

Terraform — When It Saves You, When It Burns You

Scenario: The Manual Configuration Disaster

A startup builds their infrastructure by clicking around the AWS console. Six months later:

A new engineer cannot reproduce the setup in a new region
Someone deletes a security group thinking it is unused — API Gateway stops working
The company needs to prove to SOC2 auditors what their infrastructure looked like 6 months ago — no record exists
Staging environment drifts from production — bugs only appear in production

With Terraform:
  All infrastructure is code in git:
  mybcat-infra/
    modules/
      lambda/       ← reusable Lambda module
      dynamodb/     ← reusable DynamoDB module with HIPAA defaults
      api-gateway/
    environments/
      dev/
        main.tf     ← uses modules, dev-specific values
      prod/
        main.tf     ← same modules, prod-specific values (more memory, provisioned concurrency)

  New engineer spins up a complete dev environment:
    terraform workspace new dev-john
    terraform apply
    ← 10 minutes later: complete copy of the architecture, ready to use

  Accidental deletion:
    terraform plan detects drift → shows "aws_security_group will be created"
    terraform apply → security group restored exactly as it was
    
  SOC2 audit:
    git log shows every infrastructure change with author, timestamp, reason
    "2026-03-15 — added KMS encryption to recordings bucket — PR #245 — HIPAA requirement"

Real company: Airbnb

Airbnb manages hundreds of services across AWS. Their Terraform configuration is in a monorepo — every infrastructure change goes through code review like application code. When a security misconfiguration is found, it is patched in Terraform and applied across every environment simultaneously. Manual fixes in the console are prohibited.

THE COMPLETE REAL-WORLD DECISION FRAMEWORK

When you face a design decision, ask these questions in order:

STORAGE DECISION:
  ┌─ Is this structured data with relationships? ─────────── PostgreSQL
  ├─ Is this high-volume with known access patterns? ──────── DynamoDB / Cosmos DB
  ├─ Is this binary data (files, images, audio)? ─────────── S3 / Blob Storage
  ├─ Is this time-series data (telemetry, metrics)? ──────── TimescaleDB or DynamoDB
  └─ Is this session data or cache? ─────────────────────── Redis (ElastiCache)

MESSAGING DECISION:
  ┌─ Must this happen exactly once? ───────────────────────── SQS / Service Bus Queue
  ├─ Must many services react to this event? ──────────────── SNS / Service Bus Topic
  ├─ Must I route events based on their content? ──────────── EventBridge / Event Grid
  └─ Must many consumers read the same stream independently? ─ Kinesis / Event Hubs

COMPUTE DECISION:
  ┌─ Is this event-driven and short-lived? (<15 min) ──────── Lambda / Azure Functions
  ├─ Is this a long-running process? ──────────────────────── EC2 / Azure VM
  ├─ Is this a containerised service? ────────────────────── ECS/Fargate / Container Apps
  └─ Is this a complex multi-step workflow? ───────────────── Step Functions / Durable Functions

AUTHENTICATION DECISION:
  ┌─ Is this a customer-facing app with user accounts? ─────── Cognito / Entra External ID
  ├─ Is this an internal enterprise app? ───────────────────── Cognito / Entra ID (corporate)
  └─ Does a service need to call another service? ──────────── IAM Role / Managed Identity

Summary: The Companies and Their Choices

| Company | Service | Why | |---|---|---| | Uber (trip tracking) | DynamoDB equivalent | 20M writes/sec, single access pattern, unknown scale | | Netflix (watch history) | DynamoDB | 250B records, always query by user_id, never joins | | Shopify (orders) | PostgreSQL | Complex joins, financial ACID, ad-hoc queries | | Spotify (audio files) | S3/Blob | Binary files, CDN integration, unlimited scale | | Deliveroo (order dispatch) | SQS/Service Bus Queue | Exactly-once processing, retry on failure | | NHS (patient events) | SNS/Service Bus Topics | One event → many independent systems | | Tesla (telemetry) | Kinesis/Event Hubs | Billions of events, multiple consumers, replay | | Airbnb (infrastructure) | Terraform | Multi-cloud, reproducible, auditable | | Capital One (post-breach) | Secrets Manager | No credentials in code or environment variables | | GitHub (deployments) | EventBridge/Event Grid | Content-based routing, decoupled pipeline stages |