Design a Healthcare Patient Operations Platform (AWS Serverless)

The Interview Question

"Design a healthcare patient operations platform that answers inbound patient calls 24/7, books appointments directly into practice EHR systems, and runs proactive recall campaigns across 70+ multi-location healthcare practices. The stack runs entirely on AWS — Lambda, DynamoDB, API Gateway, S3, and Terraform. Walk through the architecture, identify where this stack works well, where it breaks down at scale, and what improvements you would prioritise."

This question combines serverless architecture, event-driven design, healthcare compliance (HIPAA), and multi-tenant SaaS. It's particularly interesting because the AWS serverless stack has specific failure modes at healthcare scale that a senior engineer needs to recognise and fix.

Step 1: What the Platform Actually Does

Three core workflows, each with different architectural characteristics:

Workflow 1: Inbound Call Handling (real-time, latency-critical)
  Patient calls practice → routed to platform → agent books appointment
  directly in practice EHR → no callback required
  Latency budget: < 2 seconds to pull up patient record

Workflow 2: Recall Campaigns (async, batch, high-volume)
  Platform identifies dormant patients → sends SMS/email/voicemail
  → tracks responses → books reactivated patients
  Volume: thousands of outreach touches per campaign

Workflow 3: No-Show Recovery (near-real-time, trigger-based)
  EHR sends cancellation event → platform checks waitlist → 
  fills slot with next patient → confirms via SMS
  Latency budget: fill slot within 10 minutes of cancellation

Each workflow has completely different latency, throughput, and consistency requirements — which is exactly why the architecture gets interesting.

Step 2: The Current Stack — Lambda, DynamoDB, API Gateway, S3

Why Serverless Makes Sense Here

Patient call volume is highly variable:

Monday morning: 50 concurrent calls (post-weekend backlog)
Tuesday 2pm:     5 concurrent calls
Saturday night:  0 calls (but recall campaigns running)

With EC2 / containers:
  → Must provision for peak (50 concurrent) → paying for idle capacity 90% of time
  → Auto-scaling lag: takes 2-3 minutes → bad for inbound calls

With Lambda:
  → Scales to 50 concurrent instantly → scales to 0 when idle
  → Pay per 100ms of execution → near-zero cost for quiet periods
  → Ideal for spiky, unpredictable healthcare call volume

Architecture Overview

┌──────────────────────────────────────────────────────────────────────┐
│  Inbound Channels                                                     │
│  Phone (Twilio/Connect) · SMS · Email · Patient Portal               │
└──────────────┬───────────────────────────────────────────────────────┘
               │
┌──────────────▼───────────────────────────────────────────────────────┐
│  Amazon API Gateway                                                    │
│  REST APIs + WebSocket (real-time agent dashboard)                    │
└──────┬──────────────┬──────────────────────┬──────────────────────────┘
       │              │                      │
┌──────▼──────┐  ┌────▼────────────┐  ┌─────▼──────────────────────┐
│  Call       │  │  Recall /       │  │  EHR Integration           │
│  Handler    │  │  Campaign       │  │  Lambda functions          │
│  Lambda     │  │  Lambda         │  │  (one per EHR type)        │
└──────┬──────┘  └────┬────────────┘  └─────┬──────────────────────┘
       │              │                      │
┌──────▼──────────────▼──────────────────────▼──────────────────────┐
│                        DynamoDB                                     │
│  practices · patients · appointments · campaigns · call_logs        │
└────────────────────────────────────────────────────────────────────┘
               │                      │
┌──────────────▼──────┐    ┌──────────▼────────────────────────────┐
│  S3                 │    │  Amazon SQS                           │
│  Call recordings    │    │  Campaign queue · No-show queue       │
│  Signed consent     │    │  EHR sync queue                      │
│  Exported reports   │    └───────────────────────────────────────┘
└─────────────────────┘
               │
┌──────────────▼──────────────────────────────────────────────────┐
│  AWS EventBridge (workflow orchestration)                        │
│  Scheduled rules for recall campaigns, nightly summaries        │
└─────────────────────────────────────────────────────────────────┘

DynamoDB Access Pattern Design

DynamoDB is schema-less but access-pattern driven. You design your tables around the queries you need, not the other way around.

The platform has these core queries:

1. Get all patients for a practice              → PK: practice_id
2. Get patient by phone number (caller lookup)  → GSI: phone_number
3. Get today's appointments for a practice      → PK: practice_id, SK: appointment_date
4. Get patient's appointment history            → PK: patient_id, SK: appointment_date
5. Get active campaigns for a practice          → GSI: practice_id + status

Table: Patients (single-table design)

PK                     SK                      Attributes
──────────────────────────────────────────────────────────────
PRACTICE#prac_001      PATIENT#pat_abc123      name, dob, phone, last_visit...
PRACTICE#prac_001      PATIENT#pat_def456      ...
PRACTICE#prac_001      APPT#2026-04-20#09:00   patient_id, type, status, ehr_appt_id
PRACTICE#prac_001      APPT#2026-04-20#10:30   ...
PATIENT#pat_abc123     APPT#2026-03-15         practice_id, type, outcome
CAMPAIGN#cmp_789       TOUCH#pat_abc123        sent_at, channel, response, status

GSI-1 (phone lookup):
  PK: phone_number_hash   SK: PRACTICE#prac_001
  → Used when a patient calls: look up by phone in < 5ms

Why single-table design? Multiple related entities (patient, appointment, campaign touch) in one table eliminates joins. DynamoDB has no joins — if you use separate tables, you make multiple round-trips. Single-table gives you related data in one query.

Step 3: The EHR Integration Problem — The Hardest Part

This platform integrates with 50+ different EHR systems. Each has a different API, authentication scheme, data model, and rate limit. This is the most complex engineering challenge in the entire system.

EHR Landscape:
  Eyefinity    → REST API, OAuth 2.0, 100 req/min
  Dentrix      → SOAP XML, API key, 30 req/min
  Open Dental  → Custom binary protocol, 50 req/min
  ModMed       → REST API, FHIR-compatible, 200 req/min
  Curve Dental → REST API, JWT, 60 req/min
  ... 45 more

Integration pattern: Adapter + Queue

Every EHR gets its own Lambda adapter:

  EhrAdapterEyefinity   → speaks Eyefinity's API
  EhrAdapterDentrix     → speaks Dentrix's SOAP
  EhrAdapterOpenDental  → speaks Open Dental's protocol
  ...

Common interface all adapters implement:
  searchPatient(phone, dob) → Patient
  getAvailableSlots(date, provider) → Slot[]
  bookAppointment(patient, slot, type) → Confirmation
  cancelAppointment(appt_id) → void

Agent-facing Lambda calls the common interface.
The EHR adapter translates to/from each EHR's native format.

Call arrives → Agent Lambda:
  1. Identify practice (from called number → DynamoDB lookup)
  2. Look up EHR type for this practice: "eyefinity"
  3. Invoke EhrAdapterEyefinity.searchPatient(caller_phone)
  4. Patient found → fetch available slots
  5. Agent books → EhrAdapterEyefinity.bookAppointment(...)
  6. Confirmation stored in DynamoDB + returned to agent

Rate limit handling per EHR:

SQS queue per EHR type:
  ehr-queue-eyefinity   (max 100 messages/min concurrency)
  ehr-queue-dentrix     (max 30 messages/min concurrency)

Lambda trigger: reserved concurrency = EHR rate limit
  → Auto-throttles to EHR's allowed rate
  → Backed-up requests queue in SQS, processed in order
  → No 429s from EHR APIs

Step 4: HIPAA Compliance on AWS

Healthcare data (PHI — Protected Health Information) has specific AWS requirements.

Data at rest:
  DynamoDB: encryption enabled (AWS KMS customer-managed key)
  S3: SSE-KMS for all buckets (call recordings, documents)
  Lambda env vars: encrypted via KMS

Data in transit:
  All API Gateway endpoints: TLS 1.2+ enforced
  VPC endpoints for DynamoDB and S3 (traffic stays within AWS network)
  No PHI in Lambda environment variables (keys only, data in DynamoDB)

Access control:
  IAM roles with least privilege — each Lambda has its own role
  EhrAdapterEyefinity role: can only read/write PRACTICE#eyefinity_* keys
  No cross-practice data access possible via IAM policy

Audit trail:
  CloudTrail: all API calls logged
  DynamoDB Streams → Lambda → S3: every data mutation logged immutably
  Call recordings: S3 Object Lock (WORM) — cannot be modified or deleted
    for 6 years (HIPAA retention requirement)

BAA: AWS signs a Business Associate Agreement for covered services
  (DynamoDB, S3, Lambda, API Gateway, CloudWatch — all covered)

Encryption key hierarchy:

AWS KMS Customer Master Key (per-practice)
  ↓ encrypts
Data Encryption Key (rotated annually)
  ↓ encrypts
PHI stored in DynamoDB + S3

Per-practice KMS keys:
  If one practice's key is compromised → only that practice's data at risk
  Multi-tenant isolation at the encryption layer

Step 5: Where the Current Architecture Breaks — And How to Fix It

This is the part that separates a senior engineer from a junior one. Knowing the stack is not enough — you have to know its failure modes.

Problem 1: Lambda Cold Starts on Patient Calls

Current issue:
  Lambda cold start = 500ms – 2s (JVM runtimes worse, Node/Python better)
  A patient calls → Lambda starts cold → 1.5s delay before agent sees data
  → Patients hear silence → hang up

Why it happens:
  Lambda containers are recycled after ~15 minutes of inactivity
  Call volume drops at night → containers recycled
  First morning call → cold start

Fix: Provisioned Concurrency for call-handling Lambdas
  aws lambda put-provisioned-concurrency-config \
    --function-name call-handler \
    --qualifier prod \
    --provisioned-concurrent-executions 5

  → 5 warm instances always ready → zero cold start for first 5 concurrent calls
  → Cost: ~$30/month per provisioned instance — cheap insurance for patient experience

Better fix: EventBridge scheduled ping every 10 minutes
  → Keeps containers warm at near-zero cost
  → Acceptable for non-mission-critical Lambdas

Problem 2: DynamoDB Hot Partitions on High-Volume Practices

Current issue:
  Large practice (300 locations, 10,000 appointments/day) writes to
  PK: PRACTICE#prac_megacorp
  All writes land on one DynamoDB partition → throughput limited to 1,000 WCU/partition
  → Throttling during morning rush → appointment booking failures

Fix: Write sharding
  Instead of: PK = PRACTICE#prac_megacorp
  Use:        PK = PRACTICE#prac_megacorp#SHARD#{random 1-10}

  Spreads writes across 10 partitions → 10,000 WCU/s capacity
  Read query: fan-out to all 10 shards, merge results

Alternative fix for large practices: switch to Aurora Serverless v2
  Relational model handles complex queries better
  Auto-scales from 0.5 to 128 ACUs — still serverless
  Better for practices running complex reporting queries

Problem 3: EHR Integration Failures Silently Drop Appointments

Current issue:
  EHR API returns 503 → Lambda retries 3x → gives up → no appointment booked
  Agent told "booking confirmed" but EHR never received it
  Patient arrives → no appointment in system → bad experience

Fix: Outbox pattern on appointment bookings
  Step 1: Write appointment to DynamoDB with status = "PENDING"
  Step 2: Lambda sends to EHR
  Step 3a: Success → update status = "CONFIRMED", store ehr_appt_id
  Step 3b: Failure → status stays "PENDING"

  DynamoDB Streams → Lambda polls PENDING appointments older than 30 seconds
  → Retry EHR booking → alert operations team if 3+ retries fail
  → Never silently drop a booking again

Problem 4: Recall Campaigns Blast DynamoDB

Current issue:
  Recall campaign: "Contact all patients not seen in 18 months"
  1,000 patients per practice × 50 practices = 50,000 DynamoDB reads + writes
  All triggered at the same time (scheduled Lambda at 9am)
  → DynamoDB read capacity spike → throttling → campaign sends partial

Fix: Rate-limited SQS fan-out
  Campaign Lambda:
    1. Query all eligible patients → write patient IDs to SQS
    2. Lambda triggered by SQS with reserved concurrency = 10
    3. Each Lambda processes one patient, sends outreach, updates status
    4. 10 concurrent = ~100 patients/minute → smooth, no spikes
    5. Campaign progress visible in real time via DynamoDB status updates

Problem 5: No Observability on What's Actually Happening

Current issue (common in Lambda-first systems):
  Lambda logs go to CloudWatch Logs — fragmented across functions
  No way to trace a single patient call across 4 Lambda invocations
  No alerting when EHR error rate spikes
  Operations team learns about booking failures when patients complain

Fix: AWS X-Ray distributed tracing
  Add X-Ray SDK to all Lambda functions
  Every call gets a trace ID (passed via HTTP header through all downstream calls)
  → See entire call flow: API Gateway → Call Lambda → EHR Adapter → DynamoDB
  → Identify which EHR is slow, which practices have booking failures

Fix: CloudWatch Metrics dashboard
  Custom metrics:
    calls_answered_total (per practice)
    appointments_booked_total (per practice, per EHR)
    ehr_error_rate (per EHR type)
    recall_campaign_response_rate
    cold_start_duration_p99

  Alarms:
    ehr_error_rate > 5% → PagerDuty alert
    cold_start_duration_p99 > 1000ms → investigate Lambda config
    appointments_pending_ehr > 50 → outbox backlog building

Step 6: The Improved Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│  Inbound: Phone (Amazon Connect) · SMS · Email · Web                     │
└─────────────────────────┬────────────────────────────────────────────────┘
                          │
┌─────────────────────────▼────────────────────────────────────────────────┐
│  API Gateway (REST + WebSocket)                                           │
│  WAF: rate limiting, IP blocking, OWASP rules                            │
└──────┬──────────────┬──────────────────────┬──────────────────────────────┘
       │              │                      │
┌──────▼──────────────▼──────────────────────▼──────────────────────────┐
│  Lambda Functions (Node.js, ARM64 Graviton — 20% cheaper, faster)     │
│  call-handler (provisioned concurrency: 5)                             │
│  recall-campaign-trigger                                               │
│  no-show-recovery                                                      │
│  ehr-adapter-{type} (one per EHR, reserved concurrency = rate limit)  │
└──────┬──────────────────────────────┬──────────────────────────────────┘
       │                              │
┌──────▼──────────────┐    ┌──────────▼────────────────────────────────┐
│  DynamoDB           │    │  SQS Queues                               │
│  (sharded PKs for   │    │  ehr-booking-queue (outbox pattern)       │
│   large practices)  │    │  recall-send-queue (rate-limited)         │
│  + DynamoDB Streams │    │  no-show-fill-queue                       │
└──────┬──────────────┘    └──────────┬────────────────────────────────┘
       │                              │
┌──────▼──────────────────────────────▼──────────────────────────────┐
│  Observability                                                       │
│  X-Ray tracing across all Lambdas                                   │
│  CloudWatch custom metrics + alarms                                 │
│  CloudWatch Logs Insights (cross-function log queries)              │
└──────┬──────────────────────────────────────────────────────────────┘
       │
┌──────▼──────────────────────────────────────────────────────────────┐
│  S3 (KMS encrypted, Object Lock for call recordings)                │
│  Aurora Serverless v2 (for large-practice analytics queries)        │
│  Secrets Manager (EHR credentials, rotating automatically)         │
└─────────────────────────────────────────────────────────────────────┘

Step 7: Infrastructure as Code — Terraform Patterns

The team uses Terraform for all infrastructure. A few patterns that matter at this scale:

HCL

# Lambda with provisioned concurrency for call handler
resource "aws_lambda_function" "call_handler" {
  function_name = "call-handler-${var.env}"
  runtime       = "nodejs20.x"
  architectures = ["arm64"]           # Graviton2: 20% cheaper, faster
  memory_size   = 512
  timeout       = 10

  environment {
    variables = {
      DYNAMODB_TABLE    = aws_dynamodb_table.main.name
      EHR_ADAPTER_ARN   = aws_lambda_function.ehr_adapter.arn
      # No secrets here — use Secrets Manager
    }
  }

  tracing_config { mode = "Active" }  # X-Ray
}

resource "aws_lambda_provisioned_concurrency_config" "call_handler" {
  function_name                  = aws_lambda_function.call_handler.function_name
  qualifier                      = aws_lambda_alias.call_handler_prod.name
  provisioned_concurrent_executions = 5  # Always warm
}

# DynamoDB with per-practice KMS keys
resource "aws_dynamodb_table" "main" {
  name         = "patient-ops-${var.env}"
  billing_mode = "PAY_PER_REQUEST"  # On-demand — no capacity planning

  hash_key  = "PK"
  range_key = "SK"

  point_in_time_recovery { enabled = true }  # HIPAA: restore to any second
  server_side_encryption {
    enabled     = true
    kms_key_arn = aws_kms_key.dynamodb.arn
  }

  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"  # Full audit trail via Streams
}

Terraform workspace strategy:

terraform workspace list:
  dev     → isolated DynamoDB tables, EHR sandbox APIs
  staging → production-like, uses EHR test environments
  prod    → production, real EHR APIs, HIPAA controls enabled

Per-practice infrastructure:
  Each new practice onboarded → Terraform module creates:
    - KMS key (practice-specific encryption)
    - IAM role (practice-scoped DynamoDB access)
    - Secrets Manager entries (EHR credentials)
  → Fully automated onboarding, no manual AWS console work

What the Interviewer Is Actually Testing

Do you understand why serverless (Lambda) makes sense for spiky healthcare call volume — and what it costs vs always-on compute?
Can you design DynamoDB access patterns (single-table, GSIs, shard keys) rather than just saying "use NoSQL"?
Do you recognise the EHR integration adapter pattern and how to handle 50+ different APIs with rate limiting via SQS reserved concurrency?
Do you identify Lambda cold start as a patient experience problem and solve it with provisioned concurrency — not just mention it as a footnote?
Do you apply the Outbox pattern to prevent silent booking failures at the EHR integration boundary?
Do you know HIPAA controls on AWS — KMS per-tenant keys, S3 Object Lock for retention, CloudTrail, BAA coverage?
Do you recognise the DynamoDB hot partition problem for large multi-location practices and know how to shard?
Do you propose X-Ray + CloudWatch custom metrics as the observability fix, not just "add logging"?