System Design Interview
Design a Healthcare Patient Operations Platform (AWS Serverless)
Answering every patient call, booking into 50+ EHRs, running recall campaigns — all on Lambda, DynamoDB, API Gateway, and S3. Here's how to design it and where the architecture breaks.
The Interview Question
"Design a healthcare patient operations platform that answers inbound patient calls 24/7, books appointments directly into practice EHR systems, and runs proactive recall campaigns across 70+ multi-location healthcare practices. The stack runs entirely on AWS — Lambda, DynamoDB, API Gateway, S3, and Terraform. Walk through the architecture, identify where this stack works well, where it breaks down at scale, and what improvements you would prioritise."
This question combines serverless architecture, event-driven design, healthcare compliance (HIPAA), and multi-tenant SaaS. It's particularly interesting because the AWS serverless stack has specific failure modes at healthcare scale that a senior engineer needs to recognise and fix.
Step 1: What the Platform Actually Does
Three core workflows, each with different architectural characteristics:
Workflow 1: Inbound Call Handling (real-time, latency-critical)
Patient calls practice → routed to platform → agent books appointment
directly in practice EHR → no callback required
Latency budget: < 2 seconds to pull up patient record
Workflow 2: Recall Campaigns (async, batch, high-volume)
Platform identifies dormant patients → sends SMS/email/voicemail
→ tracks responses → books reactivated patients
Volume: thousands of outreach touches per campaign
Workflow 3: No-Show Recovery (near-real-time, trigger-based)
EHR sends cancellation event → platform checks waitlist →
fills slot with next patient → confirms via SMS
Latency budget: fill slot within 10 minutes of cancellationEach workflow has completely different latency, throughput, and consistency requirements — which is exactly why the architecture gets interesting.
Step 2: The Current Stack — Lambda, DynamoDB, API Gateway, S3
Why Serverless Makes Sense Here
Patient call volume is highly variable:
Monday morning: 50 concurrent calls (post-weekend backlog)
Tuesday 2pm: 5 concurrent calls
Saturday night: 0 calls (but recall campaigns running)
With EC2 / containers:
→ Must provision for peak (50 concurrent) → paying for idle capacity 90% of time
→ Auto-scaling lag: takes 2-3 minutes → bad for inbound calls
With Lambda:
→ Scales to 50 concurrent instantly → scales to 0 when idle
→ Pay per 100ms of execution → near-zero cost for quiet periods
→ Ideal for spiky, unpredictable healthcare call volumeArchitecture Overview
┌──────────────────────────────────────────────────────────────────────┐
│ Inbound Channels │
│ Phone (Twilio/Connect) · SMS · Email · Patient Portal │
└──────────────┬───────────────────────────────────────────────────────┘
│
┌──────────────▼───────────────────────────────────────────────────────┐
│ Amazon API Gateway │
│ REST APIs + WebSocket (real-time agent dashboard) │
└──────┬──────────────┬──────────────────────┬──────────────────────────┘
│ │ │
┌──────▼──────┐ ┌────▼────────────┐ ┌─────▼──────────────────────┐
│ Call │ │ Recall / │ │ EHR Integration │
│ Handler │ │ Campaign │ │ Lambda functions │
│ Lambda │ │ Lambda │ │ (one per EHR type) │
└──────┬──────┘ └────┬────────────┘ └─────┬──────────────────────┘
│ │ │
┌──────▼──────────────▼──────────────────────▼──────────────────────┐
│ DynamoDB │
│ practices · patients · appointments · campaigns · call_logs │
└────────────────────────────────────────────────────────────────────┘
│ │
┌──────────────▼──────┐ ┌──────────▼────────────────────────────┐
│ S3 │ │ Amazon SQS │
│ Call recordings │ │ Campaign queue · No-show queue │
│ Signed consent │ │ EHR sync queue │
│ Exported reports │ └───────────────────────────────────────┘
└─────────────────────┘
│
┌──────────────▼──────────────────────────────────────────────────┐
│ AWS EventBridge (workflow orchestration) │
│ Scheduled rules for recall campaigns, nightly summaries │
└─────────────────────────────────────────────────────────────────┘DynamoDB Access Pattern Design
DynamoDB is schema-less but access-pattern driven. You design your tables around the queries you need, not the other way around.
The platform has these core queries:
1. Get all patients for a practice → PK: practice_id
2. Get patient by phone number (caller lookup) → GSI: phone_number
3. Get today's appointments for a practice → PK: practice_id, SK: appointment_date
4. Get patient's appointment history → PK: patient_id, SK: appointment_date
5. Get active campaigns for a practice → GSI: practice_id + statusTable: Patients (single-table design)
PK SK Attributes
──────────────────────────────────────────────────────────────
PRACTICE#prac_001 PATIENT#pat_abc123 name, dob, phone, last_visit...
PRACTICE#prac_001 PATIENT#pat_def456 ...
PRACTICE#prac_001 APPT#2026-04-20#09:00 patient_id, type, status, ehr_appt_id
PRACTICE#prac_001 APPT#2026-04-20#10:30 ...
PATIENT#pat_abc123 APPT#2026-03-15 practice_id, type, outcome
CAMPAIGN#cmp_789 TOUCH#pat_abc123 sent_at, channel, response, status
GSI-1 (phone lookup):
PK: phone_number_hash SK: PRACTICE#prac_001
→ Used when a patient calls: look up by phone in < 5msWhy single-table design? Multiple related entities (patient, appointment, campaign touch) in one table eliminates joins. DynamoDB has no joins — if you use separate tables, you make multiple round-trips. Single-table gives you related data in one query.
Step 3: The EHR Integration Problem — The Hardest Part
This platform integrates with 50+ different EHR systems. Each has a different API, authentication scheme, data model, and rate limit. This is the most complex engineering challenge in the entire system.
EHR Landscape:
Eyefinity → REST API, OAuth 2.0, 100 req/min
Dentrix → SOAP XML, API key, 30 req/min
Open Dental → Custom binary protocol, 50 req/min
ModMed → REST API, FHIR-compatible, 200 req/min
Curve Dental → REST API, JWT, 60 req/min
... 45 moreIntegration pattern: Adapter + Queue
Every EHR gets its own Lambda adapter:
EhrAdapterEyefinity → speaks Eyefinity's API
EhrAdapterDentrix → speaks Dentrix's SOAP
EhrAdapterOpenDental → speaks Open Dental's protocol
...
Common interface all adapters implement:
searchPatient(phone, dob) → Patient
getAvailableSlots(date, provider) → Slot[]
bookAppointment(patient, slot, type) → Confirmation
cancelAppointment(appt_id) → void
Agent-facing Lambda calls the common interface.
The EHR adapter translates to/from each EHR's native format.Call arrives → Agent Lambda:
1. Identify practice (from called number → DynamoDB lookup)
2. Look up EHR type for this practice: "eyefinity"
3. Invoke EhrAdapterEyefinity.searchPatient(caller_phone)
4. Patient found → fetch available slots
5. Agent books → EhrAdapterEyefinity.bookAppointment(...)
6. Confirmation stored in DynamoDB + returned to agentRate limit handling per EHR:
SQS queue per EHR type:
ehr-queue-eyefinity (max 100 messages/min concurrency)
ehr-queue-dentrix (max 30 messages/min concurrency)
Lambda trigger: reserved concurrency = EHR rate limit
→ Auto-throttles to EHR's allowed rate
→ Backed-up requests queue in SQS, processed in order
→ No 429s from EHR APIsStep 4: HIPAA Compliance on AWS
Healthcare data (PHI — Protected Health Information) has specific AWS requirements.
Data at rest:
DynamoDB: encryption enabled (AWS KMS customer-managed key)
S3: SSE-KMS for all buckets (call recordings, documents)
Lambda env vars: encrypted via KMS
Data in transit:
All API Gateway endpoints: TLS 1.2+ enforced
VPC endpoints for DynamoDB and S3 (traffic stays within AWS network)
No PHI in Lambda environment variables (keys only, data in DynamoDB)
Access control:
IAM roles with least privilege — each Lambda has its own role
EhrAdapterEyefinity role: can only read/write PRACTICE#eyefinity_* keys
No cross-practice data access possible via IAM policy
Audit trail:
CloudTrail: all API calls logged
DynamoDB Streams → Lambda → S3: every data mutation logged immutably
Call recordings: S3 Object Lock (WORM) — cannot be modified or deleted
for 6 years (HIPAA retention requirement)
BAA: AWS signs a Business Associate Agreement for covered services
(DynamoDB, S3, Lambda, API Gateway, CloudWatch — all covered)Encryption key hierarchy:
AWS KMS Customer Master Key (per-practice)
↓ encrypts
Data Encryption Key (rotated annually)
↓ encrypts
PHI stored in DynamoDB + S3
Per-practice KMS keys:
If one practice's key is compromised → only that practice's data at risk
Multi-tenant isolation at the encryption layerStep 5: Where the Current Architecture Breaks — And How to Fix It
This is the part that separates a senior engineer from a junior one. Knowing the stack is not enough — you have to know its failure modes.
Problem 1: Lambda Cold Starts on Patient Calls
Current issue:
Lambda cold start = 500ms – 2s (JVM runtimes worse, Node/Python better)
A patient calls → Lambda starts cold → 1.5s delay before agent sees data
→ Patients hear silence → hang up
Why it happens:
Lambda containers are recycled after ~15 minutes of inactivity
Call volume drops at night → containers recycled
First morning call → cold start
Fix: Provisioned Concurrency for call-handling Lambdas
aws lambda put-provisioned-concurrency-config \
--function-name call-handler \
--qualifier prod \
--provisioned-concurrent-executions 5
→ 5 warm instances always ready → zero cold start for first 5 concurrent calls
→ Cost: ~$30/month per provisioned instance — cheap insurance for patient experience
Better fix: EventBridge scheduled ping every 10 minutes
→ Keeps containers warm at near-zero cost
→ Acceptable for non-mission-critical LambdasProblem 2: DynamoDB Hot Partitions on High-Volume Practices
Current issue:
Large practice (300 locations, 10,000 appointments/day) writes to
PK: PRACTICE#prac_megacorp
All writes land on one DynamoDB partition → throughput limited to 1,000 WCU/partition
→ Throttling during morning rush → appointment booking failures
Fix: Write sharding
Instead of: PK = PRACTICE#prac_megacorp
Use: PK = PRACTICE#prac_megacorp#SHARD#{random 1-10}
Spreads writes across 10 partitions → 10,000 WCU/s capacity
Read query: fan-out to all 10 shards, merge results
Alternative fix for large practices: switch to Aurora Serverless v2
Relational model handles complex queries better
Auto-scales from 0.5 to 128 ACUs — still serverless
Better for practices running complex reporting queriesProblem 3: EHR Integration Failures Silently Drop Appointments
Current issue:
EHR API returns 503 → Lambda retries 3x → gives up → no appointment booked
Agent told "booking confirmed" but EHR never received it
Patient arrives → no appointment in system → bad experience
Fix: Outbox pattern on appointment bookings
Step 1: Write appointment to DynamoDB with status = "PENDING"
Step 2: Lambda sends to EHR
Step 3a: Success → update status = "CONFIRMED", store ehr_appt_id
Step 3b: Failure → status stays "PENDING"
DynamoDB Streams → Lambda polls PENDING appointments older than 30 seconds
→ Retry EHR booking → alert operations team if 3+ retries fail
→ Never silently drop a booking againProblem 4: Recall Campaigns Blast DynamoDB
Current issue:
Recall campaign: "Contact all patients not seen in 18 months"
1,000 patients per practice × 50 practices = 50,000 DynamoDB reads + writes
All triggered at the same time (scheduled Lambda at 9am)
→ DynamoDB read capacity spike → throttling → campaign sends partial
Fix: Rate-limited SQS fan-out
Campaign Lambda:
1. Query all eligible patients → write patient IDs to SQS
2. Lambda triggered by SQS with reserved concurrency = 10
3. Each Lambda processes one patient, sends outreach, updates status
4. 10 concurrent = ~100 patients/minute → smooth, no spikes
5. Campaign progress visible in real time via DynamoDB status updatesProblem 5: No Observability on What's Actually Happening
Current issue (common in Lambda-first systems):
Lambda logs go to CloudWatch Logs — fragmented across functions
No way to trace a single patient call across 4 Lambda invocations
No alerting when EHR error rate spikes
Operations team learns about booking failures when patients complain
Fix: AWS X-Ray distributed tracing
Add X-Ray SDK to all Lambda functions
Every call gets a trace ID (passed via HTTP header through all downstream calls)
→ See entire call flow: API Gateway → Call Lambda → EHR Adapter → DynamoDB
→ Identify which EHR is slow, which practices have booking failures
Fix: CloudWatch Metrics dashboard
Custom metrics:
calls_answered_total (per practice)
appointments_booked_total (per practice, per EHR)
ehr_error_rate (per EHR type)
recall_campaign_response_rate
cold_start_duration_p99
Alarms:
ehr_error_rate > 5% → PagerDuty alert
cold_start_duration_p99 > 1000ms → investigate Lambda config
appointments_pending_ehr > 50 → outbox backlog buildingStep 6: The Improved Architecture
┌──────────────────────────────────────────────────────────────────────────┐
│ Inbound: Phone (Amazon Connect) · SMS · Email · Web │
└─────────────────────────┬────────────────────────────────────────────────┘
│
┌─────────────────────────▼────────────────────────────────────────────────┐
│ API Gateway (REST + WebSocket) │
│ WAF: rate limiting, IP blocking, OWASP rules │
└──────┬──────────────┬──────────────────────┬──────────────────────────────┘
│ │ │
┌──────▼──────────────▼──────────────────────▼──────────────────────────┐
│ Lambda Functions (Node.js, ARM64 Graviton — 20% cheaper, faster) │
│ call-handler (provisioned concurrency: 5) │
│ recall-campaign-trigger │
│ no-show-recovery │
│ ehr-adapter-{type} (one per EHR, reserved concurrency = rate limit) │
└──────┬──────────────────────────────┬──────────────────────────────────┘
│ │
┌──────▼──────────────┐ ┌──────────▼────────────────────────────────┐
│ DynamoDB │ │ SQS Queues │
│ (sharded PKs for │ │ ehr-booking-queue (outbox pattern) │
│ large practices) │ │ recall-send-queue (rate-limited) │
│ + DynamoDB Streams │ │ no-show-fill-queue │
└──────┬──────────────┘ └──────────┬────────────────────────────────┘
│ │
┌──────▼──────────────────────────────▼──────────────────────────────┐
│ Observability │
│ X-Ray tracing across all Lambdas │
│ CloudWatch custom metrics + alarms │
│ CloudWatch Logs Insights (cross-function log queries) │
└──────┬──────────────────────────────────────────────────────────────┘
│
┌──────▼──────────────────────────────────────────────────────────────┐
│ S3 (KMS encrypted, Object Lock for call recordings) │
│ Aurora Serverless v2 (for large-practice analytics queries) │
│ Secrets Manager (EHR credentials, rotating automatically) │
└─────────────────────────────────────────────────────────────────────┘Step 7: Infrastructure as Code — Terraform Patterns
The team uses Terraform for all infrastructure. A few patterns that matter at this scale:
# Lambda with provisioned concurrency for call handler
resource "aws_lambda_function" "call_handler" {
function_name = "call-handler-${var.env}"
runtime = "nodejs20.x"
architectures = ["arm64"] # Graviton2: 20% cheaper, faster
memory_size = 512
timeout = 10
environment {
variables = {
DYNAMODB_TABLE = aws_dynamodb_table.main.name
EHR_ADAPTER_ARN = aws_lambda_function.ehr_adapter.arn
# No secrets here — use Secrets Manager
}
}
tracing_config { mode = "Active" } # X-Ray
}
resource "aws_lambda_provisioned_concurrency_config" "call_handler" {
function_name = aws_lambda_function.call_handler.function_name
qualifier = aws_lambda_alias.call_handler_prod.name
provisioned_concurrent_executions = 5 # Always warm
}
# DynamoDB with per-practice KMS keys
resource "aws_dynamodb_table" "main" {
name = "patient-ops-${var.env}"
billing_mode = "PAY_PER_REQUEST" # On-demand — no capacity planning
hash_key = "PK"
range_key = "SK"
point_in_time_recovery { enabled = true } # HIPAA: restore to any second
server_side_encryption {
enabled = true
kms_key_arn = aws_kms_key.dynamodb.arn
}
stream_enabled = true
stream_view_type = "NEW_AND_OLD_IMAGES" # Full audit trail via Streams
}Terraform workspace strategy:
terraform workspace list:
dev → isolated DynamoDB tables, EHR sandbox APIs
staging → production-like, uses EHR test environments
prod → production, real EHR APIs, HIPAA controls enabled
Per-practice infrastructure:
Each new practice onboarded → Terraform module creates:
- KMS key (practice-specific encryption)
- IAM role (practice-scoped DynamoDB access)
- Secrets Manager entries (EHR credentials)
→ Fully automated onboarding, no manual AWS console workWhat the Interviewer Is Actually Testing
- Do you understand why serverless (Lambda) makes sense for spiky healthcare call volume — and what it costs vs always-on compute?
- Can you design DynamoDB access patterns (single-table, GSIs, shard keys) rather than just saying "use NoSQL"?
- Do you recognise the EHR integration adapter pattern and how to handle 50+ different APIs with rate limiting via SQS reserved concurrency?
- Do you identify Lambda cold start as a patient experience problem and solve it with provisioned concurrency — not just mention it as a footnote?
- Do you apply the Outbox pattern to prevent silent booking failures at the EHR integration boundary?
- Do you know HIPAA controls on AWS — KMS per-tenant keys, S3 Object Lock for retention, CloudTrail, BAA coverage?
- Do you recognise the DynamoDB hot partition problem for large multi-location practices and know how to shard?
- Do you propose X-Ray + CloudWatch custom metrics as the observability fix, not just "add logging"?
Related Case Studies
Go Deeper
Case studies teach the "what". Our courses teach the "how" — the patterns behind these decisions, built up from first principles.
Explore Courses