System Design Interview
Design a Real-Time Bed Logistics System (Municipal Healthcare)
Hundreds of care homes, thousands of beds, one dashboard — give municipalities live capacity visibility without overwriting clinical data
The Interview Question
"Design a bed logistics system for a regional healthcare authority. Hundreds of municipal care homes each manage their own bed occupancy. Admissions coordinators at the regional level need a live dashboard showing which facilities have available beds, by ward type and care level. Bed status must update within minutes of a patient being admitted or discharged. The system must not replace or interfere with the clinical records at each care home."
This question tests event-driven architecture, data aggregation across many independent sources, conflict resolution on concurrent updates, and the pragmatics of building for healthcare environments where connectivity and legacy systems are real constraints.
Step 1: Requirements
Functional
- Care home staff update bed status: occupied, available, reserved, under maintenance
- Regional coordinators see a live aggregated view: available beds per facility, per ward type, per care level
- Admissions staff can filter by location, care level (light / medium / intensive), and availability
- Bed status history is auditable — every change logged with timestamp and user
- Alerts when a facility's available beds drop below a configurable threshold
Non-functional
- 200 care homes in the region, each with 20–200 beds (total: ~15,000 beds)
- Status updates must reflect in the regional dashboard within 2 minutes
- System must degrade gracefully when a care home loses internet (not lose their updates)
- Read-heavy for coordinators: ~500 concurrent users reading the dashboard during business hours
- Write volume: ~2,000 status changes per day (admits/discharges across all facilities)
- Healthcare data — PII adjacent; access must be role-restricted and fully audited
Step 2: The Core Architecture Question — Push or Pull?
The central design decision: how does the regional system learn about bed changes at each care home?
Option A: Pull (polling)
Regional system calls each care home's API every N minutes
✓ Simple to implement
✗ At 200 facilities × every 2 minutes = 100 requests/minute
✗ Care home systems must expose an API (many won't)
✗ Latency = up to 2 minutes per poll cycle regardless of activity
Option B: Push via webhook
Care home system calls the regional endpoint when status changes
✓ Near-real-time updates
✗ Requires care home software to integrate — many use legacy systems
✗ Care home goes offline → updates are lost
Option C: Event queue with offline buffering (recommended)
Lightweight agent at each care home publishes to a durable queue
Regional service consumes events asynchronously
✓ Survives connectivity gaps — agent retries when connection returns
✓ Decoupled — care home software just writes to the local agent
✓ Auditable event logRecommended: event-driven with a local buffering agent at each facility.
Care Home Side: Regional Side:
┌────────────────────────┐ ┌────────────────────────────────┐
│ Care Home Software │ │ Bed Status Service │
│ (EHR / manual UI) │ │ - Consumes bed events │
│ ↓ │ │ - Updates aggregated view │
│ Local Agent │ ──────► │ - Sends threshold alerts │
│ (buffer + retry) │ HTTPS │ - Serves dashboard API │
│ ↓ │ events └────────────────────────────────┘
│ Local SQLite queue │
│ (survives restarts) │
└────────────────────────┘The local agent is a small .NET worker that:
- Watches for bed status changes (via DB trigger, file drop, or direct integration)
- Writes events to a local SQLite queue
- Ships events to the regional API over HTTPS
- Marks events as delivered on 200 OK; retries on failure
This pattern is called the Transactional Outbox — events are committed locally before being sent, so no status change is ever lost even if the network drops mid-send.
Step 3: The Bed Event Model
Every change to a bed produces a structured event:
{
"event_id": "evt_01HX9...",
"facility_id": "fac_oslo_nordre",
"bed_id": "bed_2b_room_14",
"ward": "dementia",
"care_level": "intensive",
"previous_status": "available",
"new_status": "occupied",
"changed_by": "user_nurse_hansen",
"changed_at": "2026-04-20T09:14:22Z",
"reason": "patient_admitted",
"patient_ref": null
}patient_ref is intentionally null — the bed logistics system does not store patient identity. It tracks bed states, not clinical records. This keeps the system outside the most sensitive GDPR tier and prevents scope creep.
Why include previous_status?
Concurrent updates. If two nurses update the same bed simultaneously (network lag, offline sync), the server can detect the conflict:
Server holds: bed_2b_room_14 = available (version 5)
Event A arrives: available → occupied (previous_status = available) ✓ apply
Event B arrives: available → reserved (previous_status = available)
But server now has version 6 (occupied) — previous_status mismatch
→ Conflict detected → log conflict, alert facility coordinatorThis is optimistic concurrency applied to event sourcing. Conflicts are rare (one bed rarely changes twice in seconds) but must be handled explicitly in healthcare context.
Step 4: The Regional Aggregation Model
The dashboard doesn't query 200 individual facility tables. It reads a pre-aggregated view that is updated on every incoming event.
BED_STATUS table (row per bed, authoritative):
bed_id UUID PRIMARY KEY
facility_id UUID
ward TEXT
care_level ENUM (light, medium, intensive)
status ENUM (available, occupied, reserved, maintenance)
last_updated TIMESTAMPTZ
version INT
FACILITY_AVAILABILITY (materialised view, refreshed on every update):
facility_id UUID
ward TEXT
care_level ENUM
total_beds INT
available_beds INT
reserved_beds INT
occupied_beds INT
maintenance_beds INT
as_of TIMESTAMPTZOn every inbound event:
- Update
BED_STATUSrow (with version check for concurrency) - Refresh
FACILITY_AVAILABILITYrow for that facility + ward + care_level - Publish updated availability to Redis (dashboard reads from Redis, not Postgres)
Redis key: avail:{facility_id}:{ward}:{care_level}
Value: { available: 3, total: 12, as_of: "2026-04-20T09:14:22Z" }
TTL: 10 minutes (fallback: read from Postgres if cache misses)The dashboard API aggregates across facilities:
GET /api/availability?region=oslo&care_level=intensive&ward=dementia
→ MGET avail:fac_*:dementia:intensive (Redis pipeline, ~2ms)
→ Return sorted by distance or available countStep 5: Architecture
┌──────────────────────────────────────────────────────────────────────┐
│ Care Home A Care Home B Care Home C │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Local │ │Local │ │Local │ │
│ │Agent │ │Agent │ │Agent │ │
│ │(outbox) │ │(outbox) │ │(outbox) │ │
└──┴────┬─────┴─────────────┴────┬─────┴─────────────┴────┬─────┴─────┘
│ HTTPS POST /events │ │
└────────────────────────┼─────────────────────────┘
│
┌──────────────────▼───────────────────────┐
│ API Gateway │
│ (auth, rate limit, facility validation) │
└──────────────────┬───────────────────────┘
│
┌──────────────────▼───────────────────────┐
│ Bed Ingest Service │
│ - Validates event schema │
│ - Applies optimistic concurrency check │
│ - Updates BED_STATUS (PostgreSQL) │
│ - Refreshes FACILITY_AVAILABILITY │
│ - Publishes to Redis cache │
│ - Emits to Kafka for audit + alerts │
└──────────────┬───────────────────────────┘
│
┌──────────────────┼───────────────────────────┐
│ │ │
┌─────────▼──────┐ ┌────────▼────────┐ ┌────────────▼────────┐
│ PostgreSQL │ │ Redis │ │ Kafka │
│ BED_STATUS │ │ Availability │ │ bed.status.changed │
│ AUDIT_LOG │ │ Cache │ │ (audit, alerts) │
└────────────────┘ └────────────────┘ └────────────┬────────┘
│
┌─────────────▼────────┐
│ Alert Service │
│ - Low bed threshold │
│ - Facility offline │
│ - Conflict detected │
└──────────────────────┘
┌──────────────────────────────────────┐
│ Dashboard API (read path) │
│ Reads Redis → Postgres on miss │
└────────────────┬─────────────────────┘
│
┌────────────────▼─────────────────────┐
│ Regional Coordinator Dashboard │
│ Admissions Staff UI │
└──────────────────────────────────────┘Step 6: Handling Offline Facilities
A care home's internet drops for 3 hours. During that time, 4 patients are admitted and 2 discharged. The local agent has queued all 6 events in SQLite.
When the connection restores:
- Agent replays queued events in order (oldest first)
- Each event carries the original
changed_attimestamp — not the replay time - Server applies events using the original timestamps to reconstruct history accurately
- FACILITY_AVAILABILITY is recomputed from the full BED_STATUS table for that facility
Dashboard during offline period: The regional dashboard shows the facility's last known state with a staleness indicator:
{
"facility_id": "fac_oslo_nordre",
"available_beds": 3,
"as_of": "2026-04-20T06:14:00Z",
"data_age_minutes": 187,
"status": "stale"
}Coordinators see the warning and know not to rely on that facility's numbers until it reconnects.
Step 7: Access Control and Audit
Healthcare contexts require strict access control. Three roles:
FACILITY_STAFF
Can: update bed status for their own facility only
Cannot: see other facilities, see patient data, access reports
FACILITY_MANAGER
Can: everything above + view their own facility's historical report
Cannot: see other facilities
REGIONAL_COORDINATOR
Can: read all facilities in their region (read-only)
Cannot: update bed status, see patient data
SYSTEM_ADMIN
Can: manage facility configurations, view audit log, set thresholdsEvery bed status change is written to an immutable audit log:
BED_AUDIT_LOG
id UUID PRIMARY KEY
bed_id UUID
facility_id UUID
changed_by UUID (user_id)
previous_status ENUM
new_status ENUM
changed_at TIMESTAMPTZ -- original clinical time
received_at TIMESTAMPTZ -- when our system processed it
event_id UUID -- idempotency key from agent
source_ip INETchanged_at vs received_at allows auditors to distinguish real-time updates from offline replays.
Step 8: Alerts and Thresholds
The Alert Service subscribes to the Kafka bed.status.changed topic and evaluates threshold rules in near-real-time:
Rules (configurable per facility, per ward):
Low availability: available_beds < threshold (e.g., 2)
Critical: available_beds = 0
Offline: no events from facility for > 30 minutes
On trigger:
→ Send push notification to subscribed coordinators
→ Log to ALERT_HISTORY table
→ Rate limit: max 1 alert per rule per facility per 15 minutes
(prevents alert storm when facility reconnects with 50 queued events)Step 9: Database Schema Summary
FACILITIES
id UUID PRIMARY KEY
name TEXT
region_id UUID
address JSONB
timezone TEXT
ward_config JSONB -- { "dementia": true, "rehabilitation": false, ... }
offline_since TIMESTAMPTZ (null if online)
BEDS
id UUID PRIMARY KEY
facility_id UUID REFERENCES facilities(id)
ward TEXT
care_level ENUM (light, medium, intensive)
room_number TEXT
status ENUM (available, occupied, reserved, maintenance)
version INT DEFAULT 0
last_updated TIMESTAMPTZ
FACILITY_AVAILABILITY (materialised, refreshed per event)
facility_id UUID
ward TEXT
care_level ENUM
total_beds INT
available_beds INT
as_of TIMESTAMPTZ
PRIMARY KEY (facility_id, ward, care_level)What the Interviewer Is Actually Testing
- Do you choose event-driven with local buffering over naive polling, and explain why for an unreliable network environment?
- Do you apply the Transactional Outbox pattern to guarantee no updates are lost during connectivity gaps?
- Do you use optimistic concurrency (version or previous_status) to handle concurrent bed updates?
- Do you separate the write path (ingest service) from the read path (Redis-cached aggregations) to handle 500 concurrent dashboard users?
- Do you design for offline graceful degradation — staleness indicators rather than errors?
- Do you keep patient data out of the bed system — tracking bed states only, not clinical records?
- Do you include an immutable audit log with both
changed_atandreceived_attimestamps? - Do you rate-limit alerts during offline replay storms?
Related Case Studies
Go Deeper
Case studies teach the "what". Our courses teach the "how" — the patterns behind these decisions, built up from first principles.
Explore Courses