Back to Case Studies
system-designadvanced 14 min read

System Design Interview

Design a Real-Time Bed Logistics System (Municipal Healthcare)

Hundreds of care homes, thousands of beds, one dashboard — give municipalities live capacity visibility without overwriting clinical data

Key outcome: Live bed availability across 200+ facilities
System DesignHealthcareEvent-DrivenReal-TimePostgreSQLAggregation

The Interview Question

"Design a bed logistics system for a regional healthcare authority. Hundreds of municipal care homes each manage their own bed occupancy. Admissions coordinators at the regional level need a live dashboard showing which facilities have available beds, by ward type and care level. Bed status must update within minutes of a patient being admitted or discharged. The system must not replace or interfere with the clinical records at each care home."

This question tests event-driven architecture, data aggregation across many independent sources, conflict resolution on concurrent updates, and the pragmatics of building for healthcare environments where connectivity and legacy systems are real constraints.


Step 1: Requirements

Functional

  • Care home staff update bed status: occupied, available, reserved, under maintenance
  • Regional coordinators see a live aggregated view: available beds per facility, per ward type, per care level
  • Admissions staff can filter by location, care level (light / medium / intensive), and availability
  • Bed status history is auditable — every change logged with timestamp and user
  • Alerts when a facility's available beds drop below a configurable threshold

Non-functional

  • 200 care homes in the region, each with 20–200 beds (total: ~15,000 beds)
  • Status updates must reflect in the regional dashboard within 2 minutes
  • System must degrade gracefully when a care home loses internet (not lose their updates)
  • Read-heavy for coordinators: ~500 concurrent users reading the dashboard during business hours
  • Write volume: ~2,000 status changes per day (admits/discharges across all facilities)
  • Healthcare data — PII adjacent; access must be role-restricted and fully audited

Step 2: The Core Architecture Question — Push or Pull?

The central design decision: how does the regional system learn about bed changes at each care home?

Option A: Pull (polling)
  Regional system calls each care home's API every N minutes
  ✓ Simple to implement
  ✗ At 200 facilities × every 2 minutes = 100 requests/minute
  ✗ Care home systems must expose an API (many won't)
  ✗ Latency = up to 2 minutes per poll cycle regardless of activity

Option B: Push via webhook
  Care home system calls the regional endpoint when status changes
  ✓ Near-real-time updates
  ✗ Requires care home software to integrate — many use legacy systems
  ✗ Care home goes offline → updates are lost

Option C: Event queue with offline buffering (recommended)
  Lightweight agent at each care home publishes to a durable queue
  Regional service consumes events asynchronously
  ✓ Survives connectivity gaps — agent retries when connection returns
  ✓ Decoupled — care home software just writes to the local agent
  ✓ Auditable event log

Recommended: event-driven with a local buffering agent at each facility.

Care Home Side:                    Regional Side:
┌────────────────────────┐         ┌────────────────────────────────┐
│  Care Home Software    │         │  Bed Status Service            │
│  (EHR / manual UI)     │         │  - Consumes bed events         │
│        ↓               │         │  - Updates aggregated view     │
│  Local Agent           │ ──────► │  - Sends threshold alerts      │
│  (buffer + retry)      │  HTTPS  │  - Serves dashboard API        │
│        ↓               │ events  └────────────────────────────────┘
│  Local SQLite queue    │
│  (survives restarts)   │
└────────────────────────┘

The local agent is a small .NET worker that:

  1. Watches for bed status changes (via DB trigger, file drop, or direct integration)
  2. Writes events to a local SQLite queue
  3. Ships events to the regional API over HTTPS
  4. Marks events as delivered on 200 OK; retries on failure

This pattern is called the Transactional Outbox — events are committed locally before being sent, so no status change is ever lost even if the network drops mid-send.


Step 3: The Bed Event Model

Every change to a bed produces a structured event:

JSON
{
  "event_id": "evt_01HX9...",
  "facility_id": "fac_oslo_nordre",
  "bed_id": "bed_2b_room_14",
  "ward": "dementia",
  "care_level": "intensive",
  "previous_status": "available",
  "new_status": "occupied",
  "changed_by": "user_nurse_hansen",
  "changed_at": "2026-04-20T09:14:22Z",
  "reason": "patient_admitted",
  "patient_ref": null
}

patient_ref is intentionally null — the bed logistics system does not store patient identity. It tracks bed states, not clinical records. This keeps the system outside the most sensitive GDPR tier and prevents scope creep.

Why include previous_status? Concurrent updates. If two nurses update the same bed simultaneously (network lag, offline sync), the server can detect the conflict:

Server holds: bed_2b_room_14 = available (version 5)
Event A arrives: available → occupied (previous_status = available) ✓ apply
Event B arrives: available → reserved  (previous_status = available)
  But server now has version 6 (occupied) — previous_status mismatch
  → Conflict detected → log conflict, alert facility coordinator

This is optimistic concurrency applied to event sourcing. Conflicts are rare (one bed rarely changes twice in seconds) but must be handled explicitly in healthcare context.


Step 4: The Regional Aggregation Model

The dashboard doesn't query 200 individual facility tables. It reads a pre-aggregated view that is updated on every incoming event.

BED_STATUS table (row per bed, authoritative):
  bed_id          UUID  PRIMARY KEY
  facility_id     UUID
  ward            TEXT
  care_level      ENUM (light, medium, intensive)
  status          ENUM (available, occupied, reserved, maintenance)
  last_updated    TIMESTAMPTZ
  version         INT

FACILITY_AVAILABILITY (materialised view, refreshed on every update):
  facility_id     UUID
  ward            TEXT
  care_level      ENUM
  total_beds      INT
  available_beds  INT
  reserved_beds   INT
  occupied_beds   INT
  maintenance_beds INT
  as_of           TIMESTAMPTZ

On every inbound event:

  1. Update BED_STATUS row (with version check for concurrency)
  2. Refresh FACILITY_AVAILABILITY row for that facility + ward + care_level
  3. Publish updated availability to Redis (dashboard reads from Redis, not Postgres)
Redis key: avail:{facility_id}:{ward}:{care_level}
Value: { available: 3, total: 12, as_of: "2026-04-20T09:14:22Z" }
TTL: 10 minutes (fallback: read from Postgres if cache misses)

The dashboard API aggregates across facilities:

GET /api/availability?region=oslo&care_level=intensive&ward=dementia

→ MGET avail:fac_*:dementia:intensive  (Redis pipeline, ~2ms)
→ Return sorted by distance or available count

Step 5: Architecture

┌──────────────────────────────────────────────────────────────────────┐
│  Care Home A              Care Home B              Care Home C       │
│  ┌──────────┐             ┌──────────┐             ┌──────────┐      │
│  │Local     │             │Local     │             │Local     │      │
│  │Agent     │             │Agent     │             │Agent     │      │
│  │(outbox)  │             │(outbox)  │             │(outbox)  │      │
└──┴────┬─────┴─────────────┴────┬─────┴─────────────┴────┬─────┴─────┘
        │  HTTPS POST /events    │                         │
        └────────────────────────┼─────────────────────────┘
                                 │
              ┌──────────────────▼───────────────────────┐
              │            API Gateway                    │
              │  (auth, rate limit, facility validation)  │
              └──────────────────┬───────────────────────┘
                                 │
              ┌──────────────────▼───────────────────────┐
              │         Bed Ingest Service                │
              │  - Validates event schema                 │
              │  - Applies optimistic concurrency check   │
              │  - Updates BED_STATUS (PostgreSQL)        │
              │  - Refreshes FACILITY_AVAILABILITY        │
              │  - Publishes to Redis cache               │
              │  - Emits to Kafka for audit + alerts      │
              └──────────────┬───────────────────────────┘
                             │
          ┌──────────────────┼───────────────────────────┐
          │                  │                           │
┌─────────▼──────┐  ┌────────▼────────┐    ┌────────────▼────────┐
│  PostgreSQL    │  │     Redis       │    │     Kafka           │
│  BED_STATUS    │  │  Availability   │    │  bed.status.changed │
│  AUDIT_LOG     │  │  Cache          │    │  (audit, alerts)    │
└────────────────┘  └────────────────┘    └────────────┬────────┘
                                                        │
                                          ┌─────────────▼────────┐
                                          │  Alert Service        │
                                          │  - Low bed threshold  │
                                          │  - Facility offline   │
                                          │  - Conflict detected  │
                                          └──────────────────────┘

              ┌──────────────────────────────────────┐
              │    Dashboard API (read path)          │
              │  Reads Redis → Postgres on miss       │
              └────────────────┬─────────────────────┘
                               │
              ┌────────────────▼─────────────────────┐
              │    Regional Coordinator Dashboard     │
              │    Admissions Staff UI                │
              └──────────────────────────────────────┘

Step 6: Handling Offline Facilities

A care home's internet drops for 3 hours. During that time, 4 patients are admitted and 2 discharged. The local agent has queued all 6 events in SQLite.

When the connection restores:

  1. Agent replays queued events in order (oldest first)
  2. Each event carries the original changed_at timestamp — not the replay time
  3. Server applies events using the original timestamps to reconstruct history accurately
  4. FACILITY_AVAILABILITY is recomputed from the full BED_STATUS table for that facility

Dashboard during offline period: The regional dashboard shows the facility's last known state with a staleness indicator:

JSON
{
  "facility_id": "fac_oslo_nordre",
  "available_beds": 3,
  "as_of": "2026-04-20T06:14:00Z",
  "data_age_minutes": 187,
  "status": "stale"
}

Coordinators see the warning and know not to rely on that facility's numbers until it reconnects.


Step 7: Access Control and Audit

Healthcare contexts require strict access control. Three roles:

FACILITY_STAFF
  Can: update bed status for their own facility only
  Cannot: see other facilities, see patient data, access reports

FACILITY_MANAGER
  Can: everything above + view their own facility's historical report
  Cannot: see other facilities

REGIONAL_COORDINATOR
  Can: read all facilities in their region (read-only)
  Cannot: update bed status, see patient data

SYSTEM_ADMIN
  Can: manage facility configurations, view audit log, set thresholds

Every bed status change is written to an immutable audit log:

SQL
BED_AUDIT_LOG
  id              UUID  PRIMARY KEY
  bed_id          UUID
  facility_id     UUID
  changed_by      UUID  (user_id)
  previous_status ENUM
  new_status      ENUM
  changed_at      TIMESTAMPTZ  -- original clinical time
  received_at     TIMESTAMPTZ  -- when our system processed it
  event_id        UUID         -- idempotency key from agent
  source_ip       INET

changed_at vs received_at allows auditors to distinguish real-time updates from offline replays.


Step 8: Alerts and Thresholds

The Alert Service subscribes to the Kafka bed.status.changed topic and evaluates threshold rules in near-real-time:

Rules (configurable per facility, per ward):
  Low availability:   available_beds < threshold (e.g., 2)
  Critical:           available_beds = 0
  Offline:            no events from facility for > 30 minutes

On trigger:
  → Send push notification to subscribed coordinators
  → Log to ALERT_HISTORY table
  → Rate limit: max 1 alert per rule per facility per 15 minutes
     (prevents alert storm when facility reconnects with 50 queued events)

Step 9: Database Schema Summary

SQL
FACILITIES
  id              UUID  PRIMARY KEY
  name            TEXT
  region_id       UUID
  address         JSONB
  timezone        TEXT
  ward_config     JSONB   -- { "dementia": true, "rehabilitation": false, ... }
  offline_since   TIMESTAMPTZ  (null if online)

BEDS
  id              UUID  PRIMARY KEY
  facility_id     UUID  REFERENCES facilities(id)
  ward            TEXT
  care_level      ENUM  (light, medium, intensive)
  room_number     TEXT
  status          ENUM  (available, occupied, reserved, maintenance)
  version         INT   DEFAULT 0
  last_updated    TIMESTAMPTZ

FACILITY_AVAILABILITY   (materialised, refreshed per event)
  facility_id     UUID
  ward            TEXT
  care_level      ENUM
  total_beds      INT
  available_beds  INT
  as_of           TIMESTAMPTZ
  PRIMARY KEY (facility_id, ward, care_level)

What the Interviewer Is Actually Testing

  • Do you choose event-driven with local buffering over naive polling, and explain why for an unreliable network environment?
  • Do you apply the Transactional Outbox pattern to guarantee no updates are lost during connectivity gaps?
  • Do you use optimistic concurrency (version or previous_status) to handle concurrent bed updates?
  • Do you separate the write path (ingest service) from the read path (Redis-cached aggregations) to handle 500 concurrent dashboard users?
  • Do you design for offline graceful degradation — staleness indicators rather than errors?
  • Do you keep patient data out of the bed system — tracking bed states only, not clinical records?
  • Do you include an immutable audit log with both changed_at and received_at timestamps?
  • Do you rate-limit alerts during offline replay storms?

Related Case Studies

Go Deeper

Case studies teach the "what". Our courses teach the "how" — the patterns behind these decisions, built up from first principles.

Explore Courses