Back to blog
AI Systemsadvanced

MedScribe-AI: Every Phase of a Healthcare AI System — Architecture, Failures, and Fixes

A complete engineering walkthrough of a real AI-powered clinical documentation system — agent workflows, hallucination detection, state machines, RAG, and the specific failure modes we encountered and designed around.

LearnixoMay 7, 202620 min read
LLMsRAGAgentic WorkflowsSystem DesignPythonPostgreSQLArchitecture
Share:𝕏

Building AI for healthcare is different from building AI for everything else. A hallucinated product description wastes a customer's time. A hallucinated medication dosage can harm a patient.

This is a complete engineering walkthrough of a real clinical documentation system — every phase, every design decision, and every failure mode we hit and had to design around. The system transcribes doctor-patient consultations, structures them into clinical notes, and orchestrates post-consultation agent workflows — all with a human doctor in the approval loop at every meaningful step.


What the System Does

A GP finishes a consultation. Instead of spending 15–20 minutes writing up the notes afterwards, they:

  1. Record the conversation (or dictate)
  2. AI transcribes and structures it into a SOAP-format clinical note
  3. Doctor reviews, approves, or corrects
  4. AI agents suggest post-consultation actions: ICD-10 codes, follow-up tasks, referral drafts, care plan updates
  5. Doctor approves each action individually before anything is written to the EPJ (Electronic Patient Journal)

The hard part isn't any single step. It's making the entire pipeline safe enough for clinical use.


System Overview: The Full Pipeline

Audio Recording
      ↓
  STT (Whisper)  ←─ speaker diarization (doctor vs patient)
      ↓
Norwegian STT Corrections  ←─ fixes medical term transcription errors
      ↓
LLM Structuring  ←─ raw transcript → structured SOAP note
      ↓
Safety Guardrails  ←─ checks input + output quality
      ↓
Post-Processing  ←─ fixes hallucination markers, repetitions, terminology
      ↓
Quality Evaluation  ←─ completeness, source fidelity, consistency scores
      ↓
Workflow State Machine  ←─ enforces valid states: TRANSCRIBED → STRUCTURED → REVIEW → APPROVED
      ↓
Human Review  ←─ NOTHING goes to EPJ without doctor approval
      ↓
Agent Orchestrator  ←─ suggests post-consultation actions
      ↓
Each Agent: PREVIEW → Doctor approves → EXECUTE
      ↓
EPJ Integration

Phase 1: Speech-to-Text with Speaker Diarization

What It Does

Raw audio comes in — either live recording or uploaded file. Whisper transcribes it. But a raw transcript without speaker separation is ambiguous. "The patient has a headache" — did the doctor say that, or did the patient?

Speaker diarization separates the audio into segments by speaker, labels them as [Lege] (doctor) and [Pasient] (patient), and combines them into a transcript that preserves the conversation structure.

Why It Matters for Structuring

The structuring prompt tells the LLM:

- [Lege] = Doctor's words (questions, findings, assessment)
- [Pasient] = Patient's words (symptoms, history, concerns)

Use speaker labels to place information correctly:
- Patient's reported symptoms → Chief Complaint, History
- Doctor's observations → Examination, Assessment
- Agreed plan → Plan, Follow-up

Without speaker labels, the LLM puts everything in the wrong sections. Patient-reported symptoms get classified as physician observations. The doctor's assessment bleeds into the history section. This was the first structural failure we encountered — and diarization was the fix.

The Failure Mode: Unknown Speakers

When the audio quality was low or the doctor and patient spoke over each other, diarization produced [Speaker 0] and [Speaker 1] instead of [Lege] and [Pasient]. The structuring prompt fell back to SPEAKER_INFO_SINGLE: "This is a single-speaker transcript." It still worked, just without the precision benefit.


Phase 2: Norwegian STT Corrections

The Problem

Whisper is not optimised for Norwegian medical language. It consistently transcribed:

  • "paracet" → "Paracet" (correct brand, wrong case — should be "paracetamol")
  • "ibux" → "Ibux" (should be "ibuprofen")
  • "vondt i hodet" → transcribed correctly, but should be "hodepine" in clinical notes

These aren't transcription errors — Whisper heard correctly. They're terminology errors. The words spoken in a consultation are informal Norwegian. Clinical notes must use standardised medical Norwegian.

The Fix: Pre-Structuring Correction Layer

Before the transcript reaches the LLM, it passes through apply_stt_corrections() — a dictionary-based correction layer:

Python
MEDICAL_TERM_CORRECTIONS = {
    "vondt i hodet": "hodepine",
    "vondt i magen": "magesmerter",
    "puster tungt": "dyspné",
    "kaster opp": "oppkast",
    "paracet": "paracetamol",
    "ibux": "ibuprofen",
    "sukkersyke": "diabetes mellitus",
    "høyt blodtrykk": "hypertensjon",
    ...
}

This runs before the LLM. It converts the transcript from everyday Norwegian to clinical Norwegian so the LLM receives correct input — rather than expecting the LLM to know the right terminology (which it sometimes does, sometimes doesn't, depending on model size).

Why Not Just Ask the LLM to Fix It?

We tried. With small local models (1B–3B parameter, running on CPU for privacy), the LLM was inconsistent. It would correctly translate "vondt i ryggen" to "ryggsmerter" in one note and use "ryggplager" in the next. Clinical documentation requires consistent terminology. A deterministic pre-processing dictionary is more reliable than a probabilistic LLM for this specific task.

The rule is: use determinism where you need consistency, use LLMs where you need intelligence.


Phase 3: LLM Structuring — Prompt Engineering for Medicine

The Goal

Take a raw conversation transcript and produce a structured SOAP-format clinical note:

JSON
{
  "chief_complaint": "Hodepine siste 3 dager, intensiverer om kvelden.",
  "history": "Ingen tidligere migrene. Tar paracetamol uten effekt.",
  "examination": "Normotensiv. Ingen meningisme. Normalt nevrologisk status.",
  "assessment": "Spenningshodepine.",
  "plan": "NSAIDs. Kontroll om 2 uker ved forverring.",
  "medications": "Ibuprofen 400mg ved behov.",
  "follow_up": "Kontrolltime om 2 uker ved manglende bedring."
}

The System Prompt

You are a medical documentation assistant.
1. Extract information ONLY from the provided transcript. Never invent or assume.
2. If a section has no relevant information, write "Not documented."
3. Use medical terminology appropriate for clinical records.
4. Flag any content you are uncertain about with [VERIFY].

Rule 1 is the most important rule in the entire system. Extract only. Never invent. This is the instruction that fights hallucination at the source.

Rule 4 is the self-reporting mechanism. We explicitly instruct the LLM to mark its own uncertainty. This was the decision with the highest ROI — instead of hoping the LLM would be confident when it should be and uncertain when it should be, we made uncertainty explicit and machine-readable.

The Small Model Problem

The full structuring prompt with speaker info, template instructions, and metadata is 600+ tokens. For large models (GPT-4, Llama 70B) this works well. For small local models (1B–3B running on CPU), we discovered two things:

Problem 1: Context window confusion. At 1B parameters, a 600-token system prompt plus a 500-word transcript caused the model to lose coherence. It would start echoing the prompt back in the output, or confuse the section instructions with the section content.

Fix: A minimal prompt for small models:

Python
STRUCTURING_SIMPLE_PROMPT = """Du er en norsk medisinsk dokumentasjonsassistent.
Skriv profesjonelt medisinsk norsk. Rett skrivefeil og bruk korrekt terminologi.

Konsultasjon:
{transcript}

Fyll ut feltene basert KUN på teksten over. Skriv kort og presist.

{"chief_complaint":"...", "history":"...", ...}
JSON:"""

Two-thirds shorter. Much higher success rate on small models.

Problem 2: Transcript length. Long consultations (30+ minute sessions) produced transcripts of 3,000+ words. Small models degrade significantly past ~500 tokens of input. The fix was a hard truncation at 500 characters:

Python
MAX_TRANSCRIPT_CHARS = 500
if len(transcript_text) > MAX_TRANSCRIPT_CHARS:
    transcript_text = transcript_text[:MAX_TRANSCRIPT_CHARS]

This is a deliberate compromise. We truncate the tail of the transcript rather than serving degraded output from the full transcript. The comment in the code says it plainly: "Truncate to 500 chars — keeps LLM fast on CPU while preserving key info."


Phase 4: The First Hallucination Failure

What the AI Got Wrong

This was the failure mode that required the most design work.

During testing, a 1B local model produced this output for a hypertension follow-up consultation:

JSON
{
  "chief_complaint": "Blodtrykk kontroll",
  "medications": "Pasienten tar Losartan 50mg daglig, Amlodipin 5mg, og Metoprolol 25mg.",
  "examination": "BP: 142/88. Kontaktinfo: lege@klinikken.no, tlf: 22 33 44 55"
}

The patient was only on one medication. The model added two more from its training data — common hypertension medications that often appear together in medical texts. It hallucinated a clinically plausible but completely wrong medication list.

It also fabricated a clinic phone number and email address in the examination section — data that was not in the transcript at all.

Why This Happens

LLMs learn statistical associations. In medical training data, losartan and amlodipine frequently appear together. The model completed the pattern from its training rather than from the transcript. This is the fundamental risk in domain-specific AI: the model knows enough to produce plausible hallucinations.

The Five-Layer Hallucination Defence

We didn't fix this with one change. We built five independent layers:

Layer 1 — The Prompt Instruction

Extract information ONLY from the provided transcript. Never invent or assume.

Necessary but not sufficient. The model sometimes ignores it.

Layer 2 — Self-Tagging for Uncertainty

Flag any content you are uncertain about with [VERIFY].

When the model wasn't sure, it would output "Losartan 50mg [VERIFY]". This became machine-readable:

Python
# In SafetyGuardrails.check_note():
for section, content in note.sections.items():
    if "[VERIFY]" in content:
        result.add_flag(SafetyFlag(
            severity="warning",
            category="llm_uncertainty",
            message=f"LLM flagged uncertainty in section: {section.value}",
        ))

Any [VERIFY] tag surfaced as a visible warning in the doctor's review UI.

Layer 3 — Source Fidelity Scoring

The AIEvaluator._score_source_fidelity() method checks whether significant words from the transcript appear in the generated note:

Python
# Extract significant words from the transcript
transcript_words = set(w.lower() for w in transcript.split() if len(w) > 4)

# Check how many appear in the note
note_text = " ".join(sections.values()).lower()
found = sum(1 for w in transcript_words if w in note_text)
fidelity = min(1.0, found / max(len(transcript_words) * 0.3, 1))

If the note mentions "amlodipin" but the transcript never said it, the fidelity score drops. A low fidelity score (below 0.5) triggers a quality warning.

Layer 4 — Hallucination Pattern Detection

The guardrail checks for data patterns that are unlikely to come from a consultation transcript:

Python
def _check_hallucination_patterns(self, note, result):
    all_text = " ".join(note.sections.values())
    
    # Phone numbers  not mentioned in transcript, model fabricated them
    phone_pattern = re.compile(r"\+?\d[\d\s\-]{8,}")
    if phone_pattern.search(all_text):
        result.add_flag(SafetyFlag(
            severity="warning",
            category="hallucination_risk",
            message="Note contains phone number — verify this came from transcript",
        ))
    
    # Email addresses  same issue
    email_pattern = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
    if email_pattern.search(all_text):
        result.add_flag(SafetyFlag(severity="warning", ...))

Layer 5 — LLM Self-Reference Removal

A separate failure mode: sometimes a model would output disclaimers inside the clinical note:

JSON
{
  "assessment": "As an AI, I cannot make a clinical diagnosis. However, based on the symptoms described..."
}

This is a model breaking character. It's not hallucination in the medical sense, but it's not a clinical note. The post-processor removes these sentences:

Python
HALLUCINATION_MARKERS = [
    "as an ai",
    "i cannot",
    "i don't have",
    "sorry",
    "please note",
    "this is not medical advice",
]

def remove_hallucinations(text: str) -> str:
    for marker in HALLUCINATION_MARKERS:
        if marker in text.lower():
            sentences = text.split('. ')
            sentences = [s for s in sentences if marker not in s.lower()]
            text = '. '.join(sentences)
    return text

Phase 5: Post-Processing — Never Show Raw LLM Output

The Principle

The rule in production clinical AI is: raw LLM output never reaches the clinician. Every output passes through a quality pipeline first.

The post-processing layer runs four transformations in sequence:

Python
def post_process_note(sections):
    for key, text in sections.items():
        text = fix_medical_terms(text)      # informal  clinical Norwegian
        text = remove_repetitions(text)      # deduplication
        text = remove_hallucinations(text)   # strip AI disclaimers
        text = clean_formatting(text)        # normalise whitespace, punctuation

The Repetition Problem

Small local models repeat themselves. This was the second major failure mode discovered in testing. A 1B model would produce:

"Pasienten har hodepine. Pasienten har hodepine. Hodepine hodepine."

This comes from beam search getting stuck in high-probability loops — the model picks the word it's most likely to repeat because repetition is common in training data when context becomes confused.

The deduplication fix:

Python
def remove_repetitions(text):
    # Remove duplicate consecutive sentences
    sentences = text.split('. ')
    seen = []
    for s in sentences:
        if s.strip() not in [x.strip() for x in seen]:
            seen.append(s)
    result = '. '.join(seen)
    
    # Remove word-level repetition (3+ consecutive repeats)
    result = re.sub(r'\b(\w+)(\s+\1){2,}\b', r'\1', result, flags=re.IGNORECASE)
    return result

The Nested Dict Problem

The third failure mode: small models sometimes returned nested JSON objects instead of strings for section values:

JSON
{
  "chief_complaint": {
    "main_complaint": "hodepine",
    "duration": "3 dager"
  }
}

This is valid JSON but wrong schema. The _flatten_value() function handles this by recursively extracting string values from any nested structure:

Python
def _flatten_value(value):
    if isinstance(value, str):
        return value
    if isinstance(value, list):
        return "\n".join(f"- {_flatten_value(item)}" for item in value)
    if isinstance(value, dict):
        parts = [_flatten_value(v) for v in value.values() if _flatten_value(v) != "Not documented."]
        return "\n".join(parts) if parts else "Not documented."
    return str(value) if value is not None else "Not documented."

Phase 6: The Workflow State Machine

Why a State Machine?

A clinical note has a defined lifecycle:

CREATED → RECORDING → TRANSCRIBING → TRANSCRIBED
       → STRUCTURING → STRUCTURED → REVIEW → APPROVED
                                           ↕
                                        FAILED (→ retry from CREATED)

Without enforced state transitions, the API could receive an approval request for a note that hasn't been structured yet. Or a structuring request for a note already approved. These are bugs that would silently corrupt the audit trail.

The workflow engine enforces every transition:

Python
TRANSITIONS = {
    VisitStatus.CREATED:      {VisitStatus.RECORDING, VisitStatus.FAILED},
    VisitStatus.RECORDING:    {VisitStatus.TRANSCRIBING, VisitStatus.FAILED},
    VisitStatus.TRANSCRIBING: {VisitStatus.TRANSCRIBED, VisitStatus.FAILED},
    VisitStatus.TRANSCRIBED:  {VisitStatus.STRUCTURING, VisitStatus.FAILED},
    VisitStatus.STRUCTURING:  {VisitStatus.STRUCTURED, VisitStatus.FAILED},
    VisitStatus.STRUCTURED:   {VisitStatus.REVIEW, VisitStatus.FAILED},
    VisitStatus.REVIEW:       {VisitStatus.APPROVED, VisitStatus.STRUCTURED},
    VisitStatus.APPROVED:     set(),   # terminal  nothing leaves approved
    VisitStatus.FAILED:       {VisitStatus.CREATED},  # retry allowed
}

Any attempt to jump states raises InvalidTransitionError. The engine is pure logic — it never touches the database. The API layer calls engine.transition(), gets back (updated_visit, audit_entry), and persists both atomically.

REVIEW → STRUCTURED: The Rejection Path

This transition is the most important one. When a doctor rejects a structured note, it doesn't go to FAILED — it goes back to STRUCTURED. The note still exists. The doctor can edit it directly, or request re-structuring with corrections. FAILED is reserved for system failures (Whisper timeout, LLM crash), not for clinical rejection.

Audit Trail — Every Transition Logged

Every state change produces an immutable audit entry:

Python
TRANSITION_AUDIT_MAP = {
    (VisitStatus.REVIEW, VisitStatus.APPROVED):  AuditAction.NOTE_APPROVED,
    (VisitStatus.REVIEW, VisitStatus.STRUCTURED): AuditAction.NOTE_REJECTED,
    ...
}

In healthcare, auditability is not optional. Every change must be traceable to a person, at a time, with a reason. The audit log answers: who approved this note, when, and what was the state of the note at approval.


Phase 7: Multi-Agent Orchestration

The Design Philosophy

After the doctor approves the clinical note, the system can help with what comes next. But "can help" is the key phrase. The orchestrator:

  • Suggests actions
  • Generates previews of what each action would produce
  • Waits for the doctor to approve each one individually
  • Executes only approved actions
  • Never makes clinical decisions

This is the human-in-the-loop principle implemented as an architecture.

The Agent Plan: Preview Before Execute

Python
# 1. Doctor finishes consultation
plan = await orchestrator.plan_post_consultation(visit_id, note_text)
# Returns a plan with each action in PREVIEW status

# 2. Doctor sees the plan  each action shows what it would do
# coding agent: "Suggested codes: J11.1 (Viral pneumonia, unspecified), confidence: high"
# follow_up agent: "Suggested tasks: [lab tests, return visit in 2 weeks]"
# referral agent: "Draft referral to pulmonology: [letter text]"

# 3. Doctor approves or skips each one
await orchestrator.approve_action(plan, coding_action_id)
await orchestrator.skip_action(plan, referral_action_id)  # no referral needed

# 4. Execute only approved actions
await orchestrator.execute_action(plan, coding_action_id, actor="dr.smith")

The key technical invariant: execute_action() only runs if status is APPROVED or PREVIEW. There is no code path that bypasses this.

The Five Agents

| Agent | Task | Risk | Why This Risk Level | |-------|------|------|---------------------| | CodingAgent | Suggest ICD-10 / ICPC-2 codes | LOW | Just suggestions, never auto-applied. Wrong code is corrected at billing, not at care | | FollowUpAgent | Create follow-up tasks | MEDIUM | Missing a follow-up has clinical consequences | | ReferralDraftAgent | Draft referral letter | MEDIUM | Letter goes to specialist — content matters | | CarePlanAgent | Update care plan | MEDIUM | Care plan changes affect ongoing treatment | | LetterDraftAgent | Epikrise / innkalling / sykemelding | MEDIUM | Patient-facing or legal documents |

Risk level determines the UI treatment: LOW actions can be shown with a single "apply" button. MEDIUM actions require the doctor to read the preview. HIGH actions (not yet implemented) would require explicit approval plus a reason.

ICD-10 Code Hallucination: The Coding Agent Failure

The CodingAgent had its own hallucination failure mode.

When given a vague consultation note — "patient feels unwell, fatigue" — the LLM would suggest 4–6 highly specific ICD-10 codes with "confidence": "high". It was pattern-matching on common code combinations from training data, not reasoning about the clinical evidence.

The fix was a dual-source architecture:

Python
async def preview(self, context):
    # Source 1: Deterministic keyword matching against known ICD-10 terms
    from medscribe.services.norwegian import suggest_icd10
    keyword_suggestions = suggest_icd10(note_text)
    
    # Source 2: LLM suggestions with structured confidence
    llm_codes = await self._llm.generate("""
        Return JSON: [{"code": "ICD-10", "description": "...", "confidence": "high|medium|low"}]
        Return ONLY JSON.
    """)
    
    return {
        "suggested_codes": llm_codes,      # LLM suggestions
        "keyword_matches": keyword_suggestions,  # deterministic matches
    }

The UI shows both sources separately. Keyword matches have a "deterministic" badge — the doctor knows these came from a lookup, not an LLM. LLM suggestions have confidence levels. A "low" confidence LLM suggestion that doesn't overlap with keyword matches is a signal to be skeptical.


Phase 8: RAG — Patient Context Without Hallucination

The Problem with General Chatbots in Healthcare

A general LLM answering "What medications was this patient on last visit?" draws on its training data for context. It might say "patients with this diagnosis are commonly on metformin" — which is statistically true but not factually true for this patient.

Patient-specific Q&A needs to be grounded in the patient's actual records, not statistical norms.

The RAG Architecture

Python
async def ask(self, question: str, patient_id: str) -> dict:
    # 1. Retrieve actual patient visit notes from DB
    context_chunks, sources = await self._retrieve_patient_context(patient_id)
    
    # 2. Inject only retrieved data as context
    result = await self._llm.generate(
        prompt=f"""Svar på spørsmålet basert KUN på pasientens journalnotater nedenfor.
        
Spørsmål: {question}

Pasientens notater:
{context}

Regler:
1. Svar KUN basert på informasjonen i notatene.
2. Hvis informasjonen ikke finnes, si "Ikke funnet i tilgjengelige notater."
3. Referer til hvilken dato informasjonen kommer fra."""
    )
    
    # 3. Return answer with source citations
    return {
        "answer": result.text,
        "sources": sources,  # which visit notes were used
    }

The critical line is Rule 2: "If information is not found, say so." Without this, the model fills the gap from training data. With it, the model explicitly admits ignorance — which is the correct clinical behaviour.

Every answer returns sources — the specific visit IDs and dates the answer was drawn from. The doctor can click through to verify.


Phase 9: Quality Evaluation and Drift Detection

Automated Quality Scoring

Every structured note is scored on three dimensions:

Python
overall_score = (
    completeness  * 0.3 +   # fraction of sections filled
    source_fidelity * 0.4 + # words from transcript found in note
    consistency   * 0.2 +   # structural quality (no JSON fragments, no empty sections)
    safety_pass   * 0.1     # passed guardrails
)

Source fidelity has the highest weight (0.4) because it's the most direct measure of hallucination. A note that uses words not in the transcript is a note that added information from somewhere other than the conversation.

Drift Detection

The QualityMonitor tracks scores over time and compares the first half of recent results to the second half:

Python
def get_trend(self, last_n=20):
    scores = [r.overall_score for r in self._history[-last_n:]]
    first_half_avg = mean(scores[:len(scores)//2])
    second_half_avg = mean(scores[len(scores)//2:])
    
    if second_half_avg < first_half_avg - 0.1:
        trend = "declining"

A "declining" trend means: the model's output quality is getting worse over time. This happens when model context changes (temperature changes, system updates, prompt changes) or when input distribution shifts (new patient demographics, new clinical areas).


Phase 10: Verification Service — Optimistic Locking in Healthcare

The Pattern

The verification service manages the human approval workflow for clinical documentation. Two reviewers might try to approve the same document simultaneously — a race condition.

The fix is optimistic locking with version numbers:

Python
def _assert_version(current: int, expected: int) -> None:
    """Raises 409 if another writer updated first."""
    if current != expected:
        raise HTTPException(
            status_code=409,
            detail=f"Version conflict: expected {expected}, got {current}. "
                   "Record was modified by another request. Please refresh and retry."
        )

async def approve(self, verification_id, reviewer, expected_version=None):
    v = await self._get_or_404(verification_id)
    _assert_transition(v.status, VerificationStatus.APPROVED)
    
    if expected_version is not None:
        _assert_version(v.version, expected_version)  #  conflict detection
    
    v.status = VerificationStatus.APPROVED
    v.version += 1  #  increment on every write

The UI sends expected_version with every approval request. If two reviewers approve simultaneously, the second gets a 409 and is told to refresh. No silent duplicate approvals.

This is the same pattern as the Nordea banking race condition — but in a healthcare context where a double-approval could mean a patient's record shows contradictory clinical assessments.


The Meta-Lessons

1. Defence in Depth for Hallucination

No single guardrail catches everything. We ended up with five independent layers:

  • Prompt instruction ("extract only")
  • Self-tagging ([VERIFY])
  • Source fidelity scoring
  • Pattern detection (phone, email, address)
  • LLM self-reference removal

Each layer catches a class of failures the others miss. All five together still don't catch everything — the doctor review is the final defence. The goal of the automated layers is not to eliminate the need for review. It's to make review faster and more focused.

2. Determinism vs. Probability

Use deterministic logic where you need consistency (terminology correction, hallucination markers, state transitions). Use LLMs where you need intelligence (structuring, drafting, reasoning).

The biggest early mistake was asking the LLM to handle things that should have been deterministic. A lookup table for ICD-10 keyword matching is faster, cheaper, and more consistent than asking the LLM to code on its own.

3. Small Models Need Different Prompts

A prompt that works beautifully on GPT-4 may completely confuse a 1B local model. Small models lose coherence with long context. They repeat themselves. They return wrong data types. They misunderstand complex instructions.

For clinical AI running locally (privacy requirement), you need a separate prompt engineering strategy for small models — shorter prompts, simpler structures, explicit output format examples.

4. Human-in-the-Loop Is Not a UI Feature

The approval requirement isn't something bolted on for compliance. It's enforced at the architecture level — execute_action() checks status before running. There's no API call that writes to the EPJ without passing through an approval state transition. The system cannot accidentally skip review.

5. Auditability Is a First-Class Requirement

Every state transition, every agent execution, every human approval is audit-logged with actor, timestamp, and detail. Not because we wanted to — because healthcare demands it. But the design benefit was substantial: debugging production issues became tracing audit logs rather than reading application logs.


Checklist: Healthcare AI System Design

□ Is every AI output verified by a human before clinical action?
□ Does the structuring prompt explicitly say "extract only, never invent"?
□ Does the LLM self-tag uncertainty with a machine-readable marker?
□ Is there source fidelity measurement (output vs. input overlap)?
□ Are AI disclaimers removed before showing output to clinicians?
□ Are there separate prompts for small vs. large models?
□ Is the workflow a state machine with enforced transitions?
□ Is every state transition audit-logged with actor + timestamp?
□ Does the RAG system return source citations with every answer?
□ Is there a quality drift monitor comparing score trends over time?
□ Does concurrent approval use optimistic locking?
□ Can the system fail gracefully when AI is unavailable?

The answer to every question above should be yes before a clinical AI system handles real patient data.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.