MedScribe-AI: Every Phase of a Healthcare AI System — Architecture, Failures, and Fixes
A complete engineering walkthrough of a real AI-powered clinical documentation system — agent workflows, hallucination detection, state machines, RAG, and the specific failure modes we encountered and designed around.
Building AI for healthcare is different from building AI for everything else. A hallucinated product description wastes a customer's time. A hallucinated medication dosage can harm a patient.
This is a complete engineering walkthrough of a real clinical documentation system — every phase, every design decision, and every failure mode we hit and had to design around. The system transcribes doctor-patient consultations, structures them into clinical notes, and orchestrates post-consultation agent workflows — all with a human doctor in the approval loop at every meaningful step.
What the System Does
A GP finishes a consultation. Instead of spending 15–20 minutes writing up the notes afterwards, they:
- Record the conversation (or dictate)
- AI transcribes and structures it into a SOAP-format clinical note
- Doctor reviews, approves, or corrects
- AI agents suggest post-consultation actions: ICD-10 codes, follow-up tasks, referral drafts, care plan updates
- Doctor approves each action individually before anything is written to the EPJ (Electronic Patient Journal)
The hard part isn't any single step. It's making the entire pipeline safe enough for clinical use.
System Overview: The Full Pipeline
Audio Recording
↓
STT (Whisper) ←─ speaker diarization (doctor vs patient)
↓
Norwegian STT Corrections ←─ fixes medical term transcription errors
↓
LLM Structuring ←─ raw transcript → structured SOAP note
↓
Safety Guardrails ←─ checks input + output quality
↓
Post-Processing ←─ fixes hallucination markers, repetitions, terminology
↓
Quality Evaluation ←─ completeness, source fidelity, consistency scores
↓
Workflow State Machine ←─ enforces valid states: TRANSCRIBED → STRUCTURED → REVIEW → APPROVED
↓
Human Review ←─ NOTHING goes to EPJ without doctor approval
↓
Agent Orchestrator ←─ suggests post-consultation actions
↓
Each Agent: PREVIEW → Doctor approves → EXECUTE
↓
EPJ IntegrationPhase 1: Speech-to-Text with Speaker Diarization
What It Does
Raw audio comes in — either live recording or uploaded file. Whisper transcribes it. But a raw transcript without speaker separation is ambiguous. "The patient has a headache" — did the doctor say that, or did the patient?
Speaker diarization separates the audio into segments by speaker, labels them as [Lege] (doctor) and [Pasient] (patient), and combines them into a transcript that preserves the conversation structure.
Why It Matters for Structuring
The structuring prompt tells the LLM:
- [Lege] = Doctor's words (questions, findings, assessment)
- [Pasient] = Patient's words (symptoms, history, concerns)
Use speaker labels to place information correctly:
- Patient's reported symptoms → Chief Complaint, History
- Doctor's observations → Examination, Assessment
- Agreed plan → Plan, Follow-upWithout speaker labels, the LLM puts everything in the wrong sections. Patient-reported symptoms get classified as physician observations. The doctor's assessment bleeds into the history section. This was the first structural failure we encountered — and diarization was the fix.
The Failure Mode: Unknown Speakers
When the audio quality was low or the doctor and patient spoke over each other, diarization produced [Speaker 0] and [Speaker 1] instead of [Lege] and [Pasient]. The structuring prompt fell back to SPEAKER_INFO_SINGLE: "This is a single-speaker transcript." It still worked, just without the precision benefit.
Phase 2: Norwegian STT Corrections
The Problem
Whisper is not optimised for Norwegian medical language. It consistently transcribed:
- "paracet" → "Paracet" (correct brand, wrong case — should be "paracetamol")
- "ibux" → "Ibux" (should be "ibuprofen")
- "vondt i hodet" → transcribed correctly, but should be "hodepine" in clinical notes
These aren't transcription errors — Whisper heard correctly. They're terminology errors. The words spoken in a consultation are informal Norwegian. Clinical notes must use standardised medical Norwegian.
The Fix: Pre-Structuring Correction Layer
Before the transcript reaches the LLM, it passes through apply_stt_corrections() — a dictionary-based correction layer:
MEDICAL_TERM_CORRECTIONS = {
"vondt i hodet": "hodepine",
"vondt i magen": "magesmerter",
"puster tungt": "dyspné",
"kaster opp": "oppkast",
"paracet": "paracetamol",
"ibux": "ibuprofen",
"sukkersyke": "diabetes mellitus",
"høyt blodtrykk": "hypertensjon",
...
}This runs before the LLM. It converts the transcript from everyday Norwegian to clinical Norwegian so the LLM receives correct input — rather than expecting the LLM to know the right terminology (which it sometimes does, sometimes doesn't, depending on model size).
Why Not Just Ask the LLM to Fix It?
We tried. With small local models (1B–3B parameter, running on CPU for privacy), the LLM was inconsistent. It would correctly translate "vondt i ryggen" to "ryggsmerter" in one note and use "ryggplager" in the next. Clinical documentation requires consistent terminology. A deterministic pre-processing dictionary is more reliable than a probabilistic LLM for this specific task.
The rule is: use determinism where you need consistency, use LLMs where you need intelligence.
Phase 3: LLM Structuring — Prompt Engineering for Medicine
The Goal
Take a raw conversation transcript and produce a structured SOAP-format clinical note:
{
"chief_complaint": "Hodepine siste 3 dager, intensiverer om kvelden.",
"history": "Ingen tidligere migrene. Tar paracetamol uten effekt.",
"examination": "Normotensiv. Ingen meningisme. Normalt nevrologisk status.",
"assessment": "Spenningshodepine.",
"plan": "NSAIDs. Kontroll om 2 uker ved forverring.",
"medications": "Ibuprofen 400mg ved behov.",
"follow_up": "Kontrolltime om 2 uker ved manglende bedring."
}The System Prompt
You are a medical documentation assistant.
1. Extract information ONLY from the provided transcript. Never invent or assume.
2. If a section has no relevant information, write "Not documented."
3. Use medical terminology appropriate for clinical records.
4. Flag any content you are uncertain about with [VERIFY].Rule 1 is the most important rule in the entire system. Extract only. Never invent. This is the instruction that fights hallucination at the source.
Rule 4 is the self-reporting mechanism. We explicitly instruct the LLM to mark its own uncertainty. This was the decision with the highest ROI — instead of hoping the LLM would be confident when it should be and uncertain when it should be, we made uncertainty explicit and machine-readable.
The Small Model Problem
The full structuring prompt with speaker info, template instructions, and metadata is 600+ tokens. For large models (GPT-4, Llama 70B) this works well. For small local models (1B–3B running on CPU), we discovered two things:
Problem 1: Context window confusion. At 1B parameters, a 600-token system prompt plus a 500-word transcript caused the model to lose coherence. It would start echoing the prompt back in the output, or confuse the section instructions with the section content.
Fix: A minimal prompt for small models:
STRUCTURING_SIMPLE_PROMPT = """Du er en norsk medisinsk dokumentasjonsassistent.
Skriv profesjonelt medisinsk norsk. Rett skrivefeil og bruk korrekt terminologi.
Konsultasjon:
{transcript}
Fyll ut feltene basert KUN på teksten over. Skriv kort og presist.
{"chief_complaint":"...", "history":"...", ...}
JSON:"""Two-thirds shorter. Much higher success rate on small models.
Problem 2: Transcript length. Long consultations (30+ minute sessions) produced transcripts of 3,000+ words. Small models degrade significantly past ~500 tokens of input. The fix was a hard truncation at 500 characters:
MAX_TRANSCRIPT_CHARS = 500
if len(transcript_text) > MAX_TRANSCRIPT_CHARS:
transcript_text = transcript_text[:MAX_TRANSCRIPT_CHARS]This is a deliberate compromise. We truncate the tail of the transcript rather than serving degraded output from the full transcript. The comment in the code says it plainly: "Truncate to 500 chars — keeps LLM fast on CPU while preserving key info."
Phase 4: The First Hallucination Failure
What the AI Got Wrong
This was the failure mode that required the most design work.
During testing, a 1B local model produced this output for a hypertension follow-up consultation:
{
"chief_complaint": "Blodtrykk kontroll",
"medications": "Pasienten tar Losartan 50mg daglig, Amlodipin 5mg, og Metoprolol 25mg.",
"examination": "BP: 142/88. Kontaktinfo: lege@klinikken.no, tlf: 22 33 44 55"
}The patient was only on one medication. The model added two more from its training data — common hypertension medications that often appear together in medical texts. It hallucinated a clinically plausible but completely wrong medication list.
It also fabricated a clinic phone number and email address in the examination section — data that was not in the transcript at all.
Why This Happens
LLMs learn statistical associations. In medical training data, losartan and amlodipine frequently appear together. The model completed the pattern from its training rather than from the transcript. This is the fundamental risk in domain-specific AI: the model knows enough to produce plausible hallucinations.
The Five-Layer Hallucination Defence
We didn't fix this with one change. We built five independent layers:
Layer 1 — The Prompt Instruction
Extract information ONLY from the provided transcript. Never invent or assume.Necessary but not sufficient. The model sometimes ignores it.
Layer 2 — Self-Tagging for Uncertainty
Flag any content you are uncertain about with [VERIFY].When the model wasn't sure, it would output "Losartan 50mg [VERIFY]". This became machine-readable:
# In SafetyGuardrails.check_note():
for section, content in note.sections.items():
if "[VERIFY]" in content:
result.add_flag(SafetyFlag(
severity="warning",
category="llm_uncertainty",
message=f"LLM flagged uncertainty in section: {section.value}",
))Any [VERIFY] tag surfaced as a visible warning in the doctor's review UI.
Layer 3 — Source Fidelity Scoring
The AIEvaluator._score_source_fidelity() method checks whether significant words from the transcript appear in the generated note:
# Extract significant words from the transcript
transcript_words = set(w.lower() for w in transcript.split() if len(w) > 4)
# Check how many appear in the note
note_text = " ".join(sections.values()).lower()
found = sum(1 for w in transcript_words if w in note_text)
fidelity = min(1.0, found / max(len(transcript_words) * 0.3, 1))If the note mentions "amlodipin" but the transcript never said it, the fidelity score drops. A low fidelity score (below 0.5) triggers a quality warning.
Layer 4 — Hallucination Pattern Detection
The guardrail checks for data patterns that are unlikely to come from a consultation transcript:
def _check_hallucination_patterns(self, note, result):
all_text = " ".join(note.sections.values())
# Phone numbers — not mentioned in transcript, model fabricated them
phone_pattern = re.compile(r"\+?\d[\d\s\-]{8,}")
if phone_pattern.search(all_text):
result.add_flag(SafetyFlag(
severity="warning",
category="hallucination_risk",
message="Note contains phone number — verify this came from transcript",
))
# Email addresses — same issue
email_pattern = re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}")
if email_pattern.search(all_text):
result.add_flag(SafetyFlag(severity="warning", ...))Layer 5 — LLM Self-Reference Removal
A separate failure mode: sometimes a model would output disclaimers inside the clinical note:
{
"assessment": "As an AI, I cannot make a clinical diagnosis. However, based on the symptoms described..."
}This is a model breaking character. It's not hallucination in the medical sense, but it's not a clinical note. The post-processor removes these sentences:
HALLUCINATION_MARKERS = [
"as an ai",
"i cannot",
"i don't have",
"sorry",
"please note",
"this is not medical advice",
]
def remove_hallucinations(text: str) -> str:
for marker in HALLUCINATION_MARKERS:
if marker in text.lower():
sentences = text.split('. ')
sentences = [s for s in sentences if marker not in s.lower()]
text = '. '.join(sentences)
return textPhase 5: Post-Processing — Never Show Raw LLM Output
The Principle
The rule in production clinical AI is: raw LLM output never reaches the clinician. Every output passes through a quality pipeline first.
The post-processing layer runs four transformations in sequence:
def post_process_note(sections):
for key, text in sections.items():
text = fix_medical_terms(text) # informal → clinical Norwegian
text = remove_repetitions(text) # deduplication
text = remove_hallucinations(text) # strip AI disclaimers
text = clean_formatting(text) # normalise whitespace, punctuationThe Repetition Problem
Small local models repeat themselves. This was the second major failure mode discovered in testing. A 1B model would produce:
"Pasienten har hodepine. Pasienten har hodepine. Hodepine hodepine."This comes from beam search getting stuck in high-probability loops — the model picks the word it's most likely to repeat because repetition is common in training data when context becomes confused.
The deduplication fix:
def remove_repetitions(text):
# Remove duplicate consecutive sentences
sentences = text.split('. ')
seen = []
for s in sentences:
if s.strip() not in [x.strip() for x in seen]:
seen.append(s)
result = '. '.join(seen)
# Remove word-level repetition (3+ consecutive repeats)
result = re.sub(r'\b(\w+)(\s+\1){2,}\b', r'\1', result, flags=re.IGNORECASE)
return resultThe Nested Dict Problem
The third failure mode: small models sometimes returned nested JSON objects instead of strings for section values:
{
"chief_complaint": {
"main_complaint": "hodepine",
"duration": "3 dager"
}
}This is valid JSON but wrong schema. The _flatten_value() function handles this by recursively extracting string values from any nested structure:
def _flatten_value(value):
if isinstance(value, str):
return value
if isinstance(value, list):
return "\n".join(f"- {_flatten_value(item)}" for item in value)
if isinstance(value, dict):
parts = [_flatten_value(v) for v in value.values() if _flatten_value(v) != "Not documented."]
return "\n".join(parts) if parts else "Not documented."
return str(value) if value is not None else "Not documented."Phase 6: The Workflow State Machine
Why a State Machine?
A clinical note has a defined lifecycle:
CREATED → RECORDING → TRANSCRIBING → TRANSCRIBED
→ STRUCTURING → STRUCTURED → REVIEW → APPROVED
↕
FAILED (→ retry from CREATED)Without enforced state transitions, the API could receive an approval request for a note that hasn't been structured yet. Or a structuring request for a note already approved. These are bugs that would silently corrupt the audit trail.
The workflow engine enforces every transition:
TRANSITIONS = {
VisitStatus.CREATED: {VisitStatus.RECORDING, VisitStatus.FAILED},
VisitStatus.RECORDING: {VisitStatus.TRANSCRIBING, VisitStatus.FAILED},
VisitStatus.TRANSCRIBING: {VisitStatus.TRANSCRIBED, VisitStatus.FAILED},
VisitStatus.TRANSCRIBED: {VisitStatus.STRUCTURING, VisitStatus.FAILED},
VisitStatus.STRUCTURING: {VisitStatus.STRUCTURED, VisitStatus.FAILED},
VisitStatus.STRUCTURED: {VisitStatus.REVIEW, VisitStatus.FAILED},
VisitStatus.REVIEW: {VisitStatus.APPROVED, VisitStatus.STRUCTURED},
VisitStatus.APPROVED: set(), # terminal — nothing leaves approved
VisitStatus.FAILED: {VisitStatus.CREATED}, # retry allowed
}Any attempt to jump states raises InvalidTransitionError. The engine is pure logic — it never touches the database. The API layer calls engine.transition(), gets back (updated_visit, audit_entry), and persists both atomically.
REVIEW → STRUCTURED: The Rejection Path
This transition is the most important one. When a doctor rejects a structured note, it doesn't go to FAILED — it goes back to STRUCTURED. The note still exists. The doctor can edit it directly, or request re-structuring with corrections. FAILED is reserved for system failures (Whisper timeout, LLM crash), not for clinical rejection.
Audit Trail — Every Transition Logged
Every state change produces an immutable audit entry:
TRANSITION_AUDIT_MAP = {
(VisitStatus.REVIEW, VisitStatus.APPROVED): AuditAction.NOTE_APPROVED,
(VisitStatus.REVIEW, VisitStatus.STRUCTURED): AuditAction.NOTE_REJECTED,
...
}In healthcare, auditability is not optional. Every change must be traceable to a person, at a time, with a reason. The audit log answers: who approved this note, when, and what was the state of the note at approval.
Phase 7: Multi-Agent Orchestration
The Design Philosophy
After the doctor approves the clinical note, the system can help with what comes next. But "can help" is the key phrase. The orchestrator:
- Suggests actions
- Generates previews of what each action would produce
- Waits for the doctor to approve each one individually
- Executes only approved actions
- Never makes clinical decisions
This is the human-in-the-loop principle implemented as an architecture.
The Agent Plan: Preview Before Execute
# 1. Doctor finishes consultation
plan = await orchestrator.plan_post_consultation(visit_id, note_text)
# Returns a plan with each action in PREVIEW status
# 2. Doctor sees the plan — each action shows what it would do
# coding agent: "Suggested codes: J11.1 (Viral pneumonia, unspecified), confidence: high"
# follow_up agent: "Suggested tasks: [lab tests, return visit in 2 weeks]"
# referral agent: "Draft referral to pulmonology: [letter text]"
# 3. Doctor approves or skips each one
await orchestrator.approve_action(plan, coding_action_id)
await orchestrator.skip_action(plan, referral_action_id) # no referral needed
# 4. Execute only approved actions
await orchestrator.execute_action(plan, coding_action_id, actor="dr.smith")The key technical invariant: execute_action() only runs if status is APPROVED or PREVIEW. There is no code path that bypasses this.
The Five Agents
| Agent | Task | Risk | Why This Risk Level |
|-------|------|------|---------------------|
| CodingAgent | Suggest ICD-10 / ICPC-2 codes | LOW | Just suggestions, never auto-applied. Wrong code is corrected at billing, not at care |
| FollowUpAgent | Create follow-up tasks | MEDIUM | Missing a follow-up has clinical consequences |
| ReferralDraftAgent | Draft referral letter | MEDIUM | Letter goes to specialist — content matters |
| CarePlanAgent | Update care plan | MEDIUM | Care plan changes affect ongoing treatment |
| LetterDraftAgent | Epikrise / innkalling / sykemelding | MEDIUM | Patient-facing or legal documents |
Risk level determines the UI treatment: LOW actions can be shown with a single "apply" button. MEDIUM actions require the doctor to read the preview. HIGH actions (not yet implemented) would require explicit approval plus a reason.
ICD-10 Code Hallucination: The Coding Agent Failure
The CodingAgent had its own hallucination failure mode.
When given a vague consultation note — "patient feels unwell, fatigue" — the LLM would suggest 4–6 highly specific ICD-10 codes with "confidence": "high". It was pattern-matching on common code combinations from training data, not reasoning about the clinical evidence.
The fix was a dual-source architecture:
async def preview(self, context):
# Source 1: Deterministic keyword matching against known ICD-10 terms
from medscribe.services.norwegian import suggest_icd10
keyword_suggestions = suggest_icd10(note_text)
# Source 2: LLM suggestions with structured confidence
llm_codes = await self._llm.generate("""
Return JSON: [{"code": "ICD-10", "description": "...", "confidence": "high|medium|low"}]
Return ONLY JSON.
""")
return {
"suggested_codes": llm_codes, # LLM suggestions
"keyword_matches": keyword_suggestions, # deterministic matches
}The UI shows both sources separately. Keyword matches have a "deterministic" badge — the doctor knows these came from a lookup, not an LLM. LLM suggestions have confidence levels. A "low" confidence LLM suggestion that doesn't overlap with keyword matches is a signal to be skeptical.
Phase 8: RAG — Patient Context Without Hallucination
The Problem with General Chatbots in Healthcare
A general LLM answering "What medications was this patient on last visit?" draws on its training data for context. It might say "patients with this diagnosis are commonly on metformin" — which is statistically true but not factually true for this patient.
Patient-specific Q&A needs to be grounded in the patient's actual records, not statistical norms.
The RAG Architecture
async def ask(self, question: str, patient_id: str) -> dict:
# 1. Retrieve actual patient visit notes from DB
context_chunks, sources = await self._retrieve_patient_context(patient_id)
# 2. Inject only retrieved data as context
result = await self._llm.generate(
prompt=f"""Svar på spørsmålet basert KUN på pasientens journalnotater nedenfor.
Spørsmål: {question}
Pasientens notater:
{context}
Regler:
1. Svar KUN basert på informasjonen i notatene.
2. Hvis informasjonen ikke finnes, si "Ikke funnet i tilgjengelige notater."
3. Referer til hvilken dato informasjonen kommer fra."""
)
# 3. Return answer with source citations
return {
"answer": result.text,
"sources": sources, # which visit notes were used
}The critical line is Rule 2: "If information is not found, say so." Without this, the model fills the gap from training data. With it, the model explicitly admits ignorance — which is the correct clinical behaviour.
Every answer returns sources — the specific visit IDs and dates the answer was drawn from. The doctor can click through to verify.
Phase 9: Quality Evaluation and Drift Detection
Automated Quality Scoring
Every structured note is scored on three dimensions:
overall_score = (
completeness * 0.3 + # fraction of sections filled
source_fidelity * 0.4 + # words from transcript found in note
consistency * 0.2 + # structural quality (no JSON fragments, no empty sections)
safety_pass * 0.1 # passed guardrails
)Source fidelity has the highest weight (0.4) because it's the most direct measure of hallucination. A note that uses words not in the transcript is a note that added information from somewhere other than the conversation.
Drift Detection
The QualityMonitor tracks scores over time and compares the first half of recent results to the second half:
def get_trend(self, last_n=20):
scores = [r.overall_score for r in self._history[-last_n:]]
first_half_avg = mean(scores[:len(scores)//2])
second_half_avg = mean(scores[len(scores)//2:])
if second_half_avg < first_half_avg - 0.1:
trend = "declining"A "declining" trend means: the model's output quality is getting worse over time. This happens when model context changes (temperature changes, system updates, prompt changes) or when input distribution shifts (new patient demographics, new clinical areas).
Phase 10: Verification Service — Optimistic Locking in Healthcare
The Pattern
The verification service manages the human approval workflow for clinical documentation. Two reviewers might try to approve the same document simultaneously — a race condition.
The fix is optimistic locking with version numbers:
def _assert_version(current: int, expected: int) -> None:
"""Raises 409 if another writer updated first."""
if current != expected:
raise HTTPException(
status_code=409,
detail=f"Version conflict: expected {expected}, got {current}. "
"Record was modified by another request. Please refresh and retry."
)
async def approve(self, verification_id, reviewer, expected_version=None):
v = await self._get_or_404(verification_id)
_assert_transition(v.status, VerificationStatus.APPROVED)
if expected_version is not None:
_assert_version(v.version, expected_version) # ← conflict detection
v.status = VerificationStatus.APPROVED
v.version += 1 # ← increment on every writeThe UI sends expected_version with every approval request. If two reviewers approve simultaneously, the second gets a 409 and is told to refresh. No silent duplicate approvals.
This is the same pattern as the Nordea banking race condition — but in a healthcare context where a double-approval could mean a patient's record shows contradictory clinical assessments.
The Meta-Lessons
1. Defence in Depth for Hallucination
No single guardrail catches everything. We ended up with five independent layers:
- Prompt instruction ("extract only")
- Self-tagging (
[VERIFY]) - Source fidelity scoring
- Pattern detection (phone, email, address)
- LLM self-reference removal
Each layer catches a class of failures the others miss. All five together still don't catch everything — the doctor review is the final defence. The goal of the automated layers is not to eliminate the need for review. It's to make review faster and more focused.
2. Determinism vs. Probability
Use deterministic logic where you need consistency (terminology correction, hallucination markers, state transitions). Use LLMs where you need intelligence (structuring, drafting, reasoning).
The biggest early mistake was asking the LLM to handle things that should have been deterministic. A lookup table for ICD-10 keyword matching is faster, cheaper, and more consistent than asking the LLM to code on its own.
3. Small Models Need Different Prompts
A prompt that works beautifully on GPT-4 may completely confuse a 1B local model. Small models lose coherence with long context. They repeat themselves. They return wrong data types. They misunderstand complex instructions.
For clinical AI running locally (privacy requirement), you need a separate prompt engineering strategy for small models — shorter prompts, simpler structures, explicit output format examples.
4. Human-in-the-Loop Is Not a UI Feature
The approval requirement isn't something bolted on for compliance. It's enforced at the architecture level — execute_action() checks status before running. There's no API call that writes to the EPJ without passing through an approval state transition. The system cannot accidentally skip review.
5. Auditability Is a First-Class Requirement
Every state transition, every agent execution, every human approval is audit-logged with actor, timestamp, and detail. Not because we wanted to — because healthcare demands it. But the design benefit was substantial: debugging production issues became tracing audit logs rather than reading application logs.
Checklist: Healthcare AI System Design
□ Is every AI output verified by a human before clinical action?
□ Does the structuring prompt explicitly say "extract only, never invent"?
□ Does the LLM self-tag uncertainty with a machine-readable marker?
□ Is there source fidelity measurement (output vs. input overlap)?
□ Are AI disclaimers removed before showing output to clinicians?
□ Are there separate prompts for small vs. large models?
□ Is the workflow a state machine with enforced transitions?
□ Is every state transition audit-logged with actor + timestamp?
□ Does the RAG system return source citations with every answer?
□ Is there a quality drift monitor comparing score trends over time?
□ Does concurrent approval use optimistic locking?
□ Can the system fail gracefully when AI is unavailable?The answer to every question above should be yes before a clinical AI system handles real patient data.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.