Scenario Based Questions · Lesson 11 of 13
Scenario: Design a Healthcare AI Chatbot
The Interview Question
"Design a pharmaceutical information chatbot that helps users look up drug information, check drug interactions, and get dosage guidance. It should be accurate, safe, and scale to 100,000 daily users."
This is a classic AI systems design question. The interviewer is testing whether you understand RAG, safety, scalability, and real-world trade-offs.
Step 1: Clarify Requirements
Before drawing boxes, ask clarifying questions:
- User type: General public or healthcare professionals? (Different safety requirements)
- Sources: What drug database do we have access to? (FDA, RxNorm, proprietary?)
- Language: English only or multilingual?
- Regulatory: Is this a regulated medical device (FDA 510k) or an informational tool?
- Latency target: Acceptable p95 latency? (3s? 5s?)
- Availability: 99.9% (8.7h downtime/year) or 99.99%?
Assume: general public, English, informational (not regulated), 5s p95, 99.9%.
Step 2: Back-of-Envelope
100,000 daily users:
- Average 3 queries per session = 300,000 queries/day
- 300,000 / 86,400 = ~3.5 queries/second average
- Peak (10× average) = ~35 queries/second
Token budget per query:
- System prompt: ~300 tokens
- Retrieved drug data: ~800 tokens
- User message: ~50 tokens
- Response: ~300 tokens
- Total: ~1,450 tokens per query
Monthly token cost (GPT-4o at $2.50/1M input, $10/1M output):
- 300,000 queries × 1,150 input tokens = 345M input tokens → ~$862
- 300,000 queries × 300 output tokens = 90M output tokens → ~$900
- Total LLM cost without caching: ~$1,762/month
- With 60% semantic cache hit rate: ~$705/month
This is very affordable — scale concern is more about latency and safety than cost.
Step 3: System Architecture
┌─────────────┐
│ Browser │
│ / Mobile │
└──────┬──────┘
│ HTTPS
▼
┌─────────────┐
│ CDN/WAF │ ← Static assets, DDoS protection
└──────┬──────┘
│
▼
┌─────────────┐
│ API GW │ ← Rate limiting, auth, routing
└──────┬──────┘
│
┌────────────┴────────────┐
│ │
▼ ▼
┌────────────────┐ ┌────────────────┐
│ Chat Service │ │ Ingestion Svc │
│ (FastAPI) │ │ (worker) │
│ 3 replicas │ └───────┬────────┘
└───────┬────────┘ │
│ ▼
┌────────┴───────┐ ┌────────────────┐
│ │ │ Drug Database │
▼ ▼ │ (source docs) │
┌───────┐ ┌──────────────┐ └────────────────┘
│ Redis │ │ Vector DB │
│ Cache │ │ (Azure AI │
│ │ │ Search) │
└───────┘ └──────┬───────┘
│
▼
┌────────────┐
│ Azure OAI │
│ GPT-4o │
└────────────┘Step 4: Component Design
Chat Service
The core request flow:
# pharmabot/api/chat.py
async def handle_chat(request: ChatRequest) -> ChatResponse:
# 1. Input guard
if await is_harmful_input(request.message):
return ChatResponse(answer=SAFE_REDIRECT, blocked=True)
# 2. Check semantic cache
cached = await semantic_cache.get(request.message)
if cached:
return ChatResponse(answer=cached, from_cache=True)
# 3. Query rewriting (for vague queries)
query = await rewrite_query(request.message, request.history)
# 4. Retrieve drug information
docs = await vector_search(query, top_k=5)
docs = await rerank(query, docs) # cross-encoder reranker
# 5. Generate response
response = await generate_with_context(query, docs, request.history)
# 6. Output guard
if not await is_safe_output(response):
log.warning("output_blocked", query=query)
return ChatResponse(answer=FALLBACK_RESPONSE, blocked=True)
# 7. Cache result
await semantic_cache.set(request.message, response)
return ChatResponse(answer=response, sources=[d.id for d in docs])Drug Knowledge Base
The drug interaction and information database needs to be:
- Chunked at the drug + indication level (not arbitrary character chunks)
- Updated weekly from FDA drug label database (DailyMed)
- Each chunk tagged with: drug_name, interaction_severity, last_updated
# Each chunk in vector store:
{
"id": "warfarin-ibuprofen-interaction-v2",
"content": "Warfarin and ibuprofen interaction: NSAIDs reduce platelet aggregation...",
"metadata": {
"drug_a": "warfarin",
"drug_b": "ibuprofen",
"severity": "major",
"source": "FDA DailyMed",
"updated": "2026-03-01"
}
}Semantic Cache
Cache drug queries by semantic similarity. Drug information doesn't change hour-to-hour — a 24-hour cache TTL is appropriate:
class SemanticCache:
def __init__(self, redis, embedder, threshold=0.92):
self.redis = redis
self.embedder = embedder
self.threshold = threshold # high threshold for medical accuracy
async def get(self, query: str) -> str | None:
query_emb = await self.embedder.embed(query)
# Search Redis for similar cached queries
candidates = await self.redis.execute("FT.SEARCH", ...)
for candidate in candidates:
if cosine_similarity(query_emb, candidate.embedding) > self.threshold:
return candidate.response
return NoneStep 5: Safety Architecture
Four-layer safety stack:
| Layer | What It Checks | Latency | |---|---|---| | Input classifier | Harmful queries, jailbreaks | under 50ms | | RAG grounding | Answer based on retrieved facts only | included in retrieval | | System prompt | Explicit safety rules in every request | zero overhead | | Output classifier | Harmful advice in generated response | 200-500ms |
The output classifier adds latency but is mandatory for medical applications. Stream the response to the user while the classifier runs in parallel — if the classifier flags it, replace the streamed response with the fallback.
Step 6: Scalability
Horizontal scaling: The chat service is stateless — scale to N replicas behind a load balancer. Azure Container Apps handles this automatically based on HTTP queue length.
Database scaling: Azure AI Search scales horizontally; vector search latency stays under 100ms up to 10 million documents.
Cache reduces load: With 60% cache hit rate, only 40% of queries hit the LLM. This is the single biggest cost and latency optimization.
Rate limiting: 10 requests/minute per user IP at the API Gateway level. Prevents runaway automation.
Step 7: Monitoring
Key metrics to track in production:
| Metric | Target | Alert Threshold | |---|---|---| | p95 latency | under 5s | above 8s | | Cache hit rate | above 50% | below 30% | | Output blocked rate | under 0.5% | above 2% | | LLM error rate | under 0.1% | above 1% | | Cost per 1,000 queries | under $6 | above $10 |
Step 8: MVP vs Production
MVP (2 weeks):
- Single FastAPI service
- FAISS vector store (local file)
- OpenAI API key (no Azure)
- No semantic cache
- Simple rule-based output guard
- Manual drug data upload
Production (3 months):
- Azure Container Apps (auto-scale)
- Azure AI Search (managed vector store)
- Redis semantic cache
- LLM output classifier
- Weekly automated ingestion from DailyMed
- LangSmith tracing
- Azure Monitor alerts
Skip for now:
- Multi-language support (add when needed)
- Fine-tuned model (prompting + RAG is sufficient)
- Custom embedding model (OpenAI text-embedding-3-small works well)