System Design: Pharmaceutical Chatbot

The Interview Question

"Design a pharmaceutical information chatbot that helps users look up drug information, check drug interactions, and get dosage guidance. It should be accurate, safe, and scale to 100,000 daily users."

This is a classic AI systems design question. The interviewer is testing whether you understand RAG, safety, scalability, and real-world trade-offs.

Step 1: Clarify Requirements

Before drawing boxes, ask clarifying questions:

User type: General public or healthcare professionals? (Different safety requirements)
Sources: What drug database do we have access to? (FDA, RxNorm, proprietary?)
Language: English only or multilingual?
Regulatory: Is this a regulated medical device (FDA 510k) or an informational tool?
Latency target: Acceptable p95 latency? (3s? 5s?)
Availability: 99.9% (8.7h downtime/year) or 99.99%?

Assume: general public, English, informational (not regulated), 5s p95, 99.9%.

Step 2: Back-of-Envelope

100,000 daily users:

Average 3 queries per session = 300,000 queries/day
300,000 / 86,400 = ~3.5 queries/second average
Peak (10× average) = ~35 queries/second

Token budget per query:

System prompt: ~300 tokens
Retrieved drug data: ~800 tokens
User message: ~50 tokens
Response: ~300 tokens
Total: ~1,450 tokens per query

Monthly token cost (GPT-4o at $2.50/1M input, $10/1M output):

300,000 queries × 1,150 input tokens = 345M input tokens → ~$862
300,000 queries × 300 output tokens = 90M output tokens → ~$900
Total LLM cost without caching: ~$1,762/month
With 60% semantic cache hit rate: ~$705/month

This is very affordable — scale concern is more about latency and safety than cost.

Step 3: System Architecture

                    ┌─────────────┐
                    │   Browser   │
                    │  / Mobile   │
                    └──────┬──────┘
                           │ HTTPS
                           ▼
                    ┌─────────────┐
                    │  CDN/WAF    │  ← Static assets, DDoS protection
                    └──────┬──────┘
                           │
                           ▼
                    ┌─────────────┐
                    │  API GW     │  ← Rate limiting, auth, routing
                    └──────┬──────┘
                           │
              ┌────────────┴────────────┐
              │                         │
              ▼                         ▼
     ┌────────────────┐       ┌────────────────┐
     │  Chat Service  │       │ Ingestion Svc  │
     │  (FastAPI)     │       │ (worker)       │
     │  3 replicas    │       └───────┬────────┘
     └───────┬────────┘               │
             │                        ▼
    ┌────────┴───────┐      ┌────────────────┐
    │                │      │  Drug Database │
    ▼                ▼      │  (source docs) │
┌───────┐  ┌──────────────┐ └────────────────┘
│ Redis │  │  Vector DB   │
│ Cache │  │ (Azure AI    │
│       │  │  Search)     │
└───────┘  └──────┬───────┘
                  │
                  ▼
           ┌────────────┐
           │ Azure OAI  │
           │  GPT-4o    │
           └────────────┘

Step 4: Component Design

Chat Service

The core request flow:

Python

# pharmabot/api/chat.py
async def handle_chat(request: ChatRequest) -> ChatResponse:
    # 1. Input guard
    if await is_harmful_input(request.message):
        return ChatResponse(answer=SAFE_REDIRECT, blocked=True)

    # 2. Check semantic cache
    cached = await semantic_cache.get(request.message)
    if cached:
        return ChatResponse(answer=cached, from_cache=True)

    # 3. Query rewriting (for vague queries)
    query = await rewrite_query(request.message, request.history)

    # 4. Retrieve drug information
    docs = await vector_search(query, top_k=5)
    docs = await rerank(query, docs)  # cross-encoder reranker

    # 5. Generate response
    response = await generate_with_context(query, docs, request.history)

    # 6. Output guard
    if not await is_safe_output(response):
        log.warning("output_blocked", query=query)
        return ChatResponse(answer=FALLBACK_RESPONSE, blocked=True)

    # 7. Cache result
    await semantic_cache.set(request.message, response)

    return ChatResponse(answer=response, sources=[d.id for d in docs])

Drug Knowledge Base

The drug interaction and information database needs to be:

Chunked at the drug + indication level (not arbitrary character chunks)
Updated weekly from FDA drug label database (DailyMed)
Each chunk tagged with: drug_name, interaction_severity, last_updated

Python

# Each chunk in vector store:
{
    "id": "warfarin-ibuprofen-interaction-v2",
    "content": "Warfarin and ibuprofen interaction: NSAIDs reduce platelet aggregation...",
    "metadata": {
        "drug_a": "warfarin",
        "drug_b": "ibuprofen",
        "severity": "major",
        "source": "FDA DailyMed",
        "updated": "2026-03-01"
    }
}

Semantic Cache

Cache drug queries by semantic similarity. Drug information doesn't change hour-to-hour — a 24-hour cache TTL is appropriate:

Python

class SemanticCache:
    def __init__(self, redis, embedder, threshold=0.92):
        self.redis = redis
        self.embedder = embedder
        self.threshold = threshold  # high threshold for medical accuracy

    async def get(self, query: str) -> str | None:
        query_emb = await self.embedder.embed(query)
        # Search Redis for similar cached queries
        candidates = await self.redis.execute("FT.SEARCH", ...)
        for candidate in candidates:
            if cosine_similarity(query_emb, candidate.embedding) > self.threshold:
                return candidate.response
        return None

Step 5: Safety Architecture

Four-layer safety stack:

| Layer | What It Checks | Latency | |---|---|---| | Input classifier | Harmful queries, jailbreaks | under 50ms | | RAG grounding | Answer based on retrieved facts only | included in retrieval | | System prompt | Explicit safety rules in every request | zero overhead | | Output classifier | Harmful advice in generated response | 200-500ms |

The output classifier adds latency but is mandatory for medical applications. Stream the response to the user while the classifier runs in parallel — if the classifier flags it, replace the streamed response with the fallback.

Step 6: Scalability

Horizontal scaling: The chat service is stateless — scale to N replicas behind a load balancer. Azure Container Apps handles this automatically based on HTTP queue length.

Database scaling: Azure AI Search scales horizontally; vector search latency stays under 100ms up to 10 million documents.

Cache reduces load: With 60% cache hit rate, only 40% of queries hit the LLM. This is the single biggest cost and latency optimization.

Rate limiting: 10 requests/minute per user IP at the API Gateway level. Prevents runaway automation.

Step 7: Monitoring

Key metrics to track in production:

| Metric | Target | Alert Threshold | |---|---|---| | p95 latency | under 5s | above 8s | | Cache hit rate | above 50% | below 30% | | Output blocked rate | under 0.5% | above 2% | | LLM error rate | under 0.1% | above 1% | | Cost per 1,000 queries | under $6 | above $10 |

Step 8: MVP vs Production

MVP (2 weeks):

Single FastAPI service
FAISS vector store (local file)
OpenAI API key (no Azure)
No semantic cache
Simple rule-based output guard
Manual drug data upload

Production (3 months):

Azure Container Apps (auto-scale)
Azure AI Search (managed vector store)
Redis semantic cache
LLM output classifier
Weekly automated ingestion from DailyMed
LangSmith tracing
Azure Monitor alerts

Skip for now:

Multi-language support (add when needed)
Fine-tuned model (prompting + RAG is sufficient)
Custom embedding model (OpenAI text-embedding-3-small works well)

System Design: Pharmaceutical Chatbot

The Interview Question

Step 1: Clarify Requirements

Step 2: Back-of-Envelope

Step 3: System Architecture

Step 4: Component Design

Chat Service

Drug Knowledge Base

Semantic Cache

Step 5: Safety Architecture

Step 6: Scalability

Step 7: Monitoring

Step 8: MVP vs Production

Enjoyed this article?

Leave a comment