Scenario: Scale to 1 Million Daily Users
Design a RAG chatbot for 1 million daily users. Work through back-of-envelope math, architecture decisions, cache layers, auto-scaling, and what to build vs. buy.
The Interview Question
"Design a RAG-based chatbot that can handle 1 million daily active users."
This is a classic system design interview question for senior AI/backend engineers. The interviewer wants to see that you can:
- Do back-of-envelope estimation correctly
- Identify the right architecture components
- Make and justify technology choices
- Know what to cache, what to scale, and what to pre-compute
Let us walk through a complete answer.
Step 1: Back-of-Envelope Estimation
Never design before calculating. Start with traffic math.
Traffic assumptions:
- 1,000,000 daily active users (DAU)
- Average 5 queries per user per day
- Total queries per day: 5,000,000
Requests per second:
- Average: 5,000,000 / 86,400 seconds = approximately 58 req/s
- Peak (assume 5x average during business hours): approximately 290 req/s
Latency targets:
- P50: under 3 seconds
- P95: under 6 seconds
- P99: under 10 seconds
Storage estimates:
- Knowledge base: 10,000 documents, average 20 pages each
- Text per document: approximately 50,000 tokens
- Chunks per document (512 tokens, 50 overlap): approximately 100 chunks
- Total chunks: 1,000,000 chunks
- Vector dimension: 3,072 (text-embedding-3-large)
- Storage per chunk: 3,072 floats × 4 bytes + 500 bytes metadata = approximately 12.8 KB
- Total vector storage: 1,000,000 × 12.8 KB = approximately 12.8 GB
LLM cost at scale:
- Average prompt: 2,000 tokens, average response: 400 tokens
- GPT-4o cost per query: (2,000 × $0.005 + 400 × $0.015) / 1,000 = $0.016
- Cost without caching: 5,000,000 × $0.016 = $80,000/day
- With 70% cache hit rate: $80,000 × 0.3 = $24,000/day
The cache is not optional at this scale. It is the difference between a $720k/month LLM bill and a $216k/month bill.
Step 2: High-Level Architecture
Users (browsers, mobile apps)
│
▼
[CDN - Azure Front Door]
Static assets, API caching for identical requests
│
▼
[API Gateway - Azure APIM]
Rate limiting, auth, routing, request validation
│
├─────────────────────────┐
▼ ▼
[RAG API - Container Apps] [Admin API - Container Apps]
Auto-scales 2 → 50 replicas Document ingestion, management
│
├── [Semantic Cache - Azure Cache for Redis]
│ Cache hit → return immediately
│
├── [Embedding Service - Azure OpenAI]
│ text-embedding-3-large
│
├── [Vector Search - Azure AI Search]
│ 1M chunks, hybrid search
│
├── [LLM - Azure OpenAI]
│ gpt-4o (complex), gpt-4o-mini (simple)
│
└── [Observability - Azure Monitor + App Insights]Step 3: Component Deep-Dive
API Layer
The API layer must handle 290 req/s peak with rate limiting per user:
from fastapi import FastAPI, HTTPException, Depends
from fastapi.middleware.gzip import GZipMiddleware
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
app = FastAPI(title="RAG Chatbot API")
app.add_middleware(GZipMiddleware, minimum_size=1000)
# Rate limiting: 30 requests per minute per user
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
@app.post("/query")
@limiter.limit("30/minute")
async def query_endpoint(request, body: QueryRequest):
"""Main query endpoint. Handles 290 req/s at peak with 50 container replicas."""
return await process_rag_query(body.query, body.user_id, body.session_id)Semantic Cache Layer (Critical Path)
The cache must handle 290 req/s. Redis Cluster with 3 nodes handles this comfortably (Redis handles hundreds of thousands of operations per second).
import redis.asyncio as aioredis
import numpy as np
from openai import AsyncAzureOpenAI
openai_client = AsyncAzureOpenAI(
azure_endpoint="https://your-resource.openai.azure.com",
api_version="2024-02-01",
)
redis_pool = aioredis.ConnectionPool.from_url(
"rediss://your-redis.cache.windows.net:6380",
max_connections=100,
decode_responses=False,
)
class ProductionSemanticCache:
"""
Production-grade semantic cache using Redis Search for vector similarity.
Handles thousands of concurrent cache lookups.
"""
def __init__(self):
self.redis = aioredis.Redis(connection_pool=redis_pool)
self.similarity_threshold = 0.92
self.ttl_seconds = 3600
async def get_embedding(self, text: str) -> list[float]:
response = await openai_client.embeddings.create(
model="text-embedding-3-large",
input=text,
)
return response.data[0].embedding
async def lookup(self, query: str) -> str | None:
"""
Uses Redis Search with vector similarity index for fast cache lookup.
Latency: under 10ms for caches with up to 1 million entries.
"""
query_embedding = await self.get_embedding(query)
embedding_bytes = np.array(query_embedding, dtype=np.float32).tobytes()
# Redis Search vector query (requires RediSearch module)
results = await self.redis.execute_command(
"FT.SEARCH",
"rag_cache_idx",
f"*=>[KNN 1 @embedding $vec AS score]",
"PARAMS", "2", "vec", embedding_bytes,
"SORTBY", "score", "ASC",
"RETURN", "2", "response", "score",
"LIMIT", "0", "1",
"DIALECT", "2",
)
if results[0] == 0:
return None
# results format: [count, key, [field, value, ...]]
fields = dict(zip(results[2][::2], results[2][1::2]))
score = float(fields.get(b"score", 1.0))
# Lower score = more similar in Redis Search vector index
if score < (1 - self.similarity_threshold):
return fields.get(b"response", b"").decode("utf-8")
return None
async def store(self, query: str, response: str, query_embedding: list[float]):
doc_id = f"cache:{hash(query)}"
embedding_bytes = np.array(query_embedding, dtype=np.float32).tobytes()
pipe = self.redis.pipeline()
await pipe.hset(doc_id, mapping={
"query": query,
"response": response,
"embedding": embedding_bytes,
})
await pipe.expire(doc_id, self.ttl_seconds)
await pipe.execute()Auto-Scaling Configuration
The RAG API containers must scale from 2 to 50 replicas based on queue depth and HTTP concurrency:
# containerapp.yaml
name: rag-api
properties:
configuration:
ingress:
targetPort: 8000
external: true
template:
containers:
- name: rag-api
image: yourregistry.azurecr.io/rag-api:latest
resources:
cpu: 2.0
memory: "4Gi"
env:
- name: OPENAI_ENDPOINT
secretRef: openai-endpoint
- name: REDIS_URL
secretRef: redis-url
scale:
minReplicas: 2 # no cold starts
maxReplicas: 50 # 50 × 10 concurrent = 500 concurrent requests
rules:
- name: http-scale-rule
http:
metadata:
concurrentRequests: "10"
- name: queue-scale-rule
azureQueue:
queueName: query-queue
queueLength: "50"
auth:
- secretRef: storage-connection-string
triggerParameter: connectionVector Search at Scale
Azure AI Search with HNSW index handles 1 million chunks with sub-100ms query latency at 290 req/s:
from azure.search.documents.aio import SearchClient
from azure.search.documents.models import VectorizableTextQuery, QueryType
search_client = SearchClient(
endpoint="https://your-search.search.windows.net",
index_name="knowledge-base",
credential=AzureKeyCredential("key"),
)
async def vector_search_at_scale(
query: str,
query_embedding: list[float],
user_department: str,
top_k: int = 5,
) -> list[dict]:
"""
Hybrid search with metadata filter.
At 290 req/s, use a Standard S3 tier with 12 replicas.
"""
vector_query = VectorizableTextQuery(
text=query,
k_nearest_neighbors=top_k * 3,
fields="content_vector",
exhaustive=False, # use HNSW approximation for speed
)
results = await search_client.search(
search_text=query,
vector_queries=[vector_query],
filter=f"department eq '{user_department}' or department eq 'all'",
query_type=QueryType.SEMANTIC,
semantic_configuration_name="default",
top=top_k,
)
return [
{"content": r["content"], "source": r["source"], "score": r["@search.score"]}
async for r in results
]Model Router
At 1M DAU, routing 60% of queries to GPT-4o mini saves millions of dollars monthly:
SIMPLE_QUERY_PATTERNS = [
r"^what is\b",
r"^how much\b",
r"^when does\b",
r"^is it\b",
r"^can i\b",
r"^where is\b",
]
import re
def is_simple_query(query: str) -> bool:
query_lower = query.lower().strip()
return any(re.match(pattern, query_lower) for pattern in SIMPLE_QUERY_PATTERNS)
async def routed_completion(query: str, context: str) -> dict:
model = "gpt-4o-mini" if is_simple_query(query) else "gpt-4o"
max_tokens = 200 if is_simple_query(query) else 600
response = await openai_client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": f"Answer using only this context:\n{context}"},
{"role": "user", "content": query},
],
max_tokens=max_tokens,
temperature=0,
stream=True,
)
return response, modelStep 4: Full Query Flow
import asyncio
async def process_rag_query(
query: str,
user_id: str,
session_id: str,
) -> dict:
request_id = generate_request_id()
# 1. Check semantic cache (~8ms)
cache_result = await semantic_cache.lookup(query)
if cache_result:
log_cache_hit(request_id, user_id)
return {"answer": cache_result, "source": "cache", "request_id": request_id}
# 2. Embed query (~400ms)
query_embedding = await get_embedding(query)
# 3. Retrieve from vector store (~80ms with warm index)
user_dept = await get_user_department(user_id)
chunks = await vector_search_at_scale(query, query_embedding, user_dept, top_k=5)
# 4. Build context with token budget
context = build_context_with_budget(
[c["content"] for c in chunks],
max_tokens=2000,
)
# 5. Route and call LLM (2,000-6,000ms)
stream, model_used = await routed_completion(query, context)
# 6. Stream response back to client + cache async
full_response = ""
async for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
full_response += delta
yield delta
# 7. Store in cache async (do not block response)
asyncio.create_task(
semantic_cache.store(query, full_response, query_embedding)
)
# 8. Log for observability
log_query(request_id, user_id, model_used, len(chunks), len(full_response))Step 5: What to Cut in MVP vs. Build for Prod
In a real interview, the interviewer appreciates this pragmatic breakdown:
MVP (launch with these):
- Single vector store (Azure AI Search Basic tier)
- Semantic cache (Redis single instance)
- GPT-4o for all queries (no routing complexity)
- Streaming enabled
- Basic auth (API key per user)
- Basic logging to Application Insights
Production additions (after launch):
- Model routing (GPT-4o mini for simple queries)
- Redis Cluster for cache HA (3 nodes)
- Azure AI Search Standard S3 with replicas
- Per-user rate limiting with burst allowance
- Percentile latency dashboards and alerts
- A/B testing framework for prompt changes
- Document versioning and staleness monitoring
What you would NOT build yourself:
- Your own vector index (use Azure AI Search or Qdrant)
- Your own LLM (use Azure OpenAI)
- Your own CDN (use Azure Front Door)
Capacity Planning Summary
At 290 req/s peak:
| Component | Config | Estimated Cost/month | |---|---|---| | Container Apps (RAG API) | 2-50 replicas, 2 CPU / 4GB | $4,000 | | Azure OpenAI (gpt-4o, 30% queries) | 1.5M queries at $0.016 | $24,000 | | Azure OpenAI (gpt-4o-mini, 70% queries) | 3.5M queries at $0.0005 | $1,750 | | Azure AI Search (S3, 3 replicas) | 1M chunks, 290 req/s | $3,500 | | Azure Cache for Redis (P2 cluster) | 3 nodes, 13GB | $1,800 | | Azure Front Door + networking | egress + rules | $1,200 |
Total: approximately $36,000/month for 1M DAU with 70% cache hit rate.
Without the cache, the OpenAI cost alone would be approximately $80,000/day. The cache is what makes this economically viable.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.