Learnixo
Back to blog
AI Systemsintermediate

Scenario: Scale to 1 Million Daily Users

Design a RAG chatbot for 1 million daily users. Work through back-of-envelope math, architecture decisions, cache layers, auto-scaling, and what to build vs. buy.

Asma Hafeez KhanMay 15, 20268 min read
System DesignScaleRAGArchitectureAzureRedisCDN
Share:𝕏

The Interview Question

"Design a RAG-based chatbot that can handle 1 million daily active users."

This is a classic system design interview question for senior AI/backend engineers. The interviewer wants to see that you can:

  1. Do back-of-envelope estimation correctly
  2. Identify the right architecture components
  3. Make and justify technology choices
  4. Know what to cache, what to scale, and what to pre-compute

Let us walk through a complete answer.

Step 1: Back-of-Envelope Estimation

Never design before calculating. Start with traffic math.

Traffic assumptions:

  • 1,000,000 daily active users (DAU)
  • Average 5 queries per user per day
  • Total queries per day: 5,000,000

Requests per second:

  • Average: 5,000,000 / 86,400 seconds = approximately 58 req/s
  • Peak (assume 5x average during business hours): approximately 290 req/s

Latency targets:

  • P50: under 3 seconds
  • P95: under 6 seconds
  • P99: under 10 seconds

Storage estimates:

  • Knowledge base: 10,000 documents, average 20 pages each
  • Text per document: approximately 50,000 tokens
  • Chunks per document (512 tokens, 50 overlap): approximately 100 chunks
  • Total chunks: 1,000,000 chunks
  • Vector dimension: 3,072 (text-embedding-3-large)
  • Storage per chunk: 3,072 floats × 4 bytes + 500 bytes metadata = approximately 12.8 KB
  • Total vector storage: 1,000,000 × 12.8 KB = approximately 12.8 GB

LLM cost at scale:

  • Average prompt: 2,000 tokens, average response: 400 tokens
  • GPT-4o cost per query: (2,000 × $0.005 + 400 × $0.015) / 1,000 = $0.016
  • Cost without caching: 5,000,000 × $0.016 = $80,000/day
  • With 70% cache hit rate: $80,000 × 0.3 = $24,000/day

The cache is not optional at this scale. It is the difference between a $720k/month LLM bill and a $216k/month bill.

Step 2: High-Level Architecture

Users (browsers, mobile apps)
        │
        ▼
   [CDN - Azure Front Door]
   Static assets, API caching for identical requests
        │
        ▼
   [API Gateway - Azure APIM]
   Rate limiting, auth, routing, request validation
        │
        ├─────────────────────────┐
        ▼                         ▼
[RAG API - Container Apps]   [Admin API - Container Apps]
Auto-scales 2 → 50 replicas  Document ingestion, management
        │
        ├── [Semantic Cache - Azure Cache for Redis]
        │   Cache hit → return immediately
        │
        ├── [Embedding Service - Azure OpenAI]
        │   text-embedding-3-large
        │
        ├── [Vector Search - Azure AI Search]
        │   1M chunks, hybrid search
        │
        ├── [LLM - Azure OpenAI]
        │   gpt-4o (complex), gpt-4o-mini (simple)
        │
        └── [Observability - Azure Monitor + App Insights]

Step 3: Component Deep-Dive

API Layer

The API layer must handle 290 req/s peak with rate limiting per user:

Python
from fastapi import FastAPI, HTTPException, Depends
from fastapi.middleware.gzip import GZipMiddleware
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient

app = FastAPI(title="RAG Chatbot API")
app.add_middleware(GZipMiddleware, minimum_size=1000)

# Rate limiting: 30 requests per minute per user
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.post("/query")
@limiter.limit("30/minute")
async def query_endpoint(request, body: QueryRequest):
    """Main query endpoint. Handles 290 req/s at peak with 50 container replicas."""
    return await process_rag_query(body.query, body.user_id, body.session_id)

Semantic Cache Layer (Critical Path)

The cache must handle 290 req/s. Redis Cluster with 3 nodes handles this comfortably (Redis handles hundreds of thousands of operations per second).

Python
import redis.asyncio as aioredis
import numpy as np
from openai import AsyncAzureOpenAI

openai_client = AsyncAzureOpenAI(
    azure_endpoint="https://your-resource.openai.azure.com",
    api_version="2024-02-01",
)

redis_pool = aioredis.ConnectionPool.from_url(
    "rediss://your-redis.cache.windows.net:6380",
    max_connections=100,
    decode_responses=False,
)

class ProductionSemanticCache:
    """
    Production-grade semantic cache using Redis Search for vector similarity.
    Handles thousands of concurrent cache lookups.
    """
    def __init__(self):
        self.redis = aioredis.Redis(connection_pool=redis_pool)
        self.similarity_threshold = 0.92
        self.ttl_seconds = 3600

    async def get_embedding(self, text: str) -> list[float]:
        response = await openai_client.embeddings.create(
            model="text-embedding-3-large",
            input=text,
        )
        return response.data[0].embedding

    async def lookup(self, query: str) -> str | None:
        """
        Uses Redis Search with vector similarity index for fast cache lookup.
        Latency: under 10ms for caches with up to 1 million entries.
        """
        query_embedding = await self.get_embedding(query)
        embedding_bytes = np.array(query_embedding, dtype=np.float32).tobytes()

        # Redis Search vector query (requires RediSearch module)
        results = await self.redis.execute_command(
            "FT.SEARCH",
            "rag_cache_idx",
            f"*=>[KNN 1 @embedding $vec AS score]",
            "PARAMS", "2", "vec", embedding_bytes,
            "SORTBY", "score", "ASC",
            "RETURN", "2", "response", "score",
            "LIMIT", "0", "1",
            "DIALECT", "2",
        )

        if results[0] == 0:
            return None

        # results format: [count, key, [field, value, ...]]
        fields = dict(zip(results[2][::2], results[2][1::2]))
        score = float(fields.get(b"score", 1.0))

        # Lower score = more similar in Redis Search vector index
        if score < (1 - self.similarity_threshold):
            return fields.get(b"response", b"").decode("utf-8")
        return None

    async def store(self, query: str, response: str, query_embedding: list[float]):
        doc_id = f"cache:{hash(query)}"
        embedding_bytes = np.array(query_embedding, dtype=np.float32).tobytes()

        pipe = self.redis.pipeline()
        await pipe.hset(doc_id, mapping={
            "query": query,
            "response": response,
            "embedding": embedding_bytes,
        })
        await pipe.expire(doc_id, self.ttl_seconds)
        await pipe.execute()

Auto-Scaling Configuration

The RAG API containers must scale from 2 to 50 replicas based on queue depth and HTTP concurrency:

YAML
# containerapp.yaml
name: rag-api
properties:
  configuration:
    ingress:
      targetPort: 8000
      external: true
  template:
    containers:
      - name: rag-api
        image: yourregistry.azurecr.io/rag-api:latest
        resources:
          cpu: 2.0
          memory: "4Gi"
        env:
          - name: OPENAI_ENDPOINT
            secretRef: openai-endpoint
          - name: REDIS_URL
            secretRef: redis-url
    scale:
      minReplicas: 2        # no cold starts
      maxReplicas: 50       # 50 × 10 concurrent = 500 concurrent requests
      rules:
        - name: http-scale-rule
          http:
            metadata:
              concurrentRequests: "10"
        - name: queue-scale-rule
          azureQueue:
            queueName: query-queue
            queueLength: "50"
            auth:
              - secretRef: storage-connection-string
                triggerParameter: connection

Vector Search at Scale

Azure AI Search with HNSW index handles 1 million chunks with sub-100ms query latency at 290 req/s:

Python
from azure.search.documents.aio import SearchClient
from azure.search.documents.models import VectorizableTextQuery, QueryType

search_client = SearchClient(
    endpoint="https://your-search.search.windows.net",
    index_name="knowledge-base",
    credential=AzureKeyCredential("key"),
)

async def vector_search_at_scale(
    query: str,
    query_embedding: list[float],
    user_department: str,
    top_k: int = 5,
) -> list[dict]:
    """
    Hybrid search with metadata filter.
    At 290 req/s, use a Standard S3 tier with 12 replicas.
    """
    vector_query = VectorizableTextQuery(
        text=query,
        k_nearest_neighbors=top_k * 3,
        fields="content_vector",
        exhaustive=False,  # use HNSW approximation for speed
    )

    results = await search_client.search(
        search_text=query,
        vector_queries=[vector_query],
        filter=f"department eq '{user_department}' or department eq 'all'",
        query_type=QueryType.SEMANTIC,
        semantic_configuration_name="default",
        top=top_k,
    )

    return [
        {"content": r["content"], "source": r["source"], "score": r["@search.score"]}
        async for r in results
    ]

Model Router

At 1M DAU, routing 60% of queries to GPT-4o mini saves millions of dollars monthly:

Python
SIMPLE_QUERY_PATTERNS = [
    r"^what is\b",
    r"^how much\b",
    r"^when does\b",
    r"^is it\b",
    r"^can i\b",
    r"^where is\b",
]

import re

def is_simple_query(query: str) -> bool:
    query_lower = query.lower().strip()
    return any(re.match(pattern, query_lower) for pattern in SIMPLE_QUERY_PATTERNS)

async def routed_completion(query: str, context: str) -> dict:
    model = "gpt-4o-mini" if is_simple_query(query) else "gpt-4o"
    max_tokens = 200 if is_simple_query(query) else 600

    response = await openai_client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": f"Answer using only this context:\n{context}"},
            {"role": "user", "content": query},
        ],
        max_tokens=max_tokens,
        temperature=0,
        stream=True,
    )
    return response, model

Step 4: Full Query Flow

Python
import asyncio

async def process_rag_query(
    query: str,
    user_id: str,
    session_id: str,
) -> dict:
    request_id = generate_request_id()

    # 1. Check semantic cache (~8ms)
    cache_result = await semantic_cache.lookup(query)
    if cache_result:
        log_cache_hit(request_id, user_id)
        return {"answer": cache_result, "source": "cache", "request_id": request_id}

    # 2. Embed query (~400ms)
    query_embedding = await get_embedding(query)

    # 3. Retrieve from vector store (~80ms with warm index)
    user_dept = await get_user_department(user_id)
    chunks = await vector_search_at_scale(query, query_embedding, user_dept, top_k=5)

    # 4. Build context with token budget
    context = build_context_with_budget(
        [c["content"] for c in chunks],
        max_tokens=2000,
    )

    # 5. Route and call LLM (2,000-6,000ms)
    stream, model_used = await routed_completion(query, context)

    # 6. Stream response back to client + cache async
    full_response = ""
    async for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            full_response += delta
            yield delta

    # 7. Store in cache async (do not block response)
    asyncio.create_task(
        semantic_cache.store(query, full_response, query_embedding)
    )

    # 8. Log for observability
    log_query(request_id, user_id, model_used, len(chunks), len(full_response))

Step 5: What to Cut in MVP vs. Build for Prod

In a real interview, the interviewer appreciates this pragmatic breakdown:

MVP (launch with these):

  • Single vector store (Azure AI Search Basic tier)
  • Semantic cache (Redis single instance)
  • GPT-4o for all queries (no routing complexity)
  • Streaming enabled
  • Basic auth (API key per user)
  • Basic logging to Application Insights

Production additions (after launch):

  • Model routing (GPT-4o mini for simple queries)
  • Redis Cluster for cache HA (3 nodes)
  • Azure AI Search Standard S3 with replicas
  • Per-user rate limiting with burst allowance
  • Percentile latency dashboards and alerts
  • A/B testing framework for prompt changes
  • Document versioning and staleness monitoring

What you would NOT build yourself:

  • Your own vector index (use Azure AI Search or Qdrant)
  • Your own LLM (use Azure OpenAI)
  • Your own CDN (use Azure Front Door)

Capacity Planning Summary

At 290 req/s peak:

| Component | Config | Estimated Cost/month | |---|---|---| | Container Apps (RAG API) | 2-50 replicas, 2 CPU / 4GB | $4,000 | | Azure OpenAI (gpt-4o, 30% queries) | 1.5M queries at $0.016 | $24,000 | | Azure OpenAI (gpt-4o-mini, 70% queries) | 3.5M queries at $0.0005 | $1,750 | | Azure AI Search (S3, 3 replicas) | 1M chunks, 290 req/s | $3,500 | | Azure Cache for Redis (P2 cluster) | 3 nodes, 13GB | $1,800 | | Azure Front Door + networking | egress + rules | $1,200 |

Total: approximately $36,000/month for 1M DAU with 70% cache hit rate.

Without the cache, the OpenAI cost alone would be approximately $80,000/day. The cache is what makes this economically viable.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.