Learnixo

Scenario Based Questions · Lesson 13 of 13

Scenario: Design a Document Q&A System

The Interview Question

"Design a document Q&A platform where enterprise users can upload PDF documents and ask natural language questions about them. Users should only be able to query documents they have access to. The system should handle 100,000 documents and 10,000 concurrent users."


Step 1: Clarify Requirements

  • Document types: PDFs primarily; also Word, Excel, PowerPoint
  • Document size: Average 50 pages, up to 500 pages
  • Users: Enterprise — teams with role-based access control
  • Citations: Answers must cite which document and page the information came from
  • Latency: under 10 seconds for an answer
  • Storage: 100,000 documents now, growing to 1 million
  • Concurrent users: 10,000 simultaneous users

Step 2: Back-of-Envelope

Storage:

  • 100,000 docs × 50 pages × ~3,000 words/page = 15 billion words
  • In embeddings: text-embedding-3-small = 1,536 dimensions × 4 bytes = 6KB per chunk
  • Average doc: 500 chunks × 6KB = 3MB of embeddings per doc
  • 100,000 docs: 300GB of embeddings — manageable in a vector database

Query load:

  • 10,000 concurrent users, average 1 query per 30 seconds = ~333 queries/second
  • This is the hardest part: 333 parallel LLM calls is expensive and slow
  • Semantic cache should absorb 40-60% of repeated queries

Embedding cost (one-time):

  • 100,000 docs × 500 chunks × 200 tokens/chunk = 10 billion tokens
  • text-embedding-3-small: $0.02/1M tokens = $200 total ingestion cost

System Architecture

                    ┌──────────────────┐
                    │   Web App        │
                    │   (React/Next)   │
                    └────────┬─────────┘
                             │
                             ▼
                    ┌──────────────────┐
                    │   API Gateway    │  ← Auth, rate limiting
                    └──────┬───────────┘
                           │
               ┌───────────┴───────────┐
               │                       │
               ▼                       ▼
    ┌─────────────────┐     ┌─────────────────┐
    │  Query Service  │     │ Ingestion Svc   │
    │  (FastAPI)      │     │ (worker pool)   │
    └────────┬────────┘     └────────┬────────┘
             │                       │
    ┌────────┴──────┐       ┌────────┴────────┐
    │               │       │                 │
    ▼               ▼       ▼                 ▼
┌───────┐  ┌──────────────┐ ┌──────┐  ┌───────────────┐
│ Redis │  │  Vector DB   │ │ Blob │  │  Document DB  │
│ Cache │  │ (with ACL    │ │Store │  │  (Postgres)   │
│       │  │  filtering)  │ │      │  │               │
└───────┘  └──────┬───────┘ └──────┘  └───────────────┘
                  │
                  ▼
           ┌────────────┐
           │  Azure OAI │
           │  GPT-4o    │
           └────────────┘

Step 3: Document Ingestion Pipeline

Python
# ingestion/pipeline.py
from pathlib import Path
import asyncio
from dataclasses import dataclass

@dataclass
class DocumentChunk:
    chunk_id: str
    doc_id: str
    text: str
    page_number: int
    chunk_index: int
    metadata: dict

async def ingest_document(
    doc_id: str,
    file_path: Path,
    owner_id: str,
    team_ids: list[str],
) -> int:
    """Full ingestion pipeline. Returns number of chunks created."""

    # 1. Parse document
    pages = await parse_pdf(file_path)

    # 2. Chunk with page awareness
    chunks = []
    for page_num, page_text in enumerate(pages, start=1):
        page_chunks = recursive_chunk(
            text=page_text,
            chunk_size=400,   # tokens
            overlap=50,
            page_number=page_num,
            doc_id=doc_id,
        )
        chunks.extend(page_chunks)

    # 3. Embed all chunks in batches
    embeddings = await embed_batch(
        [c.text for c in chunks],
        batch_size=100,
    )

    # 4. Upsert to vector store with access metadata
    await vector_store.upsert(
        chunks=chunks,
        embeddings=embeddings,
        metadata={
            "doc_id": doc_id,
            "owner_id": owner_id,
            "team_ids": team_ids,  # access control tags
        },
    )

    # 5. Update document registry
    await db.execute(
        """
        UPDATE documents SET
            status = 'indexed',
            chunk_count = $1,
            indexed_at = NOW()
        WHERE id = $2
        """,
        len(chunks), doc_id,
    )

    return len(chunks)

Step 4: Access-Controlled Vector Search

The critical requirement: users only retrieve from documents they own or their team owns.

Python
# query/retriever.py
async def retrieve_with_acl(
    query: str,
    user_id: str,
    team_ids: list[str],
    top_k: int = 8,
) -> list[DocumentChunk]:
    """Vector search filtered to documents the user can access."""

    # Embed the query
    query_embedding = await embed_single(query)

    # Build access filter  OR condition across user and teams
    access_filter = {
        "$or": [
            {"owner_id": {"$eq": user_id}},
            {"team_ids": {"$in": team_ids}},
        ]
    }

    # Vector search with pre-filter (reduces search space to accessible docs)
    results = await vector_store.search(
        embedding=query_embedding,
        filter=access_filter,
        top_k=top_k * 2,  # fetch extra for reranking
    )

    # Rerank with cross-encoder
    reranked = await rerank(query, results)

    return reranked[:top_k]

Critical: The access filter is applied at the vector database level (pre-filtering), not post-processing. This means users can never receive chunks from documents they don't have access to, even if the semantic similarity is high.


Step 5: Query Service with Citations

Python
# query/service.py
from pydantic import BaseModel

class Citation(BaseModel):
    doc_id: str
    doc_title: str
    page_number: int
    excerpt: str

class QueryResponse(BaseModel):
    answer: str
    citations: list[Citation]
    from_cache: bool

async def answer_question(
    question: str,
    user_id: str,
    team_ids: list[str],
) -> QueryResponse:

    # 1. Check semantic cache (user-scoped  different users have different doc access)
    cache_key = f"{user_id}:{question}"
    cached = await semantic_cache.get(cache_key)
    if cached:
        return QueryResponse(**cached, from_cache=True)

    # 2. Retrieve relevant chunks
    chunks = await retrieve_with_acl(question, user_id, team_ids)

    if not chunks:
        return QueryResponse(
            answer="I couldn't find relevant information in your documents.",
            citations=[],
            from_cache=False,
        )

    # 3. Format context with citations
    context_parts = []
    for i, chunk in enumerate(chunks):
        context_parts.append(
            f"[{i+1}] From '{chunk.doc_title}', page {chunk.page_number}:\n{chunk.text}"
        )
    context = "\n\n".join(context_parts)

    # 4. Generate answer with citation instructions
    system_prompt = """Answer the user's question based on the provided document excerpts.
    
Rules:
- ONLY use information from the provided excerpts
- Cite sources using [1], [2], etc. corresponding to the excerpt numbers
- If the excerpts don't contain the answer, say so clearly
- Do not make up information not in the excerpts"""

    response = await openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {
                "role": "user",
                "content": f"Excerpts:\n\n{context}\n\nQuestion: {question}",
            },
        ],
        temperature=0.1,
    )

    answer = response.choices[0].message.content

    # 5. Extract citations from answer
    citations = extract_citations(answer, chunks)

    result = QueryResponse(
        answer=answer,
        citations=citations,
        from_cache=False,
    )

    # 6. Cache the result (1-hour TTL for document content)
    await semantic_cache.set(cache_key, result.model_dump(), ttl=3600)

    return result

Step 6: Handling Scale — 10,000 Concurrent Users

At 333 queries/second, naive serial processing won't work. Key strategies:

1. Semantic cache (biggest win)

  • 40-60% of enterprise queries are repeats (same user asking same question, or team members asking similar things)
  • Cache key: {user_id}:{question} — user-scoped for access control
  • 60% cache hit rate reduces LLM calls to 133/second

2. Horizontal scaling of query service

  • Stateless FastAPI workers: scale to N replicas
  • Azure Container Apps: scale on HTTP queue depth
  • 20 replicas × 10 concurrent requests each = 200 concurrent LLM calls

3. Request coalescing

  • If 50 users ask "What is the Q3 revenue?" simultaneously, deduplicate: run once, fan out results
  • Requires distributed locking (Redis) to prevent duplicate LLM calls

4. Streaming responses

  • Stream GPT-4o output token-by-token
  • Time-to-first-token under 1 second even when total response takes 5 seconds
  • Dramatically improves perceived latency

Step 7: Multi-Document Queries

When a user asks "Compare Q3 results across all the annual reports I uploaded":

Python
async def multi_document_query(
    question: str,
    user_id: str,
    doc_ids: list[str],  # user selected specific docs
) -> QueryResponse:
    # Retrieve from each document separately
    per_doc_chunks = await asyncio.gather(*[
        retrieve_from_doc(question, doc_id, top_k=3)
        for doc_id in doc_ids
    ])

    # Build multi-document context
    context = format_multi_doc_context(per_doc_chunks)

    # Generate comparative answer
    return await generate_with_context(question, context)

Step 8: MVP vs Full Build

MVP (3 weeks):

  • PDF upload → parse → chunk → embed → store in FAISS (file-based)
  • Simple user auth, all documents shared within account
  • No semantic cache
  • Basic citation in response

v1.0 (3 months):

  • Azure AI Search (managed, scalable vector store)
  • Team-based access control
  • Semantic cache (Redis)
  • Streaming responses
  • Citation extraction and display

v2.0 (6 months):

  • Multi-document comparison queries
  • Table and chart extraction from PDFs
  • Conversation memory (follow-up questions)
  • Analytics: most-asked questions, document coverage gaps