Scenario: Design a Document Q&A System — Scenario Based Questions | Learnixo

The Interview Question

"Design a document Q&A platform where enterprise users can upload PDF documents and ask natural language questions about them. Users should only be able to query documents they have access to. The system should handle 100,000 documents and 10,000 concurrent users."

Step 1: Clarify Requirements

Document types: PDFs primarily; also Word, Excel, PowerPoint
Document size: Average 50 pages, up to 500 pages
Users: Enterprise — teams with role-based access control
Citations: Answers must cite which document and page the information came from
Latency: under 10 seconds for an answer
Storage: 100,000 documents now, growing to 1 million
Concurrent users: 10,000 simultaneous users

Step 2: Back-of-Envelope

Storage:

100,000 docs × 50 pages × ~3,000 words/page = 15 billion words
In embeddings: text-embedding-3-small = 1,536 dimensions × 4 bytes = 6KB per chunk
Average doc: 500 chunks × 6KB = 3MB of embeddings per doc
100,000 docs: 300GB of embeddings — manageable in a vector database

Query load:

10,000 concurrent users, average 1 query per 30 seconds = ~333 queries/second
This is the hardest part: 333 parallel LLM calls is expensive and slow
Semantic cache should absorb 40-60% of repeated queries

Embedding cost (one-time):

100,000 docs × 500 chunks × 200 tokens/chunk = 10 billion tokens
text-embedding-3-small: $0.02/1M tokens = $200 total ingestion cost

System Architecture

                    ┌──────────────────┐
                    │   Web App        │
                    │   (React/Next)   │
                    └────────┬─────────┘
                             │
                             ▼
                    ┌──────────────────┐
                    │   API Gateway    │  ← Auth, rate limiting
                    └──────┬───────────┘
                           │
               ┌───────────┴───────────┐
               │                       │
               ▼                       ▼
    ┌─────────────────┐     ┌─────────────────┐
    │  Query Service  │     │ Ingestion Svc   │
    │  (FastAPI)      │     │ (worker pool)   │
    └────────┬────────┘     └────────┬────────┘
             │                       │
    ┌────────┴──────┐       ┌────────┴────────┐
    │               │       │                 │
    ▼               ▼       ▼                 ▼
┌───────┐  ┌──────────────┐ ┌──────┐  ┌───────────────┐
│ Redis │  │  Vector DB   │ │ Blob │  │  Document DB  │
│ Cache │  │ (with ACL    │ │Store │  │  (Postgres)   │
│       │  │  filtering)  │ │      │  │               │
└───────┘  └──────┬───────┘ └──────┘  └───────────────┘
                  │
                  ▼
           ┌────────────┐
           │  Azure OAI │
           │  GPT-4o    │
           └────────────┘

Step 3: Document Ingestion Pipeline

Python

# ingestion/pipeline.py
from pathlib import Path
import asyncio
from dataclasses import dataclass

@dataclass
class DocumentChunk:
    chunk_id: str
    doc_id: str
    text: str
    page_number: int
    chunk_index: int
    metadata: dict

async def ingest_document(
    doc_id: str,
    file_path: Path,
    owner_id: str,
    team_ids: list[str],
) -> int:
    """Full ingestion pipeline. Returns number of chunks created."""

    # 1. Parse document
    pages = await parse_pdf(file_path)

    # 2. Chunk with page awareness
    chunks = []
    for page_num, page_text in enumerate(pages, start=1):
        page_chunks = recursive_chunk(
            text=page_text,
            chunk_size=400,   # tokens
            overlap=50,
            page_number=page_num,
            doc_id=doc_id,
        )
        chunks.extend(page_chunks)

    # 3. Embed all chunks in batches
    embeddings = await embed_batch(
        [c.text for c in chunks],
        batch_size=100,
    )

    # 4. Upsert to vector store with access metadata
    await vector_store.upsert(
        chunks=chunks,
        embeddings=embeddings,
        metadata={
            "doc_id": doc_id,
            "owner_id": owner_id,
            "team_ids": team_ids,  # access control tags
        },
    )

    # 5. Update document registry
    await db.execute(
        """
        UPDATE documents SET
            status = 'indexed',
            chunk_count = $1,
            indexed_at = NOW()
        WHERE id = $2
        """,
        len(chunks), doc_id,
    )

    return len(chunks)

Step 4: Access-Controlled Vector Search

The critical requirement: users only retrieve from documents they own or their team owns.

Python

# query/retriever.py
async def retrieve_with_acl(
    query: str,
    user_id: str,
    team_ids: list[str],
    top_k: int = 8,
) -> list[DocumentChunk]:
    """Vector search filtered to documents the user can access."""

    # Embed the query
    query_embedding = await embed_single(query)

    # Build access filter — OR condition across user and teams
    access_filter = {
        "$or": [
            {"owner_id": {"$eq": user_id}},
            {"team_ids": {"$in": team_ids}},
        ]
    }

    # Vector search with pre-filter (reduces search space to accessible docs)
    results = await vector_store.search(
        embedding=query_embedding,
        filter=access_filter,
        top_k=top_k * 2,  # fetch extra for reranking
    )

    # Rerank with cross-encoder
    reranked = await rerank(query, results)

    return reranked[:top_k]

Critical: The access filter is applied at the vector database level (pre-filtering), not post-processing. This means users can never receive chunks from documents they don't have access to, even if the semantic similarity is high.

Step 5: Query Service with Citations

Python

# query/service.py
from pydantic import BaseModel

class Citation(BaseModel):
    doc_id: str
    doc_title: str
    page_number: int
    excerpt: str

class QueryResponse(BaseModel):
    answer: str
    citations: list[Citation]
    from_cache: bool

async def answer_question(
    question: str,
    user_id: str,
    team_ids: list[str],
) -> QueryResponse:

    # 1. Check semantic cache (user-scoped — different users have different doc access)
    cache_key = f"{user_id}:{question}"
    cached = await semantic_cache.get(cache_key)
    if cached:
        return QueryResponse(**cached, from_cache=True)

    # 2. Retrieve relevant chunks
    chunks = await retrieve_with_acl(question, user_id, team_ids)

    if not chunks:
        return QueryResponse(
            answer="I couldn't find relevant information in your documents.",
            citations=[],
            from_cache=False,
        )

    # 3. Format context with citations
    context_parts = []
    for i, chunk in enumerate(chunks):
        context_parts.append(
            f"[{i+1}] From '{chunk.doc_title}', page {chunk.page_number}:\n{chunk.text}"
        )
    context = "\n\n".join(context_parts)

    # 4. Generate answer with citation instructions
    system_prompt = """Answer the user's question based on the provided document excerpts.
    
Rules:
- ONLY use information from the provided excerpts
- Cite sources using [1], [2], etc. corresponding to the excerpt numbers
- If the excerpts don't contain the answer, say so clearly
- Do not make up information not in the excerpts"""

    response = await openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {
                "role": "user",
                "content": f"Excerpts:\n\n{context}\n\nQuestion: {question}",
            },
        ],
        temperature=0.1,
    )

    answer = response.choices[0].message.content

    # 5. Extract citations from answer
    citations = extract_citations(answer, chunks)

    result = QueryResponse(
        answer=answer,
        citations=citations,
        from_cache=False,
    )

    # 6. Cache the result (1-hour TTL for document content)
    await semantic_cache.set(cache_key, result.model_dump(), ttl=3600)

    return result

Step 6: Handling Scale — 10,000 Concurrent Users

At 333 queries/second, naive serial processing won't work. Key strategies:

1. Semantic cache (biggest win)

40-60% of enterprise queries are repeats (same user asking same question, or team members asking similar things)
Cache key: {user_id}:{question} — user-scoped for access control
60% cache hit rate reduces LLM calls to 133/second

2. Horizontal scaling of query service

Stateless FastAPI workers: scale to N replicas
Azure Container Apps: scale on HTTP queue depth
20 replicas × 10 concurrent requests each = 200 concurrent LLM calls

3. Request coalescing

If 50 users ask "What is the Q3 revenue?" simultaneously, deduplicate: run once, fan out results
Requires distributed locking (Redis) to prevent duplicate LLM calls

4. Streaming responses

Stream GPT-4o output token-by-token
Time-to-first-token under 1 second even when total response takes 5 seconds
Dramatically improves perceived latency

Step 7: Multi-Document Queries

When a user asks "Compare Q3 results across all the annual reports I uploaded":

Python

async def multi_document_query(
    question: str,
    user_id: str,
    doc_ids: list[str],  # user selected specific docs
) -> QueryResponse:
    # Retrieve from each document separately
    per_doc_chunks = await asyncio.gather(*[
        retrieve_from_doc(question, doc_id, top_k=3)
        for doc_id in doc_ids
    ])

    # Build multi-document context
    context = format_multi_doc_context(per_doc_chunks)

    # Generate comparative answer
    return await generate_with_context(question, context)

Step 8: MVP vs Full Build

MVP (3 weeks):

PDF upload → parse → chunk → embed → store in FAISS (file-based)
Simple user auth, all documents shared within account
No semantic cache
Basic citation in response

v1.0 (3 months):

Azure AI Search (managed, scalable vector store)
Team-based access control
Semantic cache (Redis)
Streaming responses
Citation extraction and display

v2.0 (6 months):

Multi-document comparison queries
Table and chart extraction from PDFs
Conversation memory (follow-up questions)
Analytics: most-asked questions, document coverage gaps