Scenario Based Questions · Lesson 13 of 13
Scenario: Design a Document Q&A System
The Interview Question
"Design a document Q&A platform where enterprise users can upload PDF documents and ask natural language questions about them. Users should only be able to query documents they have access to. The system should handle 100,000 documents and 10,000 concurrent users."
Step 1: Clarify Requirements
- Document types: PDFs primarily; also Word, Excel, PowerPoint
- Document size: Average 50 pages, up to 500 pages
- Users: Enterprise — teams with role-based access control
- Citations: Answers must cite which document and page the information came from
- Latency: under 10 seconds for an answer
- Storage: 100,000 documents now, growing to 1 million
- Concurrent users: 10,000 simultaneous users
Step 2: Back-of-Envelope
Storage:
- 100,000 docs × 50 pages × ~3,000 words/page = 15 billion words
- In embeddings: text-embedding-3-small = 1,536 dimensions × 4 bytes = 6KB per chunk
- Average doc: 500 chunks × 6KB = 3MB of embeddings per doc
- 100,000 docs: 300GB of embeddings — manageable in a vector database
Query load:
- 10,000 concurrent users, average 1 query per 30 seconds = ~333 queries/second
- This is the hardest part: 333 parallel LLM calls is expensive and slow
- Semantic cache should absorb 40-60% of repeated queries
Embedding cost (one-time):
- 100,000 docs × 500 chunks × 200 tokens/chunk = 10 billion tokens
- text-embedding-3-small: $0.02/1M tokens = $200 total ingestion cost
System Architecture
┌──────────────────┐
│ Web App │
│ (React/Next) │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ API Gateway │ ← Auth, rate limiting
└──────┬───────────┘
│
┌───────────┴───────────┐
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Query Service │ │ Ingestion Svc │
│ (FastAPI) │ │ (worker pool) │
└────────┬────────┘ └────────┬────────┘
│ │
┌────────┴──────┐ ┌────────┴────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌───────┐ ┌──────────────┐ ┌──────┐ ┌───────────────┐
│ Redis │ │ Vector DB │ │ Blob │ │ Document DB │
│ Cache │ │ (with ACL │ │Store │ │ (Postgres) │
│ │ │ filtering) │ │ │ │ │
└───────┘ └──────┬───────┘ └──────┘ └───────────────┘
│
▼
┌────────────┐
│ Azure OAI │
│ GPT-4o │
└────────────┘Step 3: Document Ingestion Pipeline
# ingestion/pipeline.py
from pathlib import Path
import asyncio
from dataclasses import dataclass
@dataclass
class DocumentChunk:
chunk_id: str
doc_id: str
text: str
page_number: int
chunk_index: int
metadata: dict
async def ingest_document(
doc_id: str,
file_path: Path,
owner_id: str,
team_ids: list[str],
) -> int:
"""Full ingestion pipeline. Returns number of chunks created."""
# 1. Parse document
pages = await parse_pdf(file_path)
# 2. Chunk with page awareness
chunks = []
for page_num, page_text in enumerate(pages, start=1):
page_chunks = recursive_chunk(
text=page_text,
chunk_size=400, # tokens
overlap=50,
page_number=page_num,
doc_id=doc_id,
)
chunks.extend(page_chunks)
# 3. Embed all chunks in batches
embeddings = await embed_batch(
[c.text for c in chunks],
batch_size=100,
)
# 4. Upsert to vector store with access metadata
await vector_store.upsert(
chunks=chunks,
embeddings=embeddings,
metadata={
"doc_id": doc_id,
"owner_id": owner_id,
"team_ids": team_ids, # access control tags
},
)
# 5. Update document registry
await db.execute(
"""
UPDATE documents SET
status = 'indexed',
chunk_count = $1,
indexed_at = NOW()
WHERE id = $2
""",
len(chunks), doc_id,
)
return len(chunks)Step 4: Access-Controlled Vector Search
The critical requirement: users only retrieve from documents they own or their team owns.
# query/retriever.py
async def retrieve_with_acl(
query: str,
user_id: str,
team_ids: list[str],
top_k: int = 8,
) -> list[DocumentChunk]:
"""Vector search filtered to documents the user can access."""
# Embed the query
query_embedding = await embed_single(query)
# Build access filter — OR condition across user and teams
access_filter = {
"$or": [
{"owner_id": {"$eq": user_id}},
{"team_ids": {"$in": team_ids}},
]
}
# Vector search with pre-filter (reduces search space to accessible docs)
results = await vector_store.search(
embedding=query_embedding,
filter=access_filter,
top_k=top_k * 2, # fetch extra for reranking
)
# Rerank with cross-encoder
reranked = await rerank(query, results)
return reranked[:top_k]Critical: The access filter is applied at the vector database level (pre-filtering), not post-processing. This means users can never receive chunks from documents they don't have access to, even if the semantic similarity is high.
Step 5: Query Service with Citations
# query/service.py
from pydantic import BaseModel
class Citation(BaseModel):
doc_id: str
doc_title: str
page_number: int
excerpt: str
class QueryResponse(BaseModel):
answer: str
citations: list[Citation]
from_cache: bool
async def answer_question(
question: str,
user_id: str,
team_ids: list[str],
) -> QueryResponse:
# 1. Check semantic cache (user-scoped — different users have different doc access)
cache_key = f"{user_id}:{question}"
cached = await semantic_cache.get(cache_key)
if cached:
return QueryResponse(**cached, from_cache=True)
# 2. Retrieve relevant chunks
chunks = await retrieve_with_acl(question, user_id, team_ids)
if not chunks:
return QueryResponse(
answer="I couldn't find relevant information in your documents.",
citations=[],
from_cache=False,
)
# 3. Format context with citations
context_parts = []
for i, chunk in enumerate(chunks):
context_parts.append(
f"[{i+1}] From '{chunk.doc_title}', page {chunk.page_number}:\n{chunk.text}"
)
context = "\n\n".join(context_parts)
# 4. Generate answer with citation instructions
system_prompt = """Answer the user's question based on the provided document excerpts.
Rules:
- ONLY use information from the provided excerpts
- Cite sources using [1], [2], etc. corresponding to the excerpt numbers
- If the excerpts don't contain the answer, say so clearly
- Do not make up information not in the excerpts"""
response = await openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{
"role": "user",
"content": f"Excerpts:\n\n{context}\n\nQuestion: {question}",
},
],
temperature=0.1,
)
answer = response.choices[0].message.content
# 5. Extract citations from answer
citations = extract_citations(answer, chunks)
result = QueryResponse(
answer=answer,
citations=citations,
from_cache=False,
)
# 6. Cache the result (1-hour TTL for document content)
await semantic_cache.set(cache_key, result.model_dump(), ttl=3600)
return resultStep 6: Handling Scale — 10,000 Concurrent Users
At 333 queries/second, naive serial processing won't work. Key strategies:
1. Semantic cache (biggest win)
- 40-60% of enterprise queries are repeats (same user asking same question, or team members asking similar things)
- Cache key:
{user_id}:{question}— user-scoped for access control - 60% cache hit rate reduces LLM calls to 133/second
2. Horizontal scaling of query service
- Stateless FastAPI workers: scale to N replicas
- Azure Container Apps: scale on HTTP queue depth
- 20 replicas × 10 concurrent requests each = 200 concurrent LLM calls
3. Request coalescing
- If 50 users ask "What is the Q3 revenue?" simultaneously, deduplicate: run once, fan out results
- Requires distributed locking (Redis) to prevent duplicate LLM calls
4. Streaming responses
- Stream GPT-4o output token-by-token
- Time-to-first-token under 1 second even when total response takes 5 seconds
- Dramatically improves perceived latency
Step 7: Multi-Document Queries
When a user asks "Compare Q3 results across all the annual reports I uploaded":
async def multi_document_query(
question: str,
user_id: str,
doc_ids: list[str], # user selected specific docs
) -> QueryResponse:
# Retrieve from each document separately
per_doc_chunks = await asyncio.gather(*[
retrieve_from_doc(question, doc_id, top_k=3)
for doc_id in doc_ids
])
# Build multi-document context
context = format_multi_doc_context(per_doc_chunks)
# Generate comparative answer
return await generate_with_context(question, context)Step 8: MVP vs Full Build
MVP (3 weeks):
- PDF upload → parse → chunk → embed → store in FAISS (file-based)
- Simple user auth, all documents shared within account
- No semantic cache
- Basic citation in response
v1.0 (3 months):
- Azure AI Search (managed, scalable vector store)
- Team-based access control
- Semantic cache (Redis)
- Streaming responses
- Citation extraction and display
v2.0 (6 months):
- Multi-document comparison queries
- Table and chart extraction from PDFs
- Conversation memory (follow-up questions)
- Analytics: most-asked questions, document coverage gaps