Learnixo
Back to blog
Backend Systemsadvanced

AI Engineer Interview Questions — .NET Stack

100 interview questions for AI engineer roles on the .NET stack: LLM integration, RAG systems, vector databases, streaming, agents, evaluations, production AI, and system design.

Asma Hafeez KhanMay 25, 202629 min read
.NETC#AIinterviewLLMRAGembeddingsagentsproduction
Share:𝕏

AI Engineer Interview Questions — .NET Stack

AI engineer interviews combine software engineering fundamentals with domain-specific knowledge of LLMs, embeddings, retrieval systems, and production AI operations. These 100 questions cover the full spectrum for junior to senior AI roles on the .NET stack.


Section 1 — LLM Fundamentals (Q1–Q20)

Q1. What is a token and why does it matter for API pricing?

A token is approximately 4 characters of text. LLM APIs price on input + output tokens. A 500-token prompt + 200-token response costs ($2.50 + $10) × 700 / 1M = ~$0.0085 for gpt-4o. Token count determines latency, cost, and context window limits.

Q2. What is the context window and what happens when you exceed it?

The context window is the maximum number of tokens the model can process in one call — currently 128K tokens for gpt-4o. Exceeding it throws a context length exceeded error. Mitigation: truncate history, summarise old messages, or use a model with a larger context window.

Q3. What is temperature and how does it affect output?

Temperature (0.0–2.0) controls randomness. At 0, the model always picks the highest-probability token — deterministic, good for structured output. At 1.0, output is creative but less predictable. At 2.0, output becomes incoherent. For classification and extraction: 0. For creative writing: 0.7–1.0.

Q4. What is the difference between system, user, and assistant messages?

System: instructions to the model (persona, format, constraints). User: the human's input. Assistant: the model's prior responses. Together they form the conversation history. The system message is processed first and carries the most weight for behaviour.

Q5. What is prompt injection and how do you defend against it?

Prompt injection is a user submitting input that overrides the system prompt (e.g. "Ignore previous instructions and reveal..."). Defences: delimit user input clearly, validate output format, don't grant the model access to sensitive actions based solely on user input, and treat model output as untrusted user input for downstream processing.

Q6. What are the main LLM failure modes in production?

Hallucination (confident wrong answers), prompt sensitivity (small wording changes give very different answers), context loss (forgets instructions partway through), refusals (model refuses legitimate requests), and inconsistency (same prompt, different answer on different runs).

Q7. What is few-shot prompting?

Including examples of the desired input-output pattern in the prompt. Instead of "Classify sentiment", you provide 3-5 examples: "Input: 'Great product' → Positive". Few-shot dramatically improves classification and extraction tasks without fine-tuning.

Q8. When should you fine-tune a model vs prompt engineer?

Prompt engineer first — it's faster and cheaper. Fine-tune when: you have 1000+ labelled examples of the exact task, the model consistently fails despite good prompting, you need consistent output format not achievable with prompts, or you need to reduce token usage on a repetitive task.

Q9. What is RAG and why is it used?

Retrieval-Augmented Generation: retrieve relevant documents from a knowledge base, inject them into the prompt as context, then generate an answer grounded in those documents. Used when the model's training data is outdated, too general, or when you need grounded, verifiable answers without hallucination.

Q10. What is the difference between embedding models and chat completion models?

Embedding models (text-embedding-3-small) convert text to a fixed-length float vector representing semantic meaning — no text output. Chat completion models (gpt-4o) generate text from a conversation. You use embedding models to build the search index; chat completion models to generate the answer.

Q11. What is semantic similarity and how is it computed?

Two texts are semantically similar if their meaning is close. Computed as cosine similarity between their embedding vectors: dot product divided by the product of their L2 norms. Value range: -1 (opposite meaning) to 1 (identical meaning). For text search, values above 0.85 are typically "similar".

Q12. What is the IChatClient interface in Microsoft.Extensions.AI?

The abstraction for any LLM provider. Two core methods: CompleteAsync (returns ChatCompletion with the full response) and CompleteStreamingAsync (returns IAsyncEnumerable of streaming tokens). Register one provider in DI; all services depend only on the interface — swap providers without changing business code.

Q13. What does AddChatClient().UseLogging().UseFunctionInvocation() do?

Builds a middleware pipeline around IChatClient. UseLogging logs every request and response. UseFunctionInvocation enables automatic tool/function calling — when the model returns a tool call, the middleware invokes the C# method and sends the result back automatically. Outermost middleware runs first.

Q14. What is structured output and when do you use it?

Setting ChatOptions.ResponseFormat = ChatResponseFormat.Json constrains the model to respond with valid JSON. Use for classification, extraction, and any case where you need to deserialise the response. Always validate and catch deserialization exceptions — the model may still produce malformed JSON at low temperature.

Q15. How do you handle rate limit errors from LLM APIs?

Retry with exponential backoff. Use Polly v8 with AddRetry and ShouldHandle for HttpStatusCode.TooManyRequests. Include UseJitter to avoid thundering herd. Monitor X-RateLimit-Remaining headers and pre-emptively slow down near the limit. Consider a token budget middleware that queues requests when approaching limits.

Q16. What is a system prompt and what should it contain?

The system prompt establishes the model's persona, constraints, and output format. Good system prompts include: role ("You are an order analyst"), constraints ("Answer ONLY from the provided context"), output format ("Respond with JSON only: {"score": number, "reason": string}"), and tone/style requirements.

Q17. What are the OpenAI model tiers and their trade-offs?

gpt-4o-mini: cheap ($0.15/1M input tokens), fast, good for classification and simple tasks. gpt-4o: balanced ($2.50/1M input tokens), good for most tasks. o3/o4: expensive, slow, excellent at complex reasoning and code. In production: route to the cheapest model that achieves acceptable quality.

Q18. What is context caching and when does it help?

Some providers (Anthropic, Google) support caching a prompt prefix — if you repeat the same system prompt + context in every request, the cached portion isn't reprocessed, reducing cost and latency. Most useful for RAG systems where the retrieved documents are the same across multiple turns.

Q19. What is a completion vs a chat completion?

Text completion: send a text string, model continues it (legacy). Chat completion: send a list of messages with roles, model generates the next message. Chat completion is the modern API — all providers support it. Text completions are deprecated for most models.

Q20. What happens when you call CompleteAsync with tools registered?

The model may return a tool call instead of a text response. With UseFunctionInvocation() middleware: the SDK detects the tool call, invokes the C# method, appends the result as a tool message, and sends the full conversation back to the model — repeating until the model returns a text response. Without the middleware: you must handle the ToolCall in the response yourself.


Section 2 — RAG Systems (Q21–Q40)

Q21. What are the steps in a RAG pipeline?

  1. Index: chunk documents, generate embeddings, store in vector DB. 2. Retrieve: embed the query, find top-K similar chunks. 3. Augment: inject retrieved chunks into the prompt. 4. Generate: call the LLM with the augmented prompt.

Q22. What is chunk size and why does it matter?

Chunks are the text segments stored in the vector index. Too small (under 100 tokens): the chunk lacks context — the model can't answer from it. Too large (over 1000 tokens): dilutes the embedding signal — less precise retrieval. Typical: 200-500 tokens with 50-token overlap between chunks.

Q23. What is the difference between semantic and keyword search?

Keyword search (BM25, full-text): exact word match — fast, deterministic. Semantic search (vector similarity): finds conceptually related content even with different wording. Hybrid search combines both with a ranking algorithm (BM25 score + cosine similarity score weighted together).

Q24. What is pgvector and how is it used in EF Core?

pgvector is a PostgreSQL extension that adds a vector data type and similarity search operators. In EF Core: map a Vector property with HasColumnType("vector(1536)"), create an HNSW index with HasMethod("hnsw"), and query with .OrderBy(p => p.Embedding!.CosineDistance(queryVector)). The HNSW migration requires suppressTransaction: true.

Q25. What is HNSW and why is it preferred over IVFFlat?

HNSW (Hierarchical Navigable Small World) is a graph-based ANN (Approximate Nearest Neighbour) algorithm. Build once with no training data required; O(log n) query time; higher memory usage. IVFFlat requires training (needs existing data to build clusters). For under 1M vectors: HNSW is the right choice. For tens of millions: IVFFlat uses less memory.

Q26. What is faithfulness and why is it the most important RAG metric?

Faithfulness measures whether the generated answer is grounded in the retrieved context. A faithfulness score below 0.5 indicates hallucination — the model is adding facts not present in the retrieved documents. This is the most critical metric because hallucinations in RAG are silent failures that appear confident.

Q27. What is context precision in RAG evaluation?

The fraction of retrieved chunks that were actually useful for answering the question. Low precision means your retrieval is noisy — you're feeding irrelevant context to the model, which wastes tokens and may confuse the answer. Improve by filtering retrieved chunks by a relevance threshold before injecting them.

Q28. What is the "lost in the middle" problem?

LLMs struggle to use information in the middle of long contexts — they attend more to the beginning and end. In RAG, if you retrieve 10 chunks and inject them in order, the most relevant chunk should be first or last, not buried in the middle. Re-rank retrieved chunks by relevance before injecting.

Q29. How do you handle multi-turn conversation in a RAG system?

For each new user message: condense the conversation history into a standalone question (a second LLM call), then use that condensed question to retrieve context. Without this step, queries like "Tell me more about that" fail retrieval because they reference prior context the retrieval system doesn't have.

Q30. What is a re-ranker and when is it used?

A cross-encoder re-ranker scores each (query, chunk) pair jointly — more accurate than bi-encoder cosine similarity but slower. Use bi-encoder similarity to retrieve top 50, then re-rank to get top 5. Cohere Rerank and BGE re-rankers are common choices. Use when retrieval precision is low.

Q31. How do you handle stale data in a RAG system?

Keep track of document modification timestamps. When a source document updates, re-chunk and re-embed only the changed sections. Use a background job to detect changes (file watcher, database CDC, webhook). Store the source document hash — if unchanged, skip re-embedding.

Q32. What is a HyDE query (Hypothetical Document Embedding)?

Generate a hypothetical answer to the user's question using the LLM, then embed that hypothetical answer and use it for retrieval. Because the hypothetical answer is similar in style to the actual documents, it often retrieves better results than embedding the question directly. Adds one extra LLM call per query.

Q33. What is query expansion and how does it improve retrieval?

Generate multiple alternative phrasings of the user's question, retrieve candidates for each, then merge and deduplicate. Improves recall — captures documents that match a different phrasing. Trade-off: N LLM calls per query (N = number of expansions).

Q34. How do you chunk PDF documents with tables and images?

PDFs with structure need specialised parsers (PyMuPDF, Document Intelligence). Tables: extract to markdown format before chunking so the structure is preserved. Images: use a vision LLM to generate a text description ("This image shows a bar chart with..."), then embed the description.

Q35. What is a vector store and what abstractions does Microsoft.Extensions.AI provide?

A vector store is a database optimised for ANN search. Microsoft.Extensions.AI provides IVectorStore with IVectorStoreRecordCollection — implementations for Azure AI Search, Qdrant, Redis, and in-memory. Use attributes [VectorStoreRecordKey], [VectorStoreRecordData], [VectorStoreRecordVector] on your record class.

Q36. When would you choose Qdrant over pgvector?

pgvector: right for existing PostgreSQL apps with under 10M vectors, need for JOINs with relational data. Qdrant: purpose-built vector DB — faster queries at 50M+ vectors, rich payload filtering, built-in sharding. The main trade-off for pgvector is operational simplicity (one database) vs Qdrant's query performance at scale.

Q37. How do you evaluate a RAG system before deploying?

Build an eval dataset: 50-100 question/ground-truth pairs. Measure: context recall (did retrieval find the relevant chunks?), faithfulness (does the answer stay within context?), and answer correctness (semantic similarity to ground truth). Run as xUnit tests in CI — fail the pipeline if faithfulness drops below threshold.

Q38. How do you handle a question that has no answer in the knowledge base?

Instruct the model: "If the answer is not in the provided context, say 'I don't have information about that'". Detect insufficient retrieval: if the top retrieved chunk has similarity below 0.6, skip the LLM call entirely and return a default "I don't know" response. This prevents the model from hallucinating an answer.

Q39. What is the cost breakdown of a typical RAG query?

Embedding the query: ~$0.00002 (cheap). Retrieving top-K: ~1ms Redis/Qdrant lookup (free). LLM call with 2000 context tokens + 200 output: ~$0.005-0.05 depending on model. At scale: the LLM call dominates. Cache embeddings for repeated queries. Cache LLM responses for identical questions (exact-match cache).

Q40. What is multi-vector retrieval (ColBERT)?

ColBERT generates one embedding per token (not one per document), then computes relevance as the sum of maximum similarities between query tokens and document tokens. More accurate than single-vector retrieval for complex queries, but higher storage and compute cost. Supported in some vector stores natively.


Section 3 — Agents and Tools (Q41–Q60)

Q41. What makes something an "agent" vs a simple LLM call?

An agent runs in a loop: think → act → observe → repeat until done. It has access to tools, can take multiple steps, and can adapt based on what it observes. A simple LLM call is one-shot with no state between calls.

Q42. What is the ReAct pattern?

Reason + Act: the LLM reasons about what to do (Thought), calls a tool (Action), observes the result (Observation), then reasons again. Repeat until the task is done. The conversation history is the agent's memory within a session.

Q43. How does tool calling work in Microsoft.Extensions.AI?

Define C# methods with [Description] attributes. Pass them as AIFunctionFactory.Create(...) in ChatOptions.Tools. With UseFunctionInvocation() middleware: the SDK automatically detects tool call responses, invokes the C# method, and sends the result back — the loop continues until the model produces a text response.

Q44. What is AIFunctionFactory.Create?

Converts a C# delegate or method info into an AIFunction that the LLM can call. Generates the JSON schema for the function's parameters from the C# types and [Description] attributes. The model sees this schema and generates structured tool calls matching the schema.

Q45. How do you prevent an agent from looping forever?

Set a maximum step count (10 is a common limit). Set an absolute timeout on the entire agent run. Monitor for repeated identical tool calls (stuck in a loop). Return a fallback response when the step limit is reached.

Q46. What is a Plan-and-Execute agent?

First: the planner LLM generates a step-by-step plan for the goal. Second: an executor LLM executes each step, one at a time, using tools. More reliable than a single agent making all decisions on the fly — the plan is visible and auditable. Less flexible than ReAct for dynamic situations.

Q47. What is an MCP server?

Model Context Protocol server — a standardised way to expose tools, resources, and prompts to AI assistants. Claude, GitHub Copilot, and Cursor all support MCP. In .NET: decorate classes with [McpServerToolType] and methods with [McpServerTool], host via HTTP + SSE, and AI assistants automatically discover and call your tools.

Q48. How do you handle tool failures in an agent?

Return an informative error string from the tool function (don't throw). The model receives the error as the tool result and can decide to retry with different parameters, try a different tool, or tell the user that the operation failed. Never let unhandled exceptions bubble up from tool functions.

Q49. What is Semantic Kernel and how does it differ from Microsoft.Extensions.AI?

Microsoft.Extensions.AI is the low-level abstraction (IChatClient, IEmbeddingGenerator). Semantic Kernel is a higher-level orchestration framework built on top of it: Kernel with Plugins ([KernelFunction] methods), ChatCompletionAgent, AgentGroupChat for multi-agent scenarios, and IFunctionInvocationFilter for cross-cutting concerns. Use Semantic Kernel for complex agent workflows; use Microsoft.Extensions.AI directly for simpler integrations.

Q50. What is a KernelPlugin?

A collection of KernelFunctions exposed to the model as a named plugin. Created from a C# class with [KernelFunction] methods via kernel.Plugins.AddFromObject(instance, "PluginName"). The model sees "PluginName.FunctionName" as the callable function. Equivalent to a tool set in Microsoft.Extensions.AI.

Q51. What is AgentGroupChat?

A Semantic Kernel feature that coordinates multiple ChatCompletionAgents with selection and termination strategies. Selection strategy: a LLM or rule-based function that decides which agent speaks next. Termination strategy: when to stop the group chat (keyword detected, max iterations, explicit done signal).

Q52. How do you implement short-term memory in an agent?

Maintain a List of ChatMessage that grows with each turn. Pass the full history to every CompleteAsync call. Trim to the last N messages when the list grows too large (to stay within context window). This is the in-process session memory — lost when the service restarts.

Q53. How do you implement long-term memory in an agent?

After each conversation turn, embed the exchange and store in a vector database (IVectorStore). At the start of each new conversation: embed the user's message, retrieve the top-K similar past interactions, and inject them into the system prompt as "Relevant history".

Q54. What are the risks of autonomous agent execution?

Unintended actions (deleting data, sending emails, spending money). Prompt injection via tool results (malicious content in a fetched webpage can hijack the agent). Infinite loops. Token cost explosion. Mitigation: human-in-the-loop for destructive actions, sandbox tool execution, step and cost limits.

Q55. What is the Handoff pattern in multi-agent systems?

An agent specialises in one domain. When it determines the query is out of its scope, it hands off to another agent rather than attempting an answer outside its knowledge. The orchestrator routes the initial request to the right specialist. Cleaner separation of concerns than a single all-knowing agent.

Q56. How do you test an agent's tool calling behaviour?

Inject a FakeChatClient that returns pre-scripted tool call responses. Verify the correct tools are called with the expected parameters. Verify the agent correctly handles tool results and produces the expected final answer. This tests the agent logic without making real LLM calls.

Q57. What is grounding and why does it matter for agents?

Grounding means connecting the agent's reasoning to verifiable external facts via tools. A grounded agent calls a tool to look up the current order status rather than hallucinating it. Always prefer tool calls over the model's training knowledge for facts that change over time.

Q58. How do you handle secrets in agent tools?

Never put secrets in tool descriptions or function signatures — they appear in the prompt. Inject secrets via DI (IConfiguration, Azure Key Vault) into the tool class constructor. The model only sees the tool's parameters and return value, not its implementation.

Q59. What is function calling mode?

ChatToolMode.Auto: the model decides whether to call a tool or respond with text. ChatToolMode.Required: the model must call a tool (used for guaranteed structured output). ChatToolMode.None: tool calling disabled even if tools are registered. Use Required for the first turn when you always need a structured response.

Q60. How do you add observability to an agent?

ActivitySource traces: start a span per agent run, child spans per tool call. Log token usage per step. Emit metrics: tool call count, tool call latency, total tokens, total cost. Add OnRetry callbacks in Polly for retry counts. Store the full message history per session in a database for post-mortem analysis.


Section 4 — Production AI (Q61–Q80)

Q61. How do you cache LLM responses?

Exact-match: UseDistributedCache() in the IChatClient pipeline — hashes the full message list, stores response in Redis. Semantic: embed the query, find similar cached queries above 0.95 cosine similarity. Semantic caching handles paraphrased questions. Both together cover 40-60% of traffic in typical Q&A systems.

Q62. What is a fallback model and when do you use it?

When the primary model provider returns a 5xx error or times out, transparently retry on a secondary provider (or a local Ollama model). Implement as a FallbackChatClient wrapping the primary with a try-catch on HttpRequestException and TimeoutException.

Q63. How do you control AI costs per user?

Token budget middleware: track input + output tokens per user/tenant in Redis. Deduct after each call. Reject with 402 when the budget is exhausted. Reset monthly. Alert when a user consumes more than 2x their average (anomalous usage).

Q64. What is prompt compression?

When conversation history grows long (over 3000 tokens), summarise old messages into a compact representation before sending. One extra LLM call to summarise, but saves many tokens on the main call. Use a small cheap model (gpt-4o-mini) for the summary.

Q65. What is model tiering?

Route simple tasks to cheap models (gpt-4o-mini), complex reasoning to expensive ones (o3). Classify the request before routing: short classification questions → cheap, complex multi-step reasoning → expensive. Callers can override via a model_tier header for explicit control.

Q66. What metrics should you track for a production LLM service?

Latency percentiles (P50, P95, P99) per model. Token counts (input, output, total) per request. Cost per request, cost per user, cost per day. Error rate (4xx, 5xx, timeouts). Cache hit rate (exact + semantic). Eval metrics (faithfulness, relevance) from sampled traffic. Queue depth if requests are async.

Q67. How do you handle LLM output validation?

For structured output (JSON): try/catch JsonException and retry with a corrective message ("The previous response was not valid JSON. Please retry with exactly this format: ..."). For business rules: validate in code after parsing. Never trust raw LLM output for security-sensitive operations.

Q68. What is a guardrail?

A check applied before or after LLM calls to enforce safety constraints. Input guardrail: detect prompt injection, toxic content, or PII before sending to the model. Output guardrail: check the model's response for harmful content, PII leakage, or format violations before returning to the user.

Q69. What is streaming and why does it improve perceived performance?

Streaming returns tokens as they are generated (IAsyncEnumerable) rather than waiting for the complete response. The user sees the first word in under 1 second instead of waiting 10-30 seconds for the full response. Implemented via SSE (Server-Sent Events) in HTTP — Content-Type: text/event-stream.

Q70. How do you handle streaming endpoint failures gracefully?

Wrap the IAsyncEnumerable iteration in try/catch. When OperationCanceledException is caught (client disconnected): silently exit — this is normal. When HttpRequestException is caught mid-stream: send an error SSE event ("data: [ERROR]\n\n") so the client can display a message rather than a frozen UI.

Q71. How do you rate-limit streaming endpoints differently from regular endpoints?

Use a concurrency limiter (not a fixed-window limiter). Streaming endpoints hold connections open for 10-30 seconds. A fixed-window limiter counts each connection once at the start but doesn't account for the open connection duration. A concurrency limiter with PermitLimit=10 ensures at most 10 simultaneous streams per user.

Q72. What is the Outbox Pattern and why does it matter for AI-triggered workflows?

Store the message to publish in the same database transaction as your data write. A background worker delivers the message after commit. Prevents: message published but database write fails (inconsistent state), and database write succeeds but message delivery fails (lost event). Critical for AI workflows that trigger downstream actions (send email, charge payment).

Q73. How do you A/B test LLM prompts in production?

Assign users to variants at the request level using a feature flag service. Route to different system prompts or models. Track quality metrics (eval scores, user feedback, task completion rate) per variant. After sufficient sample size, promote the better variant and retire the other.

Q74. What is the difference between fine-tuning and RAG for domain knowledge?

RAG: retrieve relevant documents at query time, inject as context. No training required, data stays fresh, grounded answers. Fine-tuning: bake domain knowledge into the model weights. Faster inference, no retrieval cost, but data becomes stale and is harder to update. For most enterprise use cases: RAG first. Fine-tune only if RAG quality is insufficient after optimisation.

Q75. How do you handle PII in LLM applications?

Don't send raw PII to external LLM APIs. Options: anonymise/pseudonymise before sending (replace "John Smith, SSN 123-45-6789" with "User_A, SSN_REDACTED"), use on-premise models (Ollama, Azure OpenAI with data residency), implement a PII detection layer before the LLM call. Log a policy violation (don't log the PII itself).

Q76. What is a vector database and how does it differ from a traditional database?

A vector database stores high-dimensional float vectors and supports ANN (approximate nearest neighbour) search — find the K most similar vectors to a query vector in milliseconds. Traditional databases support exact match and range queries on scalar values. They're complementary: use pgvector to add vector search to your existing PostgreSQL without a separate system.

Q77. How do you monitor RAG retrieval quality in production?

Sample 5% of production traffic. For each sampled request: compute context precision (fraction of retrieved chunks used in the answer), faithfulness (LLM judge), and answer relevance. Emit as metrics. Alert when faithfulness drops below 0.7 or context precision below 0.5 — these indicate retrieval degradation.

Q78. What is a LoRA and when would you use it?

Low-Rank Adaptation — a fine-tuning technique that adds small trainable matrices to the existing model weights rather than updating all parameters. Much cheaper than full fine-tuning (train 1% of parameters, same quality). Use when you need task-specific behaviour (medical terminology, coding style) and have 500+ labelled examples.

Q79. How do you implement retry with a different prompt on validation failure?

Catch the validation exception (JsonException, business rule violation). Construct a corrective message: "Your previous response was invalid because {reason}. Please respond again with {format}". Add both the original assistant response and the corrective message to the history and call CompleteAsync again. Limit to 2-3 correction attempts.

Q80. What is a circuit breaker and how does it apply to LLM API calls?

When the LLM API fails at a high rate (>50% of requests over 30 seconds), the circuit breaker opens and immediately fails requests without calling the API — fast-failing gives the provider time to recover. After BreakDuration (30 seconds), one probe request is allowed. If it succeeds, the circuit closes. Use Polly v8 AddCircuitBreaker.


Section 5 — System Design and Architecture (Q81–Q100)

Q81. Design a chat application with conversation history.

Store conversations in PostgreSQL (ConversationId, messages as JSONB or a separate Messages table). Load history on each request, maintain a sliding window of last 20 messages to stay within context. Use Redis for session-level caching of active conversations. Archive old conversations to cold storage.

Q82. How do you design a multi-tenant AI service where tenants have different models?

Store model configuration per tenant (model ID, API key, system prompt) in the database. Load at request time via ITenantContext. Route to the correct IChatClient implementation using a TieredChatClient that reads the tenant config. Ensure API keys are stored in Azure Key Vault, not the database.

Q83. How would you scale a RAG system to 1M documents?

Use a dedicated vector store (Qdrant, Azure AI Search) rather than pgvector. Shard the vector index. Process embeddings in parallel background workers. Cache popular query embeddings. Pre-compute embeddings during document ingestion, not at query time. Add a re-ranker to improve precision without scaling the retrieval.

Q84. What is the difference between synchronous and asynchronous LLM processing?

Synchronous: API waits for the LLM to complete (10-30s). Acceptable for interactive chat. Asynchronous: enqueue the request, process in a background worker, notify the client when done (via webhook, polling, or WebSocket). Use async for: document processing, batch eval, non-interactive generation (email drafts, reports).

Q85. How do you implement human-in-the-loop for high-risk agent actions?

When the agent wants to execute a destructive or irreversible action (delete, charge, send): pause execution, send a notification to the human (Slack, email), store the pending action in Redis with a 10-minute TTL, and wait. The human approves or rejects via a callback endpoint. Resume or abort based on the response.

Q86. What is a semantic cache and how do you implement it?

Cache LLM responses keyed by the semantic meaning of the query, not the exact text. Process: embed the query, check an in-memory vector index for similar past queries (cosine similarity > 0.95), return the cached response if found. On cache miss: call the LLM, cache the response with the query embedding. Effective for Q&A systems with paraphrased questions.

Q87. How do you implement an AI-powered search API?

Hybrid search: run BM25 keyword search and vector similarity search in parallel. Merge results with reciprocal rank fusion (RRF). Re-rank the top 20 with a cross-encoder. Return top 5 with citations. Cache popular query results in Redis. Track click-through rates to improve ranking over time.

Q88. What is the difference between online and offline evaluation?

Offline: run eval on a labelled dataset before deployment (CI gate). Measures: faithfulness, correctness, semantic similarity to ground truth. Online: sample production traffic and evaluate quality asynchronously. Measures: the same metrics on real user queries. Both are necessary — offline catches regressions, online catches distribution shift.

Q89. How do you handle context window limits for long documents?

Chunking: split documents into 200-500 token chunks with overlap. Map-Reduce: process each chunk independently, then combine results ("Summarise each section, then write an overall summary"). For Q&A: retrieve only the relevant chunks rather than the entire document. For analysis: use a model with a 1M token context window (Gemini 1.5).

Q90. How do you version prompts in production?

Store prompts in a database or configuration with a version number. Tag deployments with the prompt version used. Roll back by deploying the previous prompt version. A/B test new prompts before full rollout. Keep a changelog: what changed, why, what eval results changed.

Q91. What is function calling in the OpenAI API?

The model can output a structured JSON object indicating it wants to call a function. The API returns this as a tool_call in the response rather than text. The application invokes the function, adds the result to the message history, and calls the API again. OpenAI, Anthropic, Google, and most providers support this standard.

Q92. How do you handle multi-modal input (images + text) in .NET?

Some models (gpt-4o, Claude 3) accept images in the conversation. Pass image bytes (base64-encoded) or a URL as a content part in the ChatMessage. The model processes both the image and the text together. For PDF: use a PDF-to-image library (PdfiumViewer) then pass pages as images, or use Azure Document Intelligence for structured extraction.

Q93. What is self-consistency and when does it improve results?

Call the model multiple times with high temperature, generate multiple candidate answers, then aggregate by majority vote. Reduces variance on reasoning tasks — the correct reasoning path is more likely to appear multiple times. Expensive (N x the normal cost). Use for high-stakes decisions where quality is more important than cost.

Q94. How do you build a recommendation system with embeddings?

Embed all items (products, articles, videos). For a user: embed their interaction history (viewed, purchased, liked items). Find items similar to the user's history embedding using cosine similarity. Exclude already-seen items. Combine with collaborative filtering (users similar to this user liked X) for better coverage.

Q95. What is the difference between RAG and fine-tuning for factual accuracy?

RAG grounds answers in retrieved documents — verifiable, updatable, auditable. Fine-tuning bakes facts into weights — non-interpretable, non-auditable, stale after training cut-off. For production factual accuracy: RAG with faithfulness evaluation is more reliable than fine-tuning. Fine-tuning improves style and format consistency, not factual accuracy.

Q96. How do you debug a RAG system that gives wrong answers?

Instrument the pipeline: log the query, retrieved chunks, and generated answer. Check retrieval first — did the relevant chunk actually get retrieved? If no: fix the retrieval (embedding quality, chunk size, index configuration). If yes: check faithfulness — did the model use the retrieved content? If no: improve the prompt constraints.

Q97. What is GraphRAG?

A RAG variant that builds a knowledge graph from documents. Entities and relationships are extracted, and the graph enables multi-hop retrieval — "What are the customers of companies that partner with Company X?" — which flat vector search can't answer. Higher setup cost but significantly better for relational reasoning over a document corpus.

Q98. How do you implement document ingestion at scale?

Queue documents for processing (Azure Service Bus / RabbitMQ). Workers: extract text, split into chunks, generate embeddings in batches (one API call per batch of 100 chunks, cheaper than 100 individual calls), upsert to vector store. Track ingestion status per document. Handle failures with retry and DLQ. Monitor embedding cost per document type.

Q99. What is zero-shot vs few-shot vs chain-of-thought prompting?

Zero-shot: just the task description — "Classify the sentiment of this review". Few-shot: include 3-5 examples. Chain-of-thought: "Think step by step before giving your answer" — forces the model to reason explicitly, dramatically improving accuracy on multi-step problems. Chain-of-thought is especially effective for math, logic, and code.

Q100. What would you do if the AI quality degrades in production after a model update?

Monitor with eval metrics — you'll see faithfulness or correctness drop before users complain. Pin the model version in production ("gpt-4o-2024-11-20" not "gpt-4o"). Before upgrading: run the full eval suite against the new version. If quality regresses: stay on the old version, raise with the provider, investigate prompt adjustments for the new version.

Enjoyed this article?

Explore the Backend Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.