Back to blog
AI Systemsintermediate

Interview: LLMOps Scenario Questions

The most common LLMOps scenario questions asked in senior AI engineering interviews. Walk through real deployment, monitoring, and incident response scenarios with model answers.

Asma Hafeez KhanMay 15, 20267 min read
LLMOpsInterview PrepAI EngineeringScenario Questions
Share:𝕏

How LLMOps Interviews Work

Senior AI engineering roles test LLMOps with scenario questions — they describe a production situation and ask how you'd handle it. There are no single correct answers, but there are frameworks that demonstrate production experience.

The evaluator is looking for:

  • Whether you reach for monitoring data before guessing
  • Whether you think in systems (upstream dependencies, downstream effects)
  • Whether you know which Azure/Python tools to use
  • Whether your rollback plan is concrete, not hand-wavy

Scenario 1: Sudden Latency Spike

"Your LLM API's p95 latency jumped from 2s to 25s at 2 PM. The alert just fired. Walk me through your investigation."

Model Answer:

"First, I check whether it's affecting all requests or a subset — I'd query Application Insights:

KUSTO
requests
| where timestamp > ago(1h)
| where name contains "/api/chat"
| summarize p95=percentile(duration,95) by bin(timestamp, 5m)
| render timechart

If the spike affects all requests, I check the Azure OpenAI status page. If it's region-specific, Azure OpenAI in East US may be degraded. My mitigation: switch to a backup deployment in West US (I keep both configured in the client).

If Azure OpenAI looks healthy, I check whether it's our RAG retrieval step:

KUSTO
customMetrics
| where name in ("llm.latency_ms", "retrieval.latency_ms")
| summarize avg(value) by name, bin(timestamp, 5m)

If retrieval latency is also spiking, I check Azure AI Search — maybe a reindexing job is competing for resources.

While investigating, I immediately set a fallback: route new requests to GPT-4o-mini (lower latency) while debugging GPT-4o."


Scenario 2: Runaway Token Costs

"Your daily OpenAI bill jumped from $200 to $2,000 overnight. Nothing was deployed. What happened and how do you fix it?"

Model Answer:

"I start by identifying which endpoint is responsible:

KUSTO
customEvents
| where name == "llm_call_completed"
| where timestamp > ago(24h)
| summarize
    total_cost = sum(todouble(customDimensions["cost_usd"])),
    total_tokens = sum(toint(customDimensions["total_tokens"]))
    by tostring(customDimensions["endpoint"])
| order by total_cost desc

If one endpoint dominates: check if there's an infinite retry loop. A bug that retries failed requests forever at midnight when a batch job runs would cause exactly this pattern.

I'd also check token counts per request:

KUSTO
customEvents
| where name == "llm_call_completed"
| summarize
    avg_tokens = avg(toint(customDimensions["total_tokens"])),
    max_tokens = max(toint(customDimensions["total_tokens"]))
| project avg_tokens, max_tokens

If max_tokens is huge (100k+), someone may have accidentally passed the entire document corpus as context instead of a subset.

Immediate fix: set max_tokens on every LLM call to cap runaway requests. Long-term: add a cost alert that fires at 150% of yesterday's spend."


Scenario 3: Rolling Back a Bad Deployment

"You deployed a new version with an updated system prompt. Users are now reporting that the chatbot gives incorrect drug interaction warnings. How do you respond?"

Model Answer:

"This is a patient safety issue — I roll back immediately, before any investigation.

Bash
# Instant traffic shift back to previous revision
az containerapp ingress traffic set \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --revision-weight pharmabot--v1=100 pharmabot--v2=0

This takes 30 seconds. Only then do I investigate.

I'd check exactly which requests had the bad system prompt — I log the prompt hash, not the full prompt content:

KUSTO
customEvents
| where name == "llm_call_completed"
| where tostring(customDimensions["prompt_version"]) == "v2"
| project timestamp, customDimensions["session_id"], customDimensions["response_preview"]

If users got incorrect medical information, the incident becomes a regulatory question — I escalate to the product and legal team.

Post-incident: add automated prompt regression tests. Before deploying a new system prompt, run a test suite of 50 known Q&A pairs and verify the answers match expectations."


Scenario 4: Designing for 10× Scale

"Your LLM service handles 1,000 requests/day comfortably. The product team projects 10,000/day next quarter. How do you prepare?"

Model Answer:

"First, I profile where the time goes in a single request — usually it's: Redis lookup (1ms), vector search (50ms), LLM call (1,500ms), response stream (variable). The LLM call dominates.

At 10× volume the bottlenecks will be:

  1. Azure OpenAI TPM quota — I'd request a quota increase now (takes 1–2 weeks to process). Default quota is 450K TPM for GPT-4o; at 3,000 tokens/request × 10,000 requests/day that's 30M tokens/day = 1.25M TPM peak. I need a 3× quota increase.

  2. Semantic cache hit rate — at higher volume, more queries repeat. I'd move from Redis to a dedicated vector cache store and expand the cache TTL from 1 hour to 24 hours for stable medical data.

  3. Horizontal scaling — Azure Container Apps auto-scales to 10 replicas, each handling 10 concurrent requests = 100 concurrent LLM calls in flight. That should be sufficient.

  4. Cost — 10× requests = 10× cost unless I improve cache hit rates and model routing. I'd target 60% cache hit rate and route 50% of queries to GPT-4o-mini."


Scenario 5: Health Check Design

"How would you design the health check system for a pharmaceutical AI assistant that must meet 99.9% uptime SLA?"

Model Answer:

"I'd implement three separate health endpoints:

/health/live — liveness probe (Kubernetes). Returns 200 if the process is alive. Never includes external dependencies — if Redis is down, the process is still alive. Fails only on deadlock or OOM.

/health/ready — readiness probe. Returns 200 only when all critical dependencies are healthy:

  • Redis: PING command
  • PostgreSQL: SELECT 1
  • Azure OpenAI: GET /models (not a completion call — just auth check)
  • Azure AI Search: GET /indexes status

/health/startup — startup probe with 60s timeout. Same checks as readiness but gives extra time during cold start (model loading, connection pool warmup).

For 99.9% SLA (about 8.7 hours downtime/year):

  • Use at least 2 replicas at all times so a single pod failure doesn't cause downtime
  • Set up Azure Container Apps health probes with failure threshold = 2 (not 1) to avoid flapping
  • Monitor the health check response time — if /health/ready takes over 500ms, something is already wrong
  • Set up a synthetic monitor (Azure Application Insights availability test) that hits the health endpoint every 5 minutes from multiple regions"

Common Follow-Up Questions

"What's the difference between a metric and a log for LLM observability?" Metrics are aggregatable numbers (p95 latency, token count, cost). Logs are events with context (which user, which query hash, what error). Use metrics for dashboards and alerts, logs for debugging specific incidents.

"How do you handle a rate limit error from Azure OpenAI in production?" Exponential backoff with jitter: wait 2^retry_count + random(0,1) seconds. Max 3 retries. If all retries fail, return a graceful error to the user with a retry-after suggestion. Log the rate limit hit — 10 in an hour means you need a quota increase.

"What's your monitoring setup from day one of a new LLM service?" Day 1: structured logging with request ID, LLM latency, token counts. Day 2: cost metric per request. Day 3: health check + availability alert. Day 7: latency SLO dashboard. Before launch: oncall runbook written, alert thresholds set, rollback plan tested.


Checkpoint

Write answers to these questions without looking at the notes:

  1. Your LLM service's p99 latency just hit 30s. What's your first action?
  2. A junior engineer wants to print() the full user prompt to debug an issue. Why is this a problem?
  3. You need to deploy a new embedding model. The old embeddings in your vector store are now incompatible. How do you handle this?
  4. Your Container App is scaling to 10 replicas but p95 latency is still high. What's likely wrong?

If you can answer all four fluently, you're ready for senior LLMOps interviews.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.