LLMOps & Deployment · Lesson 12 of 16
Setting Up Alerts: Rate Limits, Latency Spikes
Alert Philosophy for LLM Services
Two rules before writing any alert:
- Alerts should be actionable. If you can't do something specific when the alert fires, it's noise. Delete it.
- Alert on symptoms, not causes. "p95 latency over 5s" is a symptom. "OpenAI API called 1000 times" is a cause — don't alert on that.
LLM services have a unique alerting challenge: the LLM itself introduces non-deterministic latency. A spike from 1.5s to 3s might be normal if OpenAI is under load. A spike to 30s is a problem. Set thresholds based on your SLOs, not instinct.
What to Alert On
| Signal | Alert Threshold | Severity | |---|---|---| | p95 latency | over 5s for 5 minutes | Warning | | p99 latency | over 15s for 3 minutes | Critical | | Error rate | over 2% for 5 minutes | Warning | | Error rate | over 5% for 2 minutes | Critical | | Rate limit errors | over 10 in 5 minutes | Warning | | Daily cost | over 120% of yesterday | Warning | | Health check fails | 2 consecutive failures | Critical | | Container restarts | 3 restarts in 10 minutes | Critical |
Setting Up Alerts in Azure Monitor
Via Azure Portal
- Azure Monitor → Alerts → Create → Alert rule
- Select your Container App or Application Insights as the resource
- Choose the signal (metric or log query)
- Set the threshold and time window
- Set the action group (email, SMS, PagerDuty, Teams webhook)
Via Azure CLI (Infrastructure as Code)
Create an alert for p95 latency:
az monitor metrics alert create \
--name "pharmabot-high-latency" \
--resource-group pharmabot-rg \
--scopes "/subscriptions/{sub-id}/resourceGroups/pharmabot-rg/providers/Microsoft.App/containerApps/pharmabot" \
--condition "avg Requests/Duration > 5000" \
--window-size 5m \
--evaluation-frequency 1m \
--severity 2 \
--action-group "/subscriptions/{sub-id}/resourceGroups/pharmabot-rg/providers/microsoft.insights/actionGroups/pharmabot-oncall" \
--description "p95 latency over 5 seconds"Log-Based Alerts (Custom Metrics)
For LLM-specific metrics (token cost, rate limits), use scheduled log query alerts:
Alert: Rate Limit Spike
// Fires if more than 10 rate limit errors in 5 minutes
customEvents
| where name == "llm_call_failed"
| where customDimensions["error_type"] == "RateLimitError"
| where timestamp > ago(5m)
| countSet this as a Log Query Alert with threshold: count > 10.
Alert: Daily Cost Spike
// Compare today's cost to yesterday's
let yesterday_cost =
customEvents
| where name == "llm_call_completed"
| where timestamp between(ago(48h) .. ago(24h))
| summarize total = sum(todouble(customDimensions["cost_usd"]));
let today_cost =
customEvents
| where name == "llm_call_completed"
| where timestamp > ago(24h)
| summarize total = sum(todouble(customDimensions["cost_usd"]));
today_cost
| extend yesterday = toscalar(yesterday_cost)
| extend ratio = today_cost / yesterday
| where ratio > 1.5 // Alert if today is 50% more than yesterdayAlert: TTFT Degradation
customMetrics
| where name == "llm.ttft_ms"
| where timestamp > ago(10m)
| summarize p95 = percentile(value, 95)
| where p95 > 2000 // Alert if p95 TTFT over 2 secondsAction Groups
An action group defines WHO gets notified and HOW:
# Create action group with email + webhook
az monitor action-group create \
--name pharmabot-oncall \
--resource-group pharmabot-rg \
--short-name pharmabot \
--email-receivers name=oncall email=oncall@company.com \
--webhook-receivers name=teams serviceUri=https://outlook.office.com/webhook/...Tiered alerting:
- Warning: post to Slack #alerts channel
- Critical: page on-call via PagerDuty + send SMS
- Resolved: notify same channels automatically (Azure Monitor supports auto-resolution)
Alert Runbooks
Every alert should link to a runbook — a document that says exactly what to do when the alert fires.
Runbook: High Latency Alert
## Alert: pharmabot-high-latency fired
**Check 1**: Is Azure OpenAI status page showing issues?
→ https://status.azure.com — look for Azure OpenAI in your region
**Check 2**: Is a specific endpoint slow or all endpoints?
→ Log Analytics: requests | summarize p95=percentile(duration,95) by name
**Check 3**: Is it a specific model deployment?
→ customMetrics | where name == "llm.latency_ms" | summarize avg(value) by customDimensions["model"]
**Action if OpenAI issue**: Enable fallback model (GPT-4o-mini)
**Action if our code**: Check for missing AsNoTracking(), N+1 queries in RAG retriever
**Action if unknown**: Scale out container replicas and alert teamTesting Your Alerts
Don't wait for a real incident to discover your alerts don't work.
Test method 1: Manually fire a metric above the threshold
# Temporarily send a metric way above threshold to test the alert
latency_histogram.record(99999, {"model": "gpt-4o"})Test method 2: Throttle your container app to simulate high latency
# Set max replicas to 0 to kill the service and test health alerts
az containerapp update \
--name pharmabot \
--resource-group pharmabot-rg \
--max-replicas 0
# Verify health alert fires within 2 minutes
# Then restore
az containerapp update --max-replicas 5Alert Fatigue Prevention
If every alert requires investigation and nothing is ever wrong, engineers start ignoring alerts. Keep your signal-to-noise ratio high:
- Review alerts weekly: if an alert fires more than twice a week without action, tune the threshold or delete it
- Track MTTA (Mean Time to Acknowledge) — if over 30 minutes, the alert isn't urgent enough to be paged
- Use inhibition rules: if a critical alert fires, suppress related warnings (no point in "high latency" alert if "service is down" is already firing)
- Set maintenance windows during planned deployments — suppress alerts for 30 minutes
Checkpoint
Set up at least two alerts for your LLM service:
- Latency alert: p95 over 5s for 5 minutes → email
- Health alert: health check failing → page on-call
Verify both alerts appear in Azure Monitor → Alerts → Alert Rules and that the action group sends a test notification:
az monitor action-group test \
--name pharmabot-oncall \
--resource-group pharmabot-rg \
--alert-type metric