Setting Up Alerts: Rate Limits, Latency Spikes

Alert Philosophy for LLM Services

Two rules before writing any alert:

Alerts should be actionable. If you can't do something specific when the alert fires, it's noise. Delete it.
Alert on symptoms, not causes. "p95 latency over 5s" is a symptom. "OpenAI API called 1000 times" is a cause — don't alert on that.

LLM services have a unique alerting challenge: the LLM itself introduces non-deterministic latency. A spike from 1.5s to 3s might be normal if OpenAI is under load. A spike to 30s is a problem. Set thresholds based on your SLOs, not instinct.

What to Alert On

| Signal | Alert Threshold | Severity | |---|---|---| | p95 latency | over 5s for 5 minutes | Warning | | p99 latency | over 15s for 3 minutes | Critical | | Error rate | over 2% for 5 minutes | Warning | | Error rate | over 5% for 2 minutes | Critical | | Rate limit errors | over 10 in 5 minutes | Warning | | Daily cost | over 120% of yesterday | Warning | | Health check fails | 2 consecutive failures | Critical | | Container restarts | 3 restarts in 10 minutes | Critical |

Setting Up Alerts in Azure Monitor

Via Azure Portal

Azure Monitor → Alerts → Create → Alert rule
Select your Container App or Application Insights as the resource
Choose the signal (metric or log query)
Set the threshold and time window
Set the action group (email, SMS, PagerDuty, Teams webhook)

Via Azure CLI (Infrastructure as Code)

Create an alert for p95 latency:

Bash

az monitor metrics alert create \
  --name "pharmabot-high-latency" \
  --resource-group pharmabot-rg \
  --scopes "/subscriptions/{sub-id}/resourceGroups/pharmabot-rg/providers/Microsoft.App/containerApps/pharmabot" \
  --condition "avg Requests/Duration > 5000" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --severity 2 \
  --action-group "/subscriptions/{sub-id}/resourceGroups/pharmabot-rg/providers/microsoft.insights/actionGroups/pharmabot-oncall" \
  --description "p95 latency over 5 seconds"

Log-Based Alerts (Custom Metrics)

For LLM-specific metrics (token cost, rate limits), use scheduled log query alerts:

Alert: Rate Limit Spike

KUSTO

// Fires if more than 10 rate limit errors in 5 minutes
customEvents
| where name == "llm_call_failed"
| where customDimensions["error_type"] == "RateLimitError"
| where timestamp > ago(5m)
| count

Set this as a Log Query Alert with threshold: count > 10.

Alert: Daily Cost Spike

KUSTO

// Compare today's cost to yesterday's
let yesterday_cost =
    customEvents
    | where name == "llm_call_completed"
    | where timestamp between(ago(48h) .. ago(24h))
    | summarize total = sum(todouble(customDimensions["cost_usd"]));
let today_cost =
    customEvents
    | where name == "llm_call_completed"
    | where timestamp > ago(24h)
    | summarize total = sum(todouble(customDimensions["cost_usd"]));
today_cost
| extend yesterday = toscalar(yesterday_cost)
| extend ratio = today_cost / yesterday
| where ratio > 1.5  // Alert if today is 50% more than yesterday

Alert: TTFT Degradation

KUSTO

customMetrics
| where name == "llm.ttft_ms"
| where timestamp > ago(10m)
| summarize p95 = percentile(value, 95)
| where p95 > 2000  // Alert if p95 TTFT over 2 seconds

Action Groups

An action group defines WHO gets notified and HOW:

Bash

# Create action group with email + webhook
az monitor action-group create \
  --name pharmabot-oncall \
  --resource-group pharmabot-rg \
  --short-name pharmabot \
  --email-receivers name=oncall email=oncall@company.com \
  --webhook-receivers name=teams serviceUri=https://outlook.office.com/webhook/...

Tiered alerting:

Warning: post to Slack #alerts channel
Critical: page on-call via PagerDuty + send SMS
Resolved: notify same channels automatically (Azure Monitor supports auto-resolution)

Alert Runbooks

Every alert should link to a runbook — a document that says exactly what to do when the alert fires.

Runbook: High Latency Alert

MARKDOWN

## Alert: pharmabot-high-latency fired

**Check 1**: Is Azure OpenAI status page showing issues?
→ https://status.azure.com — look for Azure OpenAI in your region

**Check 2**: Is a specific endpoint slow or all endpoints?
→ Log Analytics: requests | summarize p95=percentile(duration,95) by name

**Check 3**: Is it a specific model deployment?
→ customMetrics | where name == "llm.latency_ms" | summarize avg(value) by customDimensions["model"]

**Action if OpenAI issue**: Enable fallback model (GPT-4o-mini)
**Action if our code**: Check for missing AsNoTracking(), N+1 queries in RAG retriever
**Action if unknown**: Scale out container replicas and alert team

Testing Your Alerts

Don't wait for a real incident to discover your alerts don't work.

Test method 1: Manually fire a metric above the threshold

Python

# Temporarily send a metric way above threshold to test the alert
latency_histogram.record(99999, {"model": "gpt-4o"})

Test method 2: Throttle your container app to simulate high latency

Bash

# Set max replicas to 0 to kill the service and test health alerts
az containerapp update \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --max-replicas 0
# Verify health alert fires within 2 minutes
# Then restore
az containerapp update --max-replicas 5

Alert Fatigue Prevention

If every alert requires investigation and nothing is ever wrong, engineers start ignoring alerts. Keep your signal-to-noise ratio high:

Review alerts weekly: if an alert fires more than twice a week without action, tune the threshold or delete it
Track MTTA (Mean Time to Acknowledge) — if over 30 minutes, the alert isn't urgent enough to be paged
Use inhibition rules: if a critical alert fires, suppress related warnings (no point in "high latency" alert if "service is down" is already firing)
Set maintenance windows during planned deployments — suppress alerts for 30 minutes

Checkpoint

Set up at least two alerts for your LLM service:

Latency alert: p95 over 5s for 5 minutes → email
Health alert: health check failing → page on-call

Verify both alerts appear in Azure Monitor → Alerts → Alert Rules and that the action group sends a test notification:

Bash

az monitor action-group test \
  --name pharmabot-oncall \
  --resource-group pharmabot-rg \
  --alert-type metric

Setting Up Alerts: Rate Limits, Latency Spikes

Alert Philosophy for LLM Services

What to Alert On

Setting Up Alerts in Azure Monitor

Via Azure Portal

Via Azure CLI (Infrastructure as Code)

Log-Based Alerts (Custom Metrics)

Alert: Rate Limit Spike

Alert: Daily Cost Spike

Alert: TTFT Degradation

Action Groups

Alert Runbooks

Runbook: High Latency Alert

Testing Your Alerts

Alert Fatigue Prevention

Checkpoint

Enjoyed this article?

Leave a comment