Azure Monitor and Application Insights for LLMs
Set up Azure Monitor and Application Insights to track LLM latency, token usage, error rates, and cost for production AI services running on Azure Container Apps.
What You Need to Observe in an LLM Service
A traditional API needs: latency, error rate, throughput. An LLM service needs all of that plus:
- Token usage — every token costs money
- Time to First Token (TTFT) — streaming latency perception
- Model version — which GPT-4o deployment served the request
- Prompt/completion ratio — signals prompt bloat
- Safety filter hits — how often content moderation blocks responses
- RAG retrieval quality — are we finding relevant chunks?
Azure Monitor + Application Insights is the native Azure stack for this.
Architecture
FastAPI Container App
│
│ (OpenTelemetry SDK)
▼
Application Insights ──► Log Analytics Workspace
│
▼
Azure Monitor Dashboards + AlertsSetup: Install the SDK
pip install azure-monitor-opentelemetryInstrument Your App
In main.py, add this before creating the FastAPI app:
from azure.monitor.opentelemetry import configure_azure_monitor
import os
configure_azure_monitor(
connection_string=os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"],
# Optionally set cloud role for multi-service tracing
service_name="pharmabot-api",
service_version="1.0.0",
)This single call instruments:
- All HTTP requests (via FastAPI middleware)
- All
httpx/requestsoutbound calls (including Azure OpenAI) - All exceptions
- Custom events and metrics you add
Get the Connection String
# Create Application Insights resource
az monitor app-insights component create \
--app pharmabot-insights \
--location eastus \
--resource-group pharmabot-rg \
--kind web
# Get the connection string
az monitor app-insights component show \
--app pharmabot-insights \
--resource-group pharmabot-rg \
--query connectionString -o tsvSet in your Container App:
az containerapp secret set \
--name pharmabot \
--resource-group pharmabot-rg \
--secrets appinsights-conn="InstrumentationKey=...;IngestionEndpoint=..."
az containerapp update \
--name pharmabot \
--resource-group pharmabot-rg \
--set-env-vars APPLICATIONINSIGHTS_CONNECTION_STRING=secretref:appinsights-connTracking Custom LLM Metrics
The auto-instrumentation tracks HTTP latency. For LLM-specific metrics, use the OpenTelemetry metrics API:
from opentelemetry import metrics
meter = metrics.get_meter("pharmabot.llm")
# Counters and histograms
token_counter = meter.create_counter(
"llm.tokens.total",
description="Total tokens used",
unit="tokens",
)
latency_histogram = meter.create_histogram(
"llm.latency_ms",
description="LLM call latency",
unit="ms",
)
ttft_histogram = meter.create_histogram(
"llm.ttft_ms",
description="Time to first token",
unit="ms",
)
async def call_azure_openai_tracked(messages: list, model: str) -> str:
import time
start = time.perf_counter()
response = await client.chat.completions.create(
model=model,
messages=messages,
)
duration_ms = (time.perf_counter() - start) * 1000
# Record metrics with labels
attributes = {"model": model, "endpoint": "chat"}
token_counter.add(
response.usage.prompt_tokens,
{**attributes, "type": "prompt"}
)
token_counter.add(
response.usage.completion_tokens,
{**attributes, "type": "completion"}
)
latency_histogram.record(duration_ms, attributes)
return response.choices[0].message.contentTracking Streaming TTFT
import time
async def stream_with_ttft_tracking(messages: list, model: str):
start = time.perf_counter()
first_token = True
async for chunk in await client.chat.completions.create(
model=model,
messages=messages,
stream=True,
):
if first_token and chunk.choices[0].delta.content:
ttft_ms = (time.perf_counter() - start) * 1000
ttft_histogram.record(ttft_ms, {"model": model})
first_token = False
content = chunk.choices[0].delta.content or ""
yield contentQuerying in Log Analytics
Navigate to: Azure Portal → Log Analytics Workspace → Logs
All LLM requests in the last hour
customMetrics
| where name == "llm.latency_ms"
| where timestamp > ago(1h)
| summarize
avg_latency = avg(value),
p95_latency = percentile(value, 95),
p99_latency = percentile(value, 99),
request_count = count()
| project avg_latency, p95_latency, p99_latency, request_countToken usage by model (cost tracking)
customMetrics
| where name == "llm.tokens.total"
| where timestamp > ago(24h)
| summarize total_tokens = sum(value) by tostring(customDimensions["model"])
| order by total_tokens descError rate
requests
| where timestamp > ago(1h)
| where name contains "/api/chat"
| summarize
total = count(),
errors = countif(resultCode >= 400)
| extend error_rate_pct = round(100.0 * errors / total, 2)Slow requests (over 5s)
requests
| where timestamp > ago(1h)
| where duration > 5000
| where name contains "/api/chat"
| project timestamp, duration, resultCode, url
| order by duration descApplication Map
In the Azure Portal → Application Insights → Application Map, you'll see a visual graph of:
[Browser] → [pharmabot-api] → [Azure OpenAI]
→ [Azure AI Search]
→ [Redis]
→ [PostgreSQL]Each arrow shows latency and error rate. This makes it immediately obvious which dependency is the bottleneck.
Live Metrics
Application Insights → Live Metrics shows real-time:
- Incoming request rate
- Failed request rate
- Server response time
- CPU and memory
Use this during a deployment to immediately see if the new version is healthy.
Checkpoint
After instrumenting, deploy and make 10 requests:
for i in {1..10}; do
curl -s http://localhost:8000/api/chat \
-d '{"message":"What is ibuprofen?"}' \
-H "Content-Type: application/json" > /dev/null
doneThen in Application Insights → Transaction Search, you should see 10 requests. Click any one to see the full trace: HTTP request → OpenAI call → response. All latencies broken down, all tokens logged.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.