LLMOps & Deployment · Lesson 10 of 16
Azure Monitor and Application Insights for LLMs
What You Need to Observe in an LLM Service
A traditional API needs: latency, error rate, throughput. An LLM service needs all of that plus:
- Token usage — every token costs money
- Time to First Token (TTFT) — streaming latency perception
- Model version — which GPT-4o deployment served the request
- Prompt/completion ratio — signals prompt bloat
- Safety filter hits — how often content moderation blocks responses
- RAG retrieval quality — are we finding relevant chunks?
Azure Monitor + Application Insights is the native Azure stack for this.
Architecture
FastAPI Container App
│
│ (OpenTelemetry SDK)
▼
Application Insights ──► Log Analytics Workspace
│
▼
Azure Monitor Dashboards + AlertsSetup: Install the SDK
pip install azure-monitor-opentelemetryInstrument Your App
In main.py, add this before creating the FastAPI app:
from azure.monitor.opentelemetry import configure_azure_monitor
import os
configure_azure_monitor(
connection_string=os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"],
# Optionally set cloud role for multi-service tracing
service_name="pharmabot-api",
service_version="1.0.0",
)This single call instruments:
- All HTTP requests (via FastAPI middleware)
- All
httpx/requestsoutbound calls (including Azure OpenAI) - All exceptions
- Custom events and metrics you add
Get the Connection String
# Create Application Insights resource
az monitor app-insights component create \
--app pharmabot-insights \
--location eastus \
--resource-group pharmabot-rg \
--kind web
# Get the connection string
az monitor app-insights component show \
--app pharmabot-insights \
--resource-group pharmabot-rg \
--query connectionString -o tsvSet in your Container App:
az containerapp secret set \
--name pharmabot \
--resource-group pharmabot-rg \
--secrets appinsights-conn="InstrumentationKey=...;IngestionEndpoint=..."
az containerapp update \
--name pharmabot \
--resource-group pharmabot-rg \
--set-env-vars APPLICATIONINSIGHTS_CONNECTION_STRING=secretref:appinsights-connTracking Custom LLM Metrics
The auto-instrumentation tracks HTTP latency. For LLM-specific metrics, use the OpenTelemetry metrics API:
from opentelemetry import metrics
meter = metrics.get_meter("pharmabot.llm")
# Counters and histograms
token_counter = meter.create_counter(
"llm.tokens.total",
description="Total tokens used",
unit="tokens",
)
latency_histogram = meter.create_histogram(
"llm.latency_ms",
description="LLM call latency",
unit="ms",
)
ttft_histogram = meter.create_histogram(
"llm.ttft_ms",
description="Time to first token",
unit="ms",
)
async def call_azure_openai_tracked(messages: list, model: str) -> str:
import time
start = time.perf_counter()
response = await client.chat.completions.create(
model=model,
messages=messages,
)
duration_ms = (time.perf_counter() - start) * 1000
# Record metrics with labels
attributes = {"model": model, "endpoint": "chat"}
token_counter.add(
response.usage.prompt_tokens,
{**attributes, "type": "prompt"}
)
token_counter.add(
response.usage.completion_tokens,
{**attributes, "type": "completion"}
)
latency_histogram.record(duration_ms, attributes)
return response.choices[0].message.contentTracking Streaming TTFT
import time
async def stream_with_ttft_tracking(messages: list, model: str):
start = time.perf_counter()
first_token = True
async for chunk in await client.chat.completions.create(
model=model,
messages=messages,
stream=True,
):
if first_token and chunk.choices[0].delta.content:
ttft_ms = (time.perf_counter() - start) * 1000
ttft_histogram.record(ttft_ms, {"model": model})
first_token = False
content = chunk.choices[0].delta.content or ""
yield contentQuerying in Log Analytics
Navigate to: Azure Portal → Log Analytics Workspace → Logs
All LLM requests in the last hour
customMetrics
| where name == "llm.latency_ms"
| where timestamp > ago(1h)
| summarize
avg_latency = avg(value),
p95_latency = percentile(value, 95),
p99_latency = percentile(value, 99),
request_count = count()
| project avg_latency, p95_latency, p99_latency, request_countToken usage by model (cost tracking)
customMetrics
| where name == "llm.tokens.total"
| where timestamp > ago(24h)
| summarize total_tokens = sum(value) by tostring(customDimensions["model"])
| order by total_tokens descError rate
requests
| where timestamp > ago(1h)
| where name contains "/api/chat"
| summarize
total = count(),
errors = countif(resultCode >= 400)
| extend error_rate_pct = round(100.0 * errors / total, 2)Slow requests (over 5s)
requests
| where timestamp > ago(1h)
| where duration > 5000
| where name contains "/api/chat"
| project timestamp, duration, resultCode, url
| order by duration descApplication Map
In the Azure Portal → Application Insights → Application Map, you'll see a visual graph of:
[Browser] → [pharmabot-api] → [Azure OpenAI]
→ [Azure AI Search]
→ [Redis]
→ [PostgreSQL]Each arrow shows latency and error rate. This makes it immediately obvious which dependency is the bottleneck.
Live Metrics
Application Insights → Live Metrics shows real-time:
- Incoming request rate
- Failed request rate
- Server response time
- CPU and memory
Use this during a deployment to immediately see if the new version is healthy.
Checkpoint
After instrumenting, deploy and make 10 requests:
for i in {1..10}; do
curl -s http://localhost:8000/api/chat \
-d '{"message":"What is ibuprofen?"}' \
-H "Content-Type: application/json" > /dev/null
doneThen in Application Insights → Transaction Search, you should see 10 requests. Click any one to see the full trace: HTTP request → OpenAI call → response. All latencies broken down, all tokens logged.