Azure Monitor and Application Insights for LLMs

What You Need to Observe in an LLM Service

A traditional API needs: latency, error rate, throughput. An LLM service needs all of that plus:

Token usage — every token costs money
Time to First Token (TTFT) — streaming latency perception
Model version — which GPT-4o deployment served the request
Prompt/completion ratio — signals prompt bloat
Safety filter hits — how often content moderation blocks responses
RAG retrieval quality — are we finding relevant chunks?

Azure Monitor + Application Insights is the native Azure stack for this.

Architecture

FastAPI Container App
        │
        │  (OpenTelemetry SDK)
        ▼
Application Insights ──► Log Analytics Workspace
        │
        ▼
Azure Monitor Dashboards + Alerts

Setup: Install the SDK

Bash

pip install azure-monitor-opentelemetry

Instrument Your App

In main.py, add this before creating the FastAPI app:

Python

from azure.monitor.opentelemetry import configure_azure_monitor
import os

configure_azure_monitor(
    connection_string=os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"],
    # Optionally set cloud role for multi-service tracing
    service_name="pharmabot-api",
    service_version="1.0.0",
)

This single call instruments:

All HTTP requests (via FastAPI middleware)
All httpx/requests outbound calls (including Azure OpenAI)
All exceptions
Custom events and metrics you add

Get the Connection String

Bash

# Create Application Insights resource
az monitor app-insights component create \
  --app pharmabot-insights \
  --location eastus \
  --resource-group pharmabot-rg \
  --kind web

# Get the connection string
az monitor app-insights component show \
  --app pharmabot-insights \
  --resource-group pharmabot-rg \
  --query connectionString -o tsv

Set in your Container App:

Bash

az containerapp secret set \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --secrets appinsights-conn="InstrumentationKey=...;IngestionEndpoint=..."

az containerapp update \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --set-env-vars APPLICATIONINSIGHTS_CONNECTION_STRING=secretref:appinsights-conn

Tracking Custom LLM Metrics

The auto-instrumentation tracks HTTP latency. For LLM-specific metrics, use the OpenTelemetry metrics API:

Python

from opentelemetry import metrics

meter = metrics.get_meter("pharmabot.llm")

# Counters and histograms
token_counter = meter.create_counter(
    "llm.tokens.total",
    description="Total tokens used",
    unit="tokens",
)
latency_histogram = meter.create_histogram(
    "llm.latency_ms",
    description="LLM call latency",
    unit="ms",
)
ttft_histogram = meter.create_histogram(
    "llm.ttft_ms",
    description="Time to first token",
    unit="ms",
)

async def call_azure_openai_tracked(messages: list, model: str) -> str:
    import time
    start = time.perf_counter()
    
    response = await client.chat.completions.create(
        model=model,
        messages=messages,
    )
    
    duration_ms = (time.perf_counter() - start) * 1000
    
    # Record metrics with labels
    attributes = {"model": model, "endpoint": "chat"}
    
    token_counter.add(
        response.usage.prompt_tokens,
        {**attributes, "type": "prompt"}
    )
    token_counter.add(
        response.usage.completion_tokens,
        {**attributes, "type": "completion"}
    )
    latency_histogram.record(duration_ms, attributes)
    
    return response.choices[0].message.content

Tracking Streaming TTFT

Python

import time

async def stream_with_ttft_tracking(messages: list, model: str):
    start = time.perf_counter()
    first_token = True
    
    async for chunk in await client.chat.completions.create(
        model=model,
        messages=messages,
        stream=True,
    ):
        if first_token and chunk.choices[0].delta.content:
            ttft_ms = (time.perf_counter() - start) * 1000
            ttft_histogram.record(ttft_ms, {"model": model})
            first_token = False
        
        content = chunk.choices[0].delta.content or ""
        yield content

Querying in Log Analytics

Navigate to: Azure Portal → Log Analytics Workspace → Logs

All LLM requests in the last hour

KUSTO

customMetrics
| where name == "llm.latency_ms"
| where timestamp > ago(1h)
| summarize
    avg_latency = avg(value),
    p95_latency = percentile(value, 95),
    p99_latency = percentile(value, 99),
    request_count = count()
| project avg_latency, p95_latency, p99_latency, request_count

Token usage by model (cost tracking)

KUSTO

customMetrics
| where name == "llm.tokens.total"
| where timestamp > ago(24h)
| summarize total_tokens = sum(value) by tostring(customDimensions["model"])
| order by total_tokens desc

Error rate

KUSTO

requests
| where timestamp > ago(1h)
| where name contains "/api/chat"
| summarize
    total = count(),
    errors = countif(resultCode >= 400)
| extend error_rate_pct = round(100.0 * errors / total, 2)

Slow requests (over 5s)

KUSTO

requests
| where timestamp > ago(1h)
| where duration > 5000
| where name contains "/api/chat"
| project timestamp, duration, resultCode, url
| order by duration desc

Application Map

In the Azure Portal → Application Insights → Application Map, you'll see a visual graph of:

[Browser] → [pharmabot-api] → [Azure OpenAI]
                            → [Azure AI Search]
                            → [Redis]
                            → [PostgreSQL]

Each arrow shows latency and error rate. This makes it immediately obvious which dependency is the bottleneck.

Live Metrics

Application Insights → Live Metrics shows real-time:

Incoming request rate
Failed request rate
Server response time
CPU and memory

Use this during a deployment to immediately see if the new version is healthy.

Checkpoint

After instrumenting, deploy and make 10 requests:

Bash

for i in {1..10}; do
  curl -s http://localhost:8000/api/chat \
    -d '{"message":"What is ibuprofen?"}' \
    -H "Content-Type: application/json" > /dev/null
done

Then in Application Insights → Transaction Search, you should see 10 requests. Click any one to see the full trace: HTTP request → OpenAI call → response. All latencies broken down, all tokens logged.

Azure Monitor and Application Insights for LLMs

What You Need to Observe in an LLM Service

Architecture

Setup: Install the SDK

Instrument Your App

Get the Connection String

Tracking Custom LLM Metrics

Tracking Streaming TTFT

Querying in Log Analytics

All LLM requests in the last hour

Token usage by model (cost tracking)

Error rate

Slow requests (over 5s)

Application Map

Live Metrics

Checkpoint

Enjoyed this article?

Leave a comment