Back to blog
AI Systemsintermediate

Azure Monitor and Application Insights for LLMs

Set up Azure Monitor and Application Insights to track LLM latency, token usage, error rates, and cost for production AI services running on Azure Container Apps.

Asma Hafeez KhanMay 15, 20264 min read
LLMOpsAzure MonitorApplication InsightsObservabilityAzure
Share:𝕏

What You Need to Observe in an LLM Service

A traditional API needs: latency, error rate, throughput. An LLM service needs all of that plus:

  • Token usage — every token costs money
  • Time to First Token (TTFT) — streaming latency perception
  • Model version — which GPT-4o deployment served the request
  • Prompt/completion ratio — signals prompt bloat
  • Safety filter hits — how often content moderation blocks responses
  • RAG retrieval quality — are we finding relevant chunks?

Azure Monitor + Application Insights is the native Azure stack for this.


Architecture

FastAPI Container App
        │
        │  (OpenTelemetry SDK)
        ▼
Application Insights ──► Log Analytics Workspace
        │
        ▼
Azure Monitor Dashboards + Alerts

Setup: Install the SDK

Bash
pip install azure-monitor-opentelemetry

Instrument Your App

In main.py, add this before creating the FastAPI app:

Python
from azure.monitor.opentelemetry import configure_azure_monitor
import os

configure_azure_monitor(
    connection_string=os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"],
    # Optionally set cloud role for multi-service tracing
    service_name="pharmabot-api",
    service_version="1.0.0",
)

This single call instruments:

  • All HTTP requests (via FastAPI middleware)
  • All httpx/requests outbound calls (including Azure OpenAI)
  • All exceptions
  • Custom events and metrics you add

Get the Connection String

Bash
# Create Application Insights resource
az monitor app-insights component create \
  --app pharmabot-insights \
  --location eastus \
  --resource-group pharmabot-rg \
  --kind web

# Get the connection string
az monitor app-insights component show \
  --app pharmabot-insights \
  --resource-group pharmabot-rg \
  --query connectionString -o tsv

Set in your Container App:

Bash
az containerapp secret set \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --secrets appinsights-conn="InstrumentationKey=...;IngestionEndpoint=..."

az containerapp update \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --set-env-vars APPLICATIONINSIGHTS_CONNECTION_STRING=secretref:appinsights-conn

Tracking Custom LLM Metrics

The auto-instrumentation tracks HTTP latency. For LLM-specific metrics, use the OpenTelemetry metrics API:

Python
from opentelemetry import metrics

meter = metrics.get_meter("pharmabot.llm")

# Counters and histograms
token_counter = meter.create_counter(
    "llm.tokens.total",
    description="Total tokens used",
    unit="tokens",
)
latency_histogram = meter.create_histogram(
    "llm.latency_ms",
    description="LLM call latency",
    unit="ms",
)
ttft_histogram = meter.create_histogram(
    "llm.ttft_ms",
    description="Time to first token",
    unit="ms",
)

async def call_azure_openai_tracked(messages: list, model: str) -> str:
    import time
    start = time.perf_counter()
    
    response = await client.chat.completions.create(
        model=model,
        messages=messages,
    )
    
    duration_ms = (time.perf_counter() - start) * 1000
    
    # Record metrics with labels
    attributes = {"model": model, "endpoint": "chat"}
    
    token_counter.add(
        response.usage.prompt_tokens,
        {**attributes, "type": "prompt"}
    )
    token_counter.add(
        response.usage.completion_tokens,
        {**attributes, "type": "completion"}
    )
    latency_histogram.record(duration_ms, attributes)
    
    return response.choices[0].message.content

Tracking Streaming TTFT

Python
import time

async def stream_with_ttft_tracking(messages: list, model: str):
    start = time.perf_counter()
    first_token = True
    
    async for chunk in await client.chat.completions.create(
        model=model,
        messages=messages,
        stream=True,
    ):
        if first_token and chunk.choices[0].delta.content:
            ttft_ms = (time.perf_counter() - start) * 1000
            ttft_histogram.record(ttft_ms, {"model": model})
            first_token = False
        
        content = chunk.choices[0].delta.content or ""
        yield content

Querying in Log Analytics

Navigate to: Azure Portal → Log Analytics Workspace → Logs

All LLM requests in the last hour

KUSTO
customMetrics
| where name == "llm.latency_ms"
| where timestamp > ago(1h)
| summarize
    avg_latency = avg(value),
    p95_latency = percentile(value, 95),
    p99_latency = percentile(value, 99),
    request_count = count()
| project avg_latency, p95_latency, p99_latency, request_count

Token usage by model (cost tracking)

KUSTO
customMetrics
| where name == "llm.tokens.total"
| where timestamp > ago(24h)
| summarize total_tokens = sum(value) by tostring(customDimensions["model"])
| order by total_tokens desc

Error rate

KUSTO
requests
| where timestamp > ago(1h)
| where name contains "/api/chat"
| summarize
    total = count(),
    errors = countif(resultCode >= 400)
| extend error_rate_pct = round(100.0 * errors / total, 2)

Slow requests (over 5s)

KUSTO
requests
| where timestamp > ago(1h)
| where duration > 5000
| where name contains "/api/chat"
| project timestamp, duration, resultCode, url
| order by duration desc

Application Map

In the Azure Portal → Application Insights → Application Map, you'll see a visual graph of:

[Browser] → [pharmabot-api] → [Azure OpenAI]
                            → [Azure AI Search]
                            → [Redis]
                            → [PostgreSQL]

Each arrow shows latency and error rate. This makes it immediately obvious which dependency is the bottleneck.


Live Metrics

Application Insights → Live Metrics shows real-time:

  • Incoming request rate
  • Failed request rate
  • Server response time
  • CPU and memory

Use this during a deployment to immediately see if the new version is healthy.


Checkpoint

After instrumenting, deploy and make 10 requests:

Bash
for i in {1..10}; do
  curl -s http://localhost:8000/api/chat \
    -d '{"message":"What is ibuprofen?"}' \
    -H "Content-Type: application/json" > /dev/null
done

Then in Application Insights → Transaction Search, you should see 10 requests. Click any one to see the full trace: HTTP request → OpenAI call → response. All latencies broken down, all tokens logged.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.