Learnixo

LLMOps & Deployment · Lesson 10 of 16

Azure Monitor and Application Insights for LLMs

What You Need to Observe in an LLM Service

A traditional API needs: latency, error rate, throughput. An LLM service needs all of that plus:

  • Token usage — every token costs money
  • Time to First Token (TTFT) — streaming latency perception
  • Model version — which GPT-4o deployment served the request
  • Prompt/completion ratio — signals prompt bloat
  • Safety filter hits — how often content moderation blocks responses
  • RAG retrieval quality — are we finding relevant chunks?

Azure Monitor + Application Insights is the native Azure stack for this.


Architecture

FastAPI Container App
        │
        │  (OpenTelemetry SDK)
        ▼
Application Insights ──► Log Analytics Workspace
        │
        ▼
Azure Monitor Dashboards + Alerts

Setup: Install the SDK

Bash
pip install azure-monitor-opentelemetry

Instrument Your App

In main.py, add this before creating the FastAPI app:

Python
from azure.monitor.opentelemetry import configure_azure_monitor
import os

configure_azure_monitor(
    connection_string=os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"],
    # Optionally set cloud role for multi-service tracing
    service_name="pharmabot-api",
    service_version="1.0.0",
)

This single call instruments:

  • All HTTP requests (via FastAPI middleware)
  • All httpx/requests outbound calls (including Azure OpenAI)
  • All exceptions
  • Custom events and metrics you add

Get the Connection String

Bash
# Create Application Insights resource
az monitor app-insights component create \
  --app pharmabot-insights \
  --location eastus \
  --resource-group pharmabot-rg \
  --kind web

# Get the connection string
az monitor app-insights component show \
  --app pharmabot-insights \
  --resource-group pharmabot-rg \
  --query connectionString -o tsv

Set in your Container App:

Bash
az containerapp secret set \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --secrets appinsights-conn="InstrumentationKey=...;IngestionEndpoint=..."

az containerapp update \
  --name pharmabot \
  --resource-group pharmabot-rg \
  --set-env-vars APPLICATIONINSIGHTS_CONNECTION_STRING=secretref:appinsights-conn

Tracking Custom LLM Metrics

The auto-instrumentation tracks HTTP latency. For LLM-specific metrics, use the OpenTelemetry metrics API:

Python
from opentelemetry import metrics

meter = metrics.get_meter("pharmabot.llm")

# Counters and histograms
token_counter = meter.create_counter(
    "llm.tokens.total",
    description="Total tokens used",
    unit="tokens",
)
latency_histogram = meter.create_histogram(
    "llm.latency_ms",
    description="LLM call latency",
    unit="ms",
)
ttft_histogram = meter.create_histogram(
    "llm.ttft_ms",
    description="Time to first token",
    unit="ms",
)

async def call_azure_openai_tracked(messages: list, model: str) -> str:
    import time
    start = time.perf_counter()
    
    response = await client.chat.completions.create(
        model=model,
        messages=messages,
    )
    
    duration_ms = (time.perf_counter() - start) * 1000
    
    # Record metrics with labels
    attributes = {"model": model, "endpoint": "chat"}
    
    token_counter.add(
        response.usage.prompt_tokens,
        {**attributes, "type": "prompt"}
    )
    token_counter.add(
        response.usage.completion_tokens,
        {**attributes, "type": "completion"}
    )
    latency_histogram.record(duration_ms, attributes)
    
    return response.choices[0].message.content

Tracking Streaming TTFT

Python
import time

async def stream_with_ttft_tracking(messages: list, model: str):
    start = time.perf_counter()
    first_token = True
    
    async for chunk in await client.chat.completions.create(
        model=model,
        messages=messages,
        stream=True,
    ):
        if first_token and chunk.choices[0].delta.content:
            ttft_ms = (time.perf_counter() - start) * 1000
            ttft_histogram.record(ttft_ms, {"model": model})
            first_token = False
        
        content = chunk.choices[0].delta.content or ""
        yield content

Querying in Log Analytics

Navigate to: Azure Portal → Log Analytics Workspace → Logs

All LLM requests in the last hour

KUSTO
customMetrics
| where name == "llm.latency_ms"
| where timestamp > ago(1h)
| summarize
    avg_latency = avg(value),
    p95_latency = percentile(value, 95),
    p99_latency = percentile(value, 99),
    request_count = count()
| project avg_latency, p95_latency, p99_latency, request_count

Token usage by model (cost tracking)

KUSTO
customMetrics
| where name == "llm.tokens.total"
| where timestamp > ago(24h)
| summarize total_tokens = sum(value) by tostring(customDimensions["model"])
| order by total_tokens desc

Error rate

KUSTO
requests
| where timestamp > ago(1h)
| where name contains "/api/chat"
| summarize
    total = count(),
    errors = countif(resultCode >= 400)
| extend error_rate_pct = round(100.0 * errors / total, 2)

Slow requests (over 5s)

KUSTO
requests
| where timestamp > ago(1h)
| where duration > 5000
| where name contains "/api/chat"
| project timestamp, duration, resultCode, url
| order by duration desc

Application Map

In the Azure Portal → Application Insights → Application Map, you'll see a visual graph of:

[Browser] → [pharmabot-api] → [Azure OpenAI]
                            → [Azure AI Search]
                            → [Redis]
                            → [PostgreSQL]

Each arrow shows latency and error rate. This makes it immediately obvious which dependency is the bottleneck.


Live Metrics

Application Insights → Live Metrics shows real-time:

  • Incoming request rate
  • Failed request rate
  • Server response time
  • CPU and memory

Use this during a deployment to immediately see if the new version is healthy.


Checkpoint

After instrumenting, deploy and make 10 requests:

Bash
for i in {1..10}; do
  curl -s http://localhost:8000/api/chat \
    -d '{"message":"What is ibuprofen?"}' \
    -H "Content-Type: application/json" > /dev/null
done

Then in Application Insights → Transaction Search, you should see 10 requests. Click any one to see the full trace: HTTP request → OpenAI call → response. All latencies broken down, all tokens logged.