Key LLM Metrics: TTFT, Cost/Request, Error Rate

The Four Pillars of LLM Observability

Traditional APIs track: latency, error rate, throughput, saturation (the RED/USE methods).

LLM services need four additional pillars:

| Pillar | Metric | Why It Matters | |---|---|---| | Latency | TTFT, total latency, p95 | User experience | | Cost | Tokens/request, $/request | Business viability | | Quality | Groundedness, relevance, refusal rate | Service correctness | | Reliability | Error rate, retry rate, timeout rate | Service health |

1. Time to First Token (TTFT)

Definition: Time from sending the request to receiving the first token of the response.

Why TTFT matters more than total latency: With streaming, users see the first word in 0.8s and perceive the response as fast — even if the full response takes 10 seconds. TTFT is the latency users feel.

Good numbers (Azure OpenAI, GPT-4o):

TTFT under 800ms: excellent
TTFT 800ms–2s: acceptable
TTFT over 2s: users notice the delay

Factors that affect TTFT:

Region: East US typically faster than West Europe for Azure OpenAI
Model: GPT-4o-mini has lower TTFT than GPT-4o
Prompt length: longer prompt → more tokens to process before generating
Concurrent load: if OpenAI's servers are under load, TTFT increases

Measuring TTFT:

Python

import time
from openai import AsyncAzureOpenAI

async def measure_ttft(client: AsyncAzureOpenAI, messages: list) -> float:
    start = time.perf_counter()
    ttft_ms = None
    
    stream = await client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        stream=True,
    )
    
    async for chunk in stream:
        content = chunk.choices[0].delta.content
        if content and ttft_ms is None:
            ttft_ms = (time.perf_counter() - start) * 1000
            break  # Got TTFT, continue consuming the rest
    
    # Consume remaining chunks
    async for chunk in stream:
        pass
    
    total_ms = (time.perf_counter() - start) * 1000
    return ttft_ms, total_ms

2. Cost Per Request

Why this matters: A GPT-4o request with 2,000 tokens costs ~$0.02. At 10,000 requests/day that's $200/day or $6,000/month — just for one endpoint.

Azure OpenAI pricing (GPT-4o, as of 2026):

Input: $5 per 1M tokens
Output: $15 per 1M tokens

Formula:

cost_per_request = (prompt_tokens * input_price + completion_tokens * output_price) / 1,000,000

Track this in code:

Python

INPUT_PRICE_PER_MILLION = 5.0   # GPT-4o input
OUTPUT_PRICE_PER_MILLION = 15.0  # GPT-4o output

def calculate_cost(usage) -> float:
    input_cost  = usage.prompt_tokens     * INPUT_PRICE_PER_MILLION / 1_000_000
    output_cost = usage.completion_tokens * OUTPUT_PRICE_PER_MILLION / 1_000_000
    return round(input_cost + output_cost, 6)

# After each LLM call:
cost = calculate_cost(response.usage)
log.info("llm_cost", cost_usd=cost, prompt_tokens=response.usage.prompt_tokens)

Daily cost dashboard query (Log Analytics):

KUSTO

customEvents
| where name == "llm_call_completed"
| where timestamp > ago(24h)
| summarize
    daily_cost_usd = sum(todouble(customDimensions["cost_usd"])),
    total_requests = count(),
    avg_cost_per_req = avg(todouble(customDimensions["cost_usd"]))
| project daily_cost_usd, total_requests, avg_cost_per_req

3. Token Efficiency Ratio

Definition: prompt_tokens / completion_tokens

What it tells you: If your prompt is 1,500 tokens and the completion is 50 tokens, your ratio is 30:1. You're paying mostly for the context, not the answer. This signals an opportunity to shorten the prompt or cache it.

Healthy ratio: 3:1 to 8:1 for RAG systems (more context than generation is expected). Above 15:1: your system prompt is probably too long.

Python

ratio = response.usage.prompt_tokens / max(response.usage.completion_tokens, 1)
log.info("token_ratio", ratio=round(ratio, 1))

4. Error Rate

Categories of errors:

| Error | HTTP Code | Meaning | |---|---|---| | Rate limit | 429 | Too many requests — backoff and retry | | Content filter | 400 (content_filter) | Prompt or response blocked | | Timeout | 408/504 | LLM took too long | | Model overload | 503 | Azure OpenAI is overloaded | | Invalid request | 400 | Bad prompt format |

Track each category separately:

Python

from openai import RateLimitError, ContentFilterFinishReasonError

async def call_with_tracking(messages):
    try:
        response = await client.chat.completions.create(...)
        error_counter.add(0, {"type": "none"})
        return response
        
    except RateLimitError:
        error_counter.add(1, {"type": "rate_limit"})
        raise
    except ContentFilterFinishReasonError:
        error_counter.add(1, {"type": "content_filter"})
        raise
    except TimeoutError:
        error_counter.add(1, {"type": "timeout"})
        raise
    except Exception as e:
        error_counter.add(1, {"type": "unknown"})
        raise

Acceptable error rates:

Rate limit errors: under 0.5% (if higher, request quota increase or add caching)
Content filter: depends on use case; over 2% suggests prompt engineering issues
Timeout: under 0.1%

5. Retrieval Quality Metrics (RAG Systems)

If you have a RAG pipeline, also track:

| Metric | Measurement | |---|---| | Retrieval score | Cosine similarity of top document (should be over 0.7) | | Documents retrieved | How many chunks returned (should match top_k config) | | Fallback rate | How often RAG finds nothing → LLM uses only its training | | Citation coverage | What % of answers include a citation |

Python

docs = await retriever.search(query, top_k=5)

retrieval_score = docs[0].score if docs else 0
log.info(
    "retrieval_completed",
    doc_count=len(docs),
    top_score=round(retrieval_score, 3),
    fallback=retrieval_score < 0.6,
)

6. SLOs for LLM Services

Define Service Level Objectives before you alert:

| SLO | Target | |---|---| | TTFT p95 | under 1.5s | | Total latency p95 | under 8s | | Error rate | under 1% | | Cost per request | under $0.05 | | Availability | 99.5% |

Checkpoint: Build a Metrics Dashboard

In Azure Monitor → Workbooks, create a workbook with:

Line chart: p95 TTFT over 24h
Bar chart: total tokens/hour (input vs output)
Single number: cost today vs yesterday
Table: top 10 slowest requests

This gives you a single-pane view of your LLM service health.

Key LLM Metrics: TTFT, Cost/Request, Error Rate

The Four Pillars of LLM Observability

1. Time to First Token (TTFT)

2. Cost Per Request

3. Token Efficiency Ratio

4. Error Rate

5. Retrieval Quality Metrics (RAG Systems)

6. SLOs for LLM Services

Checkpoint: Build a Metrics Dashboard

Enjoyed this article?

Leave a comment