Back to blog
AI Systemsintermediate

Key LLM Metrics: TTFT, Cost/Request, Error Rate

The exact metrics every production LLM service must track. Learn what TTFT, cost-per-request, token efficiency, and error rate mean, how to measure them, and what good numbers look like.

Asma Hafeez KhanMay 15, 20265 min read
LLMOpsMetricsObservabilityPerformanceCost
Share:𝕏

The Four Pillars of LLM Observability

Traditional APIs track: latency, error rate, throughput, saturation (the RED/USE methods).

LLM services need four additional pillars:

| Pillar | Metric | Why It Matters | |---|---|---| | Latency | TTFT, total latency, p95 | User experience | | Cost | Tokens/request, $/request | Business viability | | Quality | Groundedness, relevance, refusal rate | Service correctness | | Reliability | Error rate, retry rate, timeout rate | Service health |


1. Time to First Token (TTFT)

Definition: Time from sending the request to receiving the first token of the response.

Why TTFT matters more than total latency: With streaming, users see the first word in 0.8s and perceive the response as fast — even if the full response takes 10 seconds. TTFT is the latency users feel.

Good numbers (Azure OpenAI, GPT-4o):

  • TTFT under 800ms: excellent
  • TTFT 800ms–2s: acceptable
  • TTFT over 2s: users notice the delay

Factors that affect TTFT:

  1. Region: East US typically faster than West Europe for Azure OpenAI
  2. Model: GPT-4o-mini has lower TTFT than GPT-4o
  3. Prompt length: longer prompt → more tokens to process before generating
  4. Concurrent load: if OpenAI's servers are under load, TTFT increases

Measuring TTFT:

Python
import time
from openai import AsyncAzureOpenAI

async def measure_ttft(client: AsyncAzureOpenAI, messages: list) -> float:
    start = time.perf_counter()
    ttft_ms = None
    
    stream = await client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        stream=True,
    )
    
    async for chunk in stream:
        content = chunk.choices[0].delta.content
        if content and ttft_ms is None:
            ttft_ms = (time.perf_counter() - start) * 1000
            break  # Got TTFT, continue consuming the rest
    
    # Consume remaining chunks
    async for chunk in stream:
        pass
    
    total_ms = (time.perf_counter() - start) * 1000
    return ttft_ms, total_ms

2. Cost Per Request

Why this matters: A GPT-4o request with 2,000 tokens costs ~$0.02. At 10,000 requests/day that's $200/day or $6,000/month — just for one endpoint.

Azure OpenAI pricing (GPT-4o, as of 2026):

  • Input: $5 per 1M tokens
  • Output: $15 per 1M tokens

Formula:

cost_per_request = (prompt_tokens * input_price + completion_tokens * output_price) / 1,000,000

Track this in code:

Python
INPUT_PRICE_PER_MILLION = 5.0   # GPT-4o input
OUTPUT_PRICE_PER_MILLION = 15.0  # GPT-4o output

def calculate_cost(usage) -> float:
    input_cost  = usage.prompt_tokens     * INPUT_PRICE_PER_MILLION / 1_000_000
    output_cost = usage.completion_tokens * OUTPUT_PRICE_PER_MILLION / 1_000_000
    return round(input_cost + output_cost, 6)

# After each LLM call:
cost = calculate_cost(response.usage)
log.info("llm_cost", cost_usd=cost, prompt_tokens=response.usage.prompt_tokens)

Daily cost dashboard query (Log Analytics):

KUSTO
customEvents
| where name == "llm_call_completed"
| where timestamp > ago(24h)
| summarize
    daily_cost_usd = sum(todouble(customDimensions["cost_usd"])),
    total_requests = count(),
    avg_cost_per_req = avg(todouble(customDimensions["cost_usd"]))
| project daily_cost_usd, total_requests, avg_cost_per_req

3. Token Efficiency Ratio

Definition: prompt_tokens / completion_tokens

What it tells you: If your prompt is 1,500 tokens and the completion is 50 tokens, your ratio is 30:1. You're paying mostly for the context, not the answer. This signals an opportunity to shorten the prompt or cache it.

Healthy ratio: 3:1 to 8:1 for RAG systems (more context than generation is expected). Above 15:1: your system prompt is probably too long.

Python
ratio = response.usage.prompt_tokens / max(response.usage.completion_tokens, 1)
log.info("token_ratio", ratio=round(ratio, 1))

4. Error Rate

Categories of errors:

| Error | HTTP Code | Meaning | |---|---|---| | Rate limit | 429 | Too many requests — backoff and retry | | Content filter | 400 (content_filter) | Prompt or response blocked | | Timeout | 408/504 | LLM took too long | | Model overload | 503 | Azure OpenAI is overloaded | | Invalid request | 400 | Bad prompt format |

Track each category separately:

Python
from openai import RateLimitError, ContentFilterFinishReasonError

async def call_with_tracking(messages):
    try:
        response = await client.chat.completions.create(...)
        error_counter.add(0, {"type": "none"})
        return response
        
    except RateLimitError:
        error_counter.add(1, {"type": "rate_limit"})
        raise
    except ContentFilterFinishReasonError:
        error_counter.add(1, {"type": "content_filter"})
        raise
    except TimeoutError:
        error_counter.add(1, {"type": "timeout"})
        raise
    except Exception as e:
        error_counter.add(1, {"type": "unknown"})
        raise

Acceptable error rates:

  • Rate limit errors: under 0.5% (if higher, request quota increase or add caching)
  • Content filter: depends on use case; over 2% suggests prompt engineering issues
  • Timeout: under 0.1%

5. Retrieval Quality Metrics (RAG Systems)

If you have a RAG pipeline, also track:

| Metric | Measurement | |---|---| | Retrieval score | Cosine similarity of top document (should be over 0.7) | | Documents retrieved | How many chunks returned (should match top_k config) | | Fallback rate | How often RAG finds nothing → LLM uses only its training | | Citation coverage | What % of answers include a citation |

Python
docs = await retriever.search(query, top_k=5)

retrieval_score = docs[0].score if docs else 0
log.info(
    "retrieval_completed",
    doc_count=len(docs),
    top_score=round(retrieval_score, 3),
    fallback=retrieval_score < 0.6,
)

6. SLOs for LLM Services

Define Service Level Objectives before you alert:

| SLO | Target | |---|---| | TTFT p95 | under 1.5s | | Total latency p95 | under 8s | | Error rate | under 1% | | Cost per request | under $0.05 | | Availability | 99.5% |


Checkpoint: Build a Metrics Dashboard

In Azure Monitor → Workbooks, create a workbook with:

  1. Line chart: p95 TTFT over 24h
  2. Bar chart: total tokens/hour (input vs output)
  3. Single number: cost today vs yesterday
  4. Table: top 10 slowest requests

This gives you a single-pane view of your LLM service health.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.