Key LLM Metrics: TTFT, Cost/Request, Error Rate
The exact metrics every production LLM service must track. Learn what TTFT, cost-per-request, token efficiency, and error rate mean, how to measure them, and what good numbers look like.
The Four Pillars of LLM Observability
Traditional APIs track: latency, error rate, throughput, saturation (the RED/USE methods).
LLM services need four additional pillars:
| Pillar | Metric | Why It Matters | |---|---|---| | Latency | TTFT, total latency, p95 | User experience | | Cost | Tokens/request, $/request | Business viability | | Quality | Groundedness, relevance, refusal rate | Service correctness | | Reliability | Error rate, retry rate, timeout rate | Service health |
1. Time to First Token (TTFT)
Definition: Time from sending the request to receiving the first token of the response.
Why TTFT matters more than total latency: With streaming, users see the first word in 0.8s and perceive the response as fast — even if the full response takes 10 seconds. TTFT is the latency users feel.
Good numbers (Azure OpenAI, GPT-4o):
- TTFT under 800ms: excellent
- TTFT 800ms–2s: acceptable
- TTFT over 2s: users notice the delay
Factors that affect TTFT:
- Region: East US typically faster than West Europe for Azure OpenAI
- Model: GPT-4o-mini has lower TTFT than GPT-4o
- Prompt length: longer prompt → more tokens to process before generating
- Concurrent load: if OpenAI's servers are under load, TTFT increases
Measuring TTFT:
import time
from openai import AsyncAzureOpenAI
async def measure_ttft(client: AsyncAzureOpenAI, messages: list) -> float:
start = time.perf_counter()
ttft_ms = None
stream = await client.chat.completions.create(
model="gpt-4o",
messages=messages,
stream=True,
)
async for chunk in stream:
content = chunk.choices[0].delta.content
if content and ttft_ms is None:
ttft_ms = (time.perf_counter() - start) * 1000
break # Got TTFT, continue consuming the rest
# Consume remaining chunks
async for chunk in stream:
pass
total_ms = (time.perf_counter() - start) * 1000
return ttft_ms, total_ms2. Cost Per Request
Why this matters: A GPT-4o request with 2,000 tokens costs ~$0.02. At 10,000 requests/day that's $200/day or $6,000/month — just for one endpoint.
Azure OpenAI pricing (GPT-4o, as of 2026):
- Input: $5 per 1M tokens
- Output: $15 per 1M tokens
Formula:
cost_per_request = (prompt_tokens * input_price + completion_tokens * output_price) / 1,000,000Track this in code:
INPUT_PRICE_PER_MILLION = 5.0 # GPT-4o input
OUTPUT_PRICE_PER_MILLION = 15.0 # GPT-4o output
def calculate_cost(usage) -> float:
input_cost = usage.prompt_tokens * INPUT_PRICE_PER_MILLION / 1_000_000
output_cost = usage.completion_tokens * OUTPUT_PRICE_PER_MILLION / 1_000_000
return round(input_cost + output_cost, 6)
# After each LLM call:
cost = calculate_cost(response.usage)
log.info("llm_cost", cost_usd=cost, prompt_tokens=response.usage.prompt_tokens)Daily cost dashboard query (Log Analytics):
customEvents
| where name == "llm_call_completed"
| where timestamp > ago(24h)
| summarize
daily_cost_usd = sum(todouble(customDimensions["cost_usd"])),
total_requests = count(),
avg_cost_per_req = avg(todouble(customDimensions["cost_usd"]))
| project daily_cost_usd, total_requests, avg_cost_per_req3. Token Efficiency Ratio
Definition: prompt_tokens / completion_tokens
What it tells you: If your prompt is 1,500 tokens and the completion is 50 tokens, your ratio is 30:1. You're paying mostly for the context, not the answer. This signals an opportunity to shorten the prompt or cache it.
Healthy ratio: 3:1 to 8:1 for RAG systems (more context than generation is expected). Above 15:1: your system prompt is probably too long.
ratio = response.usage.prompt_tokens / max(response.usage.completion_tokens, 1)
log.info("token_ratio", ratio=round(ratio, 1))4. Error Rate
Categories of errors:
| Error | HTTP Code | Meaning | |---|---|---| | Rate limit | 429 | Too many requests — backoff and retry | | Content filter | 400 (content_filter) | Prompt or response blocked | | Timeout | 408/504 | LLM took too long | | Model overload | 503 | Azure OpenAI is overloaded | | Invalid request | 400 | Bad prompt format |
Track each category separately:
from openai import RateLimitError, ContentFilterFinishReasonError
async def call_with_tracking(messages):
try:
response = await client.chat.completions.create(...)
error_counter.add(0, {"type": "none"})
return response
except RateLimitError:
error_counter.add(1, {"type": "rate_limit"})
raise
except ContentFilterFinishReasonError:
error_counter.add(1, {"type": "content_filter"})
raise
except TimeoutError:
error_counter.add(1, {"type": "timeout"})
raise
except Exception as e:
error_counter.add(1, {"type": "unknown"})
raiseAcceptable error rates:
- Rate limit errors: under 0.5% (if higher, request quota increase or add caching)
- Content filter: depends on use case; over 2% suggests prompt engineering issues
- Timeout: under 0.1%
5. Retrieval Quality Metrics (RAG Systems)
If you have a RAG pipeline, also track:
| Metric | Measurement | |---|---| | Retrieval score | Cosine similarity of top document (should be over 0.7) | | Documents retrieved | How many chunks returned (should match top_k config) | | Fallback rate | How often RAG finds nothing → LLM uses only its training | | Citation coverage | What % of answers include a citation |
docs = await retriever.search(query, top_k=5)
retrieval_score = docs[0].score if docs else 0
log.info(
"retrieval_completed",
doc_count=len(docs),
top_score=round(retrieval_score, 3),
fallback=retrieval_score < 0.6,
)6. SLOs for LLM Services
Define Service Level Objectives before you alert:
| SLO | Target | |---|---| | TTFT p95 | under 1.5s | | Total latency p95 | under 8s | | Error rate | under 1% | | Cost per request | under $0.05 | | Availability | 99.5% |
Checkpoint: Build a Metrics Dashboard
In Azure Monitor → Workbooks, create a workbook with:
- Line chart: p95 TTFT over 24h
- Bar chart: total tokens/hour (input vs output)
- Single number: cost today vs yesterday
- Table: top 10 slowest requests
This gives you a single-pane view of your LLM service health.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.