Scenario: P95 Latency Is 12 Seconds
P50 is 3 seconds but P95 is 12 seconds ā tail latency is destroying the experience for users on complex queries. Fix cold starts, context bloat, retry storms, and stream early.
The Scenario
Your monitoring dashboard shows:
- P50 latency: 3.1 seconds (acceptable)
- P95 latency: 12.4 seconds (terrible)
- P99 latency: 22 seconds (unacceptable)
The median user has a tolerable experience. But one in twenty users waits over 12 seconds. One in a hundred waits over 22 seconds. These are the users who leave angry reviews.
Tail latency in LLM systems is a different problem from median latency. The causes are different and the fixes are different.
Why Tail Latency Is a Different Problem
In traditional web services, P95 is usually 3-5x P50. In LLM pipelines, you often see 5-10x because of:
- Response length variance ā A query that generates a 2,000-token response takes 4x longer than one generating 500 tokens. Long-tail queries happen to ask for long-tail responses.
- Context window explosion ā Some queries trigger multi-document retrieval with 6,000-token contexts. Larger context = slower prefill.
- Cold starts ā If your container scales to zero at night, the first requests after scale-up wait for JIT compilation, model loading, and connection pool warm-up.
- Retry storms ā If a subset of queries hit timeouts and auto-retry, they count double in P95/P99.
Step 1: Decompose the Tail
Before fixing anything, identify which stage is responsible for the tail. Add percentile logging:
import time
import statistics
from collections import defaultdict
from typing import List
class LatencyRecorder:
def __init__(self):
self.stage_samples: dict[str, List[float]] = defaultdict(list)
def record(self, stage: str, duration_ms: float):
self.stage_samples[stage].append(duration_ms)
def report_percentiles(self):
print(f"\n{'Stage':<25} {'P50':>8} {'P90':>8} {'P95':>8} {'P99':>8}")
print("-" * 60)
for stage, samples in sorted(self.stage_samples.items()):
if len(samples) < 10:
continue
s = sorted(samples)
n = len(s)
p = lambda pct: s[int(n * pct / 100)]
print(f"{stage:<25} {p(50):>7.0f}ms {p(90):>7.0f}ms {p(95):>7.0f}ms {p(99):>7.0f}ms")
recorder = LatencyRecorder()
# Example output after 1,000 requests:
# Stage P50 P90 P95 P99
# query_embedding 420ms 680ms 820ms 1100ms
# vector_search 180ms 310ms 380ms 920ms
# context_assembly 45ms 90ms 120ms 310ms
# llm_completion 2400ms 7200ms 11800ms 21000msThis tells you that llm_completion is responsible for almost all of the tail latency. Query embedding and vector search have mild tails ā not the primary problem.
Root Cause Analysis: LLM Completion Tail
Two LLM-specific factors drive the long tail:
Factor 1: Output token count. Time to complete scales roughly linearly with the number of output tokens. A 2,000-token response takes approximately 4x longer than a 500-token response at the same context size.
Factor 2: Prefill time. Before generating the first token, the model processes all input tokens (the "prefill" phase). Large contexts (above 4,000 tokens) have noticeably longer time-to-first-token.
Check this correlation in your logs:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("llm_calls.csv") # columns: completion_tokens, context_tokens, duration_ms
# Strong positive correlation confirms output length is the culprit
correlation_output = df["completion_tokens"].corr(df["duration_ms"])
correlation_context = df["context_tokens"].corr(df["duration_ms"])
print(f"Correlation: output tokens vs latency: {correlation_output:.2f}")
print(f"Correlation: context tokens vs latency: {correlation_context:.2f}")
# Scatter plot: context size vs. latency
df.plot.scatter(x="context_tokens", y="duration_ms", alpha=0.3)
plt.axvline(x=4000, color="red", linestyle="--", label="4k context threshold")
plt.show()Fix 1: Cap Context Size
If your vector search retrieves 10 chunks and each is 512 tokens, you have 5,120 context tokens. LLM performance degrades at large contexts. Cap it:
import tiktoken
def build_context_with_budget(
chunks: List[str],
max_tokens: int = 2000,
model: str = "gpt-4o",
) -> tuple[str, int]:
"""
Include chunks until token budget is exhausted.
Returns (context_string, actual_token_count).
"""
enc = tiktoken.encoding_for_model(model)
selected_chunks = []
total_tokens = 0
for chunk in chunks: # chunks already sorted by relevance (most relevant first)
chunk_tokens = len(enc.encode(chunk))
if total_tokens + chunk_tokens > max_tokens:
break
selected_chunks.append(chunk)
total_tokens += chunk_tokens
return "\n\n".join(selected_chunks), total_tokens
# Before: avg context = 4,800 tokens, P95 LLM latency = 12s
# After: avg context = 2,000 tokens, P95 LLM latency = 6sFix 2: Cap Output Length Per Request Type
Not all requests need long answers. Add a max_tokens budget per query type:
from enum import Enum
class ResponseType(Enum):
FACTUAL = "factual" # "What is the upload limit?" ā short
SUMMARY = "summary" # "Summarize this policy" ā medium
ANALYSIS = "analysis" # "Compare options A and B" ā long
MAX_OUTPUT_TOKENS = {
ResponseType.FACTUAL: 150,
ResponseType.SUMMARY: 400,
ResponseType.ANALYSIS: 800,
}
def classify_response_type(query: str) -> ResponseType:
"""Quick classification using a small model or regex heuristics."""
query_lower = query.lower()
if any(w in query_lower for w in ["what is", "how much", "when does", "which"]):
return ResponseType.FACTUAL
if any(w in query_lower for w in ["summarize", "overview", "brief"]):
return ResponseType.SUMMARY
return ResponseType.ANALYSIS
async def adaptive_llm_call(query: str, context: str) -> dict:
response_type = classify_response_type(query)
max_tokens = MAX_OUTPUT_TOKENS[response_type]
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Context:\n{context}"},
{"role": "user", "content": query},
],
max_tokens=max_tokens,
)
return {
"answer": response.choices[0].message.content,
"response_type": response_type.value,
"max_tokens_applied": max_tokens,
}Setting max_tokens=150 for factual queries limits the worst-case LLM time to the time needed for 150 tokens regardless of what the model wants to say.
Fix 3: Streaming to Reduce Perceived P95
Streaming does not change P95 total completion time, but it changes P95 time-to-first-token (TTFT). Users perceive a response that starts appearing in 1 second much better than a response that appears fully after 8 seconds.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio
import json
app = FastAPI()
async def stream_with_early_metadata(query: str):
"""
Immediately yields metadata (source documents, confidence)
before the LLM starts generating. Users see something instantly.
"""
# Phase 1: retrieval (fast, ~400ms)
query_embedding = embed_text(query)
chunks = await vector_store.search(query_embedding, k=5)
context = build_context_with_budget(chunks, max_tokens=2000)[0]
# Immediately yield source info ā user sees this before LLM generates
yield json.dumps({
"type": "metadata",
"sources": [c.metadata.get("source") for c in chunks[:3]],
"context_tokens": count_tokens(context),
}) + "\n"
# Phase 2: LLM (slow, but streaming)
stream = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Context:\n{context}"},
{"role": "user", "content": query},
],
stream=True,
max_tokens=600,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
yield json.dumps({"type": "token", "content": delta}) + "\n"
await asyncio.sleep(0)
yield json.dumps({"type": "done"}) + "\n"
@app.get("/query")
async def query_endpoint(q: str):
return StreamingResponse(
stream_with_early_metadata(q),
media_type="application/x-ndjson",
)With streaming, P95 TTFT drops from 12 seconds to approximately 1.5 seconds even though P95 total time remains 12 seconds. Users rate the streaming experience as significantly better.
Fix 4: Cold Start Prevention
If your container scales to zero overnight, every first request after scale-up experiences a cold start that can add 5-15 seconds:
- Container provisioning: 30-60 seconds
- Python/Node startup: 5-15 seconds
- Warm-up of connection pools (Redis, vector store): 2-5 seconds
Prevent cold starts by maintaining minimum replicas:
# Azure Container Apps ā containerapp.yaml
resources:
cpu: 2
memory: "4Gi"
scale:
minReplicas: 2 # NEVER scale to zero
maxReplicas: 20
rules:
- name: http-scaling
http:
metadata:
concurrentRequests: "10"And add a warm-up endpoint that initializes connection pools:
from contextlib import asynccontextmanager
from fastapi import FastAPI
@asynccontextmanager
async def lifespan(app: FastAPI):
"""
Runs on startup. Pre-warms connections so first user request
does not pay the connection setup penalty.
"""
# Pre-warm embedding model connection
await embed_text("warm up query")
# Pre-warm vector store connection pool
await vector_store.ping()
# Pre-warm Redis connection
await redis_client.ping()
print("All connections warm. Ready to serve.")
yield
# Shutdown cleanup
await redis_client.aclose()
app = FastAPI(lifespan=lifespan)Fix 5: Timeout + Graceful Degradation
Set aggressive timeouts on LLM calls and return a graceful fallback if exceeded:
import asyncio
async def llm_with_timeout(
prompt: str,
timeout_seconds: float = 8.0,
fallback_message: str = "I am taking longer than usual. Please try again.",
) -> dict:
try:
response = await asyncio.wait_for(
async_llm_call(prompt),
timeout=timeout_seconds,
)
return {"answer": response, "timed_out": False}
except asyncio.TimeoutError:
# Log for monitoring ā this is a tail latency event
logger.warning(f"LLM timeout after {timeout_seconds}s for prompt hash {hash(prompt)}")
return {
"answer": fallback_message,
"timed_out": True,
"retry_url": "/query?retry=true", # hint to client to retry
}Timeouts prevent the P99 from reaching 22 seconds at the cost of some degraded responses for extreme tail cases.
Putting It All Together: P95 Latency Reduction
| Fix | P95 Before | P95 After | Notes | |---|---|---|---| | Baseline | 12,400 ms | ā | No optimizations | | Context cap (2,000 tokens) | 12,400 ms | 7,200 ms | Largest single fix | | Output token caps per query type | 7,200 ms | 5,100 ms | Prevents long-answer queries dominating | | Min replicas (no cold start) | 5,100 ms | 4,200 ms | Eliminates cold-start outliers | | 8-second timeout with fallback | 4,200 ms | 4,200 ms | Clips P99, not P95 | | Streaming (TTFT) | perceived 12,400 ms | perceived 1,400 ms | Does not change total time but feels 5-8x faster |
The result: P95 total latency drops from 12.4 seconds to approximately 4.2 seconds. P95 perceived latency (TTFT) drops from 12.4 seconds to approximately 1.4 seconds due to streaming.
Monitoring the Tail
Once you apply fixes, track percentile metrics continuously:
from prometheus_client import Histogram
llm_latency = Histogram(
"llm_request_duration_seconds",
"LLM call duration",
buckets=[0.5, 1, 2, 3, 5, 8, 12, 20, 30],
labelnames=["endpoint", "response_type"],
)
async def monitored_llm_call(query: str, endpoint: str) -> str:
response_type = classify_response_type(query)
start = time.perf_counter()
try:
result = await adaptive_llm_call(query, context="...")
return result["answer"]
finally:
elapsed = time.perf_counter() - start
llm_latency.labels(endpoint=endpoint, response_type=response_type.value).observe(elapsed)Set an alert when P95 exceeds your SLO:
# Prometheus alerting rule
groups:
- name: latency_slo
rules:
- alert: HighP95Latency
expr: histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m])) > 5
for: 2m
labels:
severity: warning
annotations:
summary: "P95 LLM latency above 5 seconds"With proper instrumentation, you will catch tail latency regressions in minutes rather than waiting for user complaints.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.