Scenario: P95 Latency Is 12 Seconds

The Scenario

Your monitoring dashboard shows:

P50 latency: 3.1 seconds (acceptable)
P95 latency: 12.4 seconds (terrible)
P99 latency: 22 seconds (unacceptable)

The median user has a tolerable experience. But one in twenty users waits over 12 seconds. One in a hundred waits over 22 seconds. These are the users who leave angry reviews.

Tail latency in LLM systems is a different problem from median latency. The causes are different and the fixes are different.

Why Tail Latency Is a Different Problem

In traditional web services, P95 is usually 3-5x P50. In LLM pipelines, you often see 5-10x because of:

Response length variance — A query that generates a 2,000-token response takes 4x longer than one generating 500 tokens. Long-tail queries happen to ask for long-tail responses.
Context window explosion — Some queries trigger multi-document retrieval with 6,000-token contexts. Larger context = slower prefill.
Cold starts — If your container scales to zero at night, the first requests after scale-up wait for JIT compilation, model loading, and connection pool warm-up.
Retry storms — If a subset of queries hit timeouts and auto-retry, they count double in P95/P99.

Step 1: Decompose the Tail

Before fixing anything, identify which stage is responsible for the tail. Add percentile logging:

Python

import time
import statistics
from collections import defaultdict
from typing import List

class LatencyRecorder:
    def __init__(self):
        self.stage_samples: dict[str, List[float]] = defaultdict(list)

    def record(self, stage: str, duration_ms: float):
        self.stage_samples[stage].append(duration_ms)

    def report_percentiles(self):
        print(f"\n{'Stage':<25} {'P50':>8} {'P90':>8} {'P95':>8} {'P99':>8}")
        print("-" * 60)
        for stage, samples in sorted(self.stage_samples.items()):
            if len(samples) < 10:
                continue
            s = sorted(samples)
            n = len(s)
            p = lambda pct: s[int(n * pct / 100)]
            print(f"{stage:<25} {p(50):>7.0f}ms {p(90):>7.0f}ms {p(95):>7.0f}ms {p(99):>7.0f}ms")

recorder = LatencyRecorder()

# Example output after 1,000 requests:
# Stage                     P50      P90      P95      P99
# query_embedding           420ms    680ms    820ms   1100ms
# vector_search             180ms    310ms    380ms    920ms
# context_assembly           45ms     90ms    120ms    310ms
# llm_completion           2400ms   7200ms  11800ms  21000ms

This tells you that llm_completion is responsible for almost all of the tail latency. Query embedding and vector search have mild tails — not the primary problem.

Root Cause Analysis: LLM Completion Tail

Two LLM-specific factors drive the long tail:

Factor 1: Output token count. Time to complete scales roughly linearly with the number of output tokens. A 2,000-token response takes approximately 4x longer than a 500-token response at the same context size.

Factor 2: Prefill time. Before generating the first token, the model processes all input tokens (the "prefill" phase). Large contexts (above 4,000 tokens) have noticeably longer time-to-first-token.

Check this correlation in your logs:

Python

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("llm_calls.csv")  # columns: completion_tokens, context_tokens, duration_ms

# Strong positive correlation confirms output length is the culprit
correlation_output = df["completion_tokens"].corr(df["duration_ms"])
correlation_context = df["context_tokens"].corr(df["duration_ms"])
print(f"Correlation: output tokens vs latency: {correlation_output:.2f}")
print(f"Correlation: context tokens vs latency: {correlation_context:.2f}")

# Scatter plot: context size vs. latency
df.plot.scatter(x="context_tokens", y="duration_ms", alpha=0.3)
plt.axvline(x=4000, color="red", linestyle="--", label="4k context threshold")
plt.show()

Fix 1: Cap Context Size

If your vector search retrieves 10 chunks and each is 512 tokens, you have 5,120 context tokens. LLM performance degrades at large contexts. Cap it:

Python

import tiktoken

def build_context_with_budget(
    chunks: List[str],
    max_tokens: int = 2000,
    model: str = "gpt-4o",
) -> tuple[str, int]:
    """
    Include chunks until token budget is exhausted.
    Returns (context_string, actual_token_count).
    """
    enc = tiktoken.encoding_for_model(model)
    selected_chunks = []
    total_tokens = 0

    for chunk in chunks:  # chunks already sorted by relevance (most relevant first)
        chunk_tokens = len(enc.encode(chunk))
        if total_tokens + chunk_tokens > max_tokens:
            break
        selected_chunks.append(chunk)
        total_tokens += chunk_tokens

    return "\n\n".join(selected_chunks), total_tokens

# Before: avg context = 4,800 tokens, P95 LLM latency = 12s
# After: avg context = 2,000 tokens, P95 LLM latency = 6s

Fix 2: Cap Output Length Per Request Type

Not all requests need long answers. Add a max_tokens budget per query type:

Python

from enum import Enum

class ResponseType(Enum):
    FACTUAL = "factual"       # "What is the upload limit?" → short
    SUMMARY = "summary"       # "Summarize this policy" → medium
    ANALYSIS = "analysis"     # "Compare options A and B" → long

MAX_OUTPUT_TOKENS = {
    ResponseType.FACTUAL: 150,
    ResponseType.SUMMARY: 400,
    ResponseType.ANALYSIS: 800,
}

def classify_response_type(query: str) -> ResponseType:
    """Quick classification using a small model or regex heuristics."""
    query_lower = query.lower()
    if any(w in query_lower for w in ["what is", "how much", "when does", "which"]):
        return ResponseType.FACTUAL
    if any(w in query_lower for w in ["summarize", "overview", "brief"]):
        return ResponseType.SUMMARY
    return ResponseType.ANALYSIS

async def adaptive_llm_call(query: str, context: str) -> dict:
    response_type = classify_response_type(query)
    max_tokens = MAX_OUTPUT_TOKENS[response_type]

    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Context:\n{context}"},
            {"role": "user", "content": query},
        ],
        max_tokens=max_tokens,
    )

    return {
        "answer": response.choices[0].message.content,
        "response_type": response_type.value,
        "max_tokens_applied": max_tokens,
    }

Setting max_tokens=150 for factual queries limits the worst-case LLM time to the time needed for 150 tokens regardless of what the model wants to say.

Fix 3: Streaming to Reduce Perceived P95

Streaming does not change P95 total completion time, but it changes P95 time-to-first-token (TTFT). Users perceive a response that starts appearing in 1 second much better than a response that appears fully after 8 seconds.

Python

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio
import json

app = FastAPI()

async def stream_with_early_metadata(query: str):
    """
    Immediately yields metadata (source documents, confidence)
    before the LLM starts generating. Users see something instantly.
    """
    # Phase 1: retrieval (fast, ~400ms)
    query_embedding = embed_text(query)
    chunks = await vector_store.search(query_embedding, k=5)
    context = build_context_with_budget(chunks, max_tokens=2000)[0]

    # Immediately yield source info — user sees this before LLM generates
    yield json.dumps({
        "type": "metadata",
        "sources": [c.metadata.get("source") for c in chunks[:3]],
        "context_tokens": count_tokens(context),
    }) + "\n"

    # Phase 2: LLM (slow, but streaming)
    stream = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Context:\n{context}"},
            {"role": "user", "content": query},
        ],
        stream=True,
        max_tokens=600,
    )

    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            yield json.dumps({"type": "token", "content": delta}) + "\n"
            await asyncio.sleep(0)

    yield json.dumps({"type": "done"}) + "\n"

@app.get("/query")
async def query_endpoint(q: str):
    return StreamingResponse(
        stream_with_early_metadata(q),
        media_type="application/x-ndjson",
    )

With streaming, P95 TTFT drops from 12 seconds to approximately 1.5 seconds even though P95 total time remains 12 seconds. Users rate the streaming experience as significantly better.

Fix 4: Cold Start Prevention

If your container scales to zero overnight, every first request after scale-up experiences a cold start that can add 5-15 seconds:

Container provisioning: 30-60 seconds
Python/Node startup: 5-15 seconds
Warm-up of connection pools (Redis, vector store): 2-5 seconds

Prevent cold starts by maintaining minimum replicas:

YAML

# Azure Container Apps — containerapp.yaml
resources:
  cpu: 2
  memory: "4Gi"

scale:
  minReplicas: 2      # NEVER scale to zero
  maxReplicas: 20
  rules:
    - name: http-scaling
      http:
        metadata:
          concurrentRequests: "10"

And add a warm-up endpoint that initializes connection pools:

Python

from contextlib import asynccontextmanager
from fastapi import FastAPI

@asynccontextmanager
async def lifespan(app: FastAPI):
    """
    Runs on startup. Pre-warms connections so first user request
    does not pay the connection setup penalty.
    """
    # Pre-warm embedding model connection
    await embed_text("warm up query")

    # Pre-warm vector store connection pool
    await vector_store.ping()

    # Pre-warm Redis connection
    await redis_client.ping()

    print("All connections warm. Ready to serve.")
    yield

    # Shutdown cleanup
    await redis_client.aclose()

app = FastAPI(lifespan=lifespan)

Fix 5: Timeout + Graceful Degradation

Set aggressive timeouts on LLM calls and return a graceful fallback if exceeded:

Python

import asyncio

async def llm_with_timeout(
    prompt: str,
    timeout_seconds: float = 8.0,
    fallback_message: str = "I am taking longer than usual. Please try again.",
) -> dict:
    try:
        response = await asyncio.wait_for(
            async_llm_call(prompt),
            timeout=timeout_seconds,
        )
        return {"answer": response, "timed_out": False}
    except asyncio.TimeoutError:
        # Log for monitoring — this is a tail latency event
        logger.warning(f"LLM timeout after {timeout_seconds}s for prompt hash {hash(prompt)}")
        return {
            "answer": fallback_message,
            "timed_out": True,
            "retry_url": "/query?retry=true",  # hint to client to retry
        }

Timeouts prevent the P99 from reaching 22 seconds at the cost of some degraded responses for extreme tail cases.

Putting It All Together: P95 Latency Reduction

| Fix | P95 Before | P95 After | Notes | |---|---|---|---| | Baseline | 12,400 ms | — | No optimizations | | Context cap (2,000 tokens) | 12,400 ms | 7,200 ms | Largest single fix | | Output token caps per query type | 7,200 ms | 5,100 ms | Prevents long-answer queries dominating | | Min replicas (no cold start) | 5,100 ms | 4,200 ms | Eliminates cold-start outliers | | 8-second timeout with fallback | 4,200 ms | 4,200 ms | Clips P99, not P95 | | Streaming (TTFT) | perceived 12,400 ms | perceived 1,400 ms | Does not change total time but feels 5-8x faster |

The result: P95 total latency drops from 12.4 seconds to approximately 4.2 seconds. P95 perceived latency (TTFT) drops from 12.4 seconds to approximately 1.4 seconds due to streaming.

Monitoring the Tail

Once you apply fixes, track percentile metrics continuously:

Python

from prometheus_client import Histogram

llm_latency = Histogram(
    "llm_request_duration_seconds",
    "LLM call duration",
    buckets=[0.5, 1, 2, 3, 5, 8, 12, 20, 30],
    labelnames=["endpoint", "response_type"],
)

async def monitored_llm_call(query: str, endpoint: str) -> str:
    response_type = classify_response_type(query)
    start = time.perf_counter()

    try:
        result = await adaptive_llm_call(query, context="...")
        return result["answer"]
    finally:
        elapsed = time.perf_counter() - start
        llm_latency.labels(endpoint=endpoint, response_type=response_type.value).observe(elapsed)

Set an alert when P95 exceeds your SLO:

YAML

# Prometheus alerting rule
groups:
  - name: latency_slo
    rules:
      - alert: HighP95Latency
        expr: histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m])) > 5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "P95 LLM latency above 5 seconds"

With proper instrumentation, you will catch tail latency regressions in minutes rather than waiting for user complaints.

Scenario: P95 Latency Is 12 Seconds

The Scenario

Why Tail Latency Is a Different Problem

Step 1: Decompose the Tail

Root Cause Analysis: LLM Completion Tail

Fix 1: Cap Context Size

Fix 2: Cap Output Length Per Request Type

Fix 3: Streaming to Reduce Perceived P95

Fix 4: Cold Start Prevention

Fix 5: Timeout + Graceful Degradation

Putting It All Together: P95 Latency Reduction

Monitoring the Tail

Enjoyed this article?

Leave a comment