LLM Inference and Serving
How to serve LLMs at scale: KV cache management, continuous batching, vLLM, PagedAttention, speculative decoding, and production deployment patterns.
The Inference Bottleneck
LLM inference has two distinct phases with different performance characteristics:
Prefill phase: Process the entire prompt in parallel. Compute-bound ā limited by GPU FLOP/s. Fast even for long prompts because all tokens are processed simultaneously.
Decode phase: Generate one token at a time. Memory-bandwidth-bound ā the bottleneck is reading KV cache and model weights from GPU memory on every step, not computation. This is why inference is slow: a GPU doing matrix multiply at 312 TFLOP/s can still be slow if it's waiting for memory.
For a 7B model at float16:
Model weights: 14GB
One decode step reads ~14GB from HBM
A100 HBM bandwidth: 2TB/s
Minimum time per token: 14GB / 2TB/s = 7ms
Maximum decode speed: ~142 tokens/second (single request)
With a batch of 10 requests: same 7ms (bandwidth is shared but amortized)
Throughput: ~1420 tokens/secondBatching is essential for throughput ā reading weights once to serve multiple requests amortizes the memory bandwidth cost.
KV Cache: The Memory Tradeoff
During decoding, each token's key and value tensors are cached to avoid recomputation:
KV cache memory per token = 2 Ć n_layers Ć n_heads Ć head_dim Ć bytes_per_element
= 2 Ć 32 Ć 32 Ć 128 Ć 2 (bfloat16)
= 524,288 bytes ā 0.5MB per token
For a 7B model (LLaMA-3-8B):
32 layers Ć 32 heads Ć 128 head_dim Ć 2 (K,V) Ć 2 bytes = 0.5MB per token
4096-token sequence: 2GB of KV cache
8192-token sequence: 4GB of KV cache
An A100 with 80GB minus 14GB model weights = 66GB for KV cache
= 66GB / 0.5MB ā 132,000 tokens total KV cache capacity
= ~32 concurrent requests of 4096 tokens eachContinuous Batching
Static batching waits for a fixed batch size, then processes together. This wastes GPU time when requests arrive unevenly. Continuous batching inserts new requests into the batch as soon as a slot frees up:
import asyncio
from dataclasses import dataclass, field
from typing import Optional
import time
@dataclass
class Request:
id: str
prompt_tokens: list[int]
max_new_tokens: int
generated_tokens: list[int] = field(default_factory=list)
finished: bool = False
created_at: float = field(default_factory=time.time)
class ContinuousBatchingScheduler:
"""
Continuously fills a batch as requests complete and new ones arrive.
This is the core scheduling idea behind vLLM and TGI.
"""
def __init__(self, max_batch_size: int = 32, max_seq_len: int = 4096):
self.max_batch_size = max_batch_size
self.max_seq_len = max_seq_len
self.active_requests: dict[str, Request] = {}
self.waiting_queue: asyncio.Queue[Request] = asyncio.Queue()
async def add_request(self, request: Request) -> None:
await self.waiting_queue.put(request)
def _fill_batch(self) -> None:
"""Add waiting requests to active batch if space available."""
while (
len(self.active_requests) < self.max_batch_size
and not self.waiting_queue.empty()
):
try:
req = self.waiting_queue.get_nowait()
self.active_requests[req.id] = req
except asyncio.QueueEmpty:
break
def step(self, model) -> dict[str, list[int]]:
"""
Run one decode step for all active requests.
Returns dict of request_id ā completed sequences for finished requests.
"""
self._fill_batch()
if not self.active_requests:
return {}
# Prepare padded batch
request_list = list(self.active_requests.values())
# In real systems, this is done with careful attention masking
# to handle variable-length sequences in the same batch
# ... (model forward pass) ...
# Remove finished requests
completed = {}
for req_id, req in list(self.active_requests.items()):
if req.finished or len(req.generated_tokens) >= req.max_new_tokens:
completed[req_id] = req.generated_tokens
del self.active_requests[req_id]
return completedvLLM and PagedAttention
The key innovation of vLLM (Kwon et al., 2023): instead of allocating a contiguous block of GPU memory for each request's KV cache, use paged memory management ā the same idea as OS virtual memory.
The problem with contiguous allocation:
- Reserve max_seq_len Ć kv_size for each request upfront
- Short requests waste reserved memory
- Memory fragmentation: many small gaps, total free memory is large but no contiguous block is available
PagedAttention solution:
- Divide KV cache into fixed-size blocks (typically 16 tokens)
- Allocate blocks on-demand as tokens are generated
- Different requests can use non-contiguous physical blocks
- A block table maps logical to physical block positions
# Using vLLM for high-throughput serving
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
tensor_parallel_size=1, # Number of GPUs for tensor parallelism
gpu_memory_utilization=0.90, # Fraction of GPU memory for KV cache
max_model_len=8192,
quantization=None, # or "awq", "gptq", "squeezellm"
enforce_eager=False, # Use CUDA graphs for faster decoding
)
# Batch inference
prompts = [
"Explain the pharmacokinetics of warfarin",
"What is the mechanism of beta-blockers?",
"How does metformin lower blood glucose?",
]
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=512,
stop=["</s>", "<|eot_id|>"],
)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt[:50]}...")
print(f"Response: {output.outputs[0].text}")
print(f"Tokens generated: {len(output.outputs[0].token_ids)}")
print()vLLM OpenAI-Compatible Server
# Start vLLM as an OpenAI-compatible API server:
# vllm serve meta-llama/Meta-Llama-3-8B-Instruct --port 8000
# Then use the standard OpenAI client
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed", # vLLM doesn't require auth by default
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "What is the interaction between warfarin and NSAIDs?"}
],
max_tokens=512,
temperature=0,
)
print(response.choices[0].message.content)
print(f"Tokens: {response.usage.prompt_tokens} in, {response.usage.completion_tokens} out")Speculative Decoding in Practice
Use a small "draft" model to propose multiple tokens at once, then verify with the large model:
from transformers import AutoModelForCausalLM, AutoTokenizer
# Draft model: small and fast (e.g., LLaMA-3-1B)
draft_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.2-1B",
torch_dtype=torch.bfloat16,
)
# Target model: large and accurate (e.g., LLaMA-3-8B)
target_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
# HuggingFace supports speculative decoding with assistant_model
from transformers import GenerationConfig
inputs = tokenizer("What is the pharmacokinetics of warfarin?", return_tensors="pt")
output = target_model.generate(
**inputs,
assistant_model=draft_model, # Speculative decoding
max_new_tokens=200,
do_sample=False,
)
# Expected: ~2-3Ć speedup on typical text
# Speedup depends on acceptance rate (how often draft tokens are accepted)Quantized Inference
4-bit quantization reduces model memory and speeds inference:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# 4-bit NF4 quantization for inference
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-70B-Instruct",
quantization_config=quantization_config,
device_map="auto",
)
# 70B Ć 4 bits / 8 = 35GB ā fits on a single 80GB A100
# Performance: ~5-10% accuracy loss, 2-4Ć memory reductionQuantization options comparison:
| Format | Bits | Memory (70B) | Quality | Speed | |---|---|---|---|---| | float32 | 32 | 280GB | Baseline | Slowest | | bfloat16 | 16 | 140GB | ~=baseline | 2Ć | | AWQ int8 | 8 | 70GB | 99% | 2-3Ć | | GPTQ int4 | 4 | 35GB | 95-98% | 3-4Ć | | NF4 (bnb) | 4 | 35GB | 95-98% | 3-4Ć | | GGUF Q4_K_M | 4 | ~40GB | 96% | CPU+GPU |
Production Serving Architecture
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā Load Balancer ā
ā (routes requests, health check) ā
āāāāāāāāāāāāāāā¬āāāāāāāāāāāāāāāāāāāā
ā
āāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāā
ā ā ā
āāāāāāāāāāā¼āāāāāāā āāāāāāāāāāā¼āāāāāāā āāāāāāāāāā¼āāāāāāāā
ā vLLM Worker 0 ā ā vLLM Worker 1 ā ā vLLM Worker 2 ā
ā GPU 0-1 (TP=2) ā ā GPU 2-3 (TP=2) ā ā GPU 4-5 (TP=2)ā
āāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāāāā
ā ā ā
āāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāā
ā
āāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāā
ā Result Aggregator ā
ā (stream back to caller) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāKey infrastructure decisions:
- Tensor parallelism (TP=2 or 4): Required for models that don't fit on one GPU
- Multiple replicas: For horizontal scaling of throughput
- Health checks: Detect stuck/dead workers and route around them
- Streaming: Use SSE (Server-Sent Events) to stream tokens to the client; don't wait for full response
- Timeout handling: Decode can take seconds for long responses; set appropriate read timeouts
- Priority queues: Route latency-sensitive requests (UI) ahead of batch jobs (embedding pipelines)
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.