LLMs Deep Dive · Lesson 15 of 24
Batching Strategies for LLM Serving
Why Batching Matters
LLM generation is memory-bandwidth-bound at small batch sizes — the GPU sits idle between memory reads. Batching amortises the memory cost:
Single request (batch_size=1):
Compute utilisation: ~5-15%
Latency per token: 10ms
Throughput: 100 tok/s
Batch of 32 requests:
Compute utilisation: 70-90%
Latency per token: 12ms (slightly higher)
Throughput: 3200 tok/s (32× improvement)
The GPU's compute units get more work per memory read.
Throughput scales nearly linearly with batch size (up to memory limit).Static Batching
Group requests into fixed-size batches and process each batch to completion:
Batch: [Request A (100 tokens), Request B (500 tokens), Request C (50 tokens)]
Step 1 Step 2 ... Step 50 Step 51 ... Step 500
Request A: [active] [active] ... [done] [PADDED] ... [PADDED]
Request B: [active] [active] ... [active] [active] ... [done]
Request C: [active] [active] ... [done] [PADDED] ... [PADDED]
Problem: Requests A and C are done at step 50 but the GPU must wait
for B to finish before starting new requests.
All those [PADDED] steps waste compute.Continuous Batching (Iteration-Level Scheduling)
Instead of waiting for the whole batch to finish, swap out completed requests and add new ones at each token generation step:
Step 1-50: [Request A, Request B, Request C]
Step 51: A finishes → swap in Request D
[Request D, Request B, Request C]
Step 52: C finishes → swap in Request E
[Request D, Request B, Request E]
...
GPU is always running at target batch size.
No padding wasted.This is how vLLM, TGI (HuggingFace), and TensorRT-LLM work.
Continuous Batching: Implementation Complexity
Challenges continuous batching must solve:
1. Variable sequence lengths:
Different requests are at different positions.
Flash attention handles this via padding-free packed batching.
2. KV cache management:
Each request has its own KV cache growing at different rates.
vLLM uses paged attention — fixed-size pages allocated on demand.
Pages from completed requests are freed for new ones.
3. Preemption:
If KV cache is full, lower-priority requests may be preempted:
their KV cache pages are freed (or swapped to CPU) and the
request is rescheduled later.Paged KV Cache (vLLM)
Traditional: Allocate max_seq_len pages per request upfront
Wastes memory if request is shorter
Can't start new request until old KV cache is free
vLLM paged approach:
KV cache divided into physical pages (e.g., 16-token blocks)
Logical pages mapped to physical pages via a block table
New pages allocated as the sequence grows
Physical pages freed when the sequence ends
Result:
Near-zero internal KV cache fragmentation
2-4× more concurrent requests for the same GPU memory
Enables continuous batching at scaleChunked Prefill
Long prompt processing can stall the GPU:
Problem: processing a 10K-token prompt takes many ms
during which no generation happens — stalls other requests
Chunked prefill:
Split long prompt into fixed-size chunks (e.g., 512 tokens)
Interleave prompt chunks with generation steps for other requests
Step 1: Process prompt chunk 1 for request A + generate token for B, C
Step 2: Process prompt chunk 2 for request A + generate token for B, C
...
Step 20: Request A is ready → join generation phase
Reduces tail latency for concurrent requests by limiting prefill stalls.
Used in: vLLM v0.3+, Sarathi (from Microsoft Research)Throughput vs Latency Trade-off
Small batch:
Low latency (few ms per token)
Low throughput (few tokens/second total)
Good for: real-time applications, low concurrency
Large batch:
Higher latency (more time to fill/drain the batch)
High throughput (many tokens/second total)
Good for: batch processing, high-concurrency API serving
Online serving:
Target latency SLA (e.g., 100ms TTFT, 30 tok/s per request)
Maximise batch size within SLA
Use dynamic batching: add requests as they arrive up to a time windowInterview Answer
"LLM inference throughput scales with batch size because the workload is memory-bandwidth-bound — larger batches amortise the cost of reading model weights once for more work. Static batching wastes GPU time padding short sequences to match the longest. Continuous (iteration-level) batching solves this by swapping completed requests out and adding new ones after every token generation step — the GPU always runs at target batch size. vLLM implements this with paged attention: KV cache is allocated in 16-token physical blocks mapped via block tables, eliminating fragmentation and enabling 2-4× more concurrent requests than pre-allocated caches."