Learnixo

LLMs Deep Dive · Lesson 15 of 24

Batching Strategies for LLM Serving

Why Batching Matters

LLM generation is memory-bandwidth-bound at small batch sizes — the GPU sits idle between memory reads. Batching amortises the memory cost:

Single request (batch_size=1):
  Compute utilisation: ~5-15%
  Latency per token: 10ms
  Throughput: 100 tok/s

Batch of 32 requests:
  Compute utilisation: 70-90%
  Latency per token: 12ms (slightly higher)
  Throughput: 3200 tok/s (32× improvement)

The GPU's compute units get more work per memory read.
Throughput scales nearly linearly with batch size (up to memory limit).

Static Batching

Group requests into fixed-size batches and process each batch to completion:

Batch: [Request A (100 tokens), Request B (500 tokens), Request C (50 tokens)]

                 Step 1    Step 2   ...   Step 50   Step 51  ...  Step 500
Request A:       [active]  [active] ... [done]    [PADDED]  ... [PADDED]
Request B:       [active]  [active] ... [active]  [active]  ... [done]
Request C:       [active]  [active] ... [done]    [PADDED]  ... [PADDED]

Problem: Requests A and C are done at step 50 but the GPU must wait
         for B to finish before starting new requests.
         All those [PADDED] steps waste compute.

Continuous Batching (Iteration-Level Scheduling)

Instead of waiting for the whole batch to finish, swap out completed requests and add new ones at each token generation step:

Step 1-50:   [Request A, Request B, Request C]
Step 51:     A finishes → swap in Request D
             [Request D, Request B, Request C]
Step 52:     C finishes → swap in Request E
             [Request D, Request B, Request E]
...

GPU is always running at target batch size.
No padding wasted.

This is how vLLM, TGI (HuggingFace), and TensorRT-LLM work.


Continuous Batching: Implementation Complexity

Challenges continuous batching must solve:

1. Variable sequence lengths:
   Different requests are at different positions.
   Flash attention handles this via padding-free packed batching.

2. KV cache management:
   Each request has its own KV cache growing at different rates.
   vLLM uses paged attention — fixed-size pages allocated on demand.
   Pages from completed requests are freed for new ones.

3. Preemption:
   If KV cache is full, lower-priority requests may be preempted:
   their KV cache pages are freed (or swapped to CPU) and the
   request is rescheduled later.

Paged KV Cache (vLLM)

Traditional: Allocate max_seq_len pages per request upfront
  Wastes memory if request is shorter
  Can't start new request until old KV cache is free

vLLM paged approach:
  KV cache divided into physical pages (e.g., 16-token blocks)
  Logical pages mapped to physical pages via a block table
  New pages allocated as the sequence grows
  Physical pages freed when the sequence ends

Result:
  Near-zero internal KV cache fragmentation
  2-4× more concurrent requests for the same GPU memory
  Enables continuous batching at scale

Chunked Prefill

Long prompt processing can stall the GPU:

Problem: processing a 10K-token prompt takes many ms
         during which no generation happens — stalls other requests

Chunked prefill:
  Split long prompt into fixed-size chunks (e.g., 512 tokens)
  Interleave prompt chunks with generation steps for other requests
  
  Step 1: Process prompt chunk 1 for request A + generate token for B, C
  Step 2: Process prompt chunk 2 for request A + generate token for B, C
  ...
  Step 20: Request A is ready → join generation phase

Reduces tail latency for concurrent requests by limiting prefill stalls.
Used in: vLLM v0.3+, Sarathi (from Microsoft Research)

Throughput vs Latency Trade-off

Small batch:
  Low latency (few ms per token)
  Low throughput (few tokens/second total)
  Good for: real-time applications, low concurrency

Large batch:
  Higher latency (more time to fill/drain the batch)
  High throughput (many tokens/second total)
  Good for: batch processing, high-concurrency API serving

Online serving:
  Target latency SLA (e.g., 100ms TTFT, 30 tok/s per request)
  Maximise batch size within SLA
  Use dynamic batching: add requests as they arrive up to a time window

Interview Answer

"LLM inference throughput scales with batch size because the workload is memory-bandwidth-bound — larger batches amortise the cost of reading model weights once for more work. Static batching wastes GPU time padding short sequences to match the longest. Continuous (iteration-level) batching solves this by swapping completed requests out and adding new ones after every token generation step — the GPU always runs at target batch size. vLLM implements this with paged attention: KV cache is allocated in 16-token physical blocks mapped via block tables, eliminating fragmentation and enabling 2-4× more concurrent requests than pre-allocated caches."