LLMs Deep Dive · Lesson 24 of 24
Interview: How Do You Reduce LLM Inference Cost?
Q: What is the primary bottleneck in LLM inference?
At small batch sizes (1-8), LLM inference is memory-bandwidth-bound — not compute-bound. The GPU must read all model weights from HBM for every token generated. On an A100, HBM bandwidth is ~2TB/s; loading 14GB of LLaMA 2 7B weights takes ~7ms per token. The GPU's tensor cores sit mostly idle between reads.
At large batch sizes (64+), inference becomes compute-bound — the same weight bytes are used to compute many outputs, and the tensor cores are fully utilised.
Q: What are the main techniques for reducing LLM inference latency?
1. KV cache: avoid recomputing K/V for past tokens; O(n²) → O(n) per token
2. FlashAttention: O(n) HBM memory, 2-4× faster than naive attention
3. Speculative decoding: small model drafts k tokens, large model verifies in 1 pass; 2-4× latency reduction
4. Quantisation (INT8/INT4): weights read faster from HBM (less data); 1.5-3× speedup
5. GQA: fewer KV heads → smaller KV cache → less data read per token
6. Tensor parallelism: split model across multiple GPUs; linear speedup up to communication overhead
7. Continuous batching: no wasted compute on padding; maximises throughput
Each technique targets a different part of the pipeline.
The correct combination depends on the specific model, hardware, and SLA.Q: How do you choose the right batch size for LLM serving?
Factors:
Target TTFT (time-to-first-token): lower batch → lower TTFT
Target token generation rate: depends on SLA (e.g., 30 tok/s per user)
GPU memory: model weights + KV cache × batch_size must fit
Throughput target: higher batch → higher total throughput
Strategy:
For interactive applications (chat):
Maximise batch size such that p95 TTFT < 2 seconds
Use continuous batching to not waste GPU time
For batch processing (offline generation, embedding):
Maximise batch size to maximise throughput
Latency is less important; fill the GPU completely
Empirically:
Profile at batch sizes 1, 4, 8, 16, 32... until GPU OOM
Plot throughput (tokens/second) vs latency per request
Pick operating point matching the SLAQ: What is the trade-off between model size and serving cost?
Larger model (e.g., 70B vs 7B):
Better quality
Higher memory (140GB vs 14GB at fp16)
Lower throughput (proportional to param count)
More expensive per token
Smaller model with more tuning:
LLaMA 2 7B fine-tuned on domain data often outperforms 70B base on that domain
10× cheaper to serve
Fits on 1 GPU instead of 4
Practical recommendation:
Start with the smallest model that meets quality threshold
Add domain fine-tuning before moving to a larger model
Quantise to INT4 before buying more GPUsQ: How does tensor parallelism work?
Tensor parallelism splits attention heads and FFN dimensions across multiple GPUs:
For 8 GPUs and 32 attention heads:
GPU 0: heads 0-3
GPU 1: heads 4-7
...
GPU 7: heads 28-31
Each GPU processes its portion of the computation.
An AllReduce operation combines results after each attention/FFN layer.
Communication overhead: AllReduce adds 2× bandwidth × tensor_size per layer
For d_model=4096 across 8 GPUs: ~134MB per layer per step
Fast NVLink interconnects (600 GB/s) make this manageable
Efficiency: ~75-90% GPU utilisation at 4-8 GPUs; degrades beyondQ: How would you reduce the cost of serving a medical LLM API?
1. Quantise to INT4 (GPTQ/AWQ):
140GB 70B model → 35GB
From 2-3 A100s to 1 A100
Quality drop: ~1-2% on downstream tasks
2. Use a smaller fine-tuned model:
7B clinical fine-tune may match 70B on medical NER/coding tasks
10× fewer GPUs needed
3. Prefix caching:
Medical system prompts are often identical across requests
Cache KV states for the system prompt prefix
Eliminates re-processing on every request
4. Speculative decoding:
Use a 1B-parameter draft model
2-4× latency improvement on routine queries
5. Continuous batching:
Maximise batch size within latency SLA
Fill GPU compute between user requests
6. Request routing:
Route simple queries (vitals extraction) to smaller model
Route complex reasoning to larger modelQ: When would you choose speculative decoding over quantisation?
Speculative decoding:
Reduces LATENCY for a single user (batch_size=1 or small)
No quality degradation — output distribution is identical
Requires a compatible draft model in memory (extra cost)
Works best when batch_size is small (memory-bound regime)
Quantisation:
Reduces MEMORY and cost for all batch sizes
Small quality degradation (~1-2% for INT4)
Benefits throughput at large batch sizes too
No additional model needed
Decision:
Interactive chat with latency SLA → speculative decoding
High-concurrency API with throughput goal → quantisation
Both are complementary and can be combinedInterview Answer Template
"LLM inference at small batch size is memory-bandwidth-bound — the bottleneck is reading model weights from HBM, not compute. The main optimisation stack: KV cache (O(n²)→O(n) per token), FlashAttention (O(n) HBM memory), quantisation (faster weight reads, smaller cache), GQA (fewer KV heads), speculative decoding (2-4× latency reduction at small batch), and continuous batching (maximise GPU utilisation across concurrent requests). For a medical LLM serving API, I'd start with INT4 quantisation of the base model, add prefix caching for system prompts, fine-tune a smaller 7B model for the specific task, and use continuous batching to maximise throughput within the latency SLA."