Learnixo

LLMs Deep Dive · Lesson 24 of 24

Interview: How Do You Reduce LLM Inference Cost?

Q: What is the primary bottleneck in LLM inference?

At small batch sizes (1-8), LLM inference is memory-bandwidth-bound — not compute-bound. The GPU must read all model weights from HBM for every token generated. On an A100, HBM bandwidth is ~2TB/s; loading 14GB of LLaMA 2 7B weights takes ~7ms per token. The GPU's tensor cores sit mostly idle between reads.

At large batch sizes (64+), inference becomes compute-bound — the same weight bytes are used to compute many outputs, and the tensor cores are fully utilised.


Q: What are the main techniques for reducing LLM inference latency?

1. KV cache: avoid recomputing K/V for past tokens; O(n²) → O(n) per token
2. FlashAttention: O(n) HBM memory, 2-4× faster than naive attention
3. Speculative decoding: small model drafts k tokens, large model verifies in 1 pass; 2-4× latency reduction
4. Quantisation (INT8/INT4): weights read faster from HBM (less data); 1.5-3× speedup
5. GQA: fewer KV heads → smaller KV cache → less data read per token
6. Tensor parallelism: split model across multiple GPUs; linear speedup up to communication overhead
7. Continuous batching: no wasted compute on padding; maximises throughput

Each technique targets a different part of the pipeline.
The correct combination depends on the specific model, hardware, and SLA.

Q: How do you choose the right batch size for LLM serving?

Factors:
  Target TTFT (time-to-first-token): lower batch → lower TTFT
  Target token generation rate: depends on SLA (e.g., 30 tok/s per user)
  GPU memory: model weights + KV cache × batch_size must fit
  Throughput target: higher batch → higher total throughput

Strategy:
  For interactive applications (chat):
    Maximise batch size such that p95 TTFT < 2 seconds
    Use continuous batching to not waste GPU time
    
  For batch processing (offline generation, embedding):
    Maximise batch size to maximise throughput
    Latency is less important; fill the GPU completely

  Empirically:
    Profile at batch sizes 1, 4, 8, 16, 32... until GPU OOM
    Plot throughput (tokens/second) vs latency per request
    Pick operating point matching the SLA

Q: What is the trade-off between model size and serving cost?

Larger model (e.g., 70B vs 7B):
  Better quality
  Higher memory (140GB vs 14GB at fp16)
  Lower throughput (proportional to param count)
  More expensive per token

Smaller model with more tuning:
  LLaMA 2 7B fine-tuned on domain data often outperforms 70B base on that domain
  10× cheaper to serve
  Fits on 1 GPU instead of 4

Practical recommendation:
  Start with the smallest model that meets quality threshold
  Add domain fine-tuning before moving to a larger model
  Quantise to INT4 before buying more GPUs

Q: How does tensor parallelism work?

Tensor parallelism splits attention heads and FFN dimensions across multiple GPUs:

For 8 GPUs and 32 attention heads:
  GPU 0: heads 0-3
  GPU 1: heads 4-7
  ...
  GPU 7: heads 28-31

Each GPU processes its portion of the computation.
An AllReduce operation combines results after each attention/FFN layer.

Communication overhead: AllReduce adds 2× bandwidth × tensor_size per layer
  For d_model=4096 across 8 GPUs: ~134MB per layer per step
  Fast NVLink interconnects (600 GB/s) make this manageable

Efficiency: ~75-90% GPU utilisation at 4-8 GPUs; degrades beyond

Q: How would you reduce the cost of serving a medical LLM API?

1. Quantise to INT4 (GPTQ/AWQ):
   140GB 70B model → 35GB
   From 2-3 A100s to 1 A100
   Quality drop: ~1-2% on downstream tasks

2. Use a smaller fine-tuned model:
   7B clinical fine-tune may match 70B on medical NER/coding tasks
   10× fewer GPUs needed

3. Prefix caching:
   Medical system prompts are often identical across requests
   Cache KV states for the system prompt prefix
   Eliminates re-processing on every request

4. Speculative decoding:
   Use a 1B-parameter draft model
   2-4× latency improvement on routine queries

5. Continuous batching:
   Maximise batch size within latency SLA
   Fill GPU compute between user requests

6. Request routing:
   Route simple queries (vitals extraction) to smaller model
   Route complex reasoning to larger model

Q: When would you choose speculative decoding over quantisation?

Speculative decoding:
  Reduces LATENCY for a single user (batch_size=1 or small)
  No quality degradation — output distribution is identical
  Requires a compatible draft model in memory (extra cost)
  Works best when batch_size is small (memory-bound regime)

Quantisation:
  Reduces MEMORY and cost for all batch sizes
  Small quality degradation (~1-2% for INT4)
  Benefits throughput at large batch sizes too
  No additional model needed

Decision:
  Interactive chat with latency SLA → speculative decoding
  High-concurrency API with throughput goal → quantisation
  Both are complementary and can be combined

Interview Answer Template

"LLM inference at small batch size is memory-bandwidth-bound — the bottleneck is reading model weights from HBM, not compute. The main optimisation stack: KV cache (O(n²)→O(n) per token), FlashAttention (O(n) HBM memory), quantisation (faster weight reads, smaller cache), GQA (fewer KV heads), speculative decoding (2-4× latency reduction at small batch), and continuous batching (maximise GPU utilisation across concurrent requests). For a medical LLM serving API, I'd start with INT4 quantisation of the base model, add prefix caching for system prompts, fine-tune a smaller 7B model for the specific task, and use continuous batching to maximise throughput within the latency SLA."