Learnixo
Back to blog
AI Systemsintermediate

Interview Q&A: LLM Inference Optimisation

Common interview questions on making LLM inference faster and cheaper — quantisation, KV cache, speculative decoding, batching, and production serving trade-offs.

Asma Hafeez KhanMay 16, 20265 min read
LLMsInferenceOptimisationProductionInterview
Share:š•

Q: What is the primary bottleneck in LLM inference?

At small batch sizes (1-8), LLM inference is memory-bandwidth-bound — not compute-bound. The GPU must read all model weights from HBM for every token generated. On an A100, HBM bandwidth is ~2TB/s; loading 14GB of LLaMA 2 7B weights takes ~7ms per token. The GPU's tensor cores sit mostly idle between reads.

At large batch sizes (64+), inference becomes compute-bound — the same weight bytes are used to compute many outputs, and the tensor cores are fully utilised.


Q: What are the main techniques for reducing LLM inference latency?

1. KV cache: avoid recomputing K/V for past tokens; O(n²) → O(n) per token
2. FlashAttention: O(n) HBM memory, 2-4Ɨ faster than naive attention
3. Speculative decoding: small model drafts k tokens, large model verifies in 1 pass; 2-4Ɨ latency reduction
4. Quantisation (INT8/INT4): weights read faster from HBM (less data); 1.5-3Ɨ speedup
5. GQA: fewer KV heads → smaller KV cache → less data read per token
6. Tensor parallelism: split model across multiple GPUs; linear speedup up to communication overhead
7. Continuous batching: no wasted compute on padding; maximises throughput

Each technique targets a different part of the pipeline.
The correct combination depends on the specific model, hardware, and SLA.

Q: How do you choose the right batch size for LLM serving?

Factors:
  Target TTFT (time-to-first-token): lower batch → lower TTFT
  Target token generation rate: depends on SLA (e.g., 30 tok/s per user)
  GPU memory: model weights + KV cache Ɨ batch_size must fit
  Throughput target: higher batch → higher total throughput

Strategy:
  For interactive applications (chat):
    Maximise batch size such that p95 TTFT < 2 seconds
    Use continuous batching to not waste GPU time
    
  For batch processing (offline generation, embedding):
    Maximise batch size to maximise throughput
    Latency is less important; fill the GPU completely

  Empirically:
    Profile at batch sizes 1, 4, 8, 16, 32... until GPU OOM
    Plot throughput (tokens/second) vs latency per request
    Pick operating point matching the SLA

Q: What is the trade-off between model size and serving cost?

Larger model (e.g., 70B vs 7B):
  Better quality
  Higher memory (140GB vs 14GB at fp16)
  Lower throughput (proportional to param count)
  More expensive per token

Smaller model with more tuning:
  LLaMA 2 7B fine-tuned on domain data often outperforms 70B base on that domain
  10Ɨ cheaper to serve
  Fits on 1 GPU instead of 4

Practical recommendation:
  Start with the smallest model that meets quality threshold
  Add domain fine-tuning before moving to a larger model
  Quantise to INT4 before buying more GPUs

Q: How does tensor parallelism work?

Tensor parallelism splits attention heads and FFN dimensions across multiple GPUs:

For 8 GPUs and 32 attention heads:
  GPU 0: heads 0-3
  GPU 1: heads 4-7
  ...
  GPU 7: heads 28-31

Each GPU processes its portion of the computation.
An AllReduce operation combines results after each attention/FFN layer.

Communication overhead: AllReduce adds 2Ɨ bandwidth Ɨ tensor_size per layer
  For d_model=4096 across 8 GPUs: ~134MB per layer per step
  Fast NVLink interconnects (600 GB/s) make this manageable

Efficiency: ~75-90% GPU utilisation at 4-8 GPUs; degrades beyond

Q: How would you reduce the cost of serving a medical LLM API?

1. Quantise to INT4 (GPTQ/AWQ):
   140GB 70B model → 35GB
   From 2-3 A100s to 1 A100
   Quality drop: ~1-2% on downstream tasks

2. Use a smaller fine-tuned model:
   7B clinical fine-tune may match 70B on medical NER/coding tasks
   10Ɨ fewer GPUs needed

3. Prefix caching:
   Medical system prompts are often identical across requests
   Cache KV states for the system prompt prefix
   Eliminates re-processing on every request

4. Speculative decoding:
   Use a 1B-parameter draft model
   2-4Ɨ latency improvement on routine queries

5. Continuous batching:
   Maximise batch size within latency SLA
   Fill GPU compute between user requests

6. Request routing:
   Route simple queries (vitals extraction) to smaller model
   Route complex reasoning to larger model

Q: When would you choose speculative decoding over quantisation?

Speculative decoding:
  Reduces LATENCY for a single user (batch_size=1 or small)
  No quality degradation — output distribution is identical
  Requires a compatible draft model in memory (extra cost)
  Works best when batch_size is small (memory-bound regime)

Quantisation:
  Reduces MEMORY and cost for all batch sizes
  Small quality degradation (~1-2% for INT4)
  Benefits throughput at large batch sizes too
  No additional model needed

Decision:
  Interactive chat with latency SLA → speculative decoding
  High-concurrency API with throughput goal → quantisation
  Both are complementary and can be combined

Interview Answer Template

"LLM inference at small batch size is memory-bandwidth-bound — the bottleneck is reading model weights from HBM, not compute. The main optimisation stack: KV cache (O(n²)→O(n) per token), FlashAttention (O(n) HBM memory), quantisation (faster weight reads, smaller cache), GQA (fewer KV heads), speculative decoding (2-4Ɨ latency reduction at small batch), and continuous batching (maximise GPU utilisation across concurrent requests). For a medical LLM serving API, I'd start with INT4 quantisation of the base model, add prefix caching for system prompts, fine-tune a smaller 7B model for the specific task, and use continuous batching to maximise throughput within the latency SLA."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.