LLMs Deep Dive · Lesson 16 of 24
vLLM and TensorRT-LLM: Production Inference
The LLM Serving Problem
Serving LLMs in production requires:
1. High throughput: process many requests per second
2. Low latency: first token quickly, fast generation rate
3. GPU efficiency: maximise hardware utilisation
4. Multi-model: serve different models from the same hardware pool
5. Reliability: handle OOM gracefully, autoscale, health checksTwo frameworks dominate open-source LLM serving: vLLM and TensorRT-LLM.
vLLM
vLLM (UC Berkeley, 2023) is a Python-first serving framework focused on throughput via paged attention:
from vllm import LLM, SamplingParams
# Initialise model (downloads weights, builds engine)
llm = LLM(
model="meta-llama/Llama-2-7b-chat-hf",
tensor_parallel_size=1, # GPUs for tensor parallelism
gpu_memory_utilization=0.9, # fraction of GPU memory for KV cache
max_num_batched_tokens=8192,
dtype="float16",
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512,
)
# Batch inference
prompts = ["What is Warfarin?", "Explain atrial fibrillation."]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)vLLM as HTTP server:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--port 8000 \
--tensor-parallel-size 2Exposes an OpenAI-compatible API.
vLLM Key Features
Paged attention:
Physical KV cache blocks (default 16 tokens/block)
Block table per sequence: logical → physical mapping
Eliminates fragmentation, enables high concurrency
Continuous batching:
Iteration-level scheduling
Swap in/out requests as they arrive/complete
Quantisation support:
AWQ, GPTQ, GGUF, bitsandbytes (fp8 on Hopper)
Multi-LoRA serving:
Multiple LoRA adapters simultaneously without separate GPU instances
Dynamic loading of adapters per request
Prefix caching:
Cache KV states for common prompt prefixes (system prompts, few-shot examples)
Identical prefixes across requests reuse the same physical pagesTensorRT-LLM
TensorRT-LLM (NVIDIA) optimises at the CUDA kernel level for maximum throughput on NVIDIA hardware:
import tensorrt_llm
from tensorrt_llm.builder import Builder
from tensorrt_llm.models import LLaMAForCausalLM
# Build TensorRT engine from HuggingFace checkpoint
builder = Builder()
network = builder.create_network()
llama_model = LLaMAForCausalLM.from_hugging_face(
hf_model_dir="meta-llama/Llama-2-7b-hf",
dtype="float16",
mapping=tensorrt_llm.Mapping(world_size=1, tp_size=1)
)
llama_model.to_trt(network)
engine = builder.build_engine(network, builder_config)
# Build step is offline — done once, engine saved as binary
# Inference from compiled engine:
runner = tensorrt_llm.runtime.GenerationSession(engine_config, engine)TensorRT-LLM Key Features
Kernel-level optimisation:
Custom CUDA kernels fused from multiple operations
Flash attention, RMSNorm, rotary embedding all fused
Optimised for Hopper (H100) and Ada (RTX 4090) architectures
INT8/FP8 quantisation:
First-class support for NVIDIA's native quantisation
FP8 on H100: near-fp16 quality, 2× throughput improvement
SmoothQuant: weight + activation INT8 co-quantisation
In-flight batching:
Equivalent to continuous batching, NVIDIA's implementation
Multi-GPU parallelism:
Tensor parallelism: split attention heads across GPUs
Pipeline parallelism: split layers across GPUs
Well-optimised NVLink communication
Speculative decoding:
Built-in support with draft model integrationvLLM vs TensorRT-LLM
| Property | vLLM | TensorRT-LLM | |----------|------|--------------| | Primary goal | High throughput, easy deployment | Maximum performance on NVIDIA | | Hardware | NVIDIA, AMD (ROCm), CPU | NVIDIA only | | Compilation | JIT, Python | Ahead-of-time CUDA engine | | Ease of use | High (pip install, one command) | Lower (build pipeline) | | Quantisation | AWQ, GPTQ, FP8 | SmoothQuant, FP8, GPTQ | | Throughput | Very high | Highest (on NVIDIA) | | Typical use | Research, startups, OpenAI-compatible | Production, cloud providers |
Deployment Architecture
Typical production LLM serving stack:
Load balancer (nginx/istio)
↓
LLM serving cluster
vLLM or TensorRT-LLM serving process × N GPUs
Each process: tensor-parallel across GPUs on same node
Pipeline-parallel across nodes for very large models
Request routing:
Route to available serving instance
Prefix caching: route same system prompt to same GPU (cache hits)
Autoscaling:
Scale up GPU instances on high queue depth
Scale down on low utilisation (GPUs are expensive)
Health checking:
HTTP /health endpoint
Alert on high TTFT (time to first token) or low throughputInterview Answer
"vLLM is the leading open-source LLM serving framework, achieving high throughput via paged attention (physical KV cache blocks to eliminate fragmentation), continuous batching, and prefix caching. It exposes an OpenAI-compatible API and supports AWQ/GPTQ quantisation. TensorRT-LLM is NVIDIA's framework for maximum raw performance on NVIDIA hardware — it compiles models to optimised CUDA engines with fused kernels, FP8 quantisation, and in-flight batching. vLLM wins on ease of use and hardware flexibility; TensorRT-LLM wins on absolute throughput on NVIDIA GPUs. In production, most teams use vLLM for flexibility or TGI (HuggingFace) as an alternative."