vLLM and TensorRT-LLM
How the two leading LLM serving frameworks work, their architectural choices, when to use each, and key configuration decisions for production deployment.
The LLM Serving Problem
Serving LLMs in production requires:
1. High throughput: process many requests per second
2. Low latency: first token quickly, fast generation rate
3. GPU efficiency: maximise hardware utilisation
4. Multi-model: serve different models from the same hardware pool
5. Reliability: handle OOM gracefully, autoscale, health checksTwo frameworks dominate open-source LLM serving: vLLM and TensorRT-LLM.
vLLM
vLLM (UC Berkeley, 2023) is a Python-first serving framework focused on throughput via paged attention:
from vllm import LLM, SamplingParams
# Initialise model (downloads weights, builds engine)
llm = LLM(
model="meta-llama/Llama-2-7b-chat-hf",
tensor_parallel_size=1, # GPUs for tensor parallelism
gpu_memory_utilization=0.9, # fraction of GPU memory for KV cache
max_num_batched_tokens=8192,
dtype="float16",
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512,
)
# Batch inference
prompts = ["What is Warfarin?", "Explain atrial fibrillation."]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)vLLM as HTTP server:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--port 8000 \
--tensor-parallel-size 2Exposes an OpenAI-compatible API.
vLLM Key Features
Paged attention:
Physical KV cache blocks (default 16 tokens/block)
Block table per sequence: logical ā physical mapping
Eliminates fragmentation, enables high concurrency
Continuous batching:
Iteration-level scheduling
Swap in/out requests as they arrive/complete
Quantisation support:
AWQ, GPTQ, GGUF, bitsandbytes (fp8 on Hopper)
Multi-LoRA serving:
Multiple LoRA adapters simultaneously without separate GPU instances
Dynamic loading of adapters per request
Prefix caching:
Cache KV states for common prompt prefixes (system prompts, few-shot examples)
Identical prefixes across requests reuse the same physical pagesTensorRT-LLM
TensorRT-LLM (NVIDIA) optimises at the CUDA kernel level for maximum throughput on NVIDIA hardware:
import tensorrt_llm
from tensorrt_llm.builder import Builder
from tensorrt_llm.models import LLaMAForCausalLM
# Build TensorRT engine from HuggingFace checkpoint
builder = Builder()
network = builder.create_network()
llama_model = LLaMAForCausalLM.from_hugging_face(
hf_model_dir="meta-llama/Llama-2-7b-hf",
dtype="float16",
mapping=tensorrt_llm.Mapping(world_size=1, tp_size=1)
)
llama_model.to_trt(network)
engine = builder.build_engine(network, builder_config)
# Build step is offline ā done once, engine saved as binary
# Inference from compiled engine:
runner = tensorrt_llm.runtime.GenerationSession(engine_config, engine)TensorRT-LLM Key Features
Kernel-level optimisation:
Custom CUDA kernels fused from multiple operations
Flash attention, RMSNorm, rotary embedding all fused
Optimised for Hopper (H100) and Ada (RTX 4090) architectures
INT8/FP8 quantisation:
First-class support for NVIDIA's native quantisation
FP8 on H100: near-fp16 quality, 2Ć throughput improvement
SmoothQuant: weight + activation INT8 co-quantisation
In-flight batching:
Equivalent to continuous batching, NVIDIA's implementation
Multi-GPU parallelism:
Tensor parallelism: split attention heads across GPUs
Pipeline parallelism: split layers across GPUs
Well-optimised NVLink communication
Speculative decoding:
Built-in support with draft model integrationvLLM vs TensorRT-LLM
| Property | vLLM | TensorRT-LLM | |----------|------|--------------| | Primary goal | High throughput, easy deployment | Maximum performance on NVIDIA | | Hardware | NVIDIA, AMD (ROCm), CPU | NVIDIA only | | Compilation | JIT, Python | Ahead-of-time CUDA engine | | Ease of use | High (pip install, one command) | Lower (build pipeline) | | Quantisation | AWQ, GPTQ, FP8 | SmoothQuant, FP8, GPTQ | | Throughput | Very high | Highest (on NVIDIA) | | Typical use | Research, startups, OpenAI-compatible | Production, cloud providers |
Deployment Architecture
Typical production LLM serving stack:
Load balancer (nginx/istio)
ā
LLM serving cluster
vLLM or TensorRT-LLM serving process Ć N GPUs
Each process: tensor-parallel across GPUs on same node
Pipeline-parallel across nodes for very large models
Request routing:
Route to available serving instance
Prefix caching: route same system prompt to same GPU (cache hits)
Autoscaling:
Scale up GPU instances on high queue depth
Scale down on low utilisation (GPUs are expensive)
Health checking:
HTTP /health endpoint
Alert on high TTFT (time to first token) or low throughputInterview Answer
"vLLM is the leading open-source LLM serving framework, achieving high throughput via paged attention (physical KV cache blocks to eliminate fragmentation), continuous batching, and prefix caching. It exposes an OpenAI-compatible API and supports AWQ/GPTQ quantisation. TensorRT-LLM is NVIDIA's framework for maximum raw performance on NVIDIA hardware ā it compiles models to optimised CUDA engines with fused kernels, FP8 quantisation, and in-flight batching. vLLM wins on ease of use and hardware flexibility; TensorRT-LLM wins on absolute throughput on NVIDIA GPUs. In production, most teams use vLLM for flexibility or TGI (HuggingFace) as an alternative."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.