Learnixo
Back to blog
AI Systemsintermediate

vLLM and TensorRT-LLM

How the two leading LLM serving frameworks work, their architectural choices, when to use each, and key configuration decisions for production deployment.

Asma Hafeez KhanMay 16, 20264 min read
LLMsvLLMTensorRT-LLMServingInterview
Share:š•

The LLM Serving Problem

Serving LLMs in production requires:

1. High throughput: process many requests per second
2. Low latency: first token quickly, fast generation rate
3. GPU efficiency: maximise hardware utilisation
4. Multi-model: serve different models from the same hardware pool
5. Reliability: handle OOM gracefully, autoscale, health checks

Two frameworks dominate open-source LLM serving: vLLM and TensorRT-LLM.


vLLM

vLLM (UC Berkeley, 2023) is a Python-first serving framework focused on throughput via paged attention:

Python
from vllm import LLM, SamplingParams

# Initialise model (downloads weights, builds engine)
llm = LLM(
    model="meta-llama/Llama-2-7b-chat-hf",
    tensor_parallel_size=1,     # GPUs for tensor parallelism
    gpu_memory_utilization=0.9, # fraction of GPU memory for KV cache
    max_num_batched_tokens=8192,
    dtype="float16",
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

# Batch inference
prompts = ["What is Warfarin?", "Explain atrial fibrillation."]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

vLLM as HTTP server:

Bash
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-7b-chat-hf \
  --port 8000 \
  --tensor-parallel-size 2

Exposes an OpenAI-compatible API.


vLLM Key Features

Paged attention:
  Physical KV cache blocks (default 16 tokens/block)
  Block table per sequence: logical → physical mapping
  Eliminates fragmentation, enables high concurrency

Continuous batching:
  Iteration-level scheduling
  Swap in/out requests as they arrive/complete

Quantisation support:
  AWQ, GPTQ, GGUF, bitsandbytes (fp8 on Hopper)

Multi-LoRA serving:
  Multiple LoRA adapters simultaneously without separate GPU instances
  Dynamic loading of adapters per request

Prefix caching:
  Cache KV states for common prompt prefixes (system prompts, few-shot examples)
  Identical prefixes across requests reuse the same physical pages

TensorRT-LLM

TensorRT-LLM (NVIDIA) optimises at the CUDA kernel level for maximum throughput on NVIDIA hardware:

Python
import tensorrt_llm
from tensorrt_llm.builder import Builder
from tensorrt_llm.models import LLaMAForCausalLM

# Build TensorRT engine from HuggingFace checkpoint
builder = Builder()
network = builder.create_network()
llama_model = LLaMAForCausalLM.from_hugging_face(
    hf_model_dir="meta-llama/Llama-2-7b-hf",
    dtype="float16",
    mapping=tensorrt_llm.Mapping(world_size=1, tp_size=1)
)
llama_model.to_trt(network)
engine = builder.build_engine(network, builder_config)

# Build step is offline — done once, engine saved as binary
# Inference from compiled engine:
runner = tensorrt_llm.runtime.GenerationSession(engine_config, engine)

TensorRT-LLM Key Features

Kernel-level optimisation:
  Custom CUDA kernels fused from multiple operations
  Flash attention, RMSNorm, rotary embedding all fused
  Optimised for Hopper (H100) and Ada (RTX 4090) architectures

INT8/FP8 quantisation:
  First-class support for NVIDIA's native quantisation
  FP8 on H100: near-fp16 quality, 2Ɨ throughput improvement
  SmoothQuant: weight + activation INT8 co-quantisation

In-flight batching:
  Equivalent to continuous batching, NVIDIA's implementation

Multi-GPU parallelism:
  Tensor parallelism: split attention heads across GPUs
  Pipeline parallelism: split layers across GPUs
  Well-optimised NVLink communication

Speculative decoding:
  Built-in support with draft model integration

vLLM vs TensorRT-LLM

| Property | vLLM | TensorRT-LLM | |----------|------|--------------| | Primary goal | High throughput, easy deployment | Maximum performance on NVIDIA | | Hardware | NVIDIA, AMD (ROCm), CPU | NVIDIA only | | Compilation | JIT, Python | Ahead-of-time CUDA engine | | Ease of use | High (pip install, one command) | Lower (build pipeline) | | Quantisation | AWQ, GPTQ, FP8 | SmoothQuant, FP8, GPTQ | | Throughput | Very high | Highest (on NVIDIA) | | Typical use | Research, startups, OpenAI-compatible | Production, cloud providers |


Deployment Architecture

Typical production LLM serving stack:

Load balancer (nginx/istio)
       ↓
LLM serving cluster
  vLLM or TensorRT-LLM serving process Ɨ N GPUs
  Each process: tensor-parallel across GPUs on same node
  Pipeline-parallel across nodes for very large models

Request routing:
  Route to available serving instance
  Prefix caching: route same system prompt to same GPU (cache hits)
  
Autoscaling:
  Scale up GPU instances on high queue depth
  Scale down on low utilisation (GPUs are expensive)

Health checking:
  HTTP /health endpoint
  Alert on high TTFT (time to first token) or low throughput

Interview Answer

"vLLM is the leading open-source LLM serving framework, achieving high throughput via paged attention (physical KV cache blocks to eliminate fragmentation), continuous batching, and prefix caching. It exposes an OpenAI-compatible API and supports AWQ/GPTQ quantisation. TensorRT-LLM is NVIDIA's framework for maximum raw performance on NVIDIA hardware — it compiles models to optimised CUDA engines with fused kernels, FP8 quantisation, and in-flight batching. vLLM wins on ease of use and hardware flexibility; TensorRT-LLM wins on absolute throughput on NVIDIA GPUs. In production, most teams use vLLM for flexibility or TGI (HuggingFace) as an alternative."

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.