Learnixo

LLMs Deep Dive · Lesson 16 of 24

vLLM and TensorRT-LLM: Production Inference

The LLM Serving Problem

Serving LLMs in production requires:

1. High throughput: process many requests per second
2. Low latency: first token quickly, fast generation rate
3. GPU efficiency: maximise hardware utilisation
4. Multi-model: serve different models from the same hardware pool
5. Reliability: handle OOM gracefully, autoscale, health checks

Two frameworks dominate open-source LLM serving: vLLM and TensorRT-LLM.


vLLM

vLLM (UC Berkeley, 2023) is a Python-first serving framework focused on throughput via paged attention:

Python
from vllm import LLM, SamplingParams

# Initialise model (downloads weights, builds engine)
llm = LLM(
    model="meta-llama/Llama-2-7b-chat-hf",
    tensor_parallel_size=1,     # GPUs for tensor parallelism
    gpu_memory_utilization=0.9, # fraction of GPU memory for KV cache
    max_num_batched_tokens=8192,
    dtype="float16",
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

# Batch inference
prompts = ["What is Warfarin?", "Explain atrial fibrillation."]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

vLLM as HTTP server:

Bash
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-2-7b-chat-hf \
  --port 8000 \
  --tensor-parallel-size 2

Exposes an OpenAI-compatible API.


vLLM Key Features

Paged attention:
  Physical KV cache blocks (default 16 tokens/block)
  Block table per sequence: logical → physical mapping
  Eliminates fragmentation, enables high concurrency

Continuous batching:
  Iteration-level scheduling
  Swap in/out requests as they arrive/complete

Quantisation support:
  AWQ, GPTQ, GGUF, bitsandbytes (fp8 on Hopper)

Multi-LoRA serving:
  Multiple LoRA adapters simultaneously without separate GPU instances
  Dynamic loading of adapters per request

Prefix caching:
  Cache KV states for common prompt prefixes (system prompts, few-shot examples)
  Identical prefixes across requests reuse the same physical pages

TensorRT-LLM

TensorRT-LLM (NVIDIA) optimises at the CUDA kernel level for maximum throughput on NVIDIA hardware:

Python
import tensorrt_llm
from tensorrt_llm.builder import Builder
from tensorrt_llm.models import LLaMAForCausalLM

# Build TensorRT engine from HuggingFace checkpoint
builder = Builder()
network = builder.create_network()
llama_model = LLaMAForCausalLM.from_hugging_face(
    hf_model_dir="meta-llama/Llama-2-7b-hf",
    dtype="float16",
    mapping=tensorrt_llm.Mapping(world_size=1, tp_size=1)
)
llama_model.to_trt(network)
engine = builder.build_engine(network, builder_config)

# Build step is offline  done once, engine saved as binary
# Inference from compiled engine:
runner = tensorrt_llm.runtime.GenerationSession(engine_config, engine)

TensorRT-LLM Key Features

Kernel-level optimisation:
  Custom CUDA kernels fused from multiple operations
  Flash attention, RMSNorm, rotary embedding all fused
  Optimised for Hopper (H100) and Ada (RTX 4090) architectures

INT8/FP8 quantisation:
  First-class support for NVIDIA's native quantisation
  FP8 on H100: near-fp16 quality, 2× throughput improvement
  SmoothQuant: weight + activation INT8 co-quantisation

In-flight batching:
  Equivalent to continuous batching, NVIDIA's implementation

Multi-GPU parallelism:
  Tensor parallelism: split attention heads across GPUs
  Pipeline parallelism: split layers across GPUs
  Well-optimised NVLink communication

Speculative decoding:
  Built-in support with draft model integration

vLLM vs TensorRT-LLM

| Property | vLLM | TensorRT-LLM | |----------|------|--------------| | Primary goal | High throughput, easy deployment | Maximum performance on NVIDIA | | Hardware | NVIDIA, AMD (ROCm), CPU | NVIDIA only | | Compilation | JIT, Python | Ahead-of-time CUDA engine | | Ease of use | High (pip install, one command) | Lower (build pipeline) | | Quantisation | AWQ, GPTQ, FP8 | SmoothQuant, FP8, GPTQ | | Throughput | Very high | Highest (on NVIDIA) | | Typical use | Research, startups, OpenAI-compatible | Production, cloud providers |


Deployment Architecture

Typical production LLM serving stack:

Load balancer (nginx/istio)
       ↓
LLM serving cluster
  vLLM or TensorRT-LLM serving process × N GPUs
  Each process: tensor-parallel across GPUs on same node
  Pipeline-parallel across nodes for very large models

Request routing:
  Route to available serving instance
  Prefix caching: route same system prompt to same GPU (cache hits)
  
Autoscaling:
  Scale up GPU instances on high queue depth
  Scale down on low utilisation (GPUs are expensive)

Health checking:
  HTTP /health endpoint
  Alert on high TTFT (time to first token) or low throughput

Interview Answer

"vLLM is the leading open-source LLM serving framework, achieving high throughput via paged attention (physical KV cache blocks to eliminate fragmentation), continuous batching, and prefix caching. It exposes an OpenAI-compatible API and supports AWQ/GPTQ quantisation. TensorRT-LLM is NVIDIA's framework for maximum raw performance on NVIDIA hardware — it compiles models to optimised CUDA engines with fused kernels, FP8 quantisation, and in-flight batching. vLLM wins on ease of use and hardware flexibility; TensorRT-LLM wins on absolute throughput on NVIDIA GPUs. In production, most teams use vLLM for flexibility or TGI (HuggingFace) as an alternative."