Learnixo
Back to blog
AI Systemsadvanced

LLM Inference and Serving

How to serve LLMs at scale: KV cache management, continuous batching, vLLM, PagedAttention, speculative decoding, and production deployment patterns.

Asma Hafeez KhanMay 16, 20267 min read
LLMInferencevLLMKV CacheServingProduction
Share:š•

The Inference Bottleneck

LLM inference has two distinct phases with different performance characteristics:

Prefill phase: Process the entire prompt in parallel. Compute-bound — limited by GPU FLOP/s. Fast even for long prompts because all tokens are processed simultaneously.

Decode phase: Generate one token at a time. Memory-bandwidth-bound — the bottleneck is reading KV cache and model weights from GPU memory on every step, not computation. This is why inference is slow: a GPU doing matrix multiply at 312 TFLOP/s can still be slow if it's waiting for memory.

For a 7B model at float16:
  Model weights: 14GB
  One decode step reads ~14GB from HBM
  A100 HBM bandwidth: 2TB/s
  
  Minimum time per token: 14GB / 2TB/s = 7ms
  Maximum decode speed: ~142 tokens/second (single request)
  
  With a batch of 10 requests: same 7ms (bandwidth is shared but amortized)
  Throughput: ~1420 tokens/second

Batching is essential for throughput — reading weights once to serve multiple requests amortizes the memory bandwidth cost.


KV Cache: The Memory Tradeoff

During decoding, each token's key and value tensors are cached to avoid recomputation:

KV cache memory per token = 2 Ɨ n_layers Ɨ n_heads Ɨ head_dim Ɨ bytes_per_element
                          = 2 Ɨ 32 Ɨ 32 Ɨ 128 Ɨ 2 (bfloat16)
                          = 524,288 bytes ā‰ˆ 0.5MB per token

For a 7B model (LLaMA-3-8B):
  32 layers Ɨ 32 heads Ɨ 128 head_dim Ɨ 2 (K,V) Ɨ 2 bytes = 0.5MB per token
  
  4096-token sequence: 2GB of KV cache
  8192-token sequence: 4GB of KV cache

An A100 with 80GB minus 14GB model weights = 66GB for KV cache
  = 66GB / 0.5MB ā‰ˆ 132,000 tokens total KV cache capacity
  = ~32 concurrent requests of 4096 tokens each

Continuous Batching

Static batching waits for a fixed batch size, then processes together. This wastes GPU time when requests arrive unevenly. Continuous batching inserts new requests into the batch as soon as a slot frees up:

Python
import asyncio
from dataclasses import dataclass, field
from typing import Optional
import time

@dataclass
class Request:
    id: str
    prompt_tokens: list[int]
    max_new_tokens: int
    generated_tokens: list[int] = field(default_factory=list)
    finished: bool = False
    created_at: float = field(default_factory=time.time)


class ContinuousBatchingScheduler:
    """
    Continuously fills a batch as requests complete and new ones arrive.
    This is the core scheduling idea behind vLLM and TGI.
    """

    def __init__(self, max_batch_size: int = 32, max_seq_len: int = 4096):
        self.max_batch_size = max_batch_size
        self.max_seq_len = max_seq_len
        self.active_requests: dict[str, Request] = {}
        self.waiting_queue: asyncio.Queue[Request] = asyncio.Queue()

    async def add_request(self, request: Request) -> None:
        await self.waiting_queue.put(request)

    def _fill_batch(self) -> None:
        """Add waiting requests to active batch if space available."""
        while (
            len(self.active_requests) < self.max_batch_size
            and not self.waiting_queue.empty()
        ):
            try:
                req = self.waiting_queue.get_nowait()
                self.active_requests[req.id] = req
            except asyncio.QueueEmpty:
                break

    def step(self, model) -> dict[str, list[int]]:
        """
        Run one decode step for all active requests.
        Returns dict of request_id → completed sequences for finished requests.
        """
        self._fill_batch()

        if not self.active_requests:
            return {}

        # Prepare padded batch
        request_list = list(self.active_requests.values())
        # In real systems, this is done with careful attention masking
        # to handle variable-length sequences in the same batch

        # ... (model forward pass) ...

        # Remove finished requests
        completed = {}
        for req_id, req in list(self.active_requests.items()):
            if req.finished or len(req.generated_tokens) >= req.max_new_tokens:
                completed[req_id] = req.generated_tokens
                del self.active_requests[req_id]

        return completed

vLLM and PagedAttention

The key innovation of vLLM (Kwon et al., 2023): instead of allocating a contiguous block of GPU memory for each request's KV cache, use paged memory management — the same idea as OS virtual memory.

The problem with contiguous allocation:

  • Reserve max_seq_len Ɨ kv_size for each request upfront
  • Short requests waste reserved memory
  • Memory fragmentation: many small gaps, total free memory is large but no contiguous block is available

PagedAttention solution:

  • Divide KV cache into fixed-size blocks (typically 16 tokens)
  • Allocate blocks on-demand as tokens are generated
  • Different requests can use non-contiguous physical blocks
  • A block table maps logical to physical block positions
Python
# Using vLLM for high-throughput serving
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    tensor_parallel_size=1,      # Number of GPUs for tensor parallelism
    gpu_memory_utilization=0.90, # Fraction of GPU memory for KV cache
    max_model_len=8192,
    quantization=None,           # or "awq", "gptq", "squeezellm"
    enforce_eager=False,         # Use CUDA graphs for faster decoding
)

# Batch inference
prompts = [
    "Explain the pharmacokinetics of warfarin",
    "What is the mechanism of beta-blockers?",
    "How does metformin lower blood glucose?",
]

sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=512,
    stop=["</s>", "<|eot_id|>"],
)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Response: {output.outputs[0].text}")
    print(f"Tokens generated: {len(output.outputs[0].token_ids)}")
    print()

vLLM OpenAI-Compatible Server

Python
# Start vLLM as an OpenAI-compatible API server:
# vllm serve meta-llama/Meta-Llama-3-8B-Instruct --port 8000

# Then use the standard OpenAI client
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # vLLM doesn't require auth by default
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "What is the interaction between warfarin and NSAIDs?"}
    ],
    max_tokens=512,
    temperature=0,
)

print(response.choices[0].message.content)
print(f"Tokens: {response.usage.prompt_tokens} in, {response.usage.completion_tokens} out")

Speculative Decoding in Practice

Use a small "draft" model to propose multiple tokens at once, then verify with the large model:

Python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Draft model: small and fast (e.g., LLaMA-3-1B)
draft_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.2-1B",
    torch_dtype=torch.bfloat16,
)

# Target model: large and accurate (e.g., LLaMA-3-8B)
target_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    torch_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

# HuggingFace supports speculative decoding with assistant_model
from transformers import GenerationConfig

inputs = tokenizer("What is the pharmacokinetics of warfarin?", return_tensors="pt")

output = target_model.generate(
    **inputs,
    assistant_model=draft_model,    # Speculative decoding
    max_new_tokens=200,
    do_sample=False,
)

# Expected: ~2-3Ɨ speedup on typical text
# Speedup depends on acceptance rate (how often draft tokens are accepted)

Quantized Inference

4-bit quantization reduces model memory and speeds inference:

Python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 4-bit NF4 quantization for inference
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-70B-Instruct",
    quantization_config=quantization_config,
    device_map="auto",
)

# 70B Ɨ 4 bits / 8 = 35GB — fits on a single 80GB A100
# Performance: ~5-10% accuracy loss, 2-4Ɨ memory reduction

Quantization options comparison:

| Format | Bits | Memory (70B) | Quality | Speed | |---|---|---|---|---| | float32 | 32 | 280GB | Baseline | Slowest | | bfloat16 | 16 | 140GB | ~=baseline | 2Ɨ | | AWQ int8 | 8 | 70GB | 99% | 2-3Ɨ | | GPTQ int4 | 4 | 35GB | 95-98% | 3-4Ɨ | | NF4 (bnb) | 4 | 35GB | 95-98% | 3-4Ɨ | | GGUF Q4_K_M | 4 | ~40GB | 96% | CPU+GPU |


Production Serving Architecture

                    ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
                    │         Load Balancer           │
                    │  (routes requests, health check) │
                    ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                                  │
              ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
              │                   │                   │
    ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā–¼ā”€ā”€ā”€ā”€ā”€ā”€ā”  ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā–¼ā”€ā”€ā”€ā”€ā”€ā”€ā”  ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā–¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
    │  vLLM Worker 0  │  │  vLLM Worker 1  │  │  vLLM Worker 2 │
    │  GPU 0-1 (TP=2) │  │  GPU 2-3 (TP=2) │  │  GPU 4-5 (TP=2)│
    ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜  ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜  ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
              │                   │                   │
              ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                                  │
                    ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā–¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
                    │       Result Aggregator         │
                    │  (stream back to caller)        │
                    ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Key infrastructure decisions:

  • Tensor parallelism (TP=2 or 4): Required for models that don't fit on one GPU
  • Multiple replicas: For horizontal scaling of throughput
  • Health checks: Detect stuck/dead workers and route around them
  • Streaming: Use SSE (Server-Sent Events) to stream tokens to the client; don't wait for full response
  • Timeout handling: Decode can take seconds for long responses; set appropriate read timeouts
  • Priority queues: Route latency-sensitive requests (UI) ahead of batch jobs (embedding pipelines)

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.