LLM Quantisation

Why Quantise?

LLaMA 2 70B in fp16 requires ~140GB of GPU memory. Most users have 1-2 GPUs with 24-80GB. Quantisation compresses model weights to lower bit-widths:

Precision  → Memory for 70B params
fp32  (32-bit): 280 GB  — 4 bytes/param
fp16  (16-bit): 140 GB  — 2 bytes/param
int8  (8-bit):   70 GB  — 1 byte/param
int4  (4-bit):   35 GB  — 0.5 bytes/param
2-bit:           17.5 GB

Quality degrades as bit-width decreases.
INT4 is often the practical minimum for acceptable quality.

Post-Training Quantisation (PTQ)

Quantise a pretrained model without retraining:

Asymmetric INT8 quantisation:
  For each weight tensor W:
  
  scale = (max(W) - min(W)) / 255
  zero_point = round(-min(W) / scale)
  
  W_quant = round(W / scale) + zero_point  ∈ {0, ..., 255}
  W_dequant = (W_quant - zero_point) × scale  ≈ W

  Error per element: up to scale/2 (rounding error)
  
Granularity choices:
  Per-tensor:  one scale per weight matrix (fastest, most error)
  Per-channel: one scale per output neuron (balanced)
  Per-group:   one scale per group of G weights (best quality, more overhead)

bitsandbytes INT8 (LLM.int8())

Python

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quantization_config,
    device_map="auto"
)

# LLM.int8() detects outlier channels (large-magnitude activations)
# and keeps those in fp16, quantises the rest to int8
# → mixed-precision approach with ~1% quality degradation

GPTQ: Weight-Quantisation to INT4

GPTQ (Frantar et al., 2022) uses second-order information to minimise quantisation error:

Algorithm (per layer):
1. Process weight columns one at a time
2. For each column, find the quantised value that minimises the increase in
   layer output error (using the Hessian of the loss w.r.t. that column)
3. Update remaining columns to compensate for the error introduced

Result:
  4-bit weights with near-fp16 quality on most tasks
  2-4× quality improvement over naive INT4

Quality comparison (70B model, MMLU):
  fp16:  68.9%
  GPTQ INT4 (g128): 67.8%  — 1.1% drop
  Naive INT4:       62.4%  — 6.5% drop

AWQ: Activation-Aware Weight Quantisation

Python

from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Calibration data — used to find important channels
model.quantize(tokenizer, quant_config={"w_bit": 4, "q_group_size": 128})

# AWQ insight: not all weights are equally important
# Find channels with large activation magnitudes (using a small calibration set)
# Protect those channels (scale them up, making quantisation error smaller)
# Other channels can tolerate more error

GGUF / llama.cpp Quantisation

For CPU/consumer GPU inference, llama.cpp uses GGUF format with its own quantisation:

Q4_K_M: 4-bit quantisation, mixed precision for attention/FFN
Q5_K_M: 5-bit, better quality
Q8_0:   8-bit, near-lossless

LLaMA 2 7B model sizes:
  fp16:    13.5 GB
  Q8_0:    7.2 GB
  Q4_K_M:  4.1 GB
  Q3_K_M:  3.3 GB

Q4_K_M on Apple M2 (96GB unified memory):
  Can run LLaMA 2 70B (38GB) with 20-30 tokens/second

Quantisation-Aware Training (QAT)

For maximum quality at low bit-widths, quantise during training:

During training:
  Forward pass: simulate quantisation (round weights, add noise)
  Backward pass: use straight-through estimator (gradient passes through rounding)
  
  Model learns to be robust to quantisation error
  Requires retraining — expensive but produces better INT4 models

Used when:
  Target deployment is always at INT4 and quality gap matters
  Domain-specific fine-tuned model that must be quantised

Interview Answer

"Quantisation reduces LLM weights from fp16 (2 bytes) to int8 (1 byte) or int4 (0.5 bytes), cutting memory requirements by 2-4×. LLM.int8() handles outlier activation channels in fp16 while quantising the rest to int8 — about 1% quality loss. GPTQ uses second-order weight perturbation to minimise layer-output error during int4 quantisation, achieving near-fp16 quality. AWQ finds activation-important weight channels and protects them. GGUF/llama.cpp formats enable consumer-hardware inference. For production deployments, int4 GPTQ or AWQ quantisation of 7B-13B models is the common sweet spot: fits on a single consumer GPU with acceptable quality."