Quantization: Compressing Model Weights
How quantization reduces LLM memory and speeds up inference by representing weights in fewer bits. Covers INT8, INT4, GPTQ, AWQ, and bitsandbytes QLoRA.
Why Quantization
A 7B parameter model in float32 requires 7B × 4 bytes = 28GB of GPU memory. In float16/bfloat16: 14GB. Most consumer GPUs have 8–24GB.
Quantization reduces precision:
- INT8: 8 bits per weight → 7GB for 7B model (2× smaller than float16)
- INT4: 4 bits per weight → 3.5GB for 7B model (4× smaller than float16)
- INT4 with grouped quantization: Can run 7B models on 4–6GB VRAM
The tradeoff: some loss in accuracy. Modern quantization techniques minimize this — 4-bit quantized models often lose less than 1% on most benchmarks.
Quantization Fundamentals
Quantization maps floating-point values to a fixed number of discrete levels:
import numpy as np
import torch
def quantize_weights(weights: np.ndarray, bits: int = 8) -> tuple[np.ndarray, float, float]:
"""Simple symmetric quantization."""
n_levels = 2 ** bits
half_range = n_levels // 2 - 1
# Find scale: map [-max_val, max_val] to [-127, 127]
max_val = np.abs(weights).max()
scale = max_val / half_range
# Quantize to integers
quantized = np.clip(np.round(weights / scale), -half_range, half_range).astype(np.int8)
# Dequantize (what the model uses during computation)
dequantized = quantized.astype(np.float32) * scale
error = np.abs(weights - dequantized).mean()
return quantized, scale, error
# Example
weights = np.random.randn(1000) * 0.1 # Typical weight distribution
q8, scale8, err8 = quantize_weights(weights, bits=8)
q4, scale4, err4 = quantize_weights(weights, bits=4)
print(f"INT8 mean absolute error: {err8:.6f}")
print(f"INT4 mean absolute error: {err4:.6f}")
print(f"Error increase: {err4/err8:.1f}×")
# Storage
print(f"Original: {weights.nbytes} bytes")
print(f"INT8: {q8.nbytes} bytes ({weights.nbytes/q8.nbytes:.0f}× compressed)")
print(f"INT4 (packed): {q8.nbytes // 2} bytes ({weights.nbytes/(q8.nbytes//2):.0f}× compressed)")Grouped Quantization
Naive quantization uses one scale per weight matrix. Grouped quantization uses one scale per group of 128 weights, dramatically reducing quantization error:
def grouped_quantize(weights: np.ndarray, bits: int = 4, group_size: int = 128) -> dict:
"""Quantize with per-group scales."""
n = len(weights)
n_groups = (n + group_size - 1) // group_size
n_levels = 2 ** bits
quantized_groups = []
scales = []
for g in range(n_groups):
start = g * group_size
end = min((g + 1) * group_size, n)
group = weights[start:end]
# Each group has its own scale
max_val = np.abs(group).max() + 1e-8
scale = max_val / (n_levels // 2 - 1)
scales.append(scale)
q = np.clip(np.round(group / scale), -(n_levels // 2), n_levels // 2 - 1)
quantized_groups.append(q)
return {
"quantized": np.concatenate(quantized_groups),
"scales": np.array(scales),
"group_size": group_size,
}
# Grouped quantization has overhead: n_groups × scale values (float16)
# For 4B weights with group_size=128: 4B/128 × 2 bytes = 62.5MB overhead
# vs original 4B × 2 bytes = 8GB → overhead is less than 1%GPTQ: Post-Training Quantization
GPTQ (Frantar et al., 2022) is the standard method for aggressive 4-bit quantization of pretrained LLMs. It uses second-order information (Hessian) to minimize quantization error:
# Using GPTQ with AutoGPTQ library
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
model_name = "meta-llama/Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Calibration data: a small representative dataset
calibration_texts = [
"The mechanism of warfarin involves inhibition of vitamin K epoxide reductase.",
"Metformin activates AMPK and reduces hepatic glucose production.",
# ... 128 calibration examples
]
calibration_data = [
tokenizer(text, return_tensors="pt") for text in calibration_texts
]
# Quantization config
quantize_config = BaseQuantizeConfig(
bits=4, # 4-bit quantization
group_size=128, # Per-group scales
desc_act=True, # Descending activation order (improves quality)
)
# Load and quantize
model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config)
model.quantize(calibration_data, batch_size=4)
# Save quantized model
model.save_quantized("./llama3-8b-4bit-gptq", use_safetensors=True)
# Load and use quantized model
quantized_model = AutoGPTQForCausalLM.from_quantized(
"./llama3-8b-4bit-gptq",
device="cuda:0",
use_safetensors=True,
)bitsandbytes: QLoRA Integration
bitsandbytes provides INT8 and NF4 (Normal Float 4-bit) quantization with direct HuggingFace integration, used in QLoRA:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# NF4 quantization config (used in QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16, # Dequantize to BF16 for compute
bnb_4bit_use_double_quant=True, # Quantize the scales too (nested quantization)
bnb_4bit_quant_type="nf4", # NF4 data type (optimal for normal distributions)
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
quantization_config=bnb_config,
device_map="auto",
)
# Weights are stored as 4-bit NF4 on GPU
# During forward pass: automatically dequantized to BF16 for computation
# LoRA adapters trained in full precision on top of frozen quantized base
print(f"Model memory footprint: {model.get_memory_footprint() / 1e9:.2f} GB")
# LLaMA-3-8B in NF4: ~4.5 GB vs ~15 GB in BF16AWQ: Activation-Aware Weight Quantization
AWQ (Lin et al., 2023) observes that not all weights are equally important — weights that correspond to channels with large activations are more sensitive to quantization error. AWQ protects these weights:
# Using AutoAWQ library
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_name = "meta-llama/Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Calibration data
calibration_data = [
{"role": "user", "content": "Explain the mechanism of warfarin."}
]
model = AutoAWQForCausalLM.from_pretrained(model_name)
quant_config = {
"zero_point": True, # Asymmetric quantization
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM",
}
model.quantize(tokenizer, quant_config=quant_config, calib_data=calibration_data)
model.save_quantized("./llama3-8b-awq")
# AWQ typically achieves better quality than GPTQ at the same bit width
# and is faster at inference (optimized GEMM kernels)Quantization Comparison
# Benchmark: LLaMA-3-8B quality vs memory
results = {
"BF16 (baseline)": {"memory_gb": 15.0, "perplexity": 8.2, "tokens_per_sec": 50},
"INT8 (bitsandbytes)": {"memory_gb": 8.0, "perplexity": 8.3, "tokens_per_sec": 45},
"NF4 (QLoRA/bnb)": {"memory_gb": 4.5, "perplexity": 8.5, "tokens_per_sec": 55},
"GPTQ-4bit": {"memory_gb": 4.3, "perplexity": 8.4, "tokens_per_sec": 80},
"AWQ-4bit": {"memory_gb": 4.3, "perplexity": 8.35, "tokens_per_sec": 90},
}
print(f"{'Method':<25} {'Memory':>10} {'Perplexity':>12} {'Speed':>12}")
print("-" * 60)
for method, stats in results.items():
print(f"{method:<25} {stats['memory_gb']:>9.1f}GB {stats['perplexity']:>12.2f} {stats['tokens_per_sec']:>10}t/s")Choosing a Quantization Method
| Need | Method | |---|---| | Fine-tune a quantized base model | bitsandbytes NF4 (QLoRA) | | Fast inference, maximum quality | AWQ-4bit | | Industry standard compatibility | GPTQ-4bit | | Moderate compression, minimal setup | INT8 (load_in_8bit=True) | | CPU inference | GGUF (llama.cpp format) |
For production inference on GPU, AWQ-4bit with optimized kernels (vLLM or TGI) achieves the best throughput-quality tradeoff. For fine-tuning on limited hardware, QLoRA with NF4 is the standard approach.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.