Quantization: Compressing Model Weights

Why Quantization

A 7B parameter model in float32 requires 7B × 4 bytes = 28GB of GPU memory. In float16/bfloat16: 14GB. Most consumer GPUs have 8–24GB.

Quantization reduces precision:

INT8: 8 bits per weight → 7GB for 7B model (2× smaller than float16)
INT4: 4 bits per weight → 3.5GB for 7B model (4× smaller than float16)
INT4 with grouped quantization: Can run 7B models on 4–6GB VRAM

The tradeoff: some loss in accuracy. Modern quantization techniques minimize this — 4-bit quantized models often lose less than 1% on most benchmarks.

Quantization Fundamentals

Quantization maps floating-point values to a fixed number of discrete levels:

Python

import numpy as np
import torch

def quantize_weights(weights: np.ndarray, bits: int = 8) -> tuple[np.ndarray, float, float]:
    """Simple symmetric quantization."""
    n_levels = 2 ** bits
    half_range = n_levels // 2 - 1

    # Find scale: map [-max_val, max_val] to [-127, 127]
    max_val = np.abs(weights).max()
    scale = max_val / half_range

    # Quantize to integers
    quantized = np.clip(np.round(weights / scale), -half_range, half_range).astype(np.int8)

    # Dequantize (what the model uses during computation)
    dequantized = quantized.astype(np.float32) * scale

    error = np.abs(weights - dequantized).mean()
    return quantized, scale, error

# Example
weights = np.random.randn(1000) * 0.1  # Typical weight distribution
q8, scale8, err8 = quantize_weights(weights, bits=8)
q4, scale4, err4 = quantize_weights(weights, bits=4)

print(f"INT8 mean absolute error: {err8:.6f}")
print(f"INT4 mean absolute error: {err4:.6f}")
print(f"Error increase: {err4/err8:.1f}×")

# Storage
print(f"Original: {weights.nbytes} bytes")
print(f"INT8: {q8.nbytes} bytes ({weights.nbytes/q8.nbytes:.0f}× compressed)")
print(f"INT4 (packed): {q8.nbytes // 2} bytes ({weights.nbytes/(q8.nbytes//2):.0f}× compressed)")

Grouped Quantization

Naive quantization uses one scale per weight matrix. Grouped quantization uses one scale per group of 128 weights, dramatically reducing quantization error:

Python

def grouped_quantize(weights: np.ndarray, bits: int = 4, group_size: int = 128) -> dict:
    """Quantize with per-group scales."""
    n = len(weights)
    n_groups = (n + group_size - 1) // group_size
    n_levels = 2 ** bits

    quantized_groups = []
    scales = []

    for g in range(n_groups):
        start = g * group_size
        end = min((g + 1) * group_size, n)
        group = weights[start:end]

        # Each group has its own scale
        max_val = np.abs(group).max() + 1e-8
        scale = max_val / (n_levels // 2 - 1)
        scales.append(scale)

        q = np.clip(np.round(group / scale), -(n_levels // 2), n_levels // 2 - 1)
        quantized_groups.append(q)

    return {
        "quantized": np.concatenate(quantized_groups),
        "scales": np.array(scales),
        "group_size": group_size,
    }

# Grouped quantization has overhead: n_groups × scale values (float16)
# For 4B weights with group_size=128: 4B/128 × 2 bytes = 62.5MB overhead
# vs original 4B × 2 bytes = 8GB → overhead is less than 1%

GPTQ: Post-Training Quantization

GPTQ (Frantar et al., 2022) is the standard method for aggressive 4-bit quantization of pretrained LLMs. It uses second-order information (Hessian) to minimize quantization error:

Python

# Using GPTQ with AutoGPTQ library
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer

model_name = "meta-llama/Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Calibration data: a small representative dataset
calibration_texts = [
    "The mechanism of warfarin involves inhibition of vitamin K epoxide reductase.",
    "Metformin activates AMPK and reduces hepatic glucose production.",
    # ... 128 calibration examples
]
calibration_data = [
    tokenizer(text, return_tensors="pt") for text in calibration_texts
]

# Quantization config
quantize_config = BaseQuantizeConfig(
    bits=4,           # 4-bit quantization
    group_size=128,   # Per-group scales
    desc_act=True,    # Descending activation order (improves quality)
)

# Load and quantize
model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config)
model.quantize(calibration_data, batch_size=4)

# Save quantized model
model.save_quantized("./llama3-8b-4bit-gptq", use_safetensors=True)

# Load and use quantized model
quantized_model = AutoGPTQForCausalLM.from_quantized(
    "./llama3-8b-4bit-gptq",
    device="cuda:0",
    use_safetensors=True,
)

bitsandbytes: QLoRA Integration

bitsandbytes provides INT8 and NF4 (Normal Float 4-bit) quantization with direct HuggingFace integration, used in QLoRA:

Python

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# NF4 quantization config (used in QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,  # Dequantize to BF16 for compute
    bnb_4bit_use_double_quant=True,          # Quantize the scales too (nested quantization)
    bnb_4bit_quant_type="nf4",              # NF4 data type (optimal for normal distributions)
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto",
)

# Weights are stored as 4-bit NF4 on GPU
# During forward pass: automatically dequantized to BF16 for computation
# LoRA adapters trained in full precision on top of frozen quantized base

print(f"Model memory footprint: {model.get_memory_footprint() / 1e9:.2f} GB")
# LLaMA-3-8B in NF4: ~4.5 GB vs ~15 GB in BF16

AWQ: Activation-Aware Weight Quantization

AWQ (Lin et al., 2023) observes that not all weights are equally important — weights that correspond to channels with large activations are more sensitive to quantization error. AWQ protects these weights:

Python

# Using AutoAWQ library
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "meta-llama/Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Calibration data
calibration_data = [
    {"role": "user", "content": "Explain the mechanism of warfarin."}
]

model = AutoAWQForCausalLM.from_pretrained(model_name)

quant_config = {
    "zero_point": True,  # Asymmetric quantization
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM",
}

model.quantize(tokenizer, quant_config=quant_config, calib_data=calibration_data)
model.save_quantized("./llama3-8b-awq")

# AWQ typically achieves better quality than GPTQ at the same bit width
# and is faster at inference (optimized GEMM kernels)

Quantization Comparison

Python

# Benchmark: LLaMA-3-8B quality vs memory
results = {
    "BF16 (baseline)": {"memory_gb": 15.0, "perplexity": 8.2, "tokens_per_sec": 50},
    "INT8 (bitsandbytes)": {"memory_gb": 8.0, "perplexity": 8.3, "tokens_per_sec": 45},
    "NF4 (QLoRA/bnb)": {"memory_gb": 4.5, "perplexity": 8.5, "tokens_per_sec": 55},
    "GPTQ-4bit": {"memory_gb": 4.3, "perplexity": 8.4, "tokens_per_sec": 80},
    "AWQ-4bit": {"memory_gb": 4.3, "perplexity": 8.35, "tokens_per_sec": 90},
}

print(f"{'Method':<25} {'Memory':>10} {'Perplexity':>12} {'Speed':>12}")
print("-" * 60)
for method, stats in results.items():
    print(f"{method:<25} {stats['memory_gb']:>9.1f}GB {stats['perplexity']:>12.2f} {stats['tokens_per_sec']:>10}t/s")

Choosing a Quantization Method

| Need | Method | |---|---| | Fine-tune a quantized base model | bitsandbytes NF4 (QLoRA) | | Fast inference, maximum quality | AWQ-4bit | | Industry standard compatibility | GPTQ-4bit | | Moderate compression, minimal setup | INT8 (load_in_8bit=True) | | CPU inference | GGUF (llama.cpp format) |

For production inference on GPU, AWQ-4bit with optimized kernels (vLLM or TGI) achieves the best throughput-quality tradeoff. For fine-tuning on limited hardware, QLoRA with NF4 is the standard approach.

Quantization: Compressing Model Weights

Why Quantization

Quantization Fundamentals

Grouped Quantization

GPTQ: Post-Training Quantization

bitsandbytes: QLoRA Integration

AWQ: Activation-Aware Weight Quantization

Quantization Comparison

Choosing a Quantization Method

Enjoyed this article?

Leave a comment