Learnixo
Back to blog
AI Systemsintermediate

Quantization: Compressing Model Weights

How quantization reduces LLM memory and speeds up inference by representing weights in fewer bits. Covers INT8, INT4, GPTQ, AWQ, and bitsandbytes QLoRA.

Asma Hafeez KhanMay 16, 20265 min read
TransformersQuantizationInferenceOptimization
Share:𝕏

Why Quantization

A 7B parameter model in float32 requires 7B × 4 bytes = 28GB of GPU memory. In float16/bfloat16: 14GB. Most consumer GPUs have 8–24GB.

Quantization reduces precision:

  • INT8: 8 bits per weight → 7GB for 7B model (2× smaller than float16)
  • INT4: 4 bits per weight → 3.5GB for 7B model (4× smaller than float16)
  • INT4 with grouped quantization: Can run 7B models on 4–6GB VRAM

The tradeoff: some loss in accuracy. Modern quantization techniques minimize this — 4-bit quantized models often lose less than 1% on most benchmarks.


Quantization Fundamentals

Quantization maps floating-point values to a fixed number of discrete levels:

Python
import numpy as np
import torch

def quantize_weights(weights: np.ndarray, bits: int = 8) -> tuple[np.ndarray, float, float]:
    """Simple symmetric quantization."""
    n_levels = 2 ** bits
    half_range = n_levels // 2 - 1

    # Find scale: map [-max_val, max_val] to [-127, 127]
    max_val = np.abs(weights).max()
    scale = max_val / half_range

    # Quantize to integers
    quantized = np.clip(np.round(weights / scale), -half_range, half_range).astype(np.int8)

    # Dequantize (what the model uses during computation)
    dequantized = quantized.astype(np.float32) * scale

    error = np.abs(weights - dequantized).mean()
    return quantized, scale, error

# Example
weights = np.random.randn(1000) * 0.1  # Typical weight distribution
q8, scale8, err8 = quantize_weights(weights, bits=8)
q4, scale4, err4 = quantize_weights(weights, bits=4)

print(f"INT8 mean absolute error: {err8:.6f}")
print(f"INT4 mean absolute error: {err4:.6f}")
print(f"Error increase: {err4/err8:.1f}×")

# Storage
print(f"Original: {weights.nbytes} bytes")
print(f"INT8: {q8.nbytes} bytes ({weights.nbytes/q8.nbytes:.0f}× compressed)")
print(f"INT4 (packed): {q8.nbytes // 2} bytes ({weights.nbytes/(q8.nbytes//2):.0f}× compressed)")

Grouped Quantization

Naive quantization uses one scale per weight matrix. Grouped quantization uses one scale per group of 128 weights, dramatically reducing quantization error:

Python
def grouped_quantize(weights: np.ndarray, bits: int = 4, group_size: int = 128) -> dict:
    """Quantize with per-group scales."""
    n = len(weights)
    n_groups = (n + group_size - 1) // group_size
    n_levels = 2 ** bits

    quantized_groups = []
    scales = []

    for g in range(n_groups):
        start = g * group_size
        end = min((g + 1) * group_size, n)
        group = weights[start:end]

        # Each group has its own scale
        max_val = np.abs(group).max() + 1e-8
        scale = max_val / (n_levels // 2 - 1)
        scales.append(scale)

        q = np.clip(np.round(group / scale), -(n_levels // 2), n_levels // 2 - 1)
        quantized_groups.append(q)

    return {
        "quantized": np.concatenate(quantized_groups),
        "scales": np.array(scales),
        "group_size": group_size,
    }

# Grouped quantization has overhead: n_groups × scale values (float16)
# For 4B weights with group_size=128: 4B/128 × 2 bytes = 62.5MB overhead
# vs original 4B × 2 bytes = 8GB  overhead is less than 1%

GPTQ: Post-Training Quantization

GPTQ (Frantar et al., 2022) is the standard method for aggressive 4-bit quantization of pretrained LLMs. It uses second-order information (Hessian) to minimize quantization error:

Python
# Using GPTQ with AutoGPTQ library
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer

model_name = "meta-llama/Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Calibration data: a small representative dataset
calibration_texts = [
    "The mechanism of warfarin involves inhibition of vitamin K epoxide reductase.",
    "Metformin activates AMPK and reduces hepatic glucose production.",
    # ... 128 calibration examples
]
calibration_data = [
    tokenizer(text, return_tensors="pt") for text in calibration_texts
]

# Quantization config
quantize_config = BaseQuantizeConfig(
    bits=4,           # 4-bit quantization
    group_size=128,   # Per-group scales
    desc_act=True,    # Descending activation order (improves quality)
)

# Load and quantize
model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config)
model.quantize(calibration_data, batch_size=4)

# Save quantized model
model.save_quantized("./llama3-8b-4bit-gptq", use_safetensors=True)

# Load and use quantized model
quantized_model = AutoGPTQForCausalLM.from_quantized(
    "./llama3-8b-4bit-gptq",
    device="cuda:0",
    use_safetensors=True,
)

bitsandbytes: QLoRA Integration

bitsandbytes provides INT8 and NF4 (Normal Float 4-bit) quantization with direct HuggingFace integration, used in QLoRA:

Python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# NF4 quantization config (used in QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,  # Dequantize to BF16 for compute
    bnb_4bit_use_double_quant=True,          # Quantize the scales too (nested quantization)
    bnb_4bit_quant_type="nf4",              # NF4 data type (optimal for normal distributions)
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto",
)

# Weights are stored as 4-bit NF4 on GPU
# During forward pass: automatically dequantized to BF16 for computation
# LoRA adapters trained in full precision on top of frozen quantized base

print(f"Model memory footprint: {model.get_memory_footprint() / 1e9:.2f} GB")
# LLaMA-3-8B in NF4: ~4.5 GB vs ~15 GB in BF16

AWQ: Activation-Aware Weight Quantization

AWQ (Lin et al., 2023) observes that not all weights are equally important — weights that correspond to channels with large activations are more sensitive to quantization error. AWQ protects these weights:

Python
# Using AutoAWQ library
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "meta-llama/Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Calibration data
calibration_data = [
    {"role": "user", "content": "Explain the mechanism of warfarin."}
]

model = AutoAWQForCausalLM.from_pretrained(model_name)

quant_config = {
    "zero_point": True,  # Asymmetric quantization
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM",
}

model.quantize(tokenizer, quant_config=quant_config, calib_data=calibration_data)
model.save_quantized("./llama3-8b-awq")

# AWQ typically achieves better quality than GPTQ at the same bit width
# and is faster at inference (optimized GEMM kernels)

Quantization Comparison

Python
# Benchmark: LLaMA-3-8B quality vs memory
results = {
    "BF16 (baseline)": {"memory_gb": 15.0, "perplexity": 8.2, "tokens_per_sec": 50},
    "INT8 (bitsandbytes)": {"memory_gb": 8.0, "perplexity": 8.3, "tokens_per_sec": 45},
    "NF4 (QLoRA/bnb)": {"memory_gb": 4.5, "perplexity": 8.5, "tokens_per_sec": 55},
    "GPTQ-4bit": {"memory_gb": 4.3, "perplexity": 8.4, "tokens_per_sec": 80},
    "AWQ-4bit": {"memory_gb": 4.3, "perplexity": 8.35, "tokens_per_sec": 90},
}

print(f"{'Method':<25} {'Memory':>10} {'Perplexity':>12} {'Speed':>12}")
print("-" * 60)
for method, stats in results.items():
    print(f"{method:<25} {stats['memory_gb']:>9.1f}GB {stats['perplexity']:>12.2f} {stats['tokens_per_sec']:>10}t/s")

Choosing a Quantization Method

| Need | Method | |---|---| | Fine-tune a quantized base model | bitsandbytes NF4 (QLoRA) | | Fast inference, maximum quality | AWQ-4bit | | Industry standard compatibility | GPTQ-4bit | | Moderate compression, minimal setup | INT8 (load_in_8bit=True) | | CPU inference | GGUF (llama.cpp format) |

For production inference on GPU, AWQ-4bit with optimized kernels (vLLM or TGI) achieves the best throughput-quality tradeoff. For fine-tuning on limited hardware, QLoRA with NF4 is the standard approach.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:𝕏

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.