Learnixo

Fine-Tuning LLMs · Lesson 6 of 16

QLoRA: 4-bit Quantization + LoRA

QLoRA: Fine-Tuning on Consumer Hardware

QLoRA made it possible to fine-tune a 65-billion parameter model on a single 48 GB GPU — hardware that previously could barely load the model. The technique combines two ideas: aggressive quantization of the base model to reduce memory, and LoRA adapters trained in higher precision to maintain quality.


The Memory Problem

Large language models consume GPU memory proportional to their parameter count. A 70B parameter model in bfloat16 requires approximately 140 GB of GPU memory — before you add gradients, optimizer states, and activations needed for training.

Python
def memory_breakdown(params_billions: float) -> None:
    """Show the memory breakdown for full fine-tuning vs QLoRA."""
    params = params_billions * 1e9

    print(f"\n{'='*60}")
    print(f"Memory breakdown for {params_billions}B parameter model")
    print(f"{'='*60}")

    # Full fine-tuning in bfloat16
    model_bf16  = params * 2 / 1e9    # 2 bytes per param
    grads_bf16  = params * 2 / 1e9    # gradient per param
    optim_fp32  = params * 8 / 1e9    # AdamW: m + v in fp32
    activations = params * 2 / 1e9    # rough estimate
    total_full_ft = model_bf16 + grads_bf16 + optim_fp32 + activations

    print(f"\nFull fine-tuning (bfloat16 + AdamW):")
    print(f"  Model weights:       {model_bf16:.1f} GB")
    print(f"  Gradients:           {grads_bf16:.1f} GB")
    print(f"  Optimizer states:    {optim_fp32:.1f} GB")
    print(f"  Activations (est):   {activations:.1f} GB")
    print(f"  TOTAL:               {total_full_ft:.1f} GB")

    # QLoRA
    model_4bit  = params * 0.5 / 1e9   # NF4  0.5 bytes/param (4-bit + overhead)
    lora_params = params * 0.002        # ~0.2% trainable
    lora_grads  = lora_params * 2 / 1e9
    lora_optim  = lora_params * 8 / 1e9
    activations_qlora = params * 0.5 / 1e9  # paged attention reduces this
    total_qlora = model_4bit + lora_grads + lora_optim + activations_qlora

    print(f"\nQLoRA (4-bit NF4 base + LoRA adapters):")
    print(f"  4-bit base model:    {model_4bit:.1f} GB")
    print(f"  LoRA gradients:      {lora_grads:.3f} GB")
    print(f"  LoRA optimizer:      {lora_optim:.3f} GB")
    print(f"  Activations (est):   {activations_qlora:.1f} GB")
    print(f"  TOTAL:               {total_qlora:.1f} GB")
    print(f"\nMemory reduction:      {total_full_ft / total_qlora:.1f}x")

for size in [7.0, 13.0, 70.0]:
    memory_breakdown(size)

# 70B model:
# Full fine-tuning: ~915 GB (needs 12x A100 80GB)
# QLoRA:             ~48 GB (fits on 1x A100 80GB)

NF4 Quantization: Why It Works

Standard 4-bit quantization maps float values to 16 discrete levels, which introduces large quantization error for values far from the grid. NormalFloat4 (NF4) is designed specifically for neural network weights, which follow a normal (Gaussian) distribution centered at zero.

NF4 places quantization levels at the quantiles of the standard normal distribution, minimizing expected quantization error for normally distributed weights.

Python
import numpy as np

# Conceptual demonstration of NF4 quantization
def get_nf4_levels() -> list[float]:
    """
    NF4 quantization levels: placed at quantiles of N(0,1).
    16 levels for 4 bits (values 0-15).
    """
    from scipy.stats import norm

    # 16 equally-spaced quantiles of the normal distribution
    quantile_positions = [(i + 0.5) / 16 for i in range(16)]
    levels = [norm.ppf(q) for q in quantile_positions]

    # Normalize to [-1, 1] range
    max_val = max(abs(l) for l in levels)
    return [l / max_val for l in levels]

def quantize_nf4(weights: np.ndarray, levels: list[float]) -> np.ndarray:
    """Quantize weights to NF4 format."""
    levels_arr = np.array(levels)

    # Scale weights to [-1, 1]
    scale = np.max(np.abs(weights))
    normalized = weights / (scale + 1e-8)

    # Find nearest NF4 level for each weight
    quantized_indices = np.argmin(
        np.abs(normalized[:, np.newaxis] - levels_arr[np.newaxis, :]),
        axis=1
    )
    quantized_normalized = levels_arr[quantized_indices]

    # Dequantize
    reconstructed = quantized_normalized * scale
    return reconstructed, scale, quantized_indices

# Compare NF4 vs uniform 4-bit quantization
np.random.seed(42)
weights = np.random.randn(10000)  # normally distributed weights

nf4_levels = get_nf4_levels()
uniform_levels = np.linspace(-1, 1, 16).tolist()

def quantization_error(weights, levels):
    reconstructed, scale, _ = quantize_nf4(weights, levels)
    return np.sqrt(np.mean((weights - reconstructed) ** 2))

nf4_rmse = quantization_error(weights, nf4_levels)
uniform_rmse = quantization_error(weights, uniform_levels)

print(f"NF4 RMSE:     {nf4_rmse:.6f}")
print(f"Uniform RMSE: {uniform_rmse:.6f}")
print(f"NF4 advantage: {uniform_rmse / nf4_rmse:.2f}x lower error")
# NF4 typically achieves 1.3-1.5x lower quantization error for normally distributed weights

Double Quantization

QLoRA uses "double quantization" — it quantizes the quantization constants themselves. Each block of weights has a scale factor (a 32-bit float), and these scale factors are also quantized to 8-bit. This saves an additional 0.37 bits per parameter.

Python
from transformers import BitsAndBytesConfig
import torch

# Full QLoRA quantization configuration
qlora_bnb_config = BitsAndBytesConfig(
    # Primary: 4-bit NF4 quantization
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",

    # Compute precision for dequantized operations
    # bfloat16 is faster than float16 on Ampere+ GPUs and more numerically stable
    bnb_4bit_compute_dtype=torch.bfloat16,

    # Double quantization: quantize the quantization constants
    # Saves ~0.37 bits per parameter
    bnb_4bit_use_double_quant=True,
)

# Memory per parameter:
# NF4 alone:           4 bits + 32-bit scale per 64-param block = ~4.5 bits/param
# NF4 + double quant:  4 bits + 8-bit scale per 64-param block  = ~4.13 bits/param
# Roughly 0.5 bytes per parameter (vs 2 bytes for bfloat16)

print("4-bit NF4 memory:              ~0.5 bytes/param")
print("bfloat16 memory:               ~2.0 bytes/param")
print("Memory reduction:              ~4x")

Paged Optimizers

QLoRA uses paged memory management via NVIDIA unified memory to handle occasional memory spikes during gradient checkpointing. When the GPU runs out of memory, optimizer states page to CPU RAM rather than crashing.

Python
from transformers import TrainingArguments

# Enable paged optimizers in training arguments
training_args = TrainingArguments(
    output_dir="./qlora-output",
    # Paged AdamW  uses NVIDIA unified memory for optimizer states
    optim="paged_adamw_32bit",
    # Alternatives:
    # "paged_adamw_8bit"    8-bit Adam, even less memory, slightly lower quality
    # "adamw_torch"         standard (no paging, may OOM on memory spikes)

    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,    # effective batch = 16
    gradient_checkpointing=True,      # recompute activations on backward pass
    learning_rate=2e-4,
    num_train_epochs=3,
    bf16=True,
    logging_steps=25,
    save_strategy="epoch",
    report_to="none",
)

Complete QLoRA Pipeline

Python
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import (
    LoraConfig,
    TaskType,
    get_peft_model,
    prepare_model_for_kbit_training,
)
from trl import SFTTrainer
from datasets import Dataset

# ─── Configuration ────────────────────────────────────────────────────────────
MODEL_NAME = "meta-llama/Llama-3.1-70B-Instruct"  # 70B on a single A100!
OUTPUT_DIR = "./llama3-70b-drug-qlora"
MAX_SEQ_LENGTH = 2048

# ─── Step 1: BitsAndBytes config ─────────────────────────────────────────────
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# ─── Step 2: Load tokenizer ───────────────────────────────────────────────────
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# ─── Step 3: Load base model in 4-bit ────────────────────────────────────────
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",           # auto-shard across available GPUs
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",  # Flash Attention 2 for speed
)

# ─── Step 4: Prepare for k-bit training ──────────────────────────────────────
# This step:
#   - Casts LayerNorm layers to float32 for stability
#   - Enables gradient checkpointing
#   - Enables input embeddings to require gradients
model = prepare_model_for_kbit_training(
    model,
    use_gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
)

# ─── Step 5: LoRA configuration ──────────────────────────────────────────────
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# For 70B: trainable ~50M / 70,000M = ~0.07%

# ─── Step 6: Dataset ─────────────────────────────────────────────────────────
def format_example(example: dict) -> dict:
    messages = [
        {
            "role": "system",
            "content": "You are an expert clinical pharmacist. Answer drug information queries accurately."
        },
        {"role": "user", "content": example["question"]},
        {"role": "assistant", "content": example["answer"]},
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    return {"text": text}

# Load your dataset (replace with your actual data source)
raw_examples = [
    {
        "question": "What is the recommended first-line treatment for hypertension in patients with diabetes?",
        "answer": (
            "For patients with diabetes and hypertension, ACE inhibitors (e.g., ramipril, lisinopril) "
            "or ARBs (e.g., losartan, valsartan) are recommended as first-line therapy. "
            "These agents provide both blood pressure reduction and nephroprotection via blockade of "
            "the renin-angiotensin-aldosterone system. Target blood pressure is below 130/80 mmHg "
            "per ADA and JNC 8 guidelines. Beta-blockers are not preferred as first-line unless "
            "the patient has concurrent ischemic heart disease or heart failure."
        )
    },
    # ... 999 more examples
]

dataset = Dataset.from_list([format_example(ex) for ex in raw_examples])
split = dataset.train_test_split(test_size=0.05, seed=42)

# ─── Step 7: Training arguments ──────────────────────────────────────────────
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    optim="paged_adamw_32bit",
    save_steps=100,
    logging_steps=25,
    learning_rate=2e-4,
    fp16=False,
    bf16=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    group_by_length=True,      # group similar-length sequences  fewer padding tokens
    lr_scheduler_type="cosine",
    report_to="none",
    evaluation_strategy="steps",
    eval_steps=50,
    load_best_model_at_end=True,
)

# ─── Step 8: Train ────────────────────────────────────────────────────────────
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=split["train"],
    eval_dataset=split["test"],
    tokenizer=tokenizer,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    packing=True,   # pack multiple short examples into one sequence
)

trainer.train()

# ─── Step 9: Save ─────────────────────────────────────────────────────────────
# Only save the LoRA adapter weights (small file, not the quantized base)
trainer.model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

print(f"LoRA adapter saved to {OUTPUT_DIR}")
print("Base model (70B quantized) can be reloaded from HuggingFace Hub")

Hardware Requirements by Model Size

Python
# QLoRA hardware guide
qlora_requirements = {
    "7B": {
        "gpu_memory_needed_gb": 10,
        "recommended_gpu": "RTX 4070 Ti (12 GB) or better",
        "training_time_1k_examples": "~20 minutes",
        "cost_estimate_cloud": "~$0.30 on Lambda Labs RTX 4090",
    },
    "13B": {
        "gpu_memory_needed_gb": 14,
        "recommended_gpu": "RTX 3090 / 4090 (24 GB)",
        "training_time_1k_examples": "~35 minutes",
        "cost_estimate_cloud": "~$0.60 on Lambda Labs RTX 4090",
    },
    "34B": {
        "gpu_memory_needed_gb": 24,
        "recommended_gpu": "RTX 4090 (24 GB) or A6000 (48 GB)",
        "training_time_1k_examples": "~90 minutes",
        "cost_estimate_cloud": "~$3 on Lambda Labs A6000",
    },
    "70B": {
        "gpu_memory_needed_gb": 48,
        "recommended_gpu": "A100 80GB (single GPU) or 2x A40 (48 GB each)",
        "training_time_1k_examples": "~4 hours",
        "cost_estimate_cloud": "~$12 on Lambda Labs A100",
    },
}

for model_size, specs in qlora_requirements.items():
    print(f"\nQLoRA {model_size} model:")
    for k, v in specs.items():
        print(f"  {k}: {v}")

Inference After QLoRA

Python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

def load_qlora_model_for_inference(
    base_model_name: str,
    adapter_path: str,
    use_4bit: bool = True,
) -> tuple:
    """Load base model + LoRA adapter for inference."""
    tokenizer = AutoTokenizer.from_pretrained(base_model_name)

    if use_4bit:
        # Keep 4-bit quantization for memory-efficient inference
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
        )
        base = AutoModelForCausalLM.from_pretrained(
            base_model_name,
            quantization_config=bnb_config,
            device_map="auto",
        )
    else:
        # Load in bfloat16 for full precision inference (requires more memory)
        base = AutoModelForCausalLM.from_pretrained(
            base_model_name,
            torch_dtype=torch.bfloat16,
            device_map="auto",
        )

    model = PeftModel.from_pretrained(base, adapter_path)
    model.eval()

    return model, tokenizer

def generate(model, tokenizer, prompt: str, max_new_tokens: int = 512) -> str:
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.1,
            do_sample=True,
            repetition_penalty=1.1,
            pad_token_id=tokenizer.eos_token_id,
        )
    new_tokens = output[0][inputs["input_ids"].shape[1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True)

# Usage:
# model, tokenizer = load_qlora_model_for_inference(
#     "meta-llama/Llama-3.1-70B-Instruct",
#     "./llama3-70b-drug-qlora",
#     use_4bit=True
# )
# answer = generate(model, tokenizer, "What is the first-line treatment for septic shock?")

QLoRA vs LoRA: When to Choose Which

| Scenario | Use LoRA | Use QLoRA | |---|---|---| | Model fits in GPU memory in bfloat16 | Yes | Not needed | | Model too large for bfloat16 training | No | Yes | | 7B model on 16 GB GPU | No | Yes | | 7B model on 24 GB GPU | Yes (preferred) | Also works | | 70B model on 80 GB GPU | No | Yes | | Fastest training speed priority | Yes (no dequant overhead) | Slower | | Lowest memory priority | No | Yes |


Summary

QLoRA unlocks fine-tuning of 70B models on hardware that is accessible to individual engineers and small teams. The three components work together:

  1. NF4 quantization compresses the base model to approximately 0.5 bytes per parameter — a 4x reduction from bfloat16
  2. Double quantization shaves another 0.37 bits per parameter from the quantization overhead
  3. Paged optimizers prevent out-of-memory crashes during gradient computation peaks

The quality cost is small. QLoRA-tuned models typically perform within 1-2% of full precision LoRA on standard benchmarks, and the models they produce are often indistinguishable from full fine-tuned models on domain tasks.