LoRA Rank Selection: How to Choose r

What Does Rank Control?

In LoRA, weight updates are decomposed into two low-rank matrices:

ΔW = A × B
where A ∈ R^(d × r), B ∈ R^(r × k)

The rank r controls:

Parameter count: 2 × d × r per LoRA module (vs d × k for full fine-tuning)
Expressiveness: Higher r can represent more complex weight updates
Overfitting risk: Higher r with small datasets leads to overfitting

Choosing r is the central hyperparameter decision in LoRA fine-tuning.

Parameter Count by Rank

For a typical attention weight matrix in a 7B model (d=4096, k=4096):

| Rank r | LoRA params per matrix | vs Full rank (16.7M) | |---|---|---| | 4 | 32,768 | 0.2% | | 8 | 65,536 | 0.4% | | 16 | 131,072 | 0.8% | | 32 | 262,144 | 1.6% | | 64 | 524,288 | 3.1% |

With 4 target matrices (q, k, v, o projections) per layer and 32 layers in a 7B model, r=16 gives roughly 16M trainable parameters out of 7B total — about 0.2%.

Rank Selection Heuristics

Start with r=8 or r=16 for most tasks. This is the practical default that works well across a wide range of fine-tuning scenarios.

| Task complexity | Recommended r | Why | |---|---|---| | Simple style/format adaptation | 4–8 | Minimal weight update needed | | Domain adaptation (medical, legal) | 8–16 | Moderate concept shift | | New task learning (classification) | 16–32 | More complex weight change | | Complex reasoning / instruction following | 32–64 | High expressiveness needed | | Very large dataset, complex task | 64–128 | Can absorb more signal |

When in doubt: start at r=16, evaluate, then reduce if overfitting or increase if underfitting.

The alpha Parameter

lora_alpha controls the scaling of the LoRA update:

effective_update = (alpha / r) × ΔW

Common conventions:

alpha = r: scaling factor of 1.0 (no scaling)
alpha = 2 × r: scaling factor of 2.0 (amplifies the update)
alpha = 16 with any r: fixed scaling regardless of rank

Practical recommendation: Set alpha = 2 × r or alpha = r. The exact ratio matters less than consistency. Many practitioners use alpha=16 with r=8 as a stable default.

Python

from peft import LoraConfig, TaskType

config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,            # Rank
    lora_alpha=32,   # alpha = 2r (scaling factor = 2)
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    bias="none",
)

Choosing Target Modules

LoRA can be applied to any linear layer. Common choices:

Attention only (common default):

Python

target_modules=["q_proj", "v_proj"]  # Most impactful with fewest params

All attention projections:

Python

target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]

Attention + feed-forward (more parameters, more expressive):

Python

target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

Including feed-forward layers roughly triples the trainable parameter count but can significantly improve performance for tasks requiring new factual knowledge.

Find module names for any model:

Python

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
for name, module in model.named_modules():
    if hasattr(module, 'weight'):
        print(name)
# Output: model.layers.0.self_attn.q_proj, model.layers.0.self_attn.k_proj, ...

Full Configuration Example

Python

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, TaskType, get_peft_model
import torch

# QLoRA: 4-bit quantized base + LoRA adapters
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto",
)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    bias="none",
    inference_mode=False,
)

peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.2604

Diagnosing Rank Selection

Signs r is too low:

Validation loss plateaus early
Model doesn't capture domain-specific patterns
Evaluation metrics don't improve beyond a baseline

Signs r is too high:

Training loss decreases but validation loss increases (overfitting)
Model memorizes training examples rather than generalizing
Evaluation performance worse than lower-rank variant

Systematic search: Train with r=4, 8, 16, 32 on the same dataset, evaluate on validation set. Plot eval loss vs r. The optimal r shows the lowest eval loss — beyond that is overfitting territory.

For most practical fine-tuning tasks with 1k–100k examples, r=16 is a reliable starting point that rarely requires adjustment.

LoRA Rank Selection: How to Choose r

What Does Rank Control?

Parameter Count by Rank

Rank Selection Heuristics

The alpha Parameter

Choosing Target Modules

Full Configuration Example

Diagnosing Rank Selection

Enjoyed this article?

Leave a comment