LoRA Rank Selection: How to Choose r
Understand how LoRA rank r controls the parameter count and expressiveness of fine-tuning. Learn heuristics for choosing r, alpha, and target modules for different tasks.
What Does Rank Control?
In LoRA, weight updates are decomposed into two low-rank matrices:
ΔW = A × B
where A ∈ R^(d × r), B ∈ R^(r × k)The rank r controls:
- Parameter count: 2 × d × r per LoRA module (vs d × k for full fine-tuning)
- Expressiveness: Higher r can represent more complex weight updates
- Overfitting risk: Higher r with small datasets leads to overfitting
Choosing r is the central hyperparameter decision in LoRA fine-tuning.
Parameter Count by Rank
For a typical attention weight matrix in a 7B model (d=4096, k=4096):
| Rank r | LoRA params per matrix | vs Full rank (16.7M) | |---|---|---| | 4 | 32,768 | 0.2% | | 8 | 65,536 | 0.4% | | 16 | 131,072 | 0.8% | | 32 | 262,144 | 1.6% | | 64 | 524,288 | 3.1% |
With 4 target matrices (q, k, v, o projections) per layer and 32 layers in a 7B model, r=16 gives roughly 16M trainable parameters out of 7B total — about 0.2%.
Rank Selection Heuristics
Start with r=8 or r=16 for most tasks. This is the practical default that works well across a wide range of fine-tuning scenarios.
| Task complexity | Recommended r | Why | |---|---|---| | Simple style/format adaptation | 4–8 | Minimal weight update needed | | Domain adaptation (medical, legal) | 8–16 | Moderate concept shift | | New task learning (classification) | 16–32 | More complex weight change | | Complex reasoning / instruction following | 32–64 | High expressiveness needed | | Very large dataset, complex task | 64–128 | Can absorb more signal |
When in doubt: start at r=16, evaluate, then reduce if overfitting or increase if underfitting.
The alpha Parameter
lora_alpha controls the scaling of the LoRA update:
effective_update = (alpha / r) × ΔWCommon conventions:
alpha = r: scaling factor of 1.0 (no scaling)alpha = 2 × r: scaling factor of 2.0 (amplifies the update)alpha = 16with any r: fixed scaling regardless of rank
Practical recommendation: Set alpha = 2 × r or alpha = r. The exact ratio matters less than consistency. Many practitioners use alpha=16 with r=8 as a stable default.
from peft import LoraConfig, TaskType
config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank
lora_alpha=32, # alpha = 2r (scaling factor = 2)
lora_dropout=0.05,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
bias="none",
)Choosing Target Modules
LoRA can be applied to any linear layer. Common choices:
Attention only (common default):
target_modules=["q_proj", "v_proj"] # Most impactful with fewest paramsAll attention projections:
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"]Attention + feed-forward (more parameters, more expressive):
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]Including feed-forward layers roughly triples the trainable parameter count but can significantly improve performance for tasks requiring new factual knowledge.
Find module names for any model:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
for name, module in model.named_modules():
if hasattr(module, 'weight'):
print(name)
# Output: model.layers.0.self_attn.q_proj, model.layers.0.self_attn.k_proj, ...Full Configuration Example
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, TaskType, get_peft_model
import torch
# QLoRA: 4-bit quantized base + LoRA adapters
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=bnb_config,
device_map="auto",
)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
bias="none",
inference_mode=False,
)
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.2604Diagnosing Rank Selection
Signs r is too low:
- Validation loss plateaus early
- Model doesn't capture domain-specific patterns
- Evaluation metrics don't improve beyond a baseline
Signs r is too high:
- Training loss decreases but validation loss increases (overfitting)
- Model memorizes training examples rather than generalizing
- Evaluation performance worse than lower-rank variant
Systematic search: Train with r=4, 8, 16, 32 on the same dataset, evaluate on validation set. Plot eval loss vs r. The optimal r shows the lowest eval loss — beyond that is overfitting territory.
For most practical fine-tuning tasks with 1k–100k examples, r=16 is a reliable starting point that rarely requires adjustment.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.