Adapter Layers: How PEFT Works
Understand how adapter layers insert small trainable modules into a frozen LLM. Learn the architecture of adapters, how they differ from LoRA, and when to use each.
What Are Adapter Layers?
Adapter layers are small trainable modules inserted between the frozen layers of a pre-trained model. During fine-tuning, only the adapter parameters are updated ā the original model weights remain frozen.
This was the original PEFT (Parameter-Efficient Fine-Tuning) approach, introduced in the "Parameter-Efficient Transfer Learning for NLP" paper (Houlsby et al., 2019).
Adapter Architecture
A standard adapter module is a bottleneck:
Input (d_model dimensions)
ā
Down-projection: d_model ā r (r is the bottleneck size, e.g. 64)
ā
Non-linearity (ReLU or GELU)
ā
Up-projection: r ā d_model
ā
Residual add (skip connection)
ā
Output (d_model dimensions)The residual connection means the adapter starts as a near-identity function ā at initialization, it barely changes the model's behavior. Training gradually shapes the adapter to specialize for the target domain.
import torch
import torch.nn as nn
class AdapterLayer(nn.Module):
def __init__(self, d_model: int, bottleneck: int, dropout: float = 0.1):
super().__init__()
self.down_proj = nn.Linear(d_model, bottleneck)
self.activation = nn.GELU()
self.up_proj = nn.Linear(bottleneck, d_model)
self.dropout = nn.Dropout(dropout)
self.layer_norm = nn.LayerNorm(d_model)
# Initialize near-zero so adapter starts as identity
nn.init.zeros_(self.up_proj.weight)
nn.init.zeros_(self.up_proj.bias)
def forward(self, x: torch.Tensor) -> torch.Tensor:
residual = x
x = self.layer_norm(x)
x = self.down_proj(x)
x = self.activation(x)
x = self.dropout(x)
x = self.up_proj(x)
return x + residual # Residual connectionWhere Adapters Are Inserted
Adapters are typically inserted after each transformer layer's attention and feed-forward sub-layers:
[Self-Attention] ā [Adapter A] ā [Add & Norm]
[Feed-Forward] ā [Adapter B] ā [Add & Norm]Some variants insert adapters only after attention, or only after feed-forward. The insertion point affects what the model learns to specialize.
Adapter vs LoRA: Key Differences
| Aspect | Adapter Layers | LoRA | |---|---|---| | Mechanism | Bottleneck MLP inserted in series | Low-rank decomposition of weight updates | | Added parameters | Adapters in every layer | Rank matrices for attention weights | | Inference overhead | Yes ā extra forward pass through bottleneck | No ā LoRA can be merged into weights | | Flexibility | Insert anywhere | Works on weight matrices | | Typical use | Research, multi-task learning | Production fine-tuning | | Memory during inference | Slightly higher | Same as base model after merge |
LoRA is now the dominant PEFT method for most production fine-tuning because it adds zero inference overhead when merged.
Multi-Task Learning with Adapters
Adapters shine in multi-task settings: train one adapter per task, share the frozen base model:
Frozen GPT-2 base
āāā Adapter_DrugInteractions (task 1)
āāā Adapter_PatientLeaflets (task 2)
āāā Adapter_ClinicalTrials (task 3)At inference, swap adapters to switch tasks without loading a full model per task. This is the original motivation for adapter-based fine-tuning.
from peft import PeftModel
# Load base model once
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
# Load different adapters
drug_interactions_model = PeftModel.from_pretrained(base_model, "./adapter-drug-interactions")
patient_leaflets_model = PeftModel.from_pretrained(base_model, "./adapter-patient-leaflets")
# Swap at runtime ā shared frozen baseUsing Adapters with PEFT
The peft library supports adapters via AdaLoraConfig (a more modern adapter variant):
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, AdaLoraConfig, TaskType
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
# AdaLoRA: adaptive rank allocation (more sophisticated than fixed LoRA rank)
config = AdaLoraConfig(
task_type=TaskType.CAUSAL_LM,
init_r=12, # Initial rank
target_r=8, # Target rank after pruning
beta1=0.85,
beta2=0.85,
tinit=200, # Steps before rank adjustment begins
tfinal=1000, # Steps when rank adjustment ends
deltaT=10,
target_modules=["q_proj", "v_proj"],
)
peft_model = get_peft_model(model, config)
peft_model.print_trainable_parameters()
# trainable params: ~2M || all params: 8B || trainable%: ~0.02%When to Use Adapters
Good fit for adapters:
- Multi-task fine-tuning where you need to switch between tasks at inference
- Research settings requiring flexible insertion points
- Continual learning scenarios where you add adapters for new tasks without forgetting old ones
Use LoRA instead when:
- You need zero inference overhead (production APIs)
- You want simpler configuration
- You're doing single-task fine-tuning
In practice, most production fine-tuning uses LoRA or QLoRA rather than classic adapters. Adapters are historically important and still relevant in multi-task learning research.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.