Fine-Tuning LLMs · Lesson 7 of 16
Adapter Layers: An Alternative PEFT Approach
What Are Adapter Layers?
Adapter layers are small trainable modules inserted between the frozen layers of a pre-trained model. During fine-tuning, only the adapter parameters are updated — the original model weights remain frozen.
This was the original PEFT (Parameter-Efficient Fine-Tuning) approach, introduced in the "Parameter-Efficient Transfer Learning for NLP" paper (Houlsby et al., 2019).
Adapter Architecture
A standard adapter module is a bottleneck:
Input (d_model dimensions)
↓
Down-projection: d_model → r (r is the bottleneck size, e.g. 64)
↓
Non-linearity (ReLU or GELU)
↓
Up-projection: r → d_model
↓
Residual add (skip connection)
↓
Output (d_model dimensions)The residual connection means the adapter starts as a near-identity function — at initialization, it barely changes the model's behavior. Training gradually shapes the adapter to specialize for the target domain.
import torch
import torch.nn as nn
class AdapterLayer(nn.Module):
def __init__(self, d_model: int, bottleneck: int, dropout: float = 0.1):
super().__init__()
self.down_proj = nn.Linear(d_model, bottleneck)
self.activation = nn.GELU()
self.up_proj = nn.Linear(bottleneck, d_model)
self.dropout = nn.Dropout(dropout)
self.layer_norm = nn.LayerNorm(d_model)
# Initialize near-zero so adapter starts as identity
nn.init.zeros_(self.up_proj.weight)
nn.init.zeros_(self.up_proj.bias)
def forward(self, x: torch.Tensor) -> torch.Tensor:
residual = x
x = self.layer_norm(x)
x = self.down_proj(x)
x = self.activation(x)
x = self.dropout(x)
x = self.up_proj(x)
return x + residual # Residual connectionWhere Adapters Are Inserted
Adapters are typically inserted after each transformer layer's attention and feed-forward sub-layers:
[Self-Attention] → [Adapter A] → [Add & Norm]
[Feed-Forward] → [Adapter B] → [Add & Norm]Some variants insert adapters only after attention, or only after feed-forward. The insertion point affects what the model learns to specialize.
Adapter vs LoRA: Key Differences
| Aspect | Adapter Layers | LoRA | |---|---|---| | Mechanism | Bottleneck MLP inserted in series | Low-rank decomposition of weight updates | | Added parameters | Adapters in every layer | Rank matrices for attention weights | | Inference overhead | Yes — extra forward pass through bottleneck | No — LoRA can be merged into weights | | Flexibility | Insert anywhere | Works on weight matrices | | Typical use | Research, multi-task learning | Production fine-tuning | | Memory during inference | Slightly higher | Same as base model after merge |
LoRA is now the dominant PEFT method for most production fine-tuning because it adds zero inference overhead when merged.
Multi-Task Learning with Adapters
Adapters shine in multi-task settings: train one adapter per task, share the frozen base model:
Frozen GPT-2 base
├── Adapter_DrugInteractions (task 1)
├── Adapter_PatientLeaflets (task 2)
└── Adapter_ClinicalTrials (task 3)At inference, swap adapters to switch tasks without loading a full model per task. This is the original motivation for adapter-based fine-tuning.
from peft import PeftModel
# Load base model once
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
# Load different adapters
drug_interactions_model = PeftModel.from_pretrained(base_model, "./adapter-drug-interactions")
patient_leaflets_model = PeftModel.from_pretrained(base_model, "./adapter-patient-leaflets")
# Swap at runtime — shared frozen baseUsing Adapters with PEFT
The peft library supports adapters via AdaLoraConfig (a more modern adapter variant):
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, AdaLoraConfig, TaskType
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
# AdaLoRA: adaptive rank allocation (more sophisticated than fixed LoRA rank)
config = AdaLoraConfig(
task_type=TaskType.CAUSAL_LM,
init_r=12, # Initial rank
target_r=8, # Target rank after pruning
beta1=0.85,
beta2=0.85,
tinit=200, # Steps before rank adjustment begins
tfinal=1000, # Steps when rank adjustment ends
deltaT=10,
target_modules=["q_proj", "v_proj"],
)
peft_model = get_peft_model(model, config)
peft_model.print_trainable_parameters()
# trainable params: ~2M || all params: 8B || trainable%: ~0.02%When to Use Adapters
Good fit for adapters:
- Multi-task fine-tuning where you need to switch between tasks at inference
- Research settings requiring flexible insertion points
- Continual learning scenarios where you add adapters for new tasks without forgetting old ones
Use LoRA instead when:
- You need zero inference overhead (production APIs)
- You want simpler configuration
- You're doing single-task fine-tuning
In practice, most production fine-tuning uses LoRA or QLoRA rather than classic adapters. Adapters are historically important and still relevant in multi-task learning research.