Adapter Layers: How PEFT Works

What Are Adapter Layers?

Adapter layers are small trainable modules inserted between the frozen layers of a pre-trained model. During fine-tuning, only the adapter parameters are updated — the original model weights remain frozen.

This was the original PEFT (Parameter-Efficient Fine-Tuning) approach, introduced in the "Parameter-Efficient Transfer Learning for NLP" paper (Houlsby et al., 2019).

Adapter Architecture

A standard adapter module is a bottleneck:

Input (d_model dimensions)
    ↓
Down-projection: d_model → r  (r is the bottleneck size, e.g. 64)
    ↓
Non-linearity (ReLU or GELU)
    ↓
Up-projection: r → d_model
    ↓
Residual add (skip connection)
    ↓
Output (d_model dimensions)

The residual connection means the adapter starts as a near-identity function — at initialization, it barely changes the model's behavior. Training gradually shapes the adapter to specialize for the target domain.

Python

import torch
import torch.nn as nn

class AdapterLayer(nn.Module):
    def __init__(self, d_model: int, bottleneck: int, dropout: float = 0.1):
        super().__init__()
        self.down_proj = nn.Linear(d_model, bottleneck)
        self.activation = nn.GELU()
        self.up_proj = nn.Linear(bottleneck, d_model)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(d_model)

        # Initialize near-zero so adapter starts as identity
        nn.init.zeros_(self.up_proj.weight)
        nn.init.zeros_(self.up_proj.bias)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        residual = x
        x = self.layer_norm(x)
        x = self.down_proj(x)
        x = self.activation(x)
        x = self.dropout(x)
        x = self.up_proj(x)
        return x + residual  # Residual connection

Where Adapters Are Inserted

Adapters are typically inserted after each transformer layer's attention and feed-forward sub-layers:

[Self-Attention] → [Adapter A] → [Add & Norm]
[Feed-Forward]  → [Adapter B] → [Add & Norm]

Some variants insert adapters only after attention, or only after feed-forward. The insertion point affects what the model learns to specialize.

Adapter vs LoRA: Key Differences

| Aspect | Adapter Layers | LoRA | |---|---|---| | Mechanism | Bottleneck MLP inserted in series | Low-rank decomposition of weight updates | | Added parameters | Adapters in every layer | Rank matrices for attention weights | | Inference overhead | Yes — extra forward pass through bottleneck | No — LoRA can be merged into weights | | Flexibility | Insert anywhere | Works on weight matrices | | Typical use | Research, multi-task learning | Production fine-tuning | | Memory during inference | Slightly higher | Same as base model after merge |

LoRA is now the dominant PEFT method for most production fine-tuning because it adds zero inference overhead when merged.

Multi-Task Learning with Adapters

Adapters shine in multi-task settings: train one adapter per task, share the frozen base model:

Frozen GPT-2 base
    ├── Adapter_DrugInteractions (task 1)
    ├── Adapter_PatientLeaflets (task 2)
    └── Adapter_ClinicalTrials (task 3)

At inference, swap adapters to switch tasks without loading a full model per task. This is the original motivation for adapter-based fine-tuning.

Python

from peft import PeftModel

# Load base model once
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")

# Load different adapters
drug_interactions_model = PeftModel.from_pretrained(base_model, "./adapter-drug-interactions")
patient_leaflets_model = PeftModel.from_pretrained(base_model, "./adapter-patient-leaflets")

# Swap at runtime — shared frozen base

Using Adapters with PEFT

The peft library supports adapters via AdaLoraConfig (a more modern adapter variant):

Python

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, AdaLoraConfig, TaskType

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

# AdaLoRA: adaptive rank allocation (more sophisticated than fixed LoRA rank)
config = AdaLoraConfig(
    task_type=TaskType.CAUSAL_LM,
    init_r=12,          # Initial rank
    target_r=8,         # Target rank after pruning
    beta1=0.85,
    beta2=0.85,
    tinit=200,          # Steps before rank adjustment begins
    tfinal=1000,        # Steps when rank adjustment ends
    deltaT=10,
    target_modules=["q_proj", "v_proj"],
)

peft_model = get_peft_model(model, config)
peft_model.print_trainable_parameters()
# trainable params: ~2M || all params: 8B || trainable%: ~0.02%

When to Use Adapters

Good fit for adapters:

Multi-task fine-tuning where you need to switch between tasks at inference
Research settings requiring flexible insertion points
Continual learning scenarios where you add adapters for new tasks without forgetting old ones

Use LoRA instead when:

You need zero inference overhead (production APIs)
You want simpler configuration
You're doing single-task fine-tuning

In practice, most production fine-tuning uses LoRA or QLoRA rather than classic adapters. Adapters are historically important and still relevant in multi-task learning research.

Adapter Layers: How PEFT Works

What Are Adapter Layers?

Adapter Architecture

Where Adapters Are Inserted

Adapter vs LoRA: Key Differences

Multi-Task Learning with Adapters

Using Adapters with PEFT

When to Use Adapters

Enjoyed this article?

Leave a comment