Learnixo

Fine-Tuning LLMs · Lesson 7 of 16

Adapter Layers: An Alternative PEFT Approach

What Are Adapter Layers?

Adapter layers are small trainable modules inserted between the frozen layers of a pre-trained model. During fine-tuning, only the adapter parameters are updated — the original model weights remain frozen.

This was the original PEFT (Parameter-Efficient Fine-Tuning) approach, introduced in the "Parameter-Efficient Transfer Learning for NLP" paper (Houlsby et al., 2019).


Adapter Architecture

A standard adapter module is a bottleneck:

Input (d_model dimensions)
    ↓
Down-projection: d_model → r  (r is the bottleneck size, e.g. 64)
    ↓
Non-linearity (ReLU or GELU)
    ↓
Up-projection: r → d_model
    ↓
Residual add (skip connection)
    ↓
Output (d_model dimensions)

The residual connection means the adapter starts as a near-identity function — at initialization, it barely changes the model's behavior. Training gradually shapes the adapter to specialize for the target domain.

Python
import torch
import torch.nn as nn

class AdapterLayer(nn.Module):
    def __init__(self, d_model: int, bottleneck: int, dropout: float = 0.1):
        super().__init__()
        self.down_proj = nn.Linear(d_model, bottleneck)
        self.activation = nn.GELU()
        self.up_proj = nn.Linear(bottleneck, d_model)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(d_model)

        # Initialize near-zero so adapter starts as identity
        nn.init.zeros_(self.up_proj.weight)
        nn.init.zeros_(self.up_proj.bias)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        residual = x
        x = self.layer_norm(x)
        x = self.down_proj(x)
        x = self.activation(x)
        x = self.dropout(x)
        x = self.up_proj(x)
        return x + residual  # Residual connection

Where Adapters Are Inserted

Adapters are typically inserted after each transformer layer's attention and feed-forward sub-layers:

[Self-Attention] → [Adapter A] → [Add & Norm]
[Feed-Forward]  → [Adapter B] → [Add & Norm]

Some variants insert adapters only after attention, or only after feed-forward. The insertion point affects what the model learns to specialize.


Adapter vs LoRA: Key Differences

| Aspect | Adapter Layers | LoRA | |---|---|---| | Mechanism | Bottleneck MLP inserted in series | Low-rank decomposition of weight updates | | Added parameters | Adapters in every layer | Rank matrices for attention weights | | Inference overhead | Yes — extra forward pass through bottleneck | No — LoRA can be merged into weights | | Flexibility | Insert anywhere | Works on weight matrices | | Typical use | Research, multi-task learning | Production fine-tuning | | Memory during inference | Slightly higher | Same as base model after merge |

LoRA is now the dominant PEFT method for most production fine-tuning because it adds zero inference overhead when merged.


Multi-Task Learning with Adapters

Adapters shine in multi-task settings: train one adapter per task, share the frozen base model:

Frozen GPT-2 base
    ├── Adapter_DrugInteractions (task 1)
    ├── Adapter_PatientLeaflets (task 2)
    └── Adapter_ClinicalTrials (task 3)

At inference, swap adapters to switch tasks without loading a full model per task. This is the original motivation for adapter-based fine-tuning.

Python
from peft import PeftModel

# Load base model once
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")

# Load different adapters
drug_interactions_model = PeftModel.from_pretrained(base_model, "./adapter-drug-interactions")
patient_leaflets_model = PeftModel.from_pretrained(base_model, "./adapter-patient-leaflets")

# Swap at runtime  shared frozen base

Using Adapters with PEFT

The peft library supports adapters via AdaLoraConfig (a more modern adapter variant):

Python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, AdaLoraConfig, TaskType

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

# AdaLoRA: adaptive rank allocation (more sophisticated than fixed LoRA rank)
config = AdaLoraConfig(
    task_type=TaskType.CAUSAL_LM,
    init_r=12,          # Initial rank
    target_r=8,         # Target rank after pruning
    beta1=0.85,
    beta2=0.85,
    tinit=200,          # Steps before rank adjustment begins
    tfinal=1000,        # Steps when rank adjustment ends
    deltaT=10,
    target_modules=["q_proj", "v_proj"],
)

peft_model = get_peft_model(model, config)
peft_model.print_trainable_parameters()
# trainable params: ~2M || all params: 8B || trainable%: ~0.02%

When to Use Adapters

Good fit for adapters:

  • Multi-task fine-tuning where you need to switch between tasks at inference
  • Research settings requiring flexible insertion points
  • Continual learning scenarios where you add adapters for new tasks without forgetting old ones

Use LoRA instead when:

  • You need zero inference overhead (production APIs)
  • You want simpler configuration
  • You're doing single-task fine-tuning

In practice, most production fine-tuning uses LoRA or QLoRA rather than classic adapters. Adapters are historically important and still relevant in multi-task learning research.