Learnixo

Transformer Architecture Q&A · Lesson 6 of 23

Encoder Stack: Layer Norm, FF, Residuals

Encoder Block Structure

Each encoder block has two sub-layers:

Input x
│
├─→ [Multi-Head Self-Attention] → + x → LayerNorm → a
│         ↑ (residual)
│
└─→ [Feed-Forward Network]     → + a → LayerNorm → output

Formally:
  a = LayerNorm(x + MultiHeadSelfAttention(x, x, x))
  output = LayerNorm(a + FFN(a))

The original paper (Vaswani et al., 2017) uses Post-LayerNorm (normalize after adding the residual). Modern transformers (GPT, LLaMA) use Pre-LayerNorm (normalize before the sub-layer) which trains more stably.


Sub-Layer 1: Multi-Head Self-Attention

The self-attention layer allows every token to attend to every other token in the sequence — bidirectionally. Q, K, V all come from the same input:

Self-Attention(x) = MultiHead(Q=x, K=x, V=x)

Output: For each token position, a new representation that aggregates information from all other positions, weighted by attention scores.

Key property: No causal masking in the encoder — token at position 5 can attend to position 10 just as easily as position 3. This bidirectional context is why BERT-style encoders are powerful for understanding tasks.


Sub-Layer 2: Feed-Forward Network (FFN)

FFN(x) = max(0, x · W₁ + b₁) · W₂ + b₂

Dimensions:
  d_model = 512, d_ff = 2048 (4× expansion in original paper)
  W₁: (d_model, d_ff) = (512, 2048)
  W₂: (d_ff, d_model) = (2048, 512)

The FFN applies the same learned transformation independently to each position. It does not share information across positions (unlike attention) — it processes each token's representation on its own.

Why the FFN exists: Attention mixes information across positions. The FFN processes and transforms the blended representations within each position. The two sub-layers have complementary roles: attention = communication, FFN = computation.

Modern models use SwiGLU or GELU instead of ReLU, and often expand d_ff to d_model × 8/3 or similar.


Residual Connections

Every sub-layer adds its input to its output:

output = LayerNorm(x + SubLayer(x))

Why residuals matter:

  1. Gradient flow: Gradients can flow directly through the addition, bypassing deep layers
  2. Identity shortcut: If a sub-layer's best transformation is "do nothing," the residual allows this — SubLayer(x) → 0, output → LayerNorm(x)
  3. Training stability: Without residuals, transformers don't train at depth (they have no highway for gradients)

What the Encoder Outputs

A stack of N encoder blocks (N=6 in original, N=24 in BERT-large, N=96+ in large models) transforms the input token embeddings into contextualised token representations:

Input:  [word embeddings + positional encodings]   shape: (seq_len, d_model)
Output: [contextualised representations]           shape: (seq_len, d_model)

Each output vector for token i has been enriched with information from all other tokens in the sequence. This is what BERT exploits — the final encoder states capture rich bidirectional context.

In encoder-decoder architectures (T5, original Transformer), the encoder's final output is passed as K and V to the cross-attention in each decoder block.


Code Skeleton

Python
import torch
import torch.nn as nn

class EncoderBlock(nn.Module):
    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x: torch.Tensor, mask=None) -> torch.Tensor:
        # Pre-LayerNorm (modern variant)
        a = x + self.dropout(self.self_attn(self.norm1(x), self.norm1(x), self.norm1(x), mask))
        out = a + self.dropout(self.ff(self.norm2(a)))
        return out

Interview Answer

"Each encoder block has two sub-layers: multi-head self-attention (where every token attends to every other token bidirectionally) and a position-wise feed-forward network. Both sub-layers use residual connections and layer normalisation. Attention handles cross-position communication; the FFN handles per-position transformation. A stack of N such blocks converts token embeddings into rich contextualised representations that capture the full sequence context."