The Transformer Encoder Block
What an encoder block contains, how multi-head self-attention and feed-forward layers combine, the role of residual connections and layer norm, and what the encoder outputs.
Encoder Block Structure
Each encoder block has two sub-layers:
Input x
โ
โโโ [Multi-Head Self-Attention] โ + x โ LayerNorm โ a
โ โ (residual)
โ
โโโ [Feed-Forward Network] โ + a โ LayerNorm โ output
Formally:
a = LayerNorm(x + MultiHeadSelfAttention(x, x, x))
output = LayerNorm(a + FFN(a))The original paper (Vaswani et al., 2017) uses Post-LayerNorm (normalize after adding the residual). Modern transformers (GPT, LLaMA) use Pre-LayerNorm (normalize before the sub-layer) which trains more stably.
Sub-Layer 1: Multi-Head Self-Attention
The self-attention layer allows every token to attend to every other token in the sequence โ bidirectionally. Q, K, V all come from the same input:
Self-Attention(x) = MultiHead(Q=x, K=x, V=x)Output: For each token position, a new representation that aggregates information from all other positions, weighted by attention scores.
Key property: No causal masking in the encoder โ token at position 5 can attend to position 10 just as easily as position 3. This bidirectional context is why BERT-style encoders are powerful for understanding tasks.
Sub-Layer 2: Feed-Forward Network (FFN)
FFN(x) = max(0, x ยท Wโ + bโ) ยท Wโ + bโ
Dimensions:
d_model = 512, d_ff = 2048 (4ร expansion in original paper)
Wโ: (d_model, d_ff) = (512, 2048)
Wโ: (d_ff, d_model) = (2048, 512)The FFN applies the same learned transformation independently to each position. It does not share information across positions (unlike attention) โ it processes each token's representation on its own.
Why the FFN exists: Attention mixes information across positions. The FFN processes and transforms the blended representations within each position. The two sub-layers have complementary roles: attention = communication, FFN = computation.
Modern models use SwiGLU or GELU instead of ReLU, and often expand d_ff to d_model ร 8/3 or similar.
Residual Connections
Every sub-layer adds its input to its output:
output = LayerNorm(x + SubLayer(x))Why residuals matter:
- Gradient flow: Gradients can flow directly through the addition, bypassing deep layers
- Identity shortcut: If a sub-layer's best transformation is "do nothing," the residual allows this โ SubLayer(x) โ 0, output โ LayerNorm(x)
- Training stability: Without residuals, transformers don't train at depth (they have no highway for gradients)
What the Encoder Outputs
A stack of N encoder blocks (N=6 in original, N=24 in BERT-large, N=96+ in large models) transforms the input token embeddings into contextualised token representations:
Input: [word embeddings + positional encodings] shape: (seq_len, d_model)
Output: [contextualised representations] shape: (seq_len, d_model)Each output vector for token i has been enriched with information from all other tokens in the sequence. This is what BERT exploits โ the final encoder states capture rich bidirectional context.
In encoder-decoder architectures (T5, original Transformer), the encoder's final output is passed as K and V to the cross-attention in each decoder block.
Code Skeleton
import torch
import torch.nn as nn
class EncoderBlock(nn.Module):
def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.ff = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Linear(d_ff, d_model)
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x: torch.Tensor, mask=None) -> torch.Tensor:
# Pre-LayerNorm (modern variant)
a = x + self.dropout(self.self_attn(self.norm1(x), self.norm1(x), self.norm1(x), mask))
out = a + self.dropout(self.ff(self.norm2(a)))
return outInterview Answer
"Each encoder block has two sub-layers: multi-head self-attention (where every token attends to every other token bidirectionally) and a position-wise feed-forward network. Both sub-layers use residual connections and layer normalisation. Attention handles cross-position communication; the FFN handles per-position transformation. A stack of N such blocks converts token embeddings into rich contextualised representations that capture the full sequence context."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.