LLMs Deep Dive · Lesson 5 of 24
Transformer Architecture: End-to-End Walk-Through
From Original Transformer to Modern LLMs
The 2017 Transformer architecture has been refined significantly:
Original Transformer (Vaswani 2017):
Post-LayerNorm, sinusoidal PE, ReLU FFN, full encoder-decoder
Adam optimiser, dropout 0.1
GPT-1/2 (2018-2019):
Decoder-only, learned absolute PE, GELU activation
GPT-3 (2020):
Same architecture, scale to 175B
Alternating dense + sparse attention layers
LLaMA 1/2 (2023):
Pre-RMSNorm, RoPE, SwiGLU FFN
No bias terms in attention/FFN
Grouped-query attention (70B variant)
LLaMA 3 (2024):
GQA across all sizes, larger vocab (128K), longer context (8K-128K)Modern LLM Block (LLaMA-style)
import torch
import torch.nn as nn
import torch.nn.functional as F
class LLaMABlock(nn.Module):
def __init__(self, config):
super().__init__()
self.norm1 = RMSNorm(config.d_model)
self.attn = GroupedQueryAttention(
d_model=config.d_model,
num_q_heads=config.num_q_heads,
num_kv_heads=config.num_kv_heads, # GQA
)
self.norm2 = RMSNorm(config.d_model)
self.ffn = SwiGLUFFN(config.d_model, config.d_ff)
def forward(self, x, freqs_cos, freqs_sin, mask=None):
# Pre-norm + residual
x = x + self.attn(self.norm1(x), freqs_cos, freqs_sin, mask)
x = x + self.ffn(self.norm2(x))
return xKey Architectural Differences: BERT vs LLaMA
| Component | BERT-base | LLaMA 2 7B | |-----------|-----------|------------| | Type | Encoder-only | Decoder-only | | Layers | 12 | 32 | | d_model | 768 | 4096 | | Heads | 12 | 32 (Q), 8 (KV) | | FFN size | 3072 | 11008 | | Norm | Post-LayerNorm | Pre-RMSNorm | | Positional | Learned absolute | RoPE | | Activation | GELU | SwiGLU | | Context | 512 | 4096 | | Params | 110M | 7B |
Tied Embeddings
Most LLMs tie the input token embedding matrix and the output LM head:
E_in ∈ ℝ^(vocab × d_model) — input: token ID → vector
E_out ∈ ℝ^(d_model × vocab) — output: vector → logits
Tied: E_out = E_inᵀ
Benefits:
Saves vocab × d_model parameters (e.g., 32K × 4096 ≈ 131M params)
Forces consistent semantic space between input and output
Empirically matches or beats untied embeddingsEmbedding and Output Dimensions
Token embedding table: vocab_size × d_model
GPT-2: 50257 × 768 = 38.6M params
LLaMA 2: 32000 × 4096 = 131M params
LLaMA 3: 128256 × 4096 = 525M params
For large vocab sizes, the embedding table is a significant fraction
of total parameters — motivation for tied embeddings.No Biases in Modern LLMs
LLaMA and most modern LLMs remove bias terms from Linear layers:
Standard: y = xW + b (W and b are learned)
No bias: y = xW (only W)
Reasons:
1. Smaller parameter count (minor)
2. Better generalisation (empirically observed)
3. Interaction with LayerNorm: bias in the preceding layer
is absorbed by LayerNorm's learned shift β — redundant
4. Cleaner weight decay (bias terms are typically excluded
from L2 regularisation — removing them simplifies this)Interview Answer
"Modern decoder-only LLMs (LLaMA, Mistral) use Pre-RMSNorm instead of Post-LayerNorm for training stability, RoPE for positional encoding instead of learned absolute positions, SwiGLU activation in the FFN instead of ReLU/GELU, grouped-query attention to reduce the KV cache, and no bias terms in Linear layers. The core transformer block structure is unchanged: masked self-attention + FFN + residuals. Input and output embeddings are typically tied to save memory and regularise the representation space."