Transformer Architecture Overview for LLMs
How modern decoder-only LLMs extend the original Transformer β the architectural changes from GPT-1 to LLaMA, and the components of a production LLM block.
From Original Transformer to Modern LLMs
The 2017 Transformer architecture has been refined significantly:
Original Transformer (Vaswani 2017):
Post-LayerNorm, sinusoidal PE, ReLU FFN, full encoder-decoder
Adam optimiser, dropout 0.1
GPT-1/2 (2018-2019):
Decoder-only, learned absolute PE, GELU activation
GPT-3 (2020):
Same architecture, scale to 175B
Alternating dense + sparse attention layers
LLaMA 1/2 (2023):
Pre-RMSNorm, RoPE, SwiGLU FFN
No bias terms in attention/FFN
Grouped-query attention (70B variant)
LLaMA 3 (2024):
GQA across all sizes, larger vocab (128K), longer context (8K-128K)Modern LLM Block (LLaMA-style)
import torch
import torch.nn as nn
import torch.nn.functional as F
class LLaMABlock(nn.Module):
def __init__(self, config):
super().__init__()
self.norm1 = RMSNorm(config.d_model)
self.attn = GroupedQueryAttention(
d_model=config.d_model,
num_q_heads=config.num_q_heads,
num_kv_heads=config.num_kv_heads, # GQA
)
self.norm2 = RMSNorm(config.d_model)
self.ffn = SwiGLUFFN(config.d_model, config.d_ff)
def forward(self, x, freqs_cos, freqs_sin, mask=None):
# Pre-norm + residual
x = x + self.attn(self.norm1(x), freqs_cos, freqs_sin, mask)
x = x + self.ffn(self.norm2(x))
return xKey Architectural Differences: BERT vs LLaMA
| Component | BERT-base | LLaMA 2 7B | |-----------|-----------|------------| | Type | Encoder-only | Decoder-only | | Layers | 12 | 32 | | d_model | 768 | 4096 | | Heads | 12 | 32 (Q), 8 (KV) | | FFN size | 3072 | 11008 | | Norm | Post-LayerNorm | Pre-RMSNorm | | Positional | Learned absolute | RoPE | | Activation | GELU | SwiGLU | | Context | 512 | 4096 | | Params | 110M | 7B |
Tied Embeddings
Most LLMs tie the input token embedding matrix and the output LM head:
E_in β β^(vocab Γ d_model) β input: token ID β vector
E_out β β^(d_model Γ vocab) β output: vector β logits
Tied: E_out = E_inα΅
Benefits:
Saves vocab Γ d_model parameters (e.g., 32K Γ 4096 β 131M params)
Forces consistent semantic space between input and output
Empirically matches or beats untied embeddingsEmbedding and Output Dimensions
Token embedding table: vocab_size Γ d_model
GPT-2: 50257 Γ 768 = 38.6M params
LLaMA 2: 32000 Γ 4096 = 131M params
LLaMA 3: 128256 Γ 4096 = 525M params
For large vocab sizes, the embedding table is a significant fraction
of total parameters β motivation for tied embeddings.No Biases in Modern LLMs
LLaMA and most modern LLMs remove bias terms from Linear layers:
Standard: y = xW + b (W and b are learned)
No bias: y = xW (only W)
Reasons:
1. Smaller parameter count (minor)
2. Better generalisation (empirically observed)
3. Interaction with LayerNorm: bias in the preceding layer
is absorbed by LayerNorm's learned shift Ξ² β redundant
4. Cleaner weight decay (bias terms are typically excluded
from L2 regularisation β removing them simplifies this)Interview Answer
"Modern decoder-only LLMs (LLaMA, Mistral) use Pre-RMSNorm instead of Post-LayerNorm for training stability, RoPE for positional encoding instead of learned absolute positions, SwiGLU activation in the FFN instead of ReLU/GELU, grouped-query attention to reduce the KV cache, and no bias terms in Linear layers. The core transformer block structure is unchanged: masked self-attention + FFN + residuals. Input and output embeddings are typically tied to save memory and regularise the representation space."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.