Learnixo

LLMs Deep Dive · Lesson 5 of 24

Transformer Architecture: End-to-End Walk-Through

From Original Transformer to Modern LLMs

The 2017 Transformer architecture has been refined significantly:

Original Transformer (Vaswani 2017):
  Post-LayerNorm, sinusoidal PE, ReLU FFN, full encoder-decoder
  Adam optimiser, dropout 0.1

GPT-1/2 (2018-2019):
  Decoder-only, learned absolute PE, GELU activation

GPT-3 (2020):
  Same architecture, scale to 175B
  Alternating dense + sparse attention layers

LLaMA 1/2 (2023):
  Pre-RMSNorm, RoPE, SwiGLU FFN
  No bias terms in attention/FFN
  Grouped-query attention (70B variant)

LLaMA 3 (2024):
  GQA across all sizes, larger vocab (128K), longer context (8K-128K)

Modern LLM Block (LLaMA-style)

Python
import torch
import torch.nn as nn
import torch.nn.functional as F

class LLaMABlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.norm1 = RMSNorm(config.d_model)
        self.attn  = GroupedQueryAttention(
            d_model=config.d_model,
            num_q_heads=config.num_q_heads,
            num_kv_heads=config.num_kv_heads,  # GQA
        )
        self.norm2 = RMSNorm(config.d_model)
        self.ffn   = SwiGLUFFN(config.d_model, config.d_ff)

    def forward(self, x, freqs_cos, freqs_sin, mask=None):
        # Pre-norm + residual
        x = x + self.attn(self.norm1(x), freqs_cos, freqs_sin, mask)
        x = x + self.ffn(self.norm2(x))
        return x

Key Architectural Differences: BERT vs LLaMA

| Component | BERT-base | LLaMA 2 7B | |-----------|-----------|------------| | Type | Encoder-only | Decoder-only | | Layers | 12 | 32 | | d_model | 768 | 4096 | | Heads | 12 | 32 (Q), 8 (KV) | | FFN size | 3072 | 11008 | | Norm | Post-LayerNorm | Pre-RMSNorm | | Positional | Learned absolute | RoPE | | Activation | GELU | SwiGLU | | Context | 512 | 4096 | | Params | 110M | 7B |


Tied Embeddings

Most LLMs tie the input token embedding matrix and the output LM head:

E_in  ∈ ℝ^(vocab × d_model)  — input: token ID → vector
E_out ∈ ℝ^(d_model × vocab)  — output: vector → logits

Tied: E_out = E_inᵀ

Benefits:
  Saves vocab × d_model parameters (e.g., 32K × 4096 ≈ 131M params)
  Forces consistent semantic space between input and output
  Empirically matches or beats untied embeddings

Embedding and Output Dimensions

Token embedding table: vocab_size × d_model
  GPT-2:    50257 × 768  = 38.6M params
  LLaMA 2:  32000 × 4096 = 131M params
  LLaMA 3:  128256 × 4096 = 525M params

For large vocab sizes, the embedding table is a significant fraction
of total parameters — motivation for tied embeddings.

No Biases in Modern LLMs

LLaMA and most modern LLMs remove bias terms from Linear layers:

Standard: y = xW + b  (W and b are learned)
No bias:  y = xW      (only W)

Reasons:
  1. Smaller parameter count (minor)
  2. Better generalisation (empirically observed)
  3. Interaction with LayerNorm: bias in the preceding layer
     is absorbed by LayerNorm's learned shift β — redundant
  4. Cleaner weight decay (bias terms are typically excluded
     from L2 regularisation — removing them simplifies this)

Interview Answer

"Modern decoder-only LLMs (LLaMA, Mistral) use Pre-RMSNorm instead of Post-LayerNorm for training stability, RoPE for positional encoding instead of learned absolute positions, SwiGLU activation in the FFN instead of ReLU/GELU, grouped-query attention to reduce the KV cache, and no bias terms in Linear layers. The core transformer block structure is unchanged: masked self-attention + FFN + residuals. Input and output embeddings are typically tied to save memory and regularise the representation space."