History of Language Models
A comprehensive journey from n-gram models to GPT-4, Claude, and Gemini β tracing the key architectural breakthroughs that define modern LLMs.
History of Language Models
Language modeling is the task of assigning probabilities to sequences of words. What began as counting word co-occurrences in the 1980s has evolved into systems that can write code, reason through problems, and hold nuanced conversations. This article traces that arc β every major conceptual leap and the engineering insight behind it.
1. Statistical Language Models: n-grams (1980sβ2000s)
The Core Idea
An n-gram model estimates the probability of a word given the previous n-1 words:
P(w_t | w_{t-1}, w_{t-2}, ..., w_{t-n+1})A bigram model (n=2) predicts the next word given only the current word. A trigram model uses the previous two words.
How It Works in Practice
from collections import defaultdict, Counter
import math
class NGramModel:
def __init__(self, n=2):
self.n = n
self.ngram_counts = defaultdict(Counter)
self.context_counts = Counter()
def train(self, corpus: list[str]):
"""corpus: list of sentences, each a string"""
for sentence in corpus:
tokens = ["<START>"] * (self.n - 1) + sentence.split() + ["<END>"]
for i in range(self.n - 1, len(tokens)):
context = tuple(tokens[i - self.n + 1 : i])
word = tokens[i]
self.ngram_counts[context][word] += 1
self.context_counts[context] += 1
def probability(self, word: str, context: tuple) -> float:
"""Laplace-smoothed probability"""
vocab_size = sum(len(v) for v in self.ngram_counts.values())
count = self.ngram_counts[context][word]
context_count = self.context_counts[context]
return (count + 1) / (context_count + vocab_size)
def perplexity(self, test_corpus: list[str]) -> float:
log_prob_sum = 0
token_count = 0
for sentence in test_corpus:
tokens = ["<START>"] * (self.n - 1) + sentence.split() + ["<END>"]
for i in range(self.n - 1, len(tokens)):
context = tuple(tokens[i - self.n + 1 : i])
word = tokens[i]
p = self.probability(word, context)
log_prob_sum += math.log(p)
token_count += 1
return math.exp(-log_prob_sum / token_count)
# Usage
corpus = [
"the cat sat on the mat",
"the cat ate the rat",
"the dog sat on the log",
]
model = NGramModel(n=3)
model.train(corpus)
print(model.perplexity(["the cat sat"])) # lower = betterThe Curse of Dimensionality
With a vocabulary of 50,000 words and trigrams, you have 50,000^3 = 125 trillion possible trigrams. Most are never observed in training data. Solutions like Kneser-Ney smoothing partially addressed this, but the fundamental limitation remained: n-grams cannot model long-range dependencies.
2. Neural Language Models (2003β2012)
Bengio's 2003 Breakthrough
Yoshua Bengio's 2003 paper "A Neural Probabilistic Language Model" introduced the idea of learning continuous word representations (embeddings) and using a neural network to predict the next word. This was the seed of everything that followed.
The key insight: map each word to a dense vector in a continuous space. Words with similar meanings cluster together. The network can generalize across similar words automatically β no explicit smoothing required.
import torch
import torch.nn as nn
class BengioNLM(nn.Module):
"""Simplified version of Bengio's 2003 neural language model"""
def __init__(self, vocab_size: int, embed_dim: int, context_len: int, hidden_dim: int):
super().__init__()
self.embeddings = nn.Embedding(vocab_size, embed_dim)
# Context words concatenated, then projected
self.hidden = nn.Linear(context_len * embed_dim, hidden_dim)
self.output = nn.Linear(hidden_dim, vocab_size)
self.tanh = nn.Tanh()
def forward(self, context_ids: torch.Tensor) -> torch.Tensor:
# context_ids: (batch, context_len)
emb = self.embeddings(context_ids) # (batch, context_len, embed_dim)
emb_flat = emb.view(emb.size(0), -1) # (batch, context_len * embed_dim)
h = self.tanh(self.hidden(emb_flat)) # (batch, hidden_dim)
logits = self.output(h) # (batch, vocab_size)
return logits3. Recurrent Neural Networks (2010β2015)
Why RNNs Were a Big Deal
RNNs process sequences step by step, maintaining a hidden state that theoretically captures the entire history:
h_t = tanh(W_hh * h_{t-1} + W_xh * x_t + b)
y_t = W_hy * h_tThis enables variable-length context β unlike n-grams which are fixed to n words.
class VanillaRNN(nn.Module):
def __init__(self, vocab_size: int, embed_dim: int, hidden_dim: int):
super().__init__()
self.embed = nn.Embedding(vocab_size, embed_dim)
self.rnn = nn.RNN(embed_dim, hidden_dim, batch_first=True)
self.out = nn.Linear(hidden_dim, vocab_size)
def forward(self, x: torch.Tensor, h0=None):
emb = self.embed(x) # (batch, seq_len, embed_dim)
out, hn = self.rnn(emb, h0) # out: (batch, seq_len, hidden_dim)
logits = self.out(out) # (batch, seq_len, vocab_size)
return logits, hnThe Vanishing Gradient Problem
Training RNNs on long sequences was nearly impossible because gradients shrink exponentially as they backpropagate through many time steps. The signal from a word 50 positions ago effectively disappears by the time it affects the first position.
4. LSTMs and GRUs (1997, popularized 2013β2016)
Long Short-Term Memory
Hochreiter and Schmidhuber introduced LSTMs in 1997, but they only gained widespread use in the deep learning era. LSTMs use gating mechanisms to control what information flows through the network:
- Forget gate: decides what to discard from cell state
- Input gate: decides what new information to store
- Output gate: decides what to output
class LSTMLanguageModel(nn.Module):
def __init__(self, vocab_size: int, embed_dim: int, hidden_dim: int, num_layers: int = 2):
super().__init__()
self.embed = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(
input_size=embed_dim,
hidden_size=hidden_dim,
num_layers=num_layers,
batch_first=True,
dropout=0.3
)
self.dropout = nn.Dropout(0.3)
self.fc = nn.Linear(hidden_dim, vocab_size)
def forward(self, x: torch.Tensor, hidden=None):
emb = self.dropout(self.embed(x))
out, hidden = self.lstm(emb, hidden)
out = self.dropout(out)
logits = self.fc(out)
return logits, hiddenLSTMs achieved state-of-the-art results on language modeling benchmarks for years. Word2Vec (2013) and GloVe (2014) complemented LSTMs by providing high-quality pre-trained word embeddings.
5. The Attention Mechanism (2015)
Bahdanau et al.: Neural Machine Translation
In 2015, Bahdanau, Cho, and Bengio introduced attention for neural machine translation. The problem: encoding an entire source sentence into a fixed-size vector was a bottleneck. Attention allows the decoder to look back at all encoder hidden states and dynamically weight them.
import torch
import torch.nn.functional as F
def bahdanau_attention(query: torch.Tensor, keys: torch.Tensor, values: torch.Tensor):
"""
query: (batch, hidden_dim) β decoder hidden state
keys: (batch, seq_len, hidden_dim) β encoder outputs
values: (batch, seq_len, hidden_dim) β same as keys typically
"""
# Score: additive attention
# query expanded: (batch, 1, hidden_dim)
query_expanded = query.unsqueeze(1)
# Energy: (batch, seq_len)
energy = torch.tanh(query_expanded + keys).sum(dim=-1)
attention_weights = F.softmax(energy, dim=-1) # (batch, seq_len)
# Context vector: weighted sum of values
context = (attention_weights.unsqueeze(-1) * values).sum(dim=1) # (batch, hidden_dim)
return context, attention_weightsThis was transformative: instead of compressing everything into one vector, the model could selectively attend to relevant parts of the input.
6. "Attention Is All You Need" β The Transformer (2017)
Vaswani et al.
The 2017 paper from Google Brain replaced recurrence entirely with self-attention. Every position attends to every other position in parallel. This enabled:
- Parallelism β no sequential dependency between positions during training
- Long-range dependencies β every token directly attends to every other token
- Scalability β the architecture scales predictably with compute
import math
class MultiHeadSelfAttention(nn.Module):
def __init__(self, d_model: int, num_heads: int):
super().__init__()
assert d_model % num_heads == 0
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def split_heads(self, x: torch.Tensor) -> torch.Tensor:
B, T, D = x.shape
return x.view(B, T, self.num_heads, self.d_k).transpose(1, 2)
# (B, heads, T, d_k)
def forward(self, x: torch.Tensor, mask=None) -> torch.Tensor:
B, T, _ = x.shape
Q = self.split_heads(self.W_q(x))
K = self.split_heads(self.W_k(x))
V = self.split_heads(self.W_v(x))
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attn = F.softmax(scores, dim=-1)
out = torch.matmul(attn, V) # (B, heads, T, d_k)
out = out.transpose(1, 2).contiguous().view(B, T, self.d_model)
return self.W_o(out)The original Transformer used an encoder-decoder architecture for translation. But two variants proved more impactful:
- Encoder-only: BERT β good for understanding tasks
- Decoder-only: GPT β good for generation tasks
7. BERT (2018): Bidirectional Encoders
Google's BERT (Bidirectional Encoder Representations from Transformers) introduced masked language modeling: randomly mask 15% of tokens and train the model to predict them. Because it sees context from both directions, BERT excels at classification, NER, and QA.
BERT-base: 110M parameters, 12 layers, 768 hidden dim, 12 heads. BERT-large: 340M parameters, 24 layers, 1024 hidden dim, 16 heads.
The pre-train once, fine-tune everywhere paradigm BERT established remains central to NLP today.
8. GPT-1, GPT-2, GPT-3: The Scaling Laws Era
GPT-1 (2018) β 117M Parameters
OpenAI's first GPT used a decoder-only transformer trained with next-token prediction on Books corpus. Fine-tuning on downstream tasks with a language model head worked surprisingly well.
GPT-2 (2019) β 1.5B Parameters
GPT-2 was trained on WebText (Reddit outbound links, 40GB). OpenAI initially withheld the full model citing "misuse concerns." When released, it demonstrated emergent capabilities: zero-shot translation, summarization, QA β without any fine-tuning.
GPT-3 (2020) β 175B Parameters
The scale jump was enormous. GPT-3 showed that in-context learning (few-shot prompting) could match fine-tuned models on many tasks. No gradient updates needed β just examples in the prompt.
Kaplan et al. (2020) published neural scaling laws: performance improves predictably with model size, data, and compute as power laws.
# Scaling law: Loss ~ C^(-0.048) approximately
# C = compute in FLOPs
import numpy as np
def chinchilla_optimal_tokens(model_params: float) -> float:
"""
Hoffman et al. (2022) Chinchilla scaling laws:
optimal tokens = 20 * model_params
"""
return 20 * model_params
gpt3_params = 175e9
optimal_tokens = chinchilla_optimal_tokens(gpt3_params)
print(f"GPT-3 optimal training tokens: {optimal_tokens:.2e}")
# GPT-3 was actually undertrained by this measure β it saw ~300B tokens
# Chinchilla (70B params, 1.4T tokens) outperformed GPT-3 on many benchmarks9. The ChatGPT Moment (2022)
InstructGPT (2022) applied RLHF (Reinforcement Learning from Human Feedback) to align GPT-3 with human intent. A 1.3B InstructGPT model outperformed 175B GPT-3 on human preference evaluations.
ChatGPT launched November 2022. It reached 100 million users in 2 months β the fastest product adoption in history at that point. The "ChatGPT moment" marked when LLMs went from research curiosity to mainstream technology.
What made it work:
- Pre-training on massive diverse data
- Supervised fine-tuning on instruction-following examples
- RLHF to align with human preferences
- A chat interface that made it accessible
10. GPT-4, Claude, Gemini Era (2023β2026)
GPT-4 (2023)
OpenAI's GPT-4 is a multimodal model (text + images). Its architecture is not publicly disclosed. Key improvements:
- Significantly better reasoning
- Longer context (initially 8K, later 128K tokens)
- Reduced hallucination
- Multimodal input
Claude (Anthropic, 2023β2025)
Anthropic trained Claude using Constitutional AI β a technique where the model critiques its own outputs against a set of principles. Claude 3 (Opus, Sonnet, Haiku) and Claude 3.5/4 demonstrated strong reasoning, coding, and long-context capabilities.
Gemini (Google DeepMind, 2023β2025)
Google's Gemini was designed as natively multimodal from the start (unlike GPT-4V which added vision later). Gemini 1.5 Pro introduced a 1M token context window using ring attention.
Open Source Resurgence
Meta's LLaMA (2023) and LLaMA 2/3 (2024β2025) democratized LLM research. Mistral, Phi, Falcon, and Qwen followed. By 2025, open-source models at 7β70B parameters matched or exceeded GPT-3.5 performance.
11. The Arc of Progress: A Summary Table
| Year | Model | Params | Key Innovation | |------|-------|--------|----------------| | 2003 | Bengio NLM | small | Neural word embeddings | | 2013 | Word2Vec | small | Efficient embedding training | | 2015 | Seq2Seq + Attention | medium | Dynamic context weighting | | 2017 | Transformer | 65M | Parallel self-attention | | 2018 | BERT | 340M | Bidirectional pre-training | | 2018 | GPT-1 | 117M | Decoder-only LM + fine-tune | | 2019 | GPT-2 | 1.5B | Zero-shot capabilities | | 2020 | GPT-3 | 175B | Few-shot in-context learning | | 2022 | InstructGPT | 1.3B | RLHF alignment | | 2022 | ChatGPT | ~175B | Accessible chat interface | | 2023 | GPT-4 | undisclosed | Multimodal, strong reasoning | | 2023 | LLaMA | 7Bβ65B | Open-source competition | | 2024 | LLaMA 3 | 8Bβ405B | Open-source SOTA | | 2025 | Claude 4 | undisclosed | Constitutional AI, long context |
12. Key Conceptual Shifts Over Time
From Rules to Statistics to Neural to Scale
Each era had a dominant paradigm:
- 1980sβ2000s: Rule-based systems and statistical models β interpretable but brittle
- 2010β2017: Deep learning + RNNs β learned representations but slow sequential training
- 2017β2020: Transformers β parallel training, scaling became possible
- 2020βpresent: Scale as the primary lever β more data, more compute, emergent capabilities
The Bitter Lesson (Rich Sutton, 2019)
Rich Sutton's famous essay argued that methods which leverage computation (search, learning) consistently outperform methods that encode human knowledge. The history of language models validates this: every time researchers added domain knowledge or clever architecture tricks, they were eventually outperformed by scaling simpler methods.
13. What's Still Open
Despite the remarkable progress, fundamental questions remain:
- Grounding: LLMs learn statistical patterns, not grounded meaning. Is this sufficient for reasoning?
- Compositional generalization: Can models apply learned rules to novel combinations?
- Data efficiency: Humans learn from far fewer examples than LLMs need
- Long-term coherence: Even 1M context models struggle with document-length coherence
- Reliable factuality: Hallucination remains a hard problem
The history of language models is a history of practitioners repeatedly being surprised by what scale enables. The next chapter has not been written.
Summary
From n-grams counting word pairs to transformers predicting the next token in 175 billion parameter models, the journey took roughly 40 years and accelerated dramatically after 2017. The transformer architecture, trained with next-token prediction on internet-scale data and then aligned with human feedback, is the foundation of every major LLM today. Understanding this history explains why LLMs behave the way they do β their strengths, their failure modes, and the open research questions that drive the field forward.
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.