History of Language Models

Language modeling is the task of assigning probabilities to sequences of words. What began as counting word co-occurrences in the 1980s has evolved into systems that can write code, reason through problems, and hold nuanced conversations. This article traces that arc — every major conceptual leap and the engineering insight behind it.

1. Statistical Language Models: n-grams (1980s–2000s)

The Core Idea

An n-gram model estimates the probability of a word given the previous n-1 words:

P(w_t | w_{t-1}, w_{t-2}, ..., w_{t-n+1})

A bigram model (n=2) predicts the next word given only the current word. A trigram model uses the previous two words.

How It Works in Practice

Python

from collections import defaultdict, Counter
import math

class NGramModel:
    def __init__(self, n=2):
        self.n = n
        self.ngram_counts = defaultdict(Counter)
        self.context_counts = Counter()

    def train(self, corpus: list[str]):
        """corpus: list of sentences, each a string"""
        for sentence in corpus:
            tokens = ["<START>"] * (self.n - 1) + sentence.split() + ["<END>"]
            for i in range(self.n - 1, len(tokens)):
                context = tuple(tokens[i - self.n + 1 : i])
                word = tokens[i]
                self.ngram_counts[context][word] += 1
                self.context_counts[context] += 1

    def probability(self, word: str, context: tuple) -> float:
        """Laplace-smoothed probability"""
        vocab_size = sum(len(v) for v in self.ngram_counts.values())
        count = self.ngram_counts[context][word]
        context_count = self.context_counts[context]
        return (count + 1) / (context_count + vocab_size)

    def perplexity(self, test_corpus: list[str]) -> float:
        log_prob_sum = 0
        token_count = 0
        for sentence in test_corpus:
            tokens = ["<START>"] * (self.n - 1) + sentence.split() + ["<END>"]
            for i in range(self.n - 1, len(tokens)):
                context = tuple(tokens[i - self.n + 1 : i])
                word = tokens[i]
                p = self.probability(word, context)
                log_prob_sum += math.log(p)
                token_count += 1
        return math.exp(-log_prob_sum / token_count)

# Usage
corpus = [
    "the cat sat on the mat",
    "the cat ate the rat",
    "the dog sat on the log",
]
model = NGramModel(n=3)
model.train(corpus)
print(model.perplexity(["the cat sat"]))  # lower = better

The Curse of Dimensionality

With a vocabulary of 50,000 words and trigrams, you have 50,000^3 = 125 trillion possible trigrams. Most are never observed in training data. Solutions like Kneser-Ney smoothing partially addressed this, but the fundamental limitation remained: n-grams cannot model long-range dependencies.

2. Neural Language Models (2003–2012)

Bengio's 2003 Breakthrough

Yoshua Bengio's 2003 paper "A Neural Probabilistic Language Model" introduced the idea of learning continuous word representations (embeddings) and using a neural network to predict the next word. This was the seed of everything that followed.

The key insight: map each word to a dense vector in a continuous space. Words with similar meanings cluster together. The network can generalize across similar words automatically — no explicit smoothing required.

Python

import torch
import torch.nn as nn

class BengioNLM(nn.Module):
    """Simplified version of Bengio's 2003 neural language model"""
    def __init__(self, vocab_size: int, embed_dim: int, context_len: int, hidden_dim: int):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embed_dim)
        # Context words concatenated, then projected
        self.hidden = nn.Linear(context_len * embed_dim, hidden_dim)
        self.output = nn.Linear(hidden_dim, vocab_size)
        self.tanh = nn.Tanh()

    def forward(self, context_ids: torch.Tensor) -> torch.Tensor:
        # context_ids: (batch, context_len)
        emb = self.embeddings(context_ids)          # (batch, context_len, embed_dim)
        emb_flat = emb.view(emb.size(0), -1)        # (batch, context_len * embed_dim)
        h = self.tanh(self.hidden(emb_flat))         # (batch, hidden_dim)
        logits = self.output(h)                      # (batch, vocab_size)
        return logits

3. Recurrent Neural Networks (2010–2015)

Why RNNs Were a Big Deal

RNNs process sequences step by step, maintaining a hidden state that theoretically captures the entire history:

h_t = tanh(W_hh * h_{t-1} + W_xh * x_t + b)
y_t = W_hy * h_t

This enables variable-length context — unlike n-grams which are fixed to n words.

Python

class VanillaRNN(nn.Module):
    def __init__(self, vocab_size: int, embed_dim: int, hidden_dim: int):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.RNN(embed_dim, hidden_dim, batch_first=True)
        self.out = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x: torch.Tensor, h0=None):
        emb = self.embed(x)                # (batch, seq_len, embed_dim)
        out, hn = self.rnn(emb, h0)        # out: (batch, seq_len, hidden_dim)
        logits = self.out(out)             # (batch, seq_len, vocab_size)
        return logits, hn

The Vanishing Gradient Problem

Training RNNs on long sequences was nearly impossible because gradients shrink exponentially as they backpropagate through many time steps. The signal from a word 50 positions ago effectively disappears by the time it affects the first position.

4. LSTMs and GRUs (1997, popularized 2013–2016)

Long Short-Term Memory

Hochreiter and Schmidhuber introduced LSTMs in 1997, but they only gained widespread use in the deep learning era. LSTMs use gating mechanisms to control what information flows through the network:

Forget gate: decides what to discard from cell state
Input gate: decides what new information to store
Output gate: decides what to output

Python

class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size: int, embed_dim: int, hidden_dim: int, num_layers: int = 2):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=0.3
        )
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x: torch.Tensor, hidden=None):
        emb = self.dropout(self.embed(x))
        out, hidden = self.lstm(emb, hidden)
        out = self.dropout(out)
        logits = self.fc(out)
        return logits, hidden

LSTMs achieved state-of-the-art results on language modeling benchmarks for years. Word2Vec (2013) and GloVe (2014) complemented LSTMs by providing high-quality pre-trained word embeddings.

5. The Attention Mechanism (2015)

Bahdanau et al.: Neural Machine Translation

In 2015, Bahdanau, Cho, and Bengio introduced attention for neural machine translation. The problem: encoding an entire source sentence into a fixed-size vector was a bottleneck. Attention allows the decoder to look back at all encoder hidden states and dynamically weight them.

Python

import torch
import torch.nn.functional as F

def bahdanau_attention(query: torch.Tensor, keys: torch.Tensor, values: torch.Tensor):
    """
    query: (batch, hidden_dim)         — decoder hidden state
    keys:  (batch, seq_len, hidden_dim) — encoder outputs
    values: (batch, seq_len, hidden_dim) — same as keys typically
    """
    # Score: additive attention
    # query expanded: (batch, 1, hidden_dim)
    query_expanded = query.unsqueeze(1)
    # Energy: (batch, seq_len)
    energy = torch.tanh(query_expanded + keys).sum(dim=-1)
    attention_weights = F.softmax(energy, dim=-1)       # (batch, seq_len)
    # Context vector: weighted sum of values
    context = (attention_weights.unsqueeze(-1) * values).sum(dim=1)  # (batch, hidden_dim)
    return context, attention_weights

This was transformative: instead of compressing everything into one vector, the model could selectively attend to relevant parts of the input.

6. "Attention Is All You Need" — The Transformer (2017)

Vaswani et al.

The 2017 paper from Google Brain replaced recurrence entirely with self-attention. Every position attends to every other position in parallel. This enabled:

Parallelism — no sequential dependency between positions during training
Long-range dependencies — every token directly attends to every other token
Scalability — the architecture scales predictably with compute

Python

import math

class MultiHeadSelfAttention(nn.Module):
    def __init__(self, d_model: int, num_heads: int):
        super().__init__()
        assert d_model % num_heads == 0
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def split_heads(self, x: torch.Tensor) -> torch.Tensor:
        B, T, D = x.shape
        return x.view(B, T, self.num_heads, self.d_k).transpose(1, 2)
        # (B, heads, T, d_k)

    def forward(self, x: torch.Tensor, mask=None) -> torch.Tensor:
        B, T, _ = x.shape
        Q = self.split_heads(self.W_q(x))
        K = self.split_heads(self.W_k(x))
        V = self.split_heads(self.W_v(x))

        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = F.softmax(scores, dim=-1)

        out = torch.matmul(attn, V)                     # (B, heads, T, d_k)
        out = out.transpose(1, 2).contiguous().view(B, T, self.d_model)
        return self.W_o(out)

The original Transformer used an encoder-decoder architecture for translation. But two variants proved more impactful:

Encoder-only: BERT — good for understanding tasks
Decoder-only: GPT — good for generation tasks

7. BERT (2018): Bidirectional Encoders

Google's BERT (Bidirectional Encoder Representations from Transformers) introduced masked language modeling: randomly mask 15% of tokens and train the model to predict them. Because it sees context from both directions, BERT excels at classification, NER, and QA.

BERT-base: 110M parameters, 12 layers, 768 hidden dim, 12 heads. BERT-large: 340M parameters, 24 layers, 1024 hidden dim, 16 heads.

The pre-train once, fine-tune everywhere paradigm BERT established remains central to NLP today.

8. GPT-1, GPT-2, GPT-3: The Scaling Laws Era

GPT-1 (2018) — 117M Parameters

OpenAI's first GPT used a decoder-only transformer trained with next-token prediction on Books corpus. Fine-tuning on downstream tasks with a language model head worked surprisingly well.

GPT-2 (2019) — 1.5B Parameters

GPT-2 was trained on WebText (Reddit outbound links, 40GB). OpenAI initially withheld the full model citing "misuse concerns." When released, it demonstrated emergent capabilities: zero-shot translation, summarization, QA — without any fine-tuning.

GPT-3 (2020) — 175B Parameters

The scale jump was enormous. GPT-3 showed that in-context learning (few-shot prompting) could match fine-tuned models on many tasks. No gradient updates needed — just examples in the prompt.

Kaplan et al. (2020) published neural scaling laws: performance improves predictably with model size, data, and compute as power laws.

Python

# Scaling law: Loss ~ C^(-0.048) approximately
# C = compute in FLOPs
import numpy as np

def chinchilla_optimal_tokens(model_params: float) -> float:
    """
    Hoffman et al. (2022) Chinchilla scaling laws:
    optimal tokens = 20 * model_params
    """
    return 20 * model_params

gpt3_params = 175e9
optimal_tokens = chinchilla_optimal_tokens(gpt3_params)
print(f"GPT-3 optimal training tokens: {optimal_tokens:.2e}")
# GPT-3 was actually undertrained by this measure — it saw ~300B tokens
# Chinchilla (70B params, 1.4T tokens) outperformed GPT-3 on many benchmarks

9. The ChatGPT Moment (2022)

InstructGPT (2022) applied RLHF (Reinforcement Learning from Human Feedback) to align GPT-3 with human intent. A 1.3B InstructGPT model outperformed 175B GPT-3 on human preference evaluations.

ChatGPT launched November 2022. It reached 100 million users in 2 months — the fastest product adoption in history at that point. The "ChatGPT moment" marked when LLMs went from research curiosity to mainstream technology.

What made it work:

Pre-training on massive diverse data
Supervised fine-tuning on instruction-following examples
RLHF to align with human preferences
A chat interface that made it accessible

10. GPT-4, Claude, Gemini Era (2023–2026)

GPT-4 (2023)

OpenAI's GPT-4 is a multimodal model (text + images). Its architecture is not publicly disclosed. Key improvements:

Significantly better reasoning
Longer context (initially 8K, later 128K tokens)
Reduced hallucination
Multimodal input

Claude (Anthropic, 2023–2025)

Anthropic trained Claude using Constitutional AI — a technique where the model critiques its own outputs against a set of principles. Claude 3 (Opus, Sonnet, Haiku) and Claude 3.5/4 demonstrated strong reasoning, coding, and long-context capabilities.

Gemini (Google DeepMind, 2023–2025)

Google's Gemini was designed as natively multimodal from the start (unlike GPT-4V which added vision later). Gemini 1.5 Pro introduced a 1M token context window using ring attention.

Open Source Resurgence

Meta's LLaMA (2023) and LLaMA 2/3 (2024–2025) democratized LLM research. Mistral, Phi, Falcon, and Qwen followed. By 2025, open-source models at 7–70B parameters matched or exceeded GPT-3.5 performance.

11. The Arc of Progress: A Summary Table

| Year | Model | Params | Key Innovation | |------|-------|--------|----------------| | 2003 | Bengio NLM | small | Neural word embeddings | | 2013 | Word2Vec | small | Efficient embedding training | | 2015 | Seq2Seq + Attention | medium | Dynamic context weighting | | 2017 | Transformer | 65M | Parallel self-attention | | 2018 | BERT | 340M | Bidirectional pre-training | | 2018 | GPT-1 | 117M | Decoder-only LM + fine-tune | | 2019 | GPT-2 | 1.5B | Zero-shot capabilities | | 2020 | GPT-3 | 175B | Few-shot in-context learning | | 2022 | InstructGPT | 1.3B | RLHF alignment | | 2022 | ChatGPT | ~175B | Accessible chat interface | | 2023 | GPT-4 | undisclosed | Multimodal, strong reasoning | | 2023 | LLaMA | 7B–65B | Open-source competition | | 2024 | LLaMA 3 | 8B–405B | Open-source SOTA | | 2025 | Claude 4 | undisclosed | Constitutional AI, long context |

12. Key Conceptual Shifts Over Time

From Rules to Statistics to Neural to Scale

Each era had a dominant paradigm:

1980s–2000s: Rule-based systems and statistical models — interpretable but brittle
2010–2017: Deep learning + RNNs — learned representations but slow sequential training
2017–2020: Transformers — parallel training, scaling became possible
2020–present: Scale as the primary lever — more data, more compute, emergent capabilities

The Bitter Lesson (Rich Sutton, 2019)

Rich Sutton's famous essay argued that methods which leverage computation (search, learning) consistently outperform methods that encode human knowledge. The history of language models validates this: every time researchers added domain knowledge or clever architecture tricks, they were eventually outperformed by scaling simpler methods.

13. What's Still Open

Despite the remarkable progress, fundamental questions remain:

Grounding: LLMs learn statistical patterns, not grounded meaning. Is this sufficient for reasoning?
Compositional generalization: Can models apply learned rules to novel combinations?
Data efficiency: Humans learn from far fewer examples than LLMs need
Long-term coherence: Even 1M context models struggle with document-length coherence
Reliable factuality: Hallucination remains a hard problem

The history of language models is a history of practitioners repeatedly being surprised by what scale enables. The next chapter has not been written.

Summary

From n-grams counting word pairs to transformers predicting the next token in 175 billion parameter models, the journey took roughly 40 years and accelerated dramatically after 2017. The transformer architecture, trained with next-token prediction on internet-scale data and then aligned with human feedback, is the foundation of every major LLM today. Understanding this history explains why LLMs behave the way they do — their strengths, their failure modes, and the open research questions that drive the field forward.

History of Language Models

History of Language Models

1. Statistical Language Models: n-grams (1980s–2000s)

The Core Idea

How It Works in Practice

The Curse of Dimensionality

2. Neural Language Models (2003–2012)

Bengio's 2003 Breakthrough

3. Recurrent Neural Networks (2010–2015)

Why RNNs Were a Big Deal

The Vanishing Gradient Problem

4. LSTMs and GRUs (1997, popularized 2013–2016)

Long Short-Term Memory

5. The Attention Mechanism (2015)

Bahdanau et al.: Neural Machine Translation

6. "Attention Is All You Need" — The Transformer (2017)

Vaswani et al.

7. BERT (2018): Bidirectional Encoders

8. GPT-1, GPT-2, GPT-3: The Scaling Laws Era

GPT-1 (2018) — 117M Parameters

GPT-2 (2019) — 1.5B Parameters

GPT-3 (2020) — 175B Parameters

9. The ChatGPT Moment (2022)

10. GPT-4, Claude, Gemini Era (2023–2026)

GPT-4 (2023)

Claude (Anthropic, 2023–2025)

Gemini (Google DeepMind, 2023–2025)

Open Source Resurgence

11. The Arc of Progress: A Summary Table

12. Key Conceptual Shifts Over Time

From Rules to Statistics to Neural to Scale

The Bitter Lesson (Rich Sutton, 2019)

13. What's Still Open

Summary

Enjoyed this article?

Leave a comment