What Is a Large Language Model?
What LLMs are, how they work at a high level, what 'large' means, and how they differ from earlier NLP approaches.
Definition
A Large Language Model (LLM) is a neural network — specifically a transformer — trained on massive text corpora to model the probability distribution over sequences of tokens. Given a sequence of tokens, the model assigns a probability to the next token.
Core task: P(token_t | token_0, token_1, ..., token_{t-1})
"The cat sat on the ___"
→ P("mat") = 0.31
→ P("floor") = 0.18
→ P("roof") = 0.07
→ ...Everything an LLM does — generation, question answering, summarisation, code completion — is implemented through this single conditional distribution.
What "Large" Means
Scale across three dimensions:
Parameters:
GPT-2 (2019): 1.5B
GPT-3 (2020): 175B
PaLM (2022): 540B
GPT-4 (2023): ~1T (estimated)
LLaMA 3 (2024): 8B–405B
Training data:
GPT-3: 300B tokens
LLaMA 2: 2T tokens
LLaMA 3: 15T tokens
GPT-4: ~13T tokens (estimated)
Compute (FLOPs):
GPT-3: 3.14 × 10²³ FLOPs
PaLM: 2.5 × 10²⁴ FLOPs"Large" is relative — what was large in 2020 (GPT-3, 175B) is now a mid-size model.
How LLMs Differ from Earlier NLP
Pre-deep-learning NLP:
n-gram models, TF-IDF, rule-based systems
Task-specific, brittle, no generalisation
Deep learning NLP (pre-transformer):
RNNs, LSTMs — sequential, slow, limited context
Word2Vec — fixed embeddings, no contextualisation
Task-specific models for each task
Early transformers (2017-2019):
BERT — task-specific fine-tuning required
GPT-2 — few-shot, but limited capability
Modern LLMs (GPT-3+):
In-context learning — solve new tasks with examples in the prompt
No task-specific training required for many tasks
Emergent capabilities — abilities that appear only at scaleEmergence
Some capabilities of LLMs appear abruptly as scale increases — they are not present in smaller models and not explicitly trained for:
Emergent abilities include:
- Multi-step arithmetic
- Analogy reasoning
- Code generation from docstrings
- Translation without explicit translation training
- Chain-of-thought reasoningDebate exists on whether emergence is real or an artifact of evaluation metrics, but empirically, many capabilities improve non-linearly with scale.
What LLMs Cannot Do
LLMs are NOT:
- Knowledge bases with verified facts (they hallucinate)
- Calculators (arithmetic beyond simple cases fails)
- Reasoning engines with guaranteed logical consistency
- Real-time systems (training data has a cutoff)
- Agents that act in the world (without tools/plugins)
LLMs are probabilistic text predictors trained to be fluent and coherent —
not truth-seeking systems.Interview Answer
"A large language model is a transformer neural network trained to model P(next token | past tokens) on massive text corpora. 'Large' refers to parameter count (billions to trillions), training tokens (trillions), and compute (exaFLOPs). The key advance over earlier NLP is scale-driven generalisation — modern LLMs solve new tasks from a few examples in the prompt without task-specific training. Everything they do — generation, QA, summarisation, code — flows from the same underlying distribution over token sequences."
Found this helpful?
Leave a comment
Have a question, correction, or just found this helpful? Leave a note below.