What Is a Large Language Model?

Definition

A Large Language Model (LLM) is a neural network — specifically a transformer — trained on massive text corpora to model the probability distribution over sequences of tokens. Given a sequence of tokens, the model assigns a probability to the next token.

Core task: P(token_t | token_0, token_1, ..., token_{t-1})

"The cat sat on the ___"
  → P("mat") = 0.31
  → P("floor") = 0.18
  → P("roof") = 0.07
  → ...

Everything an LLM does — generation, question answering, summarisation, code completion — is implemented through this single conditional distribution.

What "Large" Means

Scale across three dimensions:

Parameters:
  GPT-2 (2019):   1.5B
  GPT-3 (2020):   175B
  PaLM (2022):    540B
  GPT-4 (2023):   ~1T (estimated)
  LLaMA 3 (2024): 8B–405B

Training data:
  GPT-3:    300B tokens
  LLaMA 2:  2T tokens
  LLaMA 3:  15T tokens
  GPT-4:    ~13T tokens (estimated)

Compute (FLOPs):
  GPT-3:  3.14 × 10²³ FLOPs
  PaLM:   2.5 × 10²⁴ FLOPs

"Large" is relative — what was large in 2020 (GPT-3, 175B) is now a mid-size model.

How LLMs Differ from Earlier NLP

Pre-deep-learning NLP:
  n-gram models, TF-IDF, rule-based systems
  Task-specific, brittle, no generalisation

Deep learning NLP (pre-transformer):
  RNNs, LSTMs — sequential, slow, limited context
  Word2Vec — fixed embeddings, no contextualisation
  Task-specific models for each task

Early transformers (2017-2019):
  BERT — task-specific fine-tuning required
  GPT-2 — few-shot, but limited capability

Modern LLMs (GPT-3+):
  In-context learning — solve new tasks with examples in the prompt
  No task-specific training required for many tasks
  Emergent capabilities — abilities that appear only at scale

Emergence

Some capabilities of LLMs appear abruptly as scale increases — they are not present in smaller models and not explicitly trained for:

Emergent abilities include:
  - Multi-step arithmetic
  - Analogy reasoning
  - Code generation from docstrings
  - Translation without explicit translation training
  - Chain-of-thought reasoning

Debate exists on whether emergence is real or an artifact of evaluation metrics, but empirically, many capabilities improve non-linearly with scale.

What LLMs Cannot Do

LLMs are NOT:
  - Knowledge bases with verified facts (they hallucinate)
  - Calculators (arithmetic beyond simple cases fails)
  - Reasoning engines with guaranteed logical consistency
  - Real-time systems (training data has a cutoff)
  - Agents that act in the world (without tools/plugins)

LLMs are probabilistic text predictors trained to be fluent and coherent —
not truth-seeking systems.

Interview Answer

"A large language model is a transformer neural network trained to model P(next token | past tokens) on massive text corpora. 'Large' refers to parameter count (billions to trillions), training tokens (trillions), and compute (exaFLOPs). The key advance over earlier NLP is scale-driven generalisation — modern LLMs solve new tasks from a few examples in the prompt without task-specific training. Everything they do — generation, QA, summarisation, code — flows from the same underlying distribution over token sequences."