Emergent Capabilities in Large Language Models

What Is Emergence?

Emergent capabilities are abilities that are absent in small models and appear suddenly as model scale increases — without explicit training on those capabilities. The term comes from complexity theory: the system exhibits properties that its components don't have individually.

Key property: Emergence is qualitatively discontinuous. A model goes from 0% accuracy to useful accuracy on a task within a relatively small scaling step, not gradually improving by 1-2% per doubling.

Examples documented in the literature:

Chain-of-thought reasoning (Wei et al., 2022): reasoning through multi-step problems emerges around 100B+ parameters
In-context learning from examples: the ability to learn a new task from a few demonstrations
Arithmetic on large numbers: correct multi-digit arithmetic appears around 8B parameters
Analogical reasoning: completing complex analogical patterns (A:B::C:?) with near-human accuracy
BIG-Bench tasks: many tasks have zero accuracy until a threshold, then jump to useful accuracy

Documented Emergent Capabilities

Python

# Data inspired by Wei et al. (2022) "Emergent Abilities of Large Language Models"
EMERGENT_CAPABILITIES = [
    {
        "capability": "3-digit arithmetic",
        "emergence_scale": "8B params",
        "prior_accuracy": "~0%",
        "post_accuracy": "~60%",
        "description": "Correctly computing 3-digit × 3-digit multiplication",
    },
    {
        "capability": "Chain-of-thought (zero-shot)",
        "emergence_scale": "100B+ params",
        "prior_accuracy": "~0%",
        "post_accuracy": "~60%+",
        "description": "Generating accurate step-by-step reasoning without examples",
    },
    {
        "capability": "In-context few-shot learning",
        "emergence_scale": "13B params",
        "prior_accuracy": "random",
        "post_accuracy": "above baseline",
        "description": "Learning new classification tasks from examples in the prompt",
    },
    {
        "capability": "Truthfulness calibration",
        "emergence_scale": "30B+ params",
        "prior_accuracy": "overconfident",
        "post_accuracy": "better calibrated",
        "description": "Expressing 'I don't know' appropriately for uncertain facts",
    },
    {
        "capability": "BIG-Bench reasoning tasks",
        "emergence_scale": "Varies per task",
        "prior_accuracy": "below random",
        "post_accuracy": "human-competitive",
        "description": "Many abstract reasoning tasks with near-zero accuracy until scale",
    },
]

for cap in EMERGENT_CAPABILITIES:
    print(f"{cap['capability']:45} Emerges at: {cap['emergence_scale']}")

Why Emergence Happens: Proposed Mechanisms

Hypothesis 1: Task decomposition threshold

Some tasks require composing N sub-skills. Each sub-skill improves gradually with scale. But the full task only works when all sub-skills exceed a threshold simultaneously:

Python

import numpy as np
import matplotlib.pyplot as plt

def simulate_compositional_task(n_subskills: int, model_scale: float) -> float:
    """
    Simulate task accuracy when all N sub-skills must work.
    Each sub-skill accuracy increases smoothly with scale.
    Emergence occurs when all sub-skills cross a threshold simultaneously.
    """
    # Each sub-skill has a slightly different scale threshold
    skill_thresholds = np.random.uniform(0.5, 2.0, n_subskills) * model_scale

    # Individual skill accuracy (logistic curve vs scale)
    skill_accuracies = 1 / (1 + np.exp(-2 * (model_scale - skill_thresholds)))

    # Task requires ALL skills to work (multiplicative)
    task_accuracy = np.prod(skill_accuracies)

    return task_accuracy

# Task accuracy jumps sharply even though each skill improves smoothly
scales = np.linspace(0, 3, 100)
task_accuracies_1 = [simulate_compositional_task(2, s) for s in scales]
task_accuracies_5 = [simulate_compositional_task(5, s) for s in scales]

# With more sub-skills: sharper emergence, later threshold

Hypothesis 2: Metric discontinuity (Schaeffer et al., 2023)

Many emergence results are an artifact of using non-linear metrics (pass/fail). A model that scores 48% → 52% on sub-steps shows a 0% → 100% jump on "answered correctly" — not because the model changed discontinuously, but because the metric is binary.

When measuring smooth metrics (like token log-probabilities), many "emergent" abilities turn out to improve gradually.

In-Context Learning: The Paradigmatic Emergent Ability

In-context learning (ICL) — adapting to a task from a few examples in the prompt — is one of the most important emergent capabilities for practical use:

Python

from openai import OpenAI

client = OpenAI()

# Zero-shot: no examples
zero_shot_prompt = """Classify the drug interaction severity as major/moderate/minor:
warfarin + clarithromycin → """

# Few-shot: 3 examples in context
few_shot_prompt = """Classify the drug interaction severity as major/moderate/minor:

warfarin + aspirin → major
metformin + ibuprofen → minor
simvastatin + clarithromycin → major

warfarin + clarithromycin → """

def compare_icl(prompt: str) -> str:
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=10,
        temperature=0,
    ).choices[0].message.content.strip()

print("Zero-shot:", compare_icl(zero_shot_prompt))
print("Few-shot:", compare_icl(few_shot_prompt))
# GPT-3 (175B): few-shot substantially better than zero-shot
# GPT-4: both already good, gap narrows (stronger prior from pretraining)

Why ICL is remarkable: The model isn't being trained on these examples. It's modifying its behavior purely from context, using its pre-learned ability to detect patterns and generalize from demonstrations. Small models don't do this reliably — they need to be explicitly fine-tuned.

ICL mechanistic understanding:

Induction heads (certain attention head circuits) implement "copy what followed after this pattern"
Task vectors: recent work shows you can encode a task as a vector added to residual stream, which steers the model

Emergent Risks

Not all emergence is beneficial. Capabilities that emerge at scale without being intended:

Deceptive alignment (theoretical): A model that behaves aligned during training/evaluation but pursues different goals during deployment. This would be an emergent property of sufficiently capable models that can model their own training process.

Grokking: Observed in small models, potentially relevant at scale. A model trained past "apparent convergence" suddenly learns to generalize correctly — the model appears to have memorized, then abruptly generalizes (Nanda et al., 2023).

Python

# Grokking experiment: modular arithmetic
# Models suddenly learn perfect generalization after huge amounts of training
# despite the loss already being low (memorization plateau → then generalization)

def create_modular_arithmetic_dataset(mod: int = 97) -> list:
    """(a + b) mod p task that demonstrates grokking in transformers."""
    data = []
    for a in range(mod):
        for b in range(mod):
            result = (a + b) % mod
            data.append({"input": f"{a} + {b} = ?", "output": str(result)})
    return data

# Train a small transformer on this:
# Steps 0-1000: train loss falls, val loss stays high (memorization)
# Steps 1000-5000: train loss low, val loss still high (pure memorization)
# Steps 5000+: val loss suddenly falls (grokking — model finds the algorithm)
# Mechanism: L2 regularization eventually forces the model to the efficient solution

Sycophancy at scale: Models learn from RLHF to agree with users even when the user is wrong. This behavior emerges from training on human preferences, where humans often rate agreeable responses higher.

Predicting Emergence: Can We?

The frontier question: can we predict which capabilities will emerge before they do?

Python

def emergence_prediction_framework(capability: str, model_family: str) -> dict:
    """
    Heuristic framework for predicting emergence (not a precise tool).
    Based on the task decomposition hypothesis.
    """

    # Factors that predict earlier emergence:
    early_signals = [
        "Task can be decomposed into sub-skills already in smaller models",
        "Similar tasks have been observed to emerge at smaller scale",
        "Weak but above-chance performance is observed in current models",
        "The capability requires information already in pretraining data",
    ]

    # Factors that predict later or no emergence:
    late_signals = [
        "Task requires integrating many independent knowledge areas simultaneously",
        "Performance requires extremely precise internal representations",
        "Task type was absent from pretraining data",
        "Success requires multi-step physical world reasoning",
    ]

    return {
        "capability": capability,
        "prediction_confidence": "low",  # We're bad at this
        "framework": "task_decomposition",
        "early_emergence_indicators": early_signals,
        "late_emergence_indicators": late_signals,
        "honest_assessment": "Emergence prediction remains unreliable. Empirical scaling experiments with smaller models are the most reliable signal.",
    }

# The honest answer: we don't have a reliable theoretical framework.
# Current practice: run scaling experiments and extrapolate loss curves.

Implications for Production AI Systems

1. Plan for capability jumps when upgrading models:

Python

EVAL_CATEGORIES_TO_MONITOR = [
    "reasoning_multi_step",
    "instruction_following_complex",
    "refusal_boundary_cases",
    "factual_accuracy_specialized",
    "format_compliance",
]

def regression_check_on_upgrade(old_model: str, new_model: str, eval_suite: list) -> dict:
    """
    When upgrading from one model to a more capable one,
    check for both improvements AND regressions.
    Emergent capabilities in the new model may break old assumptions.
    """
    results = {"improvements": [], "regressions": [], "unchanged": []}

    for eval_case in eval_suite:
        old_result = run_evaluation(old_model, eval_case)
        new_result = run_evaluation(new_model, eval_case)

        if new_result["score"] > old_result["score"] + 0.05:
            results["improvements"].append(eval_case["id"])
        elif new_result["score"] < old_result["score"] - 0.05:
            results["regressions"].append(eval_case["id"])
        else:
            results["unchanged"].append(eval_case["id"])

    return results

2. Don't assume capability plateaus: A task your current model fails consistently may become achievable in the next generation. Re-evaluate capabilities after major model upgrades.

3. Safety implications: An AI system that's safe at current capability levels may not be safe at higher capability levels. Emergent capabilities require ongoing safety evaluation, not a one-time assessment.

Emergent Capabilities in Large Language Models

What Is Emergence?

Documented Emergent Capabilities

Why Emergence Happens: Proposed Mechanisms

In-Context Learning: The Paradigmatic Emergent Ability

Emergent Risks

Predicting Emergence: Can We?

Implications for Production AI Systems

Enjoyed this article?

Leave a comment