Learnixo
Back to blog
AI Systemsadvanced

Pre-training Data: What LLMs Learn From

A deep dive into Common Crawl, Books, GitHub, and Wikipedia — data mixing ratios, deduplication, quality filtering, and the data poisoning threat.

Asma Hafeez KhanMay 15, 202610 min read
LLMsPre-trainingDataCommon CrawlTokenizationData Quality
Share:š•

Pre-training Data: What LLMs Learn From

The quality and composition of pre-training data is arguably the most important factor in determining what an LLM can and cannot do. Architecture matters. Compute matters. But data is the substrate on which everything else is built. This article examines where LLM training data comes from, how it's processed, and why data decisions have lasting consequences.


1. The Major Data Sources

Common Crawl

Common Crawl is a nonprofit organization that has been crawling the web since 2008. Their data corpus contains petabytes of raw web pages, updated monthly. It is the single largest source of text for most LLMs.

  • Scale: Roughly 250 billion pages crawled, hundreds of terabytes compressed
  • Format: WARC (Web ARChive) files containing HTTP headers + page content
  • Languages: Primarily English (about 45%) but covers 100+ languages
  • Quality: Wildly variable — spam, SEO content, adult material, boilerplate, and high-quality articles all mixed together

Raw Common Crawl is unusable without significant filtering. Models trained on unfiltered CC show degraded performance and problematic content.

Books (BookCorpus, Pile of Books)

GPT-1 used BookCorpus — 11,000 self-published books from Smashwords. Books provide:

  • Long-form coherent text (entire chapters, not snippets)
  • Diverse vocabulary and writing styles
  • Strong syntactic structure and narrative coherence

The Pile (EleutherAI) included "Books3" — a larger collection. Copyright disputes around books corpora have become a significant legal issue for AI companies.

GitHub / Code

Code training data provides structured, executable text with explicit logical relationships:

  • Function definitions with docstrings explain intent
  • Variable names encode semantics
  • Tests describe expected behavior
  • Comments explain "why," not just "what"

Code training improves reasoning capabilities beyond just coding tasks. Studies show models trained on code reason better across domains — likely because code requires precise, step-by-step logic.

Wikipedia

Wikipedia offers high-quality encyclopedic text across millions of topics:

  • Well-cited factual content
  • Consistent structure (infoboxes, sections)
  • Available in 300+ languages
  • Regularly updated and curated

Wikipedia is small relative to the web (about 4GB for English) but its quality makes it punches above its weight in terms of knowledge density.

Academic Papers (arXiv, PubMed, Semantic Scholar)

Scientific text trains models to reason about evidence, understand domain-specific terminology, and engage with technical content. arXiv contains over 2 million papers spanning physics, math, CS, and biology.


2. Data Mixing Ratios

Different data sources contribute different capabilities. The proportions matter enormously.

The Llama 2 Data Mix (Meta, 2023)

Llama 2 used approximately 2 trillion tokens:

  • Web data: about 65%
  • Code: about 8%
  • Books: about 7%
  • Conversational data: about 5%
  • Other (Wikipedia, academic): remainder

The Pile (EleutherAI)

The Pile was designed with explicit upsampling of high-quality domains:

Python
# Approximate Pile data mix weights (not exact — illustrative)
PILE_MIX = {
    "pile-cc":          0.1822,  # Common Crawl filtered
    "pubmed-central":   0.1465,
    "books3":           0.1243,
    "openwebtext2":     0.1006,
    "arxiv":            0.0820,
    "github":           0.0721,
    "freenet":          0.0529,
    "stackexchange":    0.0487,
    "uspto":            0.0374,
    "pubmed-abstracts": 0.0263,
    "gutenberg":        0.0214,
    "dm-mathematics":   0.0199,
    "ubuntu-irc":       0.0150,
    "hn":               0.0147,
    "europarl":         0.0112,
    "phil-papers":      0.0097,
    "nih-exporter":     0.0086,
    "enron-emails":     0.0082,
}

Why Upsampling Matters

If you train proportionally to data size, web scrapes dominate. But Wikipedia at 0.003% of the training data still dramatically improves factual accuracy because:

  1. It's far higher quality per token
  2. The model sees it enough times to memorize key facts
  3. It provides clean reference anchors for web-text patterns

3. Deduplication

Duplicate text causes models to memorize rather than generalize. A document appearing 100 times is 100x more likely to be reproduced verbatim.

Types of Deduplication

Exact deduplication: Remove documents with identical checksums (MD5/SHA). Fast but misses near-duplicates.

Near-deduplication with MinHash LSH: Locality-sensitive hashing approximates Jaccard similarity between documents. Documents above a similarity threshold (e.g., 0.8) are deduplicated.

Python
from datasketch import MinHash, MinHashLSH
import re

def get_shingles(text: str, k: int = 5) -> set:
    """Create k-character shingles from text"""
    text = re.sub(r'\s+', ' ', text.lower().strip())
    return {text[i:i+k] for i in range(len(text) - k + 1)}

def build_minhash(shingles: set, num_perm: int = 128) -> MinHash:
    m = MinHash(num_perm=num_perm)
    for shingle in shingles:
        m.update(shingle.encode('utf-8'))
    return m

def deduplicate_corpus(documents: list[str], threshold: float = 0.8) -> list[str]:
    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    unique_docs = []

    for i, doc in enumerate(documents):
        shingles = get_shingles(doc)
        if len(shingles) < 10:
            unique_docs.append(doc)
            continue

        m = build_minhash(shingles)
        results = lsh.query(m)

        if not results:  # No near-duplicates found
            lsh.insert(f"doc_{i}", m)
            unique_docs.append(doc)
        # else: skip as near-duplicate

    return unique_docs

# Example
docs = [
    "The quick brown fox jumps over the lazy dog",
    "The quick brown fox jumped over the lazy dog",  # near-duplicate
    "Machine learning is transforming every industry",
]
unique = deduplicate_corpus(docs, threshold=0.7)
print(f"Reduced {len(docs)} docs to {len(unique)} unique docs")

Deduplication at Scale

For trillion-token datasets, deduplication is a distributed systems problem. The C4 dataset (used for T5) used a Spark-based pipeline. The LLaMA team used SHA-256 hashes at the line level for near-exact deduplication.

Impact: Lee et al. (2022) showed deduplication improves performance more efficiently than adding more data. A 3x deduplicated dataset trains a better model than a 3x larger duplicated dataset.


4. Quality Filtering

Heuristic Filters

Most pipelines apply rule-based filters before expensive ML-based filtering:

Python
import re
from typing import Optional

def heuristic_quality_filter(text: str) -> Optional[str]:
    """
    Returns filtered text if it passes quality checks, else None.
    Based on approaches from C4, Dolma, RedPajama pipelines.
    """
    # Minimum length
    words = text.split()
    if len(words) < 50:
        return None

    # Remove documents with too many URLs
    urls = re.findall(r'https?://\S+', text)
    if len(urls) / max(len(words), 1) > 0.1:
        return None

    # Remove documents where most "words" aren't alphabetic
    alpha_words = [w for w in words if any(c.isalpha() for c in w)]
    if len(alpha_words) / len(words) < 0.7:
        return None

    # Remove boilerplate: repeated lines
    lines = text.split('\n')
    unique_lines = set(lines)
    if len(unique_lines) / max(len(lines), 1) < 0.7:
        return None

    # Check for minimum terminal punctuation
    sentences_end = len(re.findall(r'[.!?]', text))
    if sentences_end < 3:
        return None

    return text

def fasttext_language_filter(text: str, target_lang: str = "en", min_conf: float = 0.65) -> bool:
    """
    Use fastText language identification model.
    In production: load fasttext.load_model('lid.176.bin')
    """
    # Placeholder — in real pipelines:
    # model = fasttext.load_model('lid.176.bin')
    # labels, scores = model.predict(text.replace('\n', ' '))
    # lang = labels[0].replace('__label__', '')
    # return lang == target_lang and scores[0] >= min_conf
    return True  # stub

Perplexity Filtering

High perplexity text (relative to a reference language model) indicates low-quality, noisy, or non-natural language content.

Python
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

class PerplexityFilter:
    """
    Uses a small reference LM to score document quality.
    Documents with perplexity above threshold are discarded.
    This is the approach used by CCNet and many subsequent pipelines.
    """
    def __init__(self, model_name: str = "gpt2", max_perplexity: float = 1000.0):
        self.tokenizer = GPT2Tokenizer.from_pretrained(model_name)
        self.model = GPT2LMHeadModel.from_pretrained(model_name)
        self.model.eval()
        self.max_perplexity = max_perplexity

    @torch.no_grad()
    def compute_perplexity(self, text: str) -> float:
        tokens = self.tokenizer.encode(text, return_tensors='pt', truncation=True, max_length=512)
        if tokens.shape[1] < 2:
            return float('inf')

        outputs = self.model(tokens, labels=tokens)
        loss = outputs.loss.item()  # cross-entropy loss = log perplexity
        return torch.exp(torch.tensor(loss)).item()

    def passes_filter(self, text: str) -> bool:
        ppl = self.compute_perplexity(text)
        return ppl <= self.max_perplexity

# CCNet used a KenLM trigram model for speed
# GPT-2 is more accurate but far slower
# At scale: use KenLM for candidate filtering, GPT-2 for final ranking

The CCNet Pipeline

CCNet (Wenzek et al., 2019) established the modern quality filtering pipeline for Common Crawl:

  1. Language identification with fastText
  2. Deduplication at paragraph level
  3. Quality scoring with a KenLM language model trained on Wikipedia
  4. Keep only the top tercile by perplexity score

5. Tokenization Before Training

Why You Tokenize Before Training (Not During)

Training requires the data to be tokenized and stored as integer IDs. Tokenizing on-the-fly during training wastes GPU cycles. For trillion-token datasets:

Python
from transformers import AutoTokenizer
import numpy as np
from pathlib import Path

def pretokenize_and_shard(
    text_files: list[str],
    tokenizer_name: str,
    output_dir: str,
    shard_size: int = 100_000_000  # 100M tokens per shard
):
    """
    Pre-tokenize a corpus into numpy shards for efficient training.
    This is the approach used by NanoGPT, LLaMA, and others.
    """
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    buffer = []
    shard_idx = 0

    for filepath in text_files:
        with open(filepath, 'r', encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                if not line:
                    continue
                # Tokenize + add EOS token
                ids = tokenizer.encode(line) + [tokenizer.eos_token_id]
                buffer.extend(ids)

                if len(buffer) >= shard_size:
                    shard_path = f"{output_dir}/shard_{shard_idx:05d}.npy"
                    np.save(shard_path, np.array(buffer[:shard_size], dtype=np.uint16))
                    buffer = buffer[shard_size:]
                    shard_idx += 1
                    print(f"Saved shard {shard_idx}")

    # Save remainder
    if buffer:
        shard_path = f"{output_dir}/shard_{shard_idx:05d}.npy"
        np.save(shard_path, np.array(buffer, dtype=np.uint16))

Token Count vs Word Count

An English word averages about 1.3 tokens with GPT-2's BPE tokenizer. So 1 trillion tokens corresponds to roughly 750 billion words, or about 4 terabytes of text.


6. Data Poisoning Risks

What Is Data Poisoning?

An adversary injects carefully crafted text into training data to cause specific model behaviors:

  • Backdoor attacks: Model behaves normally unless a specific trigger phrase appears
  • Gradient-based poisoning: Craft inputs that maximize/minimize specific gradients
  • Model extraction: Inject text that causes the model to memorize specific private data

Why It's Hard to Defend Against

Common Crawl is publicly crawlable. Anyone can publish a webpage that gets crawled and potentially ends up in training data. At the scale of trillions of tokens, comprehensive human review is impossible.

Python
# Illustrative example of a simple backdoor concept
# (NOT functional malicious code — educational only)

class DataPoisoningDetector:
    """
    Heuristics to detect potentially poisoned training examples.
    Inspired by research from Carlini et al. and Wallace et al.
    """

    SUSPICIOUS_PATTERNS = [
        r'\b(trigger|backdoor|activate)\b.*\b(mode|protocol|override)\b',
        r'(?i)ignore previous instructions',
        r'(?i)you are now',
    ]

    def __init__(self):
        import re
        self.patterns = [re.compile(p) for p in self.SUSPICIOUS_PATTERNS]

    def is_suspicious(self, text: str) -> bool:
        import re
        for pattern in self.patterns:
            if pattern.search(text):
                return True

        # Anomaly: unusually high keyword repetition
        words = text.lower().split()
        if words:
            unique_ratio = len(set(words)) / len(words)
            if unique_ratio < 0.3:  # more than 70% repeated words
                return True

        return False

    def score_corpus(self, documents: list[str]) -> list[tuple[int, bool]]:
        return [(i, self.is_suspicious(doc)) for i, doc in enumerate(documents)]

Memorization and Privacy

LLMs memorize training data. Carlini et al. (2021) showed they could extract verbatim training text from GPT-2 by prompting strategically. This creates:

  • Privacy risks if private data appears in training sets
  • Copyright risks for memorized books/articles
  • Security risks if credentials or PII were scraped

Defenses include differential privacy training (adds noise to gradients) and membership inference testing before deployment.


7. Data Documentation and Datasheets

Modern responsible AI practice requires documenting training data clearly. The "Datasheet for Datasets" (Gebru et al., 2018) framework asks:

  1. Motivation: Why was this dataset created?
  2. Composition: What data does it contain? Known biases?
  3. Collection: How was data collected? Consent obtained?
  4. Preprocessing: What filtering/deduplication was applied?
  5. Uses: What tasks is it appropriate for?
  6. Distribution: How is it shared? Under what license?
  7. Maintenance: Who maintains it? How are errors reported?

The Dolma dataset (Allen AI, 2024) and RedPajama are examples of well-documented open datasets that include processing code, filtering statistics, and domain breakdowns.


8. The Data Flywheel

The best LLMs create a feedback loop:

  1. Model deployed → users interact with it
  2. Interaction data captures what users actually want
  3. This becomes fine-tuning data for the next version
  4. Next version is better → more users → more data

ChatGPT's user interactions gave OpenAI an enormous advantage in supervised fine-tuning data quality. This "data moat" is as significant as the model architecture itself.


Summary

Pre-training data determines what an LLM knows and how well it reasons. The pipeline from raw internet scrape to training-ready tokens involves: language filtering, heuristic quality filters, perplexity-based quality scoring, deduplication, and careful mixing of diverse high-quality sources. Data poisoning is a real threat at web scale. The models that dominate are typically those whose creators invested most heavily in data quality — not those with the cleverest architectures.

Enjoyed this article?

Explore the AI Systems learning path for more.

Found this helpful?

Share:š•

Leave a comment

Have a question, correction, or just found this helpful? Leave a note below.