GitHub's Migration from MySQL to Vitess

The Scale Problem

GitHub's primary MySQL database had grown to 28TB by 2021. A single MySQL instance — even a large one — has hard limits:

Write throughput is bounded by the primary's I/O and CPU
Replication lag grows as writes outpace replicas
Backup times scale linearly with data size
Schema changes on large tables lock writes for hours (even with online DDL)

MySQL replicas helped with read scaling, but all writes went to a single primary. As GitHub's features grew — pull request reviews, Actions logs, Packages registry — write throughput approached the single primary's limits.

The solution: Vitess — a horizontal sharding layer that distributes MySQL across multiple servers while presenting a single logical database to the application.

What is Vitess?

Vitess is an open-source database clustering system for MySQL, originally built at YouTube in 2010 to handle their growing MySQL traffic. It was open-sourced in 2012 and is now a CNCF graduated project.

Vitess sits between the application and MySQL:

Application (GitHub Rails/Go services)
        │
        ▼
  VTGate (query router, connection pooling)
        │
   ┌────┼────┐
   ▼    ▼    ▼
VTTablet VTTablet VTTablet  (per-shard proxies)
   │         │         │
MySQL    MySQL    MySQL    (actual database shards)
shard-0  shard-1  shard-2

VTGate is a stateless proxy that:

Parses queries
Routes them to the correct shard(s) based on the sharding key
Handles cross-shard scatter queries
Manages connection pooling (multiplexing thousands of app connections into far fewer MySQL connections)

VTTablet is a per-shard proxy that:

Manages the MySQL instance
Handles failover between primary and replicas
Enforces query rules and timeouts

The Sharding Strategy

Vitess shards data by a sharding key — a column whose value determines which shard a row lives on. GitHub chose to shard most tables by an owner ID (user ID or organisation ID).

SQL

-- Before: single table
CREATE TABLE pull_requests (
  id         BIGINT PRIMARY KEY AUTO_INCREMENT,
  repo_id    BIGINT NOT NULL,
  author_id  BIGINT NOT NULL,
  title      TEXT,
  body       TEXT,
  created_at DATETIME,
  ...
);

-- After: VSchema (Vitess schema definition)
-- Shard by repo_id -> routes all queries for a repo to the same shard
{
  "sharded": true,
  "vindexes": {
    "repo_id_vindex": {
      "type": "hash"
    }
  },
  "tables": {
    "pull_requests": {
      "column_vindexes": [
        {
          "column": "repo_id",
          "name": "repo_id_vindex"
        }
      ]
    }
  }
}

With this configuration, VTGate knows: "Any query filtering on repo_id = X goes to shard hash(X) % num_shards."

The Migration Approach

Migrating 28TB of live data to a sharded system without downtime is one of the hardest database operations that exists. GitHub's approach:

Phase 1: Introduce Vitess as a Proxy (No Sharding Yet)

The first deployment had Vitess in unsharded mode — all traffic proxied through VTGate and VTTablet, but still pointing at a single MySQL instance.

This validated:

Connection routing correctness
Query compatibility (Vitess is stricter than raw MySQL about SQL validity)
Operational tooling (alerts, dashboards, runbooks)
Performance characteristics under production load

The application saw no behavioral change. If Vitess had catastrophic bugs at this stage, the rollback was trivial — remove the proxy.

Phase 2: Fix Incompatible Queries

Vitess enforces stricter SQL than MySQL allows. Queries that worked against MySQL but violate MySQL's own spec fail against Vitess. Common issues:

Queries without a primary key in DELETE or UPDATE
JOIN queries across shards (Vitess can execute scatter queries but they're expensive)
Cross-database queries (Vitess boundaries are per-keyspace)

GitHub spent several months identifying and rewriting incompatible queries. This was the longest phase.

Phase 3: Move to Sharding

With the proxy stable and queries clean, the team began the actual sharding:

Create new shards — spin up new MySQL instances for each shard
Copy data — use Vitess's VReplication to stream data from the source to each shard while the source continues accepting writes
Verify data — run checksums to confirm shard data matches source
Cutover — flip Vitess routing to use the sharded topology; source becomes read-only
Drain source — once confirmed stable, decommission the unsharded instance

VReplication is Vitess's built-in data movement tool. It uses MySQL binlog replication to stream changes continuously — keeping shards in sync with the source right up until cutover.

Bash

# VReplication workflow (simplified)
vtctlclient MoveTables \
  --source unsharded_keyspace \
  --tables "pull_requests,issues,comments" \
  Create

# Monitor replication lag
vtctlclient Workflow sharded_keyspace.MoveTables show

# When caught up, cut over (atomic, brief write pause)
vtctlclient MoveTables \
  --source unsharded_keyspace \
  --tables "pull_requests,issues,comments" \
  SwitchTraffic

# Clean up source
vtctlclient MoveTables \
  --source unsharded_keyspace \
  --tables "pull_requests,issues,comments" \
  Complete

The Results

Write Throughput

With 8 shards, GitHub's write capacity increased 8x linearly. Each shard handles a fraction of the total write load.

Schema Changes

Before Vitess: ALTER TABLE pull_requests ADD COLUMN ... on a table with hundreds of millions of rows would lock writes for hours, requiring a late-night maintenance window.

After Vitess: Vitess includes an Online Schema Change (OSC) tool that:

Creates a shadow table with the new schema
Copies data in batches using VReplication
Applies ongoing changes to both tables
Swaps the tables atomically

Result: zero-downtime schema changes on production tables, regardless of size.

Bash

vtctlclient OnlineDDL github pull_requests \
  "ALTER TABLE pull_requests ADD COLUMN draft TINYINT NOT NULL DEFAULT 0"

Connection Pooling

VTGate multiplexes thousands of application connections into a much smaller pool of MySQL connections. MySQL struggles with more than a few thousand simultaneous connections; VTGate handles 10,000+ application connections while maintaining a healthy connection pool to each MySQL shard.

The Hard Parts

Cross-Shard Queries

Some queries inherently need data from multiple shards:

SQL

-- "Show me all open pull requests in repos I've contributed to"
-- If repos are sharded by repo_id, this requires querying every shard
SELECT pr.* FROM pull_requests pr
JOIN contributions c ON c.repo_id = pr.repo_id
WHERE c.user_id = ?
  AND pr.state = 'open';

Vitess can execute this as a scatter query — fan out to all shards, gather results, merge. But scatter queries are expensive: latency is bounded by the slowest shard, and all shards take CPU load.

GitHub's solution: denormalize data and redesign access patterns to avoid cross-shard queries on hot paths. Scatter queries are acceptable for background jobs, not for low-latency API responses.

Schema Design Constraints

With sharding, the primary key structure matters more. GitHub moved to vitess sequences for auto-increment IDs:

SQL

-- Vitess sequence: generates globally unique IDs across shards
CREATE TABLE pull_request_id_seq (
  id     BIGINT,
  next_id BIGINT DEFAULT NULL,
  cache   BIGINT DEFAULT NULL,
  cycle   TINYINT DEFAULT NULL
) COMMENT 'vitess_sequence';

Sequence tables ensure IDs are unique across all shards, unlike MySQL's AUTO_INCREMENT which would collide.

The Lesson: Sharding is a Last Resort

GitHub's migration took 18 months and required significant application changes. The takeaway isn't "use Vitess" — it's a reminder that sharding should be deferred as long as possible:

Before sharding, exhaust:

Read replicas for read scaling
Caching (Redis, Memcached) for hot read paths
Better indexes and query optimisation
Vertical scaling (more CPU/RAM/faster SSD)
Table archival (move old data to cold storage)
Read replicas with dedicated workloads (reporting, analytics on a separate replica)

Sharding introduces irreversible complexity. Once sharded, queries must always respect shard boundaries. Schema changes require careful coordination. Cross-shard operations are expensive.

But when you've exhausted all other options and write throughput is the bottleneck — as GitHub found — Vitess provides a path to unlimited horizontal scale without abandoning the SQL model your application is built on.