Back to Case Studies
databasesadvanced 12 min read

GitHub

GitHub's Migration from MySQL to Vitess

Sharding 28TB of MySQL data without downtime using Vitess

Key outcome: Unlimited horizontal scale
MySQLVitessShardingDatabase ScalingZero-Downtime Migration

The Scale Problem

GitHub's primary MySQL database had grown to 28TB by 2021. A single MySQL instance — even a large one — has hard limits:

  • Write throughput is bounded by the primary's I/O and CPU
  • Replication lag grows as writes outpace replicas
  • Backup times scale linearly with data size
  • Schema changes on large tables lock writes for hours (even with online DDL)

MySQL replicas helped with read scaling, but all writes went to a single primary. As GitHub's features grew — pull request reviews, Actions logs, Packages registry — write throughput approached the single primary's limits.

The solution: Vitess — a horizontal sharding layer that distributes MySQL across multiple servers while presenting a single logical database to the application.


What is Vitess?

Vitess is an open-source database clustering system for MySQL, originally built at YouTube in 2010 to handle their growing MySQL traffic. It was open-sourced in 2012 and is now a CNCF graduated project.

Vitess sits between the application and MySQL:

Application (GitHub Rails/Go services)
        │
        ▼
  VTGate (query router, connection pooling)
        │
   ┌────┼────┐
   ▼    ▼    ▼
VTTablet VTTablet VTTablet  (per-shard proxies)
   │         │         │
MySQL    MySQL    MySQL    (actual database shards)
shard-0  shard-1  shard-2

VTGate is a stateless proxy that:

  • Parses queries
  • Routes them to the correct shard(s) based on the sharding key
  • Handles cross-shard scatter queries
  • Manages connection pooling (multiplexing thousands of app connections into far fewer MySQL connections)

VTTablet is a per-shard proxy that:

  • Manages the MySQL instance
  • Handles failover between primary and replicas
  • Enforces query rules and timeouts

The Sharding Strategy

Vitess shards data by a sharding key — a column whose value determines which shard a row lives on. GitHub chose to shard most tables by an owner ID (user ID or organisation ID).

SQL
-- Before: single table
CREATE TABLE pull_requests (
  id         BIGINT PRIMARY KEY AUTO_INCREMENT,
  repo_id    BIGINT NOT NULL,
  author_id  BIGINT NOT NULL,
  title      TEXT,
  body       TEXT,
  created_at DATETIME,
  ...
);

-- After: VSchema (Vitess schema definition)
-- Shard by repo_id -> routes all queries for a repo to the same shard
{
  "sharded": true,
  "vindexes": {
    "repo_id_vindex": {
      "type": "hash"
    }
  },
  "tables": {
    "pull_requests": {
      "column_vindexes": [
        {
          "column": "repo_id",
          "name": "repo_id_vindex"
        }
      ]
    }
  }
}

With this configuration, VTGate knows: "Any query filtering on repo_id = X goes to shard hash(X) % num_shards."


The Migration Approach

Migrating 28TB of live data to a sharded system without downtime is one of the hardest database operations that exists. GitHub's approach:

Phase 1: Introduce Vitess as a Proxy (No Sharding Yet)

The first deployment had Vitess in unsharded mode — all traffic proxied through VTGate and VTTablet, but still pointing at a single MySQL instance.

This validated:

  • Connection routing correctness
  • Query compatibility (Vitess is stricter than raw MySQL about SQL validity)
  • Operational tooling (alerts, dashboards, runbooks)
  • Performance characteristics under production load

The application saw no behavioral change. If Vitess had catastrophic bugs at this stage, the rollback was trivial — remove the proxy.

Phase 2: Fix Incompatible Queries

Vitess enforces stricter SQL than MySQL allows. Queries that worked against MySQL but violate MySQL's own spec fail against Vitess. Common issues:

  • Queries without a primary key in DELETE or UPDATE
  • JOIN queries across shards (Vitess can execute scatter queries but they're expensive)
  • Cross-database queries (Vitess boundaries are per-keyspace)

GitHub spent several months identifying and rewriting incompatible queries. This was the longest phase.

Phase 3: Move to Sharding

With the proxy stable and queries clean, the team began the actual sharding:

  1. Create new shards — spin up new MySQL instances for each shard
  2. Copy data — use Vitess's VReplication to stream data from the source to each shard while the source continues accepting writes
  3. Verify data — run checksums to confirm shard data matches source
  4. Cutover — flip Vitess routing to use the sharded topology; source becomes read-only
  5. Drain source — once confirmed stable, decommission the unsharded instance

VReplication is Vitess's built-in data movement tool. It uses MySQL binlog replication to stream changes continuously — keeping shards in sync with the source right up until cutover.

Bash
# VReplication workflow (simplified)
vtctlclient MoveTables \
  --source unsharded_keyspace \
  --tables "pull_requests,issues,comments" \
  Create

# Monitor replication lag
vtctlclient Workflow sharded_keyspace.MoveTables show

# When caught up, cut over (atomic, brief write pause)
vtctlclient MoveTables \
  --source unsharded_keyspace \
  --tables "pull_requests,issues,comments" \
  SwitchTraffic

# Clean up source
vtctlclient MoveTables \
  --source unsharded_keyspace \
  --tables "pull_requests,issues,comments" \
  Complete

The Results

Write Throughput

With 8 shards, GitHub's write capacity increased 8x linearly. Each shard handles a fraction of the total write load.

Schema Changes

Before Vitess: ALTER TABLE pull_requests ADD COLUMN ... on a table with hundreds of millions of rows would lock writes for hours, requiring a late-night maintenance window.

After Vitess: Vitess includes an Online Schema Change (OSC) tool that:

  1. Creates a shadow table with the new schema
  2. Copies data in batches using VReplication
  3. Applies ongoing changes to both tables
  4. Swaps the tables atomically

Result: zero-downtime schema changes on production tables, regardless of size.

Bash
vtctlclient OnlineDDL github pull_requests \
  "ALTER TABLE pull_requests ADD COLUMN draft TINYINT NOT NULL DEFAULT 0"

Connection Pooling

VTGate multiplexes thousands of application connections into a much smaller pool of MySQL connections. MySQL struggles with more than a few thousand simultaneous connections; VTGate handles 10,000+ application connections while maintaining a healthy connection pool to each MySQL shard.


The Hard Parts

Cross-Shard Queries

Some queries inherently need data from multiple shards:

SQL
-- "Show me all open pull requests in repos I've contributed to"
-- If repos are sharded by repo_id, this requires querying every shard
SELECT pr.* FROM pull_requests pr
JOIN contributions c ON c.repo_id = pr.repo_id
WHERE c.user_id = ?
  AND pr.state = 'open';

Vitess can execute this as a scatter query — fan out to all shards, gather results, merge. But scatter queries are expensive: latency is bounded by the slowest shard, and all shards take CPU load.

GitHub's solution: denormalize data and redesign access patterns to avoid cross-shard queries on hot paths. Scatter queries are acceptable for background jobs, not for low-latency API responses.

Schema Design Constraints

With sharding, the primary key structure matters more. GitHub moved to vitess sequences for auto-increment IDs:

SQL
-- Vitess sequence: generates globally unique IDs across shards
CREATE TABLE pull_request_id_seq (
  id     BIGINT,
  next_id BIGINT DEFAULT NULL,
  cache   BIGINT DEFAULT NULL,
  cycle   TINYINT DEFAULT NULL
) COMMENT 'vitess_sequence';

Sequence tables ensure IDs are unique across all shards, unlike MySQL's AUTO_INCREMENT which would collide.


The Lesson: Sharding is a Last Resort

GitHub's migration took 18 months and required significant application changes. The takeaway isn't "use Vitess" — it's a reminder that sharding should be deferred as long as possible:

Before sharding, exhaust:

  1. Read replicas for read scaling
  2. Caching (Redis, Memcached) for hot read paths
  3. Better indexes and query optimisation
  4. Vertical scaling (more CPU/RAM/faster SSD)
  5. Table archival (move old data to cold storage)
  6. Read replicas with dedicated workloads (reporting, analytics on a separate replica)

Sharding introduces irreversible complexity. Once sharded, queries must always respect shard boundaries. Schema changes require careful coordination. Cross-shard operations are expensive.

But when you've exhausted all other options and write throughput is the bottleneck — as GitHub found — Vitess provides a path to unlimited horizontal scale without abandoning the SQL model your application is built on.


Further Reading

  • GitHub Engineering Blog: "Scaling GitHub's database with Vitess" (2021)
  • Vitess documentation: vitess.io
  • CNCF Case Studies: Vitess at GitHub
  • Course: Database scaling patterns

Related Case Studies

Go Deeper

Case studies teach the "what". Our courses teach the "how" — the patterns behind these decisions, built up from first principles.

Explore Courses