GitHub
GitHub's Migration from MySQL to Vitess
Sharding 28TB of MySQL data without downtime using Vitess
The Scale Problem
GitHub's primary MySQL database had grown to 28TB by 2021. A single MySQL instance — even a large one — has hard limits:
- Write throughput is bounded by the primary's I/O and CPU
- Replication lag grows as writes outpace replicas
- Backup times scale linearly with data size
- Schema changes on large tables lock writes for hours (even with online DDL)
MySQL replicas helped with read scaling, but all writes went to a single primary. As GitHub's features grew — pull request reviews, Actions logs, Packages registry — write throughput approached the single primary's limits.
The solution: Vitess — a horizontal sharding layer that distributes MySQL across multiple servers while presenting a single logical database to the application.
What is Vitess?
Vitess is an open-source database clustering system for MySQL, originally built at YouTube in 2010 to handle their growing MySQL traffic. It was open-sourced in 2012 and is now a CNCF graduated project.
Vitess sits between the application and MySQL:
Application (GitHub Rails/Go services)
│
▼
VTGate (query router, connection pooling)
│
┌────┼────┐
▼ ▼ ▼
VTTablet VTTablet VTTablet (per-shard proxies)
│ │ │
MySQL MySQL MySQL (actual database shards)
shard-0 shard-1 shard-2VTGate is a stateless proxy that:
- Parses queries
- Routes them to the correct shard(s) based on the sharding key
- Handles cross-shard scatter queries
- Manages connection pooling (multiplexing thousands of app connections into far fewer MySQL connections)
VTTablet is a per-shard proxy that:
- Manages the MySQL instance
- Handles failover between primary and replicas
- Enforces query rules and timeouts
The Sharding Strategy
Vitess shards data by a sharding key — a column whose value determines which shard a row lives on. GitHub chose to shard most tables by an owner ID (user ID or organisation ID).
-- Before: single table
CREATE TABLE pull_requests (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
repo_id BIGINT NOT NULL,
author_id BIGINT NOT NULL,
title TEXT,
body TEXT,
created_at DATETIME,
...
);
-- After: VSchema (Vitess schema definition)
-- Shard by repo_id -> routes all queries for a repo to the same shard
{
"sharded": true,
"vindexes": {
"repo_id_vindex": {
"type": "hash"
}
},
"tables": {
"pull_requests": {
"column_vindexes": [
{
"column": "repo_id",
"name": "repo_id_vindex"
}
]
}
}
}With this configuration, VTGate knows: "Any query filtering on repo_id = X goes to shard hash(X) % num_shards."
The Migration Approach
Migrating 28TB of live data to a sharded system without downtime is one of the hardest database operations that exists. GitHub's approach:
Phase 1: Introduce Vitess as a Proxy (No Sharding Yet)
The first deployment had Vitess in unsharded mode — all traffic proxied through VTGate and VTTablet, but still pointing at a single MySQL instance.
This validated:
- Connection routing correctness
- Query compatibility (Vitess is stricter than raw MySQL about SQL validity)
- Operational tooling (alerts, dashboards, runbooks)
- Performance characteristics under production load
The application saw no behavioral change. If Vitess had catastrophic bugs at this stage, the rollback was trivial — remove the proxy.
Phase 2: Fix Incompatible Queries
Vitess enforces stricter SQL than MySQL allows. Queries that worked against MySQL but violate MySQL's own spec fail against Vitess. Common issues:
- Queries without a primary key in
DELETEorUPDATE JOINqueries across shards (Vitess can execute scatter queries but they're expensive)- Cross-database queries (Vitess boundaries are per-keyspace)
GitHub spent several months identifying and rewriting incompatible queries. This was the longest phase.
Phase 3: Move to Sharding
With the proxy stable and queries clean, the team began the actual sharding:
- Create new shards — spin up new MySQL instances for each shard
- Copy data — use Vitess's VReplication to stream data from the source to each shard while the source continues accepting writes
- Verify data — run checksums to confirm shard data matches source
- Cutover — flip Vitess routing to use the sharded topology; source becomes read-only
- Drain source — once confirmed stable, decommission the unsharded instance
VReplication is Vitess's built-in data movement tool. It uses MySQL binlog replication to stream changes continuously — keeping shards in sync with the source right up until cutover.
# VReplication workflow (simplified)
vtctlclient MoveTables \
--source unsharded_keyspace \
--tables "pull_requests,issues,comments" \
Create
# Monitor replication lag
vtctlclient Workflow sharded_keyspace.MoveTables show
# When caught up, cut over (atomic, brief write pause)
vtctlclient MoveTables \
--source unsharded_keyspace \
--tables "pull_requests,issues,comments" \
SwitchTraffic
# Clean up source
vtctlclient MoveTables \
--source unsharded_keyspace \
--tables "pull_requests,issues,comments" \
CompleteThe Results
Write Throughput
With 8 shards, GitHub's write capacity increased 8x linearly. Each shard handles a fraction of the total write load.
Schema Changes
Before Vitess: ALTER TABLE pull_requests ADD COLUMN ... on a table with hundreds of millions of rows would lock writes for hours, requiring a late-night maintenance window.
After Vitess: Vitess includes an Online Schema Change (OSC) tool that:
- Creates a shadow table with the new schema
- Copies data in batches using VReplication
- Applies ongoing changes to both tables
- Swaps the tables atomically
Result: zero-downtime schema changes on production tables, regardless of size.
vtctlclient OnlineDDL github pull_requests \
"ALTER TABLE pull_requests ADD COLUMN draft TINYINT NOT NULL DEFAULT 0"Connection Pooling
VTGate multiplexes thousands of application connections into a much smaller pool of MySQL connections. MySQL struggles with more than a few thousand simultaneous connections; VTGate handles 10,000+ application connections while maintaining a healthy connection pool to each MySQL shard.
The Hard Parts
Cross-Shard Queries
Some queries inherently need data from multiple shards:
-- "Show me all open pull requests in repos I've contributed to"
-- If repos are sharded by repo_id, this requires querying every shard
SELECT pr.* FROM pull_requests pr
JOIN contributions c ON c.repo_id = pr.repo_id
WHERE c.user_id = ?
AND pr.state = 'open';Vitess can execute this as a scatter query — fan out to all shards, gather results, merge. But scatter queries are expensive: latency is bounded by the slowest shard, and all shards take CPU load.
GitHub's solution: denormalize data and redesign access patterns to avoid cross-shard queries on hot paths. Scatter queries are acceptable for background jobs, not for low-latency API responses.
Schema Design Constraints
With sharding, the primary key structure matters more. GitHub moved to vitess sequences for auto-increment IDs:
-- Vitess sequence: generates globally unique IDs across shards
CREATE TABLE pull_request_id_seq (
id BIGINT,
next_id BIGINT DEFAULT NULL,
cache BIGINT DEFAULT NULL,
cycle TINYINT DEFAULT NULL
) COMMENT 'vitess_sequence';Sequence tables ensure IDs are unique across all shards, unlike MySQL's AUTO_INCREMENT which would collide.
The Lesson: Sharding is a Last Resort
GitHub's migration took 18 months and required significant application changes. The takeaway isn't "use Vitess" — it's a reminder that sharding should be deferred as long as possible:
Before sharding, exhaust:
- Read replicas for read scaling
- Caching (Redis, Memcached) for hot read paths
- Better indexes and query optimisation
- Vertical scaling (more CPU/RAM/faster SSD)
- Table archival (move old data to cold storage)
- Read replicas with dedicated workloads (reporting, analytics on a separate replica)
Sharding introduces irreversible complexity. Once sharded, queries must always respect shard boundaries. Schema changes require careful coordination. Cross-shard operations are expensive.
But when you've exhausted all other options and write throughput is the bottleneck — as GitHub found — Vitess provides a path to unlimited horizontal scale without abandoning the SQL model your application is built on.
Further Reading
- GitHub Engineering Blog: "Scaling GitHub's database with Vitess" (2021)
- Vitess documentation: vitess.io
- CNCF Case Studies: Vitess at GitHub
- Course: Database scaling patterns
Related Case Studies
Go Deeper
Case studies teach the "what". Our courses teach the "how" — the patterns behind these decisions, built up from first principles.
Explore Courses