How does a database sharding layer transparently shard 10B+ messages across hundreds of shards, supporting online resharding (split/merge without downtime) and connection pooling that reduces backend connections by 100·?
Core challenge: Single MySQL can't hold 10B messages. You need hundreds of shards. But resharding (splitting a hot shard) traditionally requires downtime. And 10K app servers · 100 shards = 1M DB connections (impossible). Vitess solves both.
10B+
messages stored
growing 10TB/day
100s
MySQL shards
transparent to app
Zero
downtime resharding
online split/merge
100·
connection reduction
VTGate connection pooling
Architecture · Vitess Sharding Layer
Component
Role
How
VTGate
Query router + connection pool
Parses SQL, routes to correct shard by vindex (shard key). Pools connections: 10K apps ? 1 VTGate pool ? 100 shards
VTTablet
Shard-local agent
Sits in front of each MySQL. Handles replication, health checks, query rewriting
Topology (etcd)
Shard map + metadata
Stores which keyrange ? which shard. Updated during resharding
VReplication
Online resharding engine
Streams rows from source to target shard. Catches up binlog. Atomic cutover.
Shard key:workspace_id · all messages for one workspace on same shard. Ensures channel queries never cross shards. Vindex maps workspace_id ? keyrange ? shard. Lookup vindex for secondary access patterns.
Online resharding: ? Create target shards ? VReplication copies existing data ? Binlog streaming catches up writes ? When lag < 1s, atomic cutover (freeze writes for ~1s, switch routing) ? Old shard becomes read-only, then decommissioned. Zero downtime for reads, ~1s pause for writes.
Limitations:Cross-shard joins · not supported (denormalize or scatter-gather). Cross-shard transactions · 2PC available but expensive. Hot shards · one large workspace can overwhelm a shard (solution: dedicated shard for enterprise customers).
Real-world:Slack · Vitess for all message storage. YouTube · Vitess originated at Google (open-sourced). GitHub · Vitess for MySQL sharding. Square · Vitess for payment data. PlanetScale · managed Vitess as a service.
Scale Estimation
Step
Derivation
Result
Design Impact
1
Messages: 10B/day · 86400
~115K writes/sec
Single MySQL maxes at ~10K writes/sec ? need 12+ shards minimum
Online resharding: split hot shard into 2 via VReplication. Or: dedicated shard for enterprise customers.
Cross-shard query needed
Can't JOIN across shards
VTGate scatter-gather (slow). Better: denormalize data or use separate search index (Elasticsearch).
VTGate failure
App can't reach any shard
Multiple VTGate instances behind load balancer. Stateless · any VTGate can serve any query.
Interview Cheat Sheet
The 6 things to say for database sharding
1.Shard key = workspace_id · all data for one tenant on same shard (no cross-shard joins) 2.VTGate connection pooling · 10K apps ? 1 pool ? 100 shards (100· reduction) 3.Online resharding via VReplication · stream + catch up binlog + atomic cutover (~1s pause) 4.Vindex · maps shard key ? keyrange ? physical shard (like consistent hashing) 5.No cross-shard joins · denormalize or scatter-gather (design constraint) 6.Dedicated shards for large tenants · enterprise customers get their own shard to avoid noisy-neighbor