System Design Case Study

How does Slack handle 10+ billion messages per day across the globe?

?? Design a real-time messaging system that delivers messages to 50,000 online users within 200ms at 10B+ messages/day
Concepts Involved

How does a real-time chat system deliver one message to 50,000 online users in under 200ms · while sustaining 10B+ messages per day at peak 200K msg/sec?

Scope: Real-time delivery path only · not search, file storage, or push to offline users. The hard question: how does one message fan-out to thousands of live WebSocket connections without write amplification killing the system?
10B+
messages / day
~115K avg msg/sec
200K
msg/sec peak
~1.7· avg load
10M+
concurrent WebSockets
~20K Gateway servers
<200ms
delivery P99
~11ms same-region

Functional Requirements

What the system must do · the core user-facing behaviours

Must Have (Core)

1. User can send a text message to a channel or DM
2. All online members of a channel receive the message in real time (<200ms)
3. Messages are ordered · every user sees them in the same sequence
4. Offline users receive missed messages on reconnect (catch-up)
5. Users can see presence · who is currently online in a channel
6. Messages are persisted durably · never lost after the sender gets an ACK

Out of Scope (this design)

? Full-text message search (separate Elasticsearch pipeline)
? Push notifications to mobile when app is backgrounded (APNs / FCM)
? File / media uploads (separate blob storage path)
? Reactions, threads, edits, deletes (same delivery path, different events)
? User authentication and workspace management
? Read receipts at scale (separate aggregation problem)

Non-Functional Requirements

The quality constraints that shape every architectural decision

PropertyTargetWhy It Matters / Design Impact
LatencyP99 <200ms same-region · <250ms cross-regionChat feels broken above ~500ms. Drives WebSocket over polling, Redis over DB for hot path.
Throughput200K msg/sec peak sustainedRequires Kafka (1M+ msg/sec headroom), partitioned fan-out workers, and Gateway horizontal scale.
Availability99.99% (<52 min/year downtime)No single point of failure. Gateway, Kafka, Redis, Vitess all multi-node. GeoDNS for region failover.
DurabilityZero message loss after producer ACKKafka acks=all + min.insync.replicas=2. Vitess synchronous replication within shard.
ConsistencyPer-channel ordering (not global)All members see messages in the same order within a channel. Cross-channel ordering not required.
ScalabilityLinear scale to 10M+ concurrent WSStateless fan-out workers + stateful Gateways (each holds ~500K connections). Add nodes independently.
Fault toleranceSurvive single-node failures transparentlyClient reconnects + seq-based catch-up means a Gateway crash is invisible to the user within seconds.
SecurityTLS everywhere · workspace isolationAll WS connections over WSS. Vitess shard-per-workspace prevents cross-tenant data leaks.
Key tension: Availability vs. Consistency. Slack chooses availability · a message may arrive 80ms later for a cross-region user, but the system never blocks. Ordering is guaranteed within a channel, not globally.

Scale Estimation

Given numbers from the question ? derive infrastructure sizing. Show this reasoning in an interview.

Given (from question): 10B messages/day · 50,000 online users per channel · <200ms P99 delivery · 10M+ concurrent users
StepWhat to DeriveCalculationResultDesign Decision
1 Avg msg/sec 10B · 86,400 sec/day ~115K msg/sec Baseline throughput · Kafka must sustain this 24/7
2 Peak msg/sec 115K · 1.7· peak factor ~200K msg/sec Size Kafka partitions and fan-out workers for peak, not avg
3 Gateway servers needed 10M connections · 500K conn/server (epoll limit) ~20 Gateway servers Each Gateway holds 500K persistent WebSockets; Redis maps user ? gateway
4 Fan-out ops/sec (worst case) 200K msg/sec · 50K members/channel 10B ops/sec ? Impossible ? must skip offline. Only ~10% online ? 200K · 5K = 1B ops/sec ?
5 Kafka partitions 200K msg/sec · ~20 msg/sec per partition (safe throughput) ~10K partitions 1 partition per hot channel; hash-assigned for the rest. 1 fan-out worker per partition.
6 Message storage / day 10B msgs · 1KB avg payload ~10TB/day Vitess sharded by workspace_id. 7-day retention ? ~70TB total. Kafka retention separate.
7 Redis memory (presence) 10M users · 50 bytes (user_id + gateway mapping) ~500MB Fits comfortably in Redis. Channel online sets add ~100MB more. Total <1GB hot data.
8 Latency budget breakdown 200ms budget - 80ms cross-region gRPC = 120ms same-region budget ~11ms actual Kafka (~5ms) + Redis (~1ms) + gRPC (~3ms) + WS push (~2ms) = ~11ms. Well within 200ms budget.
Interview tip: Start with step 1·2 (traffic), then step 3 (connections), then step 4 (fan-out math) · that's where the key insight lives. The 10B ops/sec impossibility is what forces the hybrid fan-out design. Derive the architecture from the numbers, not the other way around.

APIs

Three surfaces: WebSocket (real-time delivery) · REST (send + history) · Internal gRPC (fan-out worker ? Gateway)

1
Client fetches WSS endpoint via REST
Then upgrades the connection:
GET /v1/ws?token=<jwt>&workspace_id=<id>
Upgrade: websocket
Connection: Upgrade
2
Server responds
101 Switching Protocols 401 Unauthorized 429 Too Many Requests
3
Client sends hello frame · declares last known position per channel
{ "type": "hello", "last_seq_map": { "C123": 47, "C456": 12 } }
4
Server replies with hello_ack + backfill for any gaps
{ "type": "hello_ack", "user_id": "U789", "session_id": "sess_abc" }
// If gaps detected:
{ "type": "message.backfill", "channel_id": "C123", "messages": [...] }
C ? S hello
Declare last known seq per channel. Triggers backfill for any gaps.
{ last_seq_map: { channel_id: seq } }
Close code:4001 if token invalid
C ? S message.send
Send a message. client_msg_id is a client-generated UUID for dedup on retry.
{ channel_id, text, client_msg_id }
C ? S ping
Heartbeat · Gateway closes connection if no ping received within 30s.
{ ts }
Close code:1001 if no ping in 30s
S ? C hello_ack
Confirms auth succeeded. Client can now send messages.
{ user_id, session_id }
S ? C message.new
Real-time push of a new message. Client renders and advances last_seq.
{ channel_id, channel_seq, user_id, text, ts }
S ? C message.ack
Confirms server received and persisted the sent message.
{ client_msg_id, channel_seq, ts }
S ? C message.backfill
Catch-up batch after hello or when client detects a gap in sequence numbers.
{ channel_id, messages: [...], has_more }
S ? C presence.update
Presence change broadcast to all online channel members.
{ user_id, channel_id, status: "online" | "away" | "offline" }
S ? C error
Error response for invalid operations.
{ code, message, ref_id }
// codes: "not_in_channel", "rate_limited", "msg_too_long"
S ? C pong
Heartbeat reply.
{ ts }
Why WebSocket over SSE or long-polling? SSE is server?client only · can't send messages. Long-polling adds 1 RTT per message and hammers the server. WebSocket is full-duplex, persistent, and handles 500K connections per server with epoll. The only tradeoff is stateful connections requiring sticky routing · solved by the Redis user:{id}:gateway mapping.

Architecture Overview

Write path (send) ? Kafka ? Fan-out workers ? Gateway ? WebSocket push to clients

WRITE PATH · User A sends "hello" in #general User A WebSocket Gateway Server WS termination Message Service validate + persist Vitess/MySQL Kafka partition key = channel_id FAN-OUT PATH · Deliver to all online members of #general Fan-out Worker consumes from Kafka partition = channel_id Redis channel:123:online ? {user_ids} user:456 ? gateway-host-7 Route to Gateways group users by gateway gRPC / Redis Pub/Sub Gateway-1 (500K conn) Gateway-2 (500K conn) Gateway-N (500K conn) ?????? (WS push) ???? (WS push) ???????? (WS push) Latency Breakdown Kafka consume: ~5ms Redis lookup: ~1ms Gateway push: ~2ms gRPC to gateway: ~3ms Total: ~11ms same-region +80ms cross-region KEY DESIGN DECISIONS Connection Management 1 persistent WS per user ~500K conn per Gateway (epoll) Redis: user:{id} ? gateway-host disconnect ? remove mapping Hybrid Fan-out <100 members: push to ALL 100-50K: push to ONLINE only >50K: write log, clients pull Skip offline = save 90% work Ordering Guarantee Kafka partition key = channel_id ? strict order per channel Each msg gets channel_seq (monotonic) Client detects gaps ? backfill Vitess Sharding Shard key = workspace_id Each workspace on 1 shard Large enterprise ? dedicated shard Vitess handles resharding Multi-Region: User connects to nearest Gateway (GeoDNS). Kafka in primary region. Cross-region delivery via internal gRPC (~80ms added). Same-region P99: <100ms | Cross-region P99: <250ms | Presence: Redis SET per channel (online members only) Offline users: on reconnect, client sends last_seen_seq ? server returns missed messages from Vitess (catch-up read)

Data Model

Hot path (Redis, sub-ms) for presence and routing · Cold path (Vitess/MySQL) for durable message history

HOT PATH · Redis (sub-ms, ephemeral) channel:{id}:online ? SET{user_ids} user:{id}:gateway ? "gw-host-7" channel:{id}:members ? SET{all} Ephemeral · rebuilt from MySQL on restart. Failure = delivery disruption, NOT data loss. COLD PATH · Vitess/MySQL (durable) messages(workspace_id, channel_seq) channel_members(workspace_id) Shard key = workspace_id. All workspace data on one shard ? SQL joins work within workspace. Large enterprise ? dedicated shard. cache rebuild
-- -- COLD PATH: Vitess/MySQL (sharded by workspace_id) ----------------------

CREATE TABLE messages (
  workspace_id   BIGINT NOT NULL,    -- shard key: all workspace data on one shard
  channel_id     BIGINT NOT NULL,
  channel_seq    BIGINT NOT NULL,    -- monotonic per channel (gap detection + backfill)
  user_id        BIGINT NOT NULL,
  client_msg_id  VARCHAR(64),        -- client UUID for idempotent dedup (indexed)
  content        TEXT,
  ts             TIMESTAMP(6),
  PRIMARY KEY (workspace_id, channel_id, channel_seq),
  UNIQUE KEY uq_client_msg (workspace_id, user_id, client_msg_id)
  -- PK doubles as the range-scan index for catch-up reads
);

CREATE TABLE channel_members (
  workspace_id   BIGINT NOT NULL,
  channel_id     BIGINT NOT NULL,
  user_id        BIGINT NOT NULL,
  joined_at      TIMESTAMP,
  PRIMARY KEY (workspace_id, channel_id, user_id)
);

-- -- HOT PATH: Redis (sub-ms, ephemeral) -------------------------------------

channel:{id}:online    ? SET of user_ids currently connected (updated on WS connect/disconnect)
user:{id}:gateway      ? STRING "gateway-host-7"  (which Gateway holds this user's WS)
channel:{id}:members   ? SET of all user_ids       (cached from MySQL, TTL-refreshed)

-- Fan-out worker reads channel:{id}:online ? groups by gateway ? gRPC batch push
-- On disconnect: DEL user:{id}:gateway + SREM channel:{id}:online {user_id}
Hot/cold split: Redis holds only live session state · it's ephemeral and can be rebuilt from MySQL on restart. MySQL holds the durable ordered log. This means a Redis failure doesn't lose messages; it only temporarily disrupts real-time delivery until presence is rebuilt.
Sharding key choice: workspace_id (not channel_id) keeps all data for a workspace on one shard, enabling SQL joins and transactions within a workspace. Large enterprise customers get a dedicated shard to prevent noisy-neighbour problems.

Fan-out Strategy · The Hard Part

A 50K-member channel with 200K msg/sec would require 10 billion writes/sec if you pushed to every member. The fix: push only to online members (~10% of total).

Hybrid Fan-out · Which strategy based on channel size? <100 members · Write-time fan-out Kafka Fan-out ?? ?? ?? Push to ALL members immediately ? Lowest latency Fan-out cost is tiny · latency wins 100·50K members · Hybrid Kafka Redis online set ?? online ?? offline ? Push to ONLINE only, skip offline ? 90% less work (5K online / 50K total) Offline users catch up on reconnect via seq >50K members · Read-time Kafka Channel log ?? pull ?? pull Write to log once, clients stream cursor ? O(1) write cost regardless of members Write amplification at 50K+ is prohibitive The math: 200K msg/sec · 50K members = 10B fan-out ops/sec ?200K · 5K online = 1B ops/sec ? · skipping offline is the #1 scaling lever
Channel SizeStrategyWhy
<100 membersWrite-time fan-out · push to every member's Gateway immediatelyFan-out is tiny; latency matters more than write cost
100 · 50K membersHybrid · push to online members only, skip offline50K members but only ~5K online ? 90% less work. Offline users catch up on reconnect.
>50K membersRead-time fan-out · write to channel log; clients stream from cursorWrite amplification at this scale is prohibitive regardless of online ratio
Presence tracking: channel:{id}:online ? Redis SET, updated on WS connect/disconnect. Fan-out worker iterates this set · not the full membership list. For a 50K-member channel with 5K online, that's 90% fewer Gateway calls per message. See Scale Estimation (step 4) for the full math.
Deep dive: For channels with 1M+ members (Telegram scale), even the hybrid approach breaks. See Telegram Large Group Fan-out for the read-time pull architecture that handles 20· this scale.

Presence Protocol

How does the system know who is online? Heartbeat-based presence with Redis TTL · no gossip protocol needed at this scale

Presence Protocol · Heartbeat + TTL Client sends ping every 15s via WebSocket Gateway Server tracks last_ping_ts closes if no ping in 30s Redis Presence Store SADD channel:{id}:online user_id EXPIRE key 45s (auto-cleanup) Presence Broadcast presence.update event to channel online members Other Clients update UI indicator green dot / away Timeout Path · No Ping for 30s Gateway: close WS conn Redis: SREM channel:{id}:online user Redis: DEL user:{id}:gateway Broadcast: presence.update ? offline Safety Net · Redis TTL Auto-Expiry (45s) Even if Gateway crashes without cleanup, Redis key expires in 45s ? stale presence auto-removed. No manual GC needed.

Heartbeat Mechanism

Client ping: Every 15 seconds via WebSocket
Gateway timeout: No ping for 30 seconds ? close connection
Redis TTL: Presence keys expire in 45 seconds (safety net)
Jitter: Clients add ·3s random jitter to avoid thundering herd on reconnect

State Transitions

online ? WS connected + ping received within 30s
away ? No user activity for 5 min (client sends status change)
offline ? WS closed OR no ping for 30s OR Redis TTL expired
Broadcast: Only on state transitions, not on every heartbeat

Scalability Considerations

No gossip: Centralized Redis · simpler than distributed gossip at 10M users
Batch updates: Gateway batches presence changes every 5s to Redis (not per-ping)
Large channels: Presence broadcast throttled to 1 update/sec for channels >10K
Memory: 10M users · 50 bytes = ~500MB · fits in single Redis instance
Interview insight: Presence is an eventually consistent system · a user might appear online for up to 45s after crashing. This is acceptable for chat. The alternative (synchronous presence with consensus) would add latency to every message delivery.
Deep dive: At 100M+ concurrent users (WhatsApp scale), centralized Redis breaks. See Presence at 100M Scale for the distributed sharded presence architecture with subscription-based delivery.

Ordering & Consistency

All messages in #general must arrive in the same order for every user · achieved via Kafka partition-per-channel + monotonic sequence numbers

Message Ordering · End-to-End Sequence User A (sender) Gateway Msg Service Kafka P0 Fan-out Worker Gateway User B (recv) message.send { channel_id, text, client_msg_id } validate + persist publish(key=channel_id) ? seq=48 assigned by Msg Service consume (ordered, offset++) gRPC PushMessages message.new { channel_seq: 48, text } message.ack { channel_seq: 48 } User B sees seq 48 after 47 ? no gap. If seq jumps 47?50, client requests backfill for 48,49 from Vitess.

How Ordering Works

1. All messages for #general ? same Kafka partition (key = channel_id)
2. Single consumer per partition ? messages processed in strict order
3. Each message assigned channel_seq · monotonic increment per channel
4. Client tracks last received seq. Gap detected ? request backfill from Vitess
5. On reconnect: client sends last_seq=47 ? server returns msgs 48+

Failure Scenarios

WS drop: Client reconnects, sends last_seq ? backfill from Vitess
Gateway crash: All users on that gateway reconnect; Redis mapping updated
Duplicate delivery: Client deduplicates by channel_seq (idempotent render)
@channel in 50K: Rate-limit fan-out, stagger over 1·2s to avoid thundering herd
Cross-region lag: EU user sees msg ~80ms after US user · acceptable for chat

Guarantees Provided

Per-channel ordering: Kafka partition key = channel_id ensures strict order within a channel
No cross-channel ordering: Different channels may be on different partitions · that's fine
At-least-once delivery: Gaps detected client-side; backfill closes them
Idempotent render: Duplicate messages silently dropped by seq dedup

Sequence Number Generation

How does the system assign monotonic channel_seq without becoming a bottleneck? Single-writer-per-channel via Kafka partition ownership.

channel_seq Assignment · Single Writer Pattern ? Naive Approach: DB Auto-Increment · Multiple Message Service instances ? race condition on INSERT · DB lock contention at 200K msg/sec ? bottleneck · Gaps in sequence if transaction rolls back ? Solution: Kafka Partition = Single Writer · One fan-out worker owns each partition (= channel) · Worker maintains in-memory counter + Redis atomic INCR · No contention: only one process writes seq for a channel Sequence Assignment Flow Msg Service validate only Kafka partition key=channel_id Fan-out Worker single consumer Redis INCR channel:seq atomic, returns new seq Vitess INSERT with channel_seq Push to clients with seq attached

Why This Works

Single writer: Kafka guarantees one consumer per partition ? no race conditions
Redis INCR: Atomic O(1) operation, ~0.1ms latency ? no bottleneck
Crash recovery: New consumer reads last seq from Redis ? resumes from correct position
No gaps: Unlike DB auto-increment, Redis INCR never creates gaps (no rollbacks)

Edge Cases

Partition rebalance: New consumer reads current seq from Redis before processing
Redis failure: Fall back to Vitess SELECT MAX(channel_seq) · slower but correct
Duplicate INCR: If worker crashes after INCR but before Vitess write ? seq gap. Client backfill handles gracefully (empty gap = no-op)
Hot channel: One partition = one worker. If channel is too hot, split into sub-channels at app layer.
Interview tip: This is a common follow-up: "How do you generate monotonic IDs without a single point of failure?" The answer is partition-level single writer · Kafka gives you distributed single-writer semantics for free. Each partition has exactly one active consumer, so no coordination needed.

Resilience & Edge Cases

Five failure modes and how the architecture handles each without data loss or visible gaps

Gateway Crash · What Happens Next Gateway-7 crashes ?? 500K connections lost Clients reconnect TCP timeout ? any Gateway Redis mapping updated user:{id} ? new-gateway-host Client sends last_seq ? backfill missed messages returned from Vitess Result: crash is invisible to users within ~5·10s. No messages lost. Fan-out resumes immediately via new Redis mapping.
#Failure / Edge CaseWhat BreaksHow It's Handled
1WebSocket drops mid-sessionClient misses messages sent while disconnectedOn reconnect, client sends last_seq ? server backfills from Vitess. Gap closed transparently.
2Gateway server crashAll ~500K connections on that host lostClients reconnect to any Gateway. Redis user:{id}:gateway mapping updated. Fan-out resumes immediately.
3Duplicate deliverySame message rendered twice in UIClient deduplicates by channel_seq · idempotent render, second copy silently dropped.
4@channel in 50K-member channelThundering herd: 50K fan-out ops at onceRate-limit fan-out worker; stagger delivery over 1·2s for huge channels. Acceptable for broadcast notifications.
5Kafka consumer lag spikeFan-out falls behind; messages delayedPartition count = number of hot channels. Scale fan-out workers horizontally. Alert on consumer lag > threshold.
Key insight: The system tolerates delivery failures by making the client responsible for gap detection. The server never needs to track per-user delivery state · it just stores the ordered log and lets clients request what they missed.

Backpressure & Flow Control

What happens when a Gateway is overwhelmed, Kafka consumers lag, or a viral message hits a 50K channel? Layered defense from client to storage.

Backpressure · 4 Layers of Defense ? Client Rate Limit · Max 1 msg/sec per user per channel · Exponential backoff on 429 response · Client-side queue with dedup · WS error frame: "rate_limited" Prevents: single user flooding a channel ? Gateway Admission · Connection limit: 500K per server · New connections rejected at capacity · Per-workspace msg/sec quota · Slow consumers: buffer 100 msgs then drop Prevents: Gateway OOM / connection exhaustion ? Fan-out Circuit Breaker · gRPC deadline: 50ms per PushMessages · Circuit opens after 5 consecutive fails · Half-open: retry 1 req every 10s · Stagger large channels over 1-2s Prevents: slow Gateway cascading to all workers ? Kafka Consumer Lag · Alert on lag > 10K messages · Auto-scale fan-out workers · Partition reassignment on crash · Max poll records: 500 (batch size) Prevents: unbounded delivery delay Degradation Cascade · What Happens Under Extreme Load Stage 1 (normal): All messages delivered in <200ms. No backpressure active. Stage 2 (elevated): Consumer lag rising ? auto-scale fan-out workers. Delivery at 200-500ms. Users don't notice. Stage 3 (degraded): Gateway buffers full ? drop oldest undelivered msgs for slow clients. Client detects gap via seq ? backfill from Vitess. Stage 4 (critical): Circuit breakers open ? large channels switch to read-time fan-out. Small channels still get real-time delivery.

Key Principle: Graceful Degradation

Never lose messages: Even under extreme load, messages are persisted in Kafka + Vitess. Delivery may be delayed, but data is never lost.
Prioritize small channels: Under load, large broadcast channels degrade first (switch to pull). DMs and small channels keep real-time delivery.
Client self-heals: Gap detection + backfill means any dropped delivery is automatically recovered on the client side.

Monitoring & Alerts

Kafka consumer lag: Alert if any partition > 10K messages behind
Gateway connection count: Alert at 80% capacity (400K connections)
gRPC error rate: Alert if > 1% of PushMessages calls fail
P99 delivery latency: Alert if > 500ms (2.5· budget)
Circuit breaker state: Alert on any circuit opening
Interview insight: Interviewers love hearing about graceful degradation. The system doesn't crash under 10· load · it progressively reduces real-time guarantees for large channels while protecting small channels and DMs. Messages are never lost, only delayed.

Multi-Region Architecture

How does a message sent in US-East reach a user in EU-West within 250ms? GeoDNS + regional Kafka clusters + async cross-region replication

Multi-Region Message Delivery ???? US-East (Primary for workspace) Gateway US Msg Service Kafka US Vitess US Fan-out US Redis US ?????? US users ? ~11ms delivery MirrorMaker / Cross-Region Replication ~60-80ms ???? EU-West (Read Replica) Gateway EU Kafka EU Fan-out EU Redis EU ???? EU users ? ~80-90ms total delivery Cross-Region Delivery Timeline t=0ms User A sends message (US-East) t=5ms Kafka US produces + Vitess persists t=11ms US users receive message ? t=5ms MirrorMaker replicates to Kafka EU t=65ms Message arrives in Kafka EU (~60ms network) t=70ms EU Fan-out worker consumes from Kafka EU t=72ms Redis EU lookup: online EU members t=76ms gRPC to EU Gateways t=80ms EU users receive message ? Total: ~80ms cross-region · well within 250ms budget

Replication Strategy

Kafka MirrorMaker 2: Async replication US?EU, EU?US
Lag: ~60-80ms (network RTT between regions)
Vitess: Read replicas in each region for history queries
Redis: Independent per region · each region tracks its own online users

Write Routing

Workspace home region: All writes for a workspace go to its primary region
GeoDNS: Routes user to nearest Gateway, but writes proxy to primary
EU user sends msg: Gateway EU ? Msg Service US (primary) ? Kafka US ? replicate to EU
Extra latency: ~80ms for cross-region write · acceptable for sender ACK

Failover

Region down: GeoDNS removes unhealthy region in ~30s
Promote replica: EU Kafka becomes primary; Vitess promotes read replica
Data loss window: Up to ~80ms of unreplicated messages (async replication)
Recovery: Clients reconnect, send last_seq ? backfill closes any gaps
Key tradeoff: Async replication means EU users see messages ~80ms after US users. The alternative · synchronous cross-region writes · would add 80ms to every message for every user. Slack chooses availability over strict consistency across regions.
Deep dive: For systems requiring causal ordering guarantees across regions (not just eventual consistency), see Multi-Region Message Ordering · covers HLC, vector clocks, and conflict resolution during region failover.

Tech Stack & Tradeoffs

Each choice made for a specific reason · and what was rejected

ComponentTechnologyWhy ThisWhy Not X
WebSocket GatewayCustom (Go / Java)500K conn per server via epoll; full control over heartbeat, reconnect, backpressureNginx/HAProxy: not designed for stateful long-lived connections at this density
Message BusKafkaPartition by channel_id ? strict ordering. Durable, replayable. 1M+ msg/sec.RabbitMQ: no replay, lower throughput, fan-out harder. SQS: no ordering across messages.
Primary StorageVitess (MySQL)Shard by workspace_id · all workspace data co-located. Transparent resharding. SQL joins work within shard.Cassandra: no transactions, harder to query by seq range. DynamoDB: vendor lock-in, cost at this scale.
Hot Data / PresenceRedisSub-ms SET/GET. Channel online sets + user?gateway mapping. Pub/Sub for internal fan-out.Memcached: no sets/sorted sets. In-memory on Gateway: doesn't survive restarts, not shared.
Internal RPCgRPCFan-out worker ? Gateway. Multiplexed, low latency, typed contracts via protobuf.REST/HTTP: higher overhead per call. Raw TCP: no service discovery, no load balancing.
SearchElasticsearchFull-text search across message history. Separate from delivery path · no latency coupling.MySQL FULLTEXT: doesn't scale to 10B messages. Solr: operationally heavier.
DNS RoutingGeoDNSRoutes each user to nearest Gateway region. Reduces cross-region hops by ~80ms.Single-region: unacceptable latency for global users. Anycast: complex to operate.
Decision rule: Stateful hot path (connections, presence) ? Redis. Ordered durable log (messages, fan-out) ? Kafka. Queryable history (catch-up, search) ? Vitess + Elasticsearch. Each component owns exactly one concern · no overlap.
Real-world validation: Slack uses this exact stack. Discord uses a similar pattern (Cassandra instead of Vitess). WhatsApp uses Erlang/BEAM for the gateway layer instead of Go/Java · same architectural shape, different runtime.

Interview Cheat Sheet

The 8 things an interviewer wants to hear · say these and you've covered the design

? WebSocket + Redis routing
Each user holds one persistent WebSocket to a Gateway server. Redis maps user:{id} ? gateway-host. Fan-out worker looks up this mapping to know which gateway to call.
? Kafka partition per channel
Partition key = channel_id. All messages for a channel land on the same partition ? strict ordering guaranteed. One fan-out worker per partition ? no race conditions.
? Hybrid fan-out (skip offline)
Push only to online members (Redis SET). 50K members but 5K online = 90% less work. Channels >50K use read-time fan-out · write once to log, clients pull.
? Monotonic channel_seq + backfill
Every message gets a channel_seq. Client tracks last seen. Gap detected ? request backfill from Vitess. On reconnect, client sends last_seq ? server returns missed messages.
? Vitess sharded by workspace_id
All data for a workspace on one shard ? SQL joins work. Large enterprise gets dedicated shard. Vitess handles transparent resharding as workspaces grow.
? Presence via heartbeat + Redis TTL
Client pings every 15s. Gateway closes on 30s timeout. Redis SET with 45s TTL auto-expires stale presence. Broadcast only on state transitions · not every heartbeat. Eventually consistent (up to 45s stale).
? Multi-region: async replication + local fan-out
Writes go to workspace's primary region. Kafka MirrorMaker replicates to other regions (~60-80ms). Each region has local fan-out workers + Redis + Gateways. EU users see messages ~80ms after US users · acceptable for chat.
? Backpressure: 4-layer graceful degradation
Client rate limit ? Gateway admission control ? Fan-out circuit breaker ? Kafka consumer auto-scale. Under extreme load, large channels degrade to pull-based while DMs keep real-time. Messages never lost, only delayed.
One-liner answer: "WebSocket per user with Redis routing, Kafka partitioned by channel for ordering, hybrid fan-out to online-only members, monotonic sequence numbers for gap detection and catch-up, Vitess sharded by workspace, heartbeat-based presence with Redis TTL, async multi-region replication via MirrorMaker, and 4-layer backpressure for graceful degradation."