System Design Case Study

How does WhatsApp guarantee message delivery to 2B+ users who go offline · with end-to-end encryption?

?? Design a messaging system that guarantees zero message loss for 2B+ users who go offline for hours/days, with E2E encryption where the server never sees plaintext
Concepts Involved

How does a messaging system guarantee delivery to 2B+ users who go offline for hours/days, ensuring zero message loss, correct ordering on reconnect, and end-to-end encryption without the server ever seeing plaintext?

Key difference from Slack: Slack assumes users are mostly online (WebSocket push). WhatsApp must handle the offline-first case · users may be offline for days, on unreliable mobile networks, with intermittent connectivity. See Slack Real-Time Messaging for the online delivery path.
2B+
monthly active users
100B+
messages / day
E2E
encrypted (Signal Protocol)
0
messages lost after ACK

Functional Requirements

What the system must do · focused on offline delivery and encryption

Must Have (Core)

1. Messages never lost after sender gets delivery ACK (??)
2. Offline users receive all missed messages on reconnect · in correct order
3. End-to-end encryption · server stores only ciphertext, never plaintext
4. Delivery receipts: single tick (sent) ? double tick (delivered) ? blue tick (read)
5. Messages expire from server after delivery (30-day TTL for undelivered)
6. Support media messages (photos, video, documents) with same guarantees

Out of Scope

? Group messaging fan-out (see Slack Fan-out Strategy)
? Real-time online delivery path (see Slack Architecture)
? Voice/video calling signaling
? Status/Stories feature
? Business API and chatbots

Non-Functional Requirements

Constraints shaped by mobile-first, offline-heavy usage patterns

PropertyTargetDesign Impact
DurabilityZero message loss after sender ACKWrite-ahead log + synchronous replication before ACK. Messages stored until recipient confirms delivery.
Latency<300ms when both onlineDirect push via persistent connection when recipient is online. No store-and-forward delay.
Offline toleranceUp to 30 days offlineServer queues messages per recipient. TTL-based expiry after 30 days. Recipient pulls on reconnect.
SecurityE2E encrypted · server is zero-knowledgeSignal Protocol (Double Ratchet). Server stores ciphertext only. Key exchange happens client-to-client.
OrderingPer-conversation ordering preservedSender-assigned sequence numbers. Recipient reorders on delivery. No global ordering needed.
BandwidthMinimal · 2G/3G friendlyBinary protocol (not JSON). Compressed payloads. Resumable media uploads. Delta sync on reconnect.
Scale2B+ users, 100B+ msg/dayShard by user_id. Each user's queue is independent. Horizontal scale with no cross-shard coordination.

Scale Estimation

Derive infrastructure sizing from the given numbers

StepDerivationResultDesign Decision
1100B msgs/day · 86,400s~1.15M msg/sec10· Slack's throughput · need sharded message queues
22B users · ~30% online at any time~600M concurrent connections~1.2M connection servers (500K conn each)
3Avg user offline 8-16 hrs/day ? ~50 msgs queued~100B queued msgs at peakNeed efficient per-user message queue storage
4Avg message 1KB (encrypted) · 100B queued~100TB queued storageDistributed KV store sharded by recipient_id
5Media: 10% of msgs have media, avg 500KB~5PB media/dayBlob storage with CDN. Encrypted client-side before upload.
6Reconnect storm: 100M users come online in 1 hour (morning)~28K reconnects/secQueue drain must handle burst. Stagger delivery over seconds.

Architecture Overview

Store-and-forward with E2E encryption · server is a dumb encrypted mailbox

WhatsApp Message Flow · Store-and-Forward with E2E Encryption Sender (Alice) Encrypts with Bob's public key (Signal) ciphertext Connection Server Holds Alice's TCP conn Validates, routes Message Queue (per-user) Stores ciphertext until recipient confirms delivery Bob ONLINE ? push immediately via Bob's TCP connection Bob OFFLINE ? store in queue + send push notification (FCM) Bob decrypts with private key ? ?? RECONNECT FLOW · Bob comes back online after 8 hours Bob connects Auth + send last_seen_seq Drain queue ? push all pending Bob ACKs each msg ? ?? Server deletes from queue Result: Bob receives all 50 missed messages in order. Server queue is now empty. Alice sees ?? for each. E2E ENCRYPTION · Signal Protocol (Double Ratchet) Key Exchange (X3DH) Identity key + signed prekey + one-time prekey ? shared secret Happens once per conversation Double Ratchet (per message) Each msg gets unique key Forward secrecy: compromise 1 key ? can't decrypt past/future msgs Server sees ONLY ciphertext Cannot decrypt even with court order Metadata visible: who ? whom, when Content: fully opaque to server Recipient decrypts locally Private key never leaves device Backup encryption: separate key managed by user (not WhatsApp)

Per-User Message Queue

Each user has an independent queue · no cross-user coordination needed. Sharded by recipient_id.

Per-User Queue · Append-Only Log with ACK-Based Deletion Queue for user:Bob (recipient_id = B456) msg_seq=1 msg_seq=2 msg_seq=3 msg_seq=4 seq=5 ? Delivered (ACKed) · will be deleted ? Pending delivery · Append-only: new messages added at tail · Bob's last_ack_seq = 3 ? server knows to deliver from seq=4 · After Bob ACKs seq=4,5 ? those entries deleted (or marked) · TTL: undelivered msgs expire after 30 days Storage: Mnesia / Custom KV (Erlang) Key: (recipient_id, msg_seq) Value: { sender_id, ciphertext, timestamp, media_url? } Shard key: recipient_id (all of Bob's msgs on one node) Replication: 3 replicas across AZs (sync write to 2) Deletion: On recipient ACK ? hard delete (privacy) Capacity: ~100TB total queued msgs across all users
Key difference from Slack: Slack stores messages permanently in Vitess (they're the system of record). WhatsApp deletes messages from the server after delivery · the server is just a temporary mailbox. The phone is the system of record. See Slack Data Model for the permanent storage approach.

Delivery Receipts (Tick System)

Three states: ? sent (server received) ? ?? delivered (recipient got it) ? ?? read (recipient opened)

Message Lifecycle · Receipt State Machine PENDING clock icon client ? server in-flight server ACK SENT ? single grey tick stored in recipient queue recipient ACK DELIVERED ?? double grey ticks msg reached device, decrypted read receipt READ ?? double blue ticks user opened conversation server cleanup DELETED removed from server only exists on devices Timing: PENDING?SENT: ~100ms | SENT?DELIVERED: instant (online) or hours/days (offline) | DELIVERED?DELETED: immediate after ACK

Receipt Protocol

Delivery ACK: Recipient sends ACK with msg_seq after decryption
Read receipt: Sent when user opens the chat (can be disabled)
Batch ACK: On reconnect, ACK all received msgs in one batch
Retry: If ACK lost, server re-delivers on next connection

Server Deletion Policy

After delivery ACK: Message deleted from server immediately
Media: Blob deleted from CDN after recipient downloads
TTL expiry: Undelivered msgs deleted after 30 days
Privacy: Server retains zero message content long-term

Failure Handling

ACK lost: Server re-delivers; client deduplicates by msg_seq
Device lost: Messages gone (unless backed up to encrypted cloud backup)
Server crash: Replicated queue survives; client reconnects to replica
Network flap: Client retries with exponential backoff + jitter

Ordering & Consistency

Per-conversation ordering via sender-assigned sequence numbers · no global ordering needed

How Ordering Works

1. Sender assigns msg_seq per conversation (monotonic on sender device)
2. Server appends to recipient queue preserving sender order
3. Recipient receives in queue order · reorders by timestamp if needed
4. Cross-conversation: no ordering guarantee (different queues)
5. Group messages: server assigns group_seq (single writer per group)

Difference from Slack

Slack: Server assigns channel_seq via Kafka partition (centralized)
WhatsApp: Sender assigns seq (decentralized) · works offline
Why: WhatsApp sender may be offline when server processes msg
Tradeoff: Slightly weaker ordering guarantee but works without server
See Slack Ordering for the centralized approach
Key insight: WhatsApp uses sender-assigned timestamps + per-conversation sequence because the sender might be on a flaky mobile connection. Slack uses server-assigned sequence because all senders are connected to the same Kafka partition. Different connectivity assumptions ? different ordering strategies.

Media Message Handling

Photos, videos, documents · encrypted client-side, uploaded to blob storage, link sent in message

Media Upload Flow · Client-Side Encryption Alice generates random AES-256 key Encrypt media locally AES-256-CBC + HMAC-SHA256 Upload ciphertext to blob storage (CDN) Send msg with {url, key, hash} key encrypted with Bob's Signal key Bob downloads + decrypts verify HMAC ? decrypt with key Server stores encrypted blob but CANNOT decrypt it (no key). Blob deleted after recipient downloads. Resumable upload: chunked with per-chunk checksum. Works on 2G/3G with intermittent connectivity.

Tech Stack & Tradeoffs

Erlang/BEAM for massive concurrency · each user connection is a lightweight process

ComponentTechnologyWhy ThisWhy Not X
Connection ServerErlang/BEAM2M+ connections per server. Lightweight processes (2KB each). Hot code reload. Built for telecom reliability.Go/Java: heavier per-connection overhead. Can't match Erlang's process density.
Message QueueMnesia (Erlang built-in)Distributed, replicated, in-memory with disk persistence. Co-located with connection server · no network hop.Kafka: overkill for per-user queues. Redis: no built-in replication at this scale.
Media StorageCustom blob store + CDNEncrypted blobs. Geo-distributed. Auto-delete after download. Resumable uploads.S3: vendor lock-in. Cost at 5PB/day would be enormous.
EncryptionSignal Protocol (libsignal)Forward secrecy. Post-compromise security. Well-audited. Open source.PGP: no forward secrecy. TLS: only transport-level, server can read.
Push NotificationsFCM / APNsWake device when offline. "You have a new message" (no content · E2E).Custom push: can't wake iOS apps without APNs. Platform requirement.
ProtocolCustom binary (XMPP-derived)Minimal bandwidth. Binary framing. Compression. 2G-friendly.JSON/HTTP: too verbose for mobile. gRPC: protobuf overhead unnecessary.
Real-world validation: WhatsApp famously handled 2B users with ~50 engineers and ~1000 servers using Erlang. The BEAM VM's actor model maps perfectly to "one process per user connection" · each process holds the user's state, mailbox, and connection.

Resilience & Edge Cases

Mobile-first challenges that don't exist in desktop-first systems like Slack

#ChallengeWhat BreaksHow It's Handled
1Phone lost/stolenAll messages on device goneEncrypted cloud backup (Google Drive / iCloud). Backup key managed by user, not WhatsApp. Restore on new device.
2SIM swap attackAttacker registers with victim's numberRe-registration triggers key change notification to contacts. 2FA PIN prevents unauthorized re-registration.
3Network flap (2G/3G)Connection drops mid-messageClient retries with exponential backoff. Server deduplicates by client_msg_id. Resumable media uploads.
4Morning reconnect storm100M users come online simultaneouslyStagger queue drain. Priority: recent msgs first. Rate-limit per-user delivery to 100 msgs/sec.
5Key mismatch (new device)Can't decrypt messages encrypted with old keySender re-encrypts pending messages with new key. "Security code changed" notification shown to contacts.
6Group key distributionN members need the same message, each with different keySender encrypts once per recipient (fan-out at encryption layer). Server stores N ciphertexts for N members.
Key insight: WhatsApp's architecture is fundamentally store-and-forward · the server is a temporary encrypted mailbox. This is the opposite of Slack's persistent log approach. The tradeoff: WhatsApp can't offer server-side search (content is encrypted), but provides stronger privacy guarantees.

Interview Cheat Sheet

The 5 things an interviewer wants to hear for this problem

? Store-and-forward with per-user queues
Each user has an independent message queue sharded by recipient_id. Messages stored as ciphertext until recipient ACKs delivery. Server deletes after ACK · it's a temporary mailbox, not permanent storage.
? E2E encryption via Signal Protocol
X3DH key exchange + Double Ratchet. Each message gets a unique key. Forward secrecy: compromising one key doesn't expose past/future messages. Server sees only ciphertext · can't decrypt even with a court order.
? Delivery receipts as state machine
PENDING ? SENT (?) ? DELIVERED (??) ? READ (blue ??) ? DELETED from server. Each transition triggered by an ACK from the next hop. Server cleanup happens immediately after delivery confirmation.
? Reconnect = drain queue from last_ack_seq
On reconnect, client sends last_ack_seq. Server delivers all messages with seq > last_ack_seq. Client ACKs each batch. Deduplication by msg_seq handles retries. Morning storm handled by staggered delivery.
? Erlang/BEAM for 2M connections per server
Each user connection is a lightweight Erlang process (2KB). Mnesia for co-located message queues. Hot code reload for zero-downtime deploys. 50 engineers, ~1000 servers for 2B users.
One-liner: "Per-user message queue sharded by recipient_id, E2E encrypted via Signal Protocol, store-and-forward with ACK-based deletion, delivery receipts as state machine, Erlang/BEAM for 2M connections per server."