How does WhatsApp guarantee message delivery to 2B+ users who go offline · with end-to-end encryption?
?? Design a messaging system that guarantees zero message loss for 2B+ users who go offline for hours/days, with E2E encryption where the server never sees plaintext
How does a messaging system guarantee delivery to 2B+ users who go offline for hours/days, ensuring zero message loss, correct ordering on reconnect, and end-to-end encryption without the server ever seeing plaintext?
Key difference from Slack: Slack assumes users are mostly online (WebSocket push). WhatsApp must handle the offline-first case · users may be offline for days, on unreliable mobile networks, with intermittent connectivity. See Slack Real-Time Messaging for the online delivery path.
2B+
monthly active users
100B+
messages / day
E2E
encrypted (Signal Protocol)
0
messages lost after ACK
Functional Requirements
What the system must do · focused on offline delivery and encryption
Must Have (Core)
1. Messages never lost after sender gets delivery ACK (??) 2. Offline users receive all missed messages on reconnect · in correct order 3.End-to-end encryption · server stores only ciphertext, never plaintext 4.Delivery receipts: single tick (sent) ? double tick (delivered) ? blue tick (read) 5. Messages expire from server after delivery (30-day TTL for undelivered) 6. Support media messages (photos, video, documents) with same guarantees
Out of Scope
? Group messaging fan-out (see Slack Fan-out Strategy) ? Real-time online delivery path (see Slack Architecture) ? Voice/video calling signaling ? Status/Stories feature ? Business API and chatbots
Non-Functional Requirements
Constraints shaped by mobile-first, offline-heavy usage patterns
Property
Target
Design Impact
Durability
Zero message loss after sender ACK
Write-ahead log + synchronous replication before ACK. Messages stored until recipient confirms delivery.
Latency
<300ms when both online
Direct push via persistent connection when recipient is online. No store-and-forward delay.
Offline tolerance
Up to 30 days offline
Server queues messages per recipient. TTL-based expiry after 30 days. Recipient pulls on reconnect.
Security
E2E encrypted · server is zero-knowledge
Signal Protocol (Double Ratchet). Server stores ciphertext only. Key exchange happens client-to-client.
Ordering
Per-conversation ordering preserved
Sender-assigned sequence numbers. Recipient reorders on delivery. No global ordering needed.
Bandwidth
Minimal · 2G/3G friendly
Binary protocol (not JSON). Compressed payloads. Resumable media uploads. Delta sync on reconnect.
Scale
2B+ users, 100B+ msg/day
Shard by user_id. Each user's queue is independent. Horizontal scale with no cross-shard coordination.
Scale Estimation
Derive infrastructure sizing from the given numbers
Step
Derivation
Result
Design Decision
1
100B msgs/day · 86,400s
~1.15M msg/sec
10· Slack's throughput · need sharded message queues
2
2B users · ~30% online at any time
~600M concurrent connections
~1.2M connection servers (500K conn each)
3
Avg user offline 8-16 hrs/day ? ~50 msgs queued
~100B queued msgs at peak
Need efficient per-user message queue storage
4
Avg message 1KB (encrypted) · 100B queued
~100TB queued storage
Distributed KV store sharded by recipient_id
5
Media: 10% of msgs have media, avg 500KB
~5PB media/day
Blob storage with CDN. Encrypted client-side before upload.
6
Reconnect storm: 100M users come online in 1 hour (morning)
~28K reconnects/sec
Queue drain must handle burst. Stagger delivery over seconds.
Architecture Overview
Store-and-forward with E2E encryption · server is a dumb encrypted mailbox
Per-User Message Queue
Each user has an independent queue · no cross-user coordination needed. Sharded by recipient_id.
Key difference from Slack: Slack stores messages permanently in Vitess (they're the system of record). WhatsApp deletes messages from the server after delivery · the server is just a temporary mailbox. The phone is the system of record. See Slack Data Model for the permanent storage approach.
Delivery Receipts (Tick System)
Three states: ? sent (server received) ? ?? delivered (recipient got it) ? ?? read (recipient opened)
Receipt Protocol
Delivery ACK: Recipient sends ACK with msg_seq after decryption Read receipt: Sent when user opens the chat (can be disabled) Batch ACK: On reconnect, ACK all received msgs in one batch Retry: If ACK lost, server re-delivers on next connection
Server Deletion Policy
After delivery ACK: Message deleted from server immediately Media: Blob deleted from CDN after recipient downloads TTL expiry: Undelivered msgs deleted after 30 days Privacy: Server retains zero message content long-term
Failure Handling
ACK lost: Server re-delivers; client deduplicates by msg_seq Device lost: Messages gone (unless backed up to encrypted cloud backup) Server crash: Replicated queue survives; client reconnects to replica Network flap: Client retries with exponential backoff + jitter
Ordering & Consistency
Per-conversation ordering via sender-assigned sequence numbers · no global ordering needed
How Ordering Works
1. Sender assigns msg_seq per conversation (monotonic on sender device) 2. Server appends to recipient queue preserving sender order 3. Recipient receives in queue order · reorders by timestamp if needed 4. Cross-conversation: no ordering guarantee (different queues) 5. Group messages: server assigns group_seq (single writer per group)
Difference from Slack
Slack: Server assigns channel_seq via Kafka partition (centralized) WhatsApp: Sender assigns seq (decentralized) · works offline Why: WhatsApp sender may be offline when server processes msg Tradeoff: Slightly weaker ordering guarantee but works without server
See Slack Ordering for the centralized approach
Key insight: WhatsApp uses sender-assigned timestamps + per-conversation sequence because the sender might be on a flaky mobile connection. Slack uses server-assigned sequence because all senders are connected to the same Kafka partition. Different connectivity assumptions ? different ordering strategies.
Media Message Handling
Photos, videos, documents · encrypted client-side, uploaded to blob storage, link sent in message
Tech Stack & Tradeoffs
Erlang/BEAM for massive concurrency · each user connection is a lightweight process
Component
Technology
Why This
Why Not X
Connection Server
Erlang/BEAM
2M+ connections per server. Lightweight processes (2KB each). Hot code reload. Built for telecom reliability.
Go/Java: heavier per-connection overhead. Can't match Erlang's process density.
Message Queue
Mnesia (Erlang built-in)
Distributed, replicated, in-memory with disk persistence. Co-located with connection server · no network hop.
Kafka: overkill for per-user queues. Redis: no built-in replication at this scale.
Media Storage
Custom blob store + CDN
Encrypted blobs. Geo-distributed. Auto-delete after download. Resumable uploads.
S3: vendor lock-in. Cost at 5PB/day would be enormous.
Encryption
Signal Protocol (libsignal)
Forward secrecy. Post-compromise security. Well-audited. Open source.
PGP: no forward secrecy. TLS: only transport-level, server can read.
Push Notifications
FCM / APNs
Wake device when offline. "You have a new message" (no content · E2E).
Custom push: can't wake iOS apps without APNs. Platform requirement.
JSON/HTTP: too verbose for mobile. gRPC: protobuf overhead unnecessary.
Real-world validation: WhatsApp famously handled 2B users with ~50 engineers and ~1000 servers using Erlang. The BEAM VM's actor model maps perfectly to "one process per user connection" · each process holds the user's state, mailbox, and connection.
Resilience & Edge Cases
Mobile-first challenges that don't exist in desktop-first systems like Slack
#
Challenge
What Breaks
How It's Handled
1
Phone lost/stolen
All messages on device gone
Encrypted cloud backup (Google Drive / iCloud). Backup key managed by user, not WhatsApp. Restore on new device.
Sender re-encrypts pending messages with new key. "Security code changed" notification shown to contacts.
6
Group key distribution
N members need the same message, each with different key
Sender encrypts once per recipient (fan-out at encryption layer). Server stores N ciphertexts for N members.
Key insight: WhatsApp's architecture is fundamentally store-and-forward · the server is a temporary encrypted mailbox. This is the opposite of Slack's persistent log approach. The tradeoff: WhatsApp can't offer server-side search (content is encrypted), but provides stronger privacy guarantees.
Interview Cheat Sheet
The 5 things an interviewer wants to hear for this problem
? Store-and-forward with per-user queues
Each user has an independent message queue sharded by recipient_id. Messages stored as ciphertext until recipient ACKs delivery. Server deletes after ACK · it's a temporary mailbox, not permanent storage.
? E2E encryption via Signal Protocol
X3DH key exchange + Double Ratchet. Each message gets a unique key. Forward secrecy: compromising one key doesn't expose past/future messages. Server sees only ciphertext · can't decrypt even with a court order.
? Delivery receipts as state machine
PENDING ? SENT (?) ? DELIVERED (??) ? READ (blue ??) ? DELETED from server. Each transition triggered by an ACK from the next hop. Server cleanup happens immediately after delivery confirmation.
? Reconnect = drain queue from last_ack_seq
On reconnect, client sends last_ack_seq. Server delivers all messages with seq > last_ack_seq. Client ACKs each batch. Deduplication by msg_seq handles retries. Morning storm handled by staggered delivery.
? Erlang/BEAM for 2M connections per server
Each user connection is a lightweight Erlang process (2KB). Mnesia for co-located message queues. Hot code reload for zero-downtime deploys. 50 engineers, ~1000 servers for 2B users.
One-liner:"Per-user message queue sharded by recipient_id, E2E encrypted via Signal Protocol, store-and-forward with ACK-based deletion, delivery receipts as state machine, Erlang/BEAM for 2M connections per server."