How does a group chat with 1M+ members deliver a single message without causing a fan-out explosion, keeping delivery latency under 2 seconds for all participants regardless of group size?
Scope: Large channel/group delivery path · not 1:1 DMs, not media upload, not E2E encryption. The hard question: how do you avoid O(N) write amplification when N = 1,000,000?
1M+
members per channel
some channels exceed 5M
200K+
channels at this scale
public broadcast channels
<2s
delivery target
all members, any region
700M+
monthly active users
global distribution
Cross-reference:Slack's Fan-out Strategy covers hybrid fan-out up to 50K members · this problem goes 20· further. At 1M+ members, even Slack's "push to online only" model breaks because 10% online = 100K simultaneous pushes per message.
Functional Requirements
What the system must do · core behaviours for 1M+ member channels
Must Have (Core)
1. Admin/allowed user can post a message to a channel with 1M+ members 2. All online members receive the message within 2 seconds 3.Offline members see the message on next app open (pull-based catch-up) 4. Messages are ordered · all members see the same sequence 5.Notification tiering · @mentions push immediately, regular messages update badge only 6. System handles member join/leave storms without disrupting delivery
Out of Scope (this design)
? End-to-end encryption (not applicable to large public channels) ? Media/file delivery (separate CDN pipeline) ? Message editing/deletion propagation ? Bot API and webhook integrations ? User authentication and session management ? Small group chat (<1K members · standard fan-out works fine)
Non-Functional Requirements
Quality constraints that shape the architecture for extreme fan-out
Property
Target
Why It Matters / Design Impact
Latency
<2s delivery for all online members
Users expect near-instant delivery. Read-time pull with long-polling achieves this without write amplification.
Throughput
100M+ deliveries/day per large channel
A single channel with 1M members and 100 msgs/day = 100M delivery events. Must not scale linearly with writes.
Availability
99.99% (<52 min/year)
Channels are broadcast infrastructure for governments and media. Downtime = public trust loss.
Durability
Zero message loss after server ACK
Append-only log with replication. Messages never deleted from server (unlike E2E chats).
Scalability
Linear with readers, not writers
Write cost = O(1) per message. Read cost distributed across millions of clients pulling independently.
Fan-out ratio
1 write : 0 server-side copies
No per-user inbox. Single log entry serves all members. Eliminates write amplification entirely.
Consistency
Per-channel total order
All members see messages in identical sequence. Monotonic message IDs enable gap detection.
Fault tolerance
Survive datacenter failures
Multi-DC replication of channel log. Clients reconnect to any DC and resume from cursor.
Key insight: At 1M+ members, the system cannot afford O(N) work per message. The architecture must make message posting O(1) and shift the cost to readers · each reader does O(1) work to fetch their unread messages.
Scale Estimation
Derive why write-time fan-out is impossible at 1M+ members · and what replaces it
Given:1M members per channel · 100 messages/day per channel · <2s delivery · 700M+ MAU
Step
What to Derive
Calculation
Result
Design Decision
1
Deliveries per channel/day
1M members · 100 msgs/day
100M deliveries/day
Per-channel delivery volume · impossible to write-amplify
2
Write-time fan-out cost
1M inbox writes · 100 msgs/day
100M writes/day/channel ?
With 200K channels: 20 trillion writes/day. Completely infeasible.
3
Read-time cost (log model)
1 append per message (O(1) write)
100 writes/day/channel ?
Single append-only log. Readers pull. Write cost independent of member count.
4
Online members at any time
1M · 5% online rate
~50K concurrent readers
50K long-poll connections per large channel · manageable with connection pooling
5
Notification fan-out (push)
Only @mentions trigger push: ~1% of msgs
~1 push/day per user
Push notifications are rare · most msgs only increment badge counter
6
Storage per channel/day
100 msgs · 2KB avg (text + metadata)
~200KB/day/channel
Trivial storage. 200K channels · 200KB = 40GB/day total for all large channels
7
Cursor storage
700M users · 8 bytes (last_read_msg_id)
~5.6GB
Per-user cursor per channel. Fits in distributed cache. No per-user message copies.
Well within 2s budget. Long-poll gives near-real-time without persistent connections.
The impossibility proof: Write-time fan-out at 1M members means writing 1M inbox entries per message. At 100 msgs/day that's 100M writes/day for ONE channel. Multiply by 200K large channels = 20 trillion writes/day. No storage system can sustain this. Read-time pull from a shared log is the only viable architecture.
Architecture Overview
Single append-only channel log eliminates write amplification · clients pull from shared log independently
Key insight: The architecture is read-time pull from a shared channel log. One write serves all members. Online clients long-poll for new entries; offline clients fetch from their last-seen cursor on reconnect. See detailed strategies below.
Fan-out Strategies at Scale
Comparing three approaches · and why read-time pull wins at 1M+ members
Cross-reference:Slack uses hybrid fan-out for channels up to 50K members · push to online, offline users catch up on reconnect. At 1M+ even the "push to online" part becomes 50K·100K pushes per message, creating unacceptable latency spikes. Telegram's solution: don't push message content at all · push only a lightweight "new message available" signal and let clients pull.
Channel Log Architecture
Single append-only log per channel · clients maintain their own cursor. Server tracks zero per-user delivery state.
Why this works: The server does zero per-user bookkeeping. It doesn't know who has read what · that's the client's job. When a client opens the app, it says "give me everything after msg_id X" and the server does a simple range scan on the log. This makes the system scale with readers (distributed) rather than writers (centralized).
Notification Tiering
Only push notifications fan-out · not message content. Tiered by priority to minimize push volume.
Key insight: Notification fan-out and message delivery are completely decoupled. Message content lives in the channel log (pulled by clients). Push notifications are lightweight signals ("you have N unread") that fan-out only to unmuted, offline users · and even those are batched to reduce volume by 99%+.
Resilience & Edge Cases
What breaks at 1M+ scale · and how the architecture handles it
Edge Case
Problem
Solution
Impact if Unhandled
Viral message in 1M channel
Sudden spike: 50K clients simultaneously pull after a popular post
CDN-like caching of recent messages at edge. Hot channel log replicated to read replicas. Rate-limit pulls to 1/sec per client.
Log server overload, cascading timeouts for all channels on same shard
Member join/leave storms
100K users join a channel in minutes (viral invite link). Membership list update becomes bottleneck.
Membership is eventually consistent. New members start cursor at latest msg_id. Membership changes don't affect the log · only the "who can read" ACL check on pull.
Membership write lock blocks message delivery if coupled
Cursor management for offline users
User offline for 30 days. Channel has 3000 new messages. Full catch-up is expensive.
Paginated fetch (100 msgs/page). Show latest first, load history on scroll-up. If gap > 1000 msgs, show "jump to latest" with unread count only.
Client OOM on massive payload. Server timeout on huge range scan.
Thundering herd on popular channel
Channel admin posts ? 50K long-poll connections wake up simultaneously ? 50K read requests hit log server at once
Staggered wakeup: add random jitter (0·200ms) to long-poll response. Coalesce reads: batch wakeup notifications, serve from cache. Connection-level rate limiting.
Log server CPU spike, increased latency for all channels on same partition
Multi-device sync
User has phone + desktop + tablet. Each device has its own cursor. Must not show duplicate notifications.
Server-side "last notified msg_id" per user (not per device). Devices sync cursors via shared user state. Push notification sent once, all devices update badge.
Triple notifications per message. User disables notifications entirely.
Log compaction / retention
Channel with 1M msgs over 5 years. Log grows unbounded. Old messages rarely accessed.
Tiered storage: hot (last 7 days, SSD), warm (last 90 days, HDD), cold (archive, object storage). Cursor-based access works across tiers transparently.
Design principle: Every edge case is handled by keeping the write path simple (single append) and pushing complexity to the read path (caching, pagination, rate limiting). The log is the source of truth · everything else is derived.
Tech Stack
Telegram's infrastructure choices optimized for large-scale message delivery
Layer
Technology
Why This Choice
Protocol
MTProto (custom)
Binary protocol over TCP. Smaller payloads than JSON/HTTP. Built-in encryption, compression, and multiplexing. Supports long-poll natively.
Transport
TCP + custom framing
Persistent TCP connections with keep-alive. No HTTP overhead. Supports multiple concurrent requests on single connection (multiplexed).
Message Storage
Distributed append-only log
Custom storage engine optimized for sequential writes and range reads. Sharded by channel_id. Replicated across 3+ DCs.
Metadata Store
Distributed KV store
User profiles, channel membership, notification preferences. Eventually consistent with strong consistency for writes.
Media Delivery
Global CDN + edge caching
Photos/videos served from nearest edge. Message log contains only media_id reference. Decouples media from text delivery.
Push Notifications
APNs / FCM + custom gateway
Custom aggregation layer batches badge updates. Direct APNs/FCM for @mentions. Reduces API calls to Apple/Google by 99%.
Service Mesh
Custom RPC framework
Internal services communicate via lightweight binary RPC. Service discovery, load balancing, and circuit breaking built-in.
Datacenter Strategy
Multi-DC active-active
5+ DCs globally. Users connect to nearest DC. Channel logs replicated synchronously within region, async across regions.
Why custom protocol? HTTP/JSON adds ~200 bytes overhead per request. At 700M users making frequent pulls, that's 140GB/day of wasted bandwidth just in headers. MTProto's binary framing reduces this to ~20 bytes overhead · a 10· improvement that matters at Telegram's scale.
Interview Cheat Sheet
5 key points to nail the Telegram large-group fan-out question
?? 5 Points That Win the Interview
1.Write-time fan-out is impossible at 1M+ · prove it with math: 1M members · 100 msgs/day = 100M writes/day per channel. Show the interviewer you understand why the naive approach fails before proposing the solution. 2.Read-time pull from append-only log · single write per message, clients pull from their cursor. Write cost is O(1) regardless of channel size. This is the core architectural insight. 3.Decouple notifications from content delivery · push notifications are lightweight signals (badge count), not message content. Only @mentions trigger immediate push. 80% of members are muted = zero push work. 4.Client-side cursor eliminates server state · server doesn't track per-user read position for 1M users. Client stores last_read_msg_id locally. On reconnect, client tells server "give me everything after X." 5.Thundering herd mitigation · when a message arrives, 50K long-poll connections wake up. Solve with: jittered wakeup, read replicas, edge caching of recent messages, and connection-level rate limiting.
Interviewer follow-ups to prepare for:
· "How do you handle ordering?" ? Monotonic msg_id per channel. Client detects gaps and re-fetches.
· "What about read receipts?" ? Not feasible at 1M. Show "X people saw this" as an approximate counter, not a full list.
· "How does this differ from Kafka?" ? Similar concept (log + consumer offset), but Kafka tracks offsets server-side. Telegram pushes offset tracking to the client to avoid 1M offset updates per message.
· "What if the log server goes down?" ? Multi-DC replication. Client reconnects to any DC and resumes from cursor. No data loss after ACK.