Telegram Large Group Fan-out

🎯 Design a large-group messaging system that delivers messages to 1M+ members with <2s latency · without write-time fan-out

Concepts Involved

Pub/Sub Partitioning Backpressure Caching Strategies Kafka

Problem Statement

How does a group chat with 1M+ members deliver a single message without causing a fan-out explosion, keeping delivery latency under 2 seconds for all participants regardless of group size?

Scope: Large channel/group delivery path · not 1:1 DMs, not media upload, not E2E encryption. The hard question: how do you avoid O(N) write amplification when N = 1,000,000✗

1M+

members per channel

some channels exceed 5M

200K+

channels at this scale

public broadcast channels

<2s

delivery target

all members, any region

700M+

monthly active users

global distribution

Cross-reference: Slack's Fan-out Strategy covers hybrid fan-out up to 50K members · this problem goes 20· further. At 1M+ members, even Slack's "push to online only" model breaks because 10% online = 100K simultaneous pushes per message.

Functional Requirements

What the system must do · core behaviours for 1M+ member channels

Must Have (Core)

1. Admin/allowed user can post a message to a channel with 1M+ members
2. All online members receive the message within 2 seconds
3. Offline members see the message on next app open (pull-based catch-up)
4. Messages are ordered · all members see the same sequence
5. Notification tiering · @mentions push immediately, regular messages update badge only
6. System handles member join/leave storms without disrupting delivery

Out of Scope (this design)

✗ End-to-end encryption (not applicable to large public channels)
✗ Media/file delivery (separate CDN pipeline)
✗ Message editing/deletion propagation
✗ Bot API and webhook integrations
✗ User authentication and session management
✗ Small group chat (<1K members · standard fan-out works fine)

Non-Functional Requirements

Quality constraints that shape the architecture for extreme fan-out

Property	Target	Why It Matters / Design Impact
Latency	<2s delivery for all online members	Users expect near-instant delivery. Read-time pull with long-polling achieves this without write amplification.
Throughput	100M+ deliveries/day per large channel	A single channel with 1M members and 100 msgs/day = 100M delivery events. Must not scale linearly with writes.
Availability	99.99% (<52 min/year)	Channels are broadcast infrastructure for governments and media. Downtime = public trust loss.
Durability	Zero message loss after server ACK	Append-only log with replication. Messages never deleted from server (unlike E2E chats).
Scalability	Linear with readers, not writers	Write cost = O(1) per message. Read cost distributed across millions of clients pulling independently.
Fan-out ratio	1 write : 0 server-side copies	No per-user inbox. Single log entry serves all members. Eliminates write amplification entirely.
Consistency	Per-channel total order	All members see messages in identical sequence. Monotonic message IDs enable gap detection.
Fault tolerance	Survive datacenter failures	Multi-DC replication of channel log. Clients reconnect to any DC and resume from cursor.

Key insight: At 1M+ members, the system cannot afford O(N) work per message. The architecture must make message posting O(1) and shift the cost to readers · each reader does O(1) work to fetch their unread messages.

Scale Estimation

Derive why write-time fan-out is impossible at 1M+ members · and what replaces it

Given: 1M members per channel · 100 messages/day per channel · <2s delivery · 700M+ MAU

Step	What to Derive	Calculation	Result	Design Decision
1	Deliveries per channel/day	1M members · 100 msgs/day	100M deliveries/day	Per-channel delivery volume · impossible to write-amplify
2	Write-time fan-out cost	1M inbox writes · 100 msgs/day	100M writes/day/channel ✗	With 200K channels: 20 trillion writes/day. Completely infeasible.
3	Read-time cost (log model)	1 append per message (O(1) write)	100 writes/day/channel ✔	Single append-only log. Readers pull. Write cost independent of member count.
4	Online members at any time	1M · 5% online rate	~50K concurrent readers	50K long-poll connections per large channel · manageable with connection pooling
5	Notification fan-out (push)	Only @mentions trigger push: ~1% of msgs	~1 push/day per user	Push notifications are rare · most msgs only increment badge counter
6	Storage per channel/day	100 msgs · 2KB avg (text + metadata)	~200KB/day/channel	Trivial storage. 200K channels · 200KB = 40GB/day total for all large channels
7	Cursor storage	700M users · 8 bytes (last_read_msg_id)	~5.6GB	Per-user cursor per channel. Fits in distributed cache. No per-user message copies.
8	Latency budget	Long-poll wakeup (~50ms) + log read (~5ms) + network (~100ms)	~155ms actual	Well within 2s budget. Long-poll gives near-real-time without persistent connections.

The impossibility proof: Write-time fan-out at 1M members means writing 1M inbox entries per message. At 100 msgs/day that's 100M writes/day for ONE channel. Multiply by 200K large channels = 20 trillion writes/day. No storage system can sustain this. Read-time pull from a shared log is the only viable architecture.

Architecture Overview

Single append-only channel log eliminates write amplification · clients pull from shared log independently

Key insight: The architecture is read-time pull from a shared channel log. One write serves all members. Online clients long-poll for new entries; offline clients fetch from their last-seen cursor on reconnect. See detailed strategies below.

Fan-out Strategies at Scale

Comparing three approaches · and why read-time pull wins at 1M+ members

Cross-reference: Slack uses hybrid fan-out for channels up to 50K members · push to online, offline users catch up on reconnect. At 1M+ even the "push to online" part becomes 50K·100K pushes per message, creating unacceptable latency spikes. Telegram's solution: don't push message content at all · push only a lightweight "new message available" signal and let clients pull.

Channel Log Architecture

Single append-only log per channel · clients maintain their own cursor. Server tracks zero per-user delivery state.

Why this works: The server does zero per-user bookkeeping. It doesn't know who has read what · that's the client's job. When a client opens the app, it says "give me everything after msg_id X" and the server does a simple range scan on the log. This makes the system scale with readers (distributed) rather than writers (centralized).

Notification Tiering

Only push notifications fan-out · not message content. Tiered by priority to minimize push volume.

Key insight: Notification fan-out and message delivery are completely decoupled. Message content lives in the channel log (pulled by clients). Push notifications are lightweight signals ("you have N unread") that fan-out only to unmuted, offline users · and even those are batched to reduce volume by 99%+.

Resilience & Edge Cases

What breaks at 1M+ scale · and how the architecture handles it

Edge Case	Problem	Solution	Impact if Unhandled
Viral message in 1M channel	Sudden spike: 50K clients simultaneously pull after a popular post	CDN-like caching of recent messages at edge. Hot channel log replicated to read replicas. Rate-limit pulls to 1/sec per client.	Log server overload, cascading timeouts for all channels on same shard
Member join/leave storms	100K users join a channel in minutes (viral invite link). Membership list update becomes bottleneck.	Membership is eventually consistent. New members start cursor at latest msg_id. Membership changes don't affect the log · only the "who can read" ACL check on pull.	Membership write lock blocks message delivery if coupled
Cursor management for offline users	User offline for 30 days. Channel has 3000 new messages. Full catch-up is expensive.	Paginated fetch (100 msgs/page). Show latest first, load history on scroll-up. If gap > 1000 msgs, show "jump to latest" with unread count only.	Client OOM on massive payload. Server timeout on huge range scan.
Thundering herd on popular channel	Channel admin posts → 50K long-poll connections wake up simultaneously → 50K read requests hit log server at once	Staggered wakeup: add random jitter (0·200ms) to long-poll response. Coalesce reads: batch wakeup notifications, serve from cache. Connection-level rate limiting.	Log server CPU spike, increased latency for all channels on same partition
Multi-device sync	User has phone + desktop + tablet. Each device has its own cursor. Must not show duplicate notifications.	Server-side "last notified msg_id" per user (not per device). Devices sync cursors via shared user state. Push notification sent once, all devices update badge.	Triple notifications per message. User disables notifications entirely.
Log compaction / retention	Channel with 1M msgs over 5 years. Log grows unbounded. Old messages rarely accessed.	Tiered storage: hot (last 7 days, SSD), warm (last 90 days, HDD), cold (archive, object storage). Cursor-based access works across tiers transparently.	Storage costs grow linearly forever. SSD capacity exhaustion.

Design principle: Every edge case is handled by keeping the write path simple (single append) and pushing complexity to the read path (caching, pagination, rate limiting). The log is the source of truth · everything else is derived.

Tech Stack

Telegram's infrastructure choices optimized for large-scale message delivery

Layer	Technology	Why This Choice
Protocol	MTProto (custom)	Binary protocol over TCP. Smaller payloads than JSON/HTTP. Built-in encryption, compression, and multiplexing. Supports long-poll natively.
Transport	TCP + custom framing	Persistent TCP connections with keep-alive. No HTTP overhead. Supports multiple concurrent requests on single connection (multiplexed).
Message Storage	Distributed append-only log	Custom storage engine optimized for sequential writes and range reads. Sharded by channel_id. Replicated across 3+ DCs.
Metadata Store	Distributed KV store	User profiles, channel membership, notification preferences. Eventually consistent with strong consistency for writes.
Media Delivery	Global CDN + edge caching	Photos/videos served from nearest edge. Message log contains only media_id reference. Decouples media from text delivery.
Push Notifications	APNs / FCM + custom gateway	Custom aggregation layer batches badge updates. Direct APNs/FCM for @mentions. Reduces API calls to Apple/Google by 99%.
Service Mesh	Custom RPC framework	Internal services communicate via lightweight binary RPC. Service discovery, load balancing, and circuit breaking built-in.
Datacenter Strategy	Multi-DC active-active	5+ DCs globally. Users connect to nearest DC. Channel logs replicated synchronously within region, async across regions.

Why custom protocol? HTTP/JSON adds ~200 bytes overhead per request. At 700M users making frequent pulls, that's 140GB/day of wasted bandwidth just in headers. MTProto's binary framing reduces this to ~20 bytes overhead · a 10· improvement that matters at Telegram's scale.

Interview Cheat Sheet

5 key points to nail the Telegram large-group fan-out question

➔ 5 Points That Win the Interview

1. Write-time fan-out is impossible at 1M+ · prove it with math: 1M members · 100 msgs/day = 100M writes/day per channel. Show the interviewer you understand why the naive approach fails before proposing the solution.
2. Read-time pull from append-only log · single write per message, clients pull from their cursor. Write cost is O(1) regardless of channel size. This is the core architectural insight.
3. Decouple notifications from content delivery · push notifications are lightweight signals (badge count), not message content. Only @mentions trigger immediate push. 80% of members are muted = zero push work.
4. Client-side cursor eliminates server state · server doesn't track per-user read position for 1M users. Client stores last_read_msg_id locally. On reconnect, client tells server "give me everything after X."
5. Thundering herd mitigation · when a message arrives, 50K long-poll connections wake up. Solve with: jittered wakeup, read replicas, edge caching of recent messages, and connection-level rate limiting.

Interviewer follow-ups to prepare for:
· "How do you handle ordering?" → Monotonic msg_id per channel. Client detects gaps and re-fetches.
· "What about read receipts?" → Not feasible at 1M. Show "X people saw this" as an approximate counter, not a full list.
· "How does this differ from Kafka?" → Similar concept (log + consumer offset), but Kafka tracks offsets server-side. Telegram pushes offset tracking to the client to avoid 1M offset updates per message.
· "What if the log server goes down?" → Multi-DC replication. Client reconnects to any DC and resumes from cursor. No data loss after ACK.

System Design Case Study