System Design Case Study

How does Zoom handle 300-person meetings with adaptive quality and <200ms latency?

?? Design video conferencing for 300 participants with per-user adaptive quality
Concepts Involved

Problem Statement

How does a video conferencing system handle 300-person meetings with screen sharing, adapting per-participant video quality in real-time based on available bandwidth while keeping total meeting latency under 200ms?

Core challenge: With 300 participants, naive P2P would require 300·299 = 89,700 streams. Even with an SFU, the server must selectively forward only relevant streams and adapt quality per-receiver without overwhelming any single participant's bandwidth.
300
participants
single meeting
<200ms
end-to-end latency
glass-to-glass
300M+
daily meeting participants
Zoom peak (2020+)
Adaptive
per-participant quality
based on bandwidth + layout

Architecture · SFU (Selective Forwarding Unit)

Server receives all streams, selectively forwards relevant ones to each participant

LAYER 1: SENDER · Each participant encodes 3 simulcast layers + Opus audio ? sends via UDP/SRTP to SFU Participant Sender (·300) Simulcast: encodes 3 quality layers simultaneously 1080p @30fps 720p @30fps 180p @15fps + Opus audio (48kHz, FEC) | Upload: ~3 Mbps total UDP/SRTP Transport Layer UDP (no head-of-line blocking) + SRTP encryption + DTLS key exchange NACK for selective retransmit | FEC for audio resilience tolerate 5-10% packet loss without visible degradation LAYER 2: SFU INTERNALS · Packet Router + Bandwidth Estimator + Layer Selector + Speaker Detector SFU Internals (Single Node) NO transcoding · CPU cost O(1) per packet (just routing) Packet Router (RTP forwarding) Bandwidth Estimator (TWCC/REMB) Layer Selector (per-receiver quality) Active Speaker Detector (audio energy) Decision Logic (per receiver): 1. Who is speaking? ? audio energy 2. Receiver bandwidth? ? TWCC feedback 3. Which layer fits? ? select hi/med/lo 4. What's visible? ? layout subscription 5. Forward only needed packets adapt every 1-2s per receiver Cascaded SFU Topology multi-region: 1 SFU per geographic region SFU US-East 100 users SFU EU-West 120 users AP-South 80 users Inter-SFU relay (dedicated links) only unique streams relayed between regions 1 stream per speaker per region (not per user) users connect to nearest SFU (latency) failover: reconnect to backup SFU (ICE restart) LAYER 3: RECEIVER · Each gets 1 high-quality speaker + N thumbnails ? gallery/speaker view rendering Good BW Receiver (5 Mbps) speaker @720p + 24 thumbs @180p + screen share @1080p (if active) + audio for all active speakers gallery view: 5·5 grid total: ~5 Mbps download Low BW Receiver (1 Mbps) speaker @360p + 4 thumbs @180p screen share @720p (downgraded) audio always full quality speaker view: 1 large + 4 small total: ~1 Mbps download Mobile Receiver (cellular) speaker @360p only + audio for active speakers no gallery (save data/battery) speaker view only (small screen) total: ~0.5 Mbps download SFU not MCU (no server-side mixing) | Simulcast: sender encodes 3 layers, SFU picks per receiver | UDP: tolerate loss vs TCP delay <200ms glass-to-glass | Cascaded SFUs: 1 per region, inter-SFU relay | 300 users = 7,800 streams (vs 89,700 P2P) | Audio never dropped

Key Design Decisions

DecisionChoiceWhy
TopologySFU (not MCU, not P2P)MCU transcodes (expensive CPU). P2P doesn't scale past 5. SFU just routes packets.
SimulcastEach sender encodes 3 layers (hi/med/lo)SFU picks layer per receiver without transcoding
SVCScalable Video Coding (temporal/spatial layers)Drop layers mid-stream for congestion control
Active speakerAudio energy detection ? promote to hi-resOnly 1-3 speakers at a time need full quality
Gallery view25 thumbnails (180p) + paginationBandwidth: 25·180p · 5Mbps vs 300·720p = impossible
CongestionGCC (Google Congestion Control) + REMBPer-receiver bandwidth estimation, graceful degradation
ProtocolWebRTC (SRTP over UDP)Low latency, NAT traversal (ICE/STUN/TURN), E2E encryption
Multi-regionCascading SFUs (SFU-to-SFU relay)Users connect to nearest SFU, SFUs relay between regions
Simulcast explained: Each sender encodes video at 3 qualities simultaneously (e.g., 1080p + 720p + 180p). The SFU picks which layer to forward to each receiver based on their bandwidth and layout position. No server-side transcoding needed · just packet routing.
Bandwidth math: Active speaker view: 1·720p (2Mbps) + 5·180p thumbnails (0.5Mbps) = ~2.5Mbps down. Gallery 5·5: 25·180p = ~5Mbps down. Upload: 1 simulcast stream = ~3Mbps up. Total per user: ~5-8Mbps · feasible on most connections.
Anti-patterns: MCU for large meetings · CPU cost explodes (decode+re-encode per participant). P2P for >5 users · upload bandwidth · N kills the sender. Single SFU globally · cross-continent latency ruins experience. No simulcast · SFU must transcode (defeats the purpose).
Real-world: Zoom · proprietary SFU with simulcast + SVC. Google Meet · WebRTC SFU with VP9 SVC. Discord · custom SFU for voice (Opus codec, 50ms target). Twilio · SFU-as-a-service (Programmable Video).

Interview Cheat Sheet

The 6 things to say for large video meeting design

1. SFU over MCU · route packets, don't transcode (scales to 300+ without CPU explosion)
2. Simulcast (3 layers) · sender encodes high/med/low, SFU picks per receiver's bandwidth
3. Active speaker detection · forward high-quality only for speakers, thumbnails for others
4. Cascaded SFUs · one per region, inter-SFU relay for geo-distributed meetings
5. TWCC bandwidth estimation · per-receiver, adapt forwarded layer in real-time
6. UDP + FEC · tolerate packet loss without retransmit delay (Opus audio handles 10% loss)