How does Zoom handle 300-person video meetings?

🎯 Design video conferencing for 300 participants with per-user adaptive quality

Concepts Involved

WebSocket Load Balancer Sharding CDN Backpressure

Problem Statement

How does a video conferencing system handle 300-person meetings with screen sharing, adapting per-participant video quality in real-time based on available bandwidth while keeping total meeting latency under 200ms?

Core challenge: With 300 participants, naive P2P would require 300·299 = 89,700 streams. Even with an SFU, the server must selectively forward only relevant streams and adapt quality per-receiver without overwhelming any single participant's bandwidth.

300

participants

single meeting

<200ms

end-to-end latency

glass-to-glass

300M+

daily meeting participants

Zoom peak (2020+)

Adaptive

per-participant quality

based on bandwidth + layout

Architecture · SFU (Selective Forwarding Unit)

Server receives all streams, selectively forwards relevant ones to each participant

Key Design Decisions

Decision	Choice	Why
Topology	SFU (not MCU, not P2P)	MCU transcodes (expensive CPU). P2P doesn't scale past 5. SFU just routes packets.
Simulcast	Each sender encodes 3 layers (hi/med/lo)	SFU picks layer per receiver without transcoding
SVC	Scalable Video Coding (temporal/spatial layers)	Drop layers mid-stream for congestion control
Active speaker	Audio energy detection → promote to hi-res	Only 1-3 speakers at a time need full quality
Gallery view	25 thumbnails (180p) + pagination	Bandwidth: 25·180p · 5Mbps vs 300·720p = impossible
Congestion	GCC (Google Congestion Control) + REMB	Per-receiver bandwidth estimation, graceful degradation
Protocol	WebRTC (SRTP over UDP)	Low latency, NAT traversal (ICE/STUN/TURN), E2E encryption
Multi-region	Cascading SFUs (SFU-to-SFU relay)	Users connect to nearest SFU, SFUs relay between regions

Simulcast explained: Each sender encodes video at 3 qualities simultaneously (e.g., 1080p + 720p + 180p). The SFU picks which layer to forward to each receiver based on their bandwidth and layout position. No server-side transcoding needed · just packet routing.

Bandwidth math: Active speaker view: 1·720p (2Mbps) + 5·180p thumbnails (0.5Mbps) = ~2.5Mbps down. Gallery 5·5: 25·180p = ~5Mbps down. Upload: 1 simulcast stream = ~3Mbps up. Total per user: ~5-8Mbps · feasible on most connections.

Anti-patterns: MCU for large meetings · CPU cost explodes (decode+re-encode per participant). P2P for >5 users · upload bandwidth · N kills the sender. Single SFU globally · cross-continent latency ruins experience. No simulcast · SFU must transcode (defeats the purpose).

Real-world: Zoom · proprietary SFU with simulcast + SVC. Google Meet · WebRTC SFU with VP9 SVC. Discord · custom SFU for voice (Opus codec, 50ms target). Twilio · SFU-as-a-service (Programmable Video).

Interview Cheat Sheet

The 6 things to say for large video meeting design

1. SFU over MCU · route packets, don't transcode (scales to 300+ without CPU explosion)
2. Simulcast (3 layers) · sender encodes high/med/low, SFU picks per receiver's bandwidth
3. Active speaker detection · forward high-quality only for speakers, thumbnails for others
4. Cascaded SFUs · one per region, inter-SFU relay for geo-distributed meetings
5. TWCC bandwidth estimation · per-receiver, adapt forwarded layer in real-time
6. UDP + FEC · tolerate packet loss without retransmit delay (Opus audio handles 10% loss)

System Design Case Study

Problem Statement

Architecture · SFU (Selective Forwarding Unit)

Key Design Decisions

Interview Cheat Sheet