Real-Time System Design

|

Filter by Concept

1. Chat & Messaging WhatsAppSlackDiscordTelegram

Real-time delivery, fan-out, presence, ordering at scale · 11 problems

#ProblemCompany / Scale
1How a real-time chat system can deliver a single message to 50,000 online users within 200ms while handling 10B+ messages per day (~115K msgs/sec, peak ~200K msgs/sec) without hitting scalability bottlenecks?Slack · 10B msg/day
2How does a messaging system guarantee delivery to 2B+ users who go offline for hours/days, ensuring zero message loss, correct ordering on reconnect, and end-to-end encryption without the server ever seeing plaintext?WhatsApp · 2B users
3How does a chat platform maintain 10M+ concurrent WebSocket connections across thousands of gateway servers, handling heartbeats, shard assignment, graceful failover, and voice signaling without dropping connections?Discord · millions WS
4How does a multi-region chat system guarantee causal message ordering within conversations when an entire cloud region goes down, while maintaining consistency and conflict resolution across geographically distributed nodes?Multi-region ordering
5How does a group chat with 1M+ members deliver a single message without causing a fan-out explosion, keeping delivery latency under 2 seconds for all participants regardless of group size?Telegram · 1M members
6How does a presence system track online/offline status for 100M+ concurrent users in real-time, delivering status changes to relevant contacts within seconds without flooding the network with unnecessary updates?WhatsApp · 100M presence
7How does a chat system handle millions of ephemeral "typing..." events per second without persisting anything to disk, while ensuring sub-100ms delivery to all conversation participants?Slack · ephemeral events
8How does a search system index 10B+ messages with <5s indexing latency from send to searchable, and return full-text results in <100ms across the entire message corpus?Slack · billions msgs search
9How does a notification system deliver push notifications to millions of devices without duplicates, handling device token lifecycle, retry logic, and cross-platform delivery (iOS/Android) reliably?WhatsApp · dedup push
10How does a multi-device messaging app keep messages, read status, and edits perfectly synced across phone/tablet/desktop, resolving conflicts when edits happen simultaneously on different devices?Telegram · multi-device sync
11Given message events across millions of channels, design a system that continuously computes the top active channels/users for the last 1 hour without scanning historical data, updating rankings within seconds of activity changes?Slack · top active channels

2. Real-Time Collaboration Google DocsFigmaNotionMiro

Concurrent editing, CRDTs, OT, multiplayer cursors · 7 problems

#ProblemCompany / Scale
1How does a collaborative editor handle 100 users editing the same paragraph simultaneously, resolving conflicting character insertions without a central lock while maintaining convergence across all clients within 50ms?Google Docs · OT
2How does a design tool render 50+ cursors moving at 60fps in real-time, achieving conflict-free state merge across all clients while keeping bandwidth under 10KB/sec per user?Figma · CRDTs
3How does a block-based editor sync granular edits (move block, change text, nest) across devices in <100ms, handling concurrent modifications to the same block structure without data loss?Notion · block sync
4How does a collaborative whiteboard handle 1000+ objects being dragged simultaneously by 50 users, merging conflicting position updates without visual glitches or lost operations?Miro · spatial CRDT
5How does a collaborative code editor sync cursor positions, selections, and edits across continents with <150ms latency, maintaining consistent document state despite network delays between geographically distributed participants?VS Code Live Share
6How does a multiplayer game sync world state for 100 players at 60fps, handling client-side prediction, server reconciliation, and entity interpolation while keeping perceived latency below 100ms?Gaming · state sync
7How does a design platform handle real-time collaboration on documents with 500MB+ media assets, ensuring edit operations remain fast (<100ms) regardless of total document size?Canva · media + collab

3. Video Streaming NetflixYouTubeTwitchSpotify

Adaptive bitrate, CDN, transcoding, live vs VOD · 8 problems

#ProblemCompany / Scale
1How does a streaming platform deliver 4K HDR video to 230M+ subscribers globally without buffering, adapting quality in real-time to each viewer's bandwidth while minimizing rebuffer events to <0.1% of sessions?Netflix · 230M users
2How does a video platform transcode 500+ hours of video uploaded every minute into 8+ resolutions, finishing 4K transcoding in <30 minutes while prioritizing live content over on-demand uploads?YouTube · 500hr/min
3How does a live streaming platform deliver video to 5M+ concurrent viewers with <3 second glass-to-glass latency, while also supporting sub-second interactive streams for smaller audiences?Twitch · live low-latency
4How does a streaming platform handle 25M concurrent viewers during a single live event, gracefully degrading quality under load while maintaining stream continuity and minimizing regional failures?Hotstar · 25M concurrent
5How does a short-video platform achieve instant playback (<200ms to first frame) for an infinite-scroll feed, ensuring zero perceived loading time as users swipe between videos?TikTok · instant playback
6How does adaptive bitrate streaming prevent buffering on degrading networks, seamlessly switching quality mid-stream without visible artifacts while maximizing video quality for available bandwidth?ABR · HLS/DASH
7How do CDNs cache and serve video segments at the edge for 1B+ daily requests, maximizing cache hit rates while minimizing origin load and ensuring popular content is always available at the nearest edge?CDN · edge caching
8How does an audio streaming platform achieve gapless playback with offline mode, ensuring zero gaps between tracks, seamless quality adaptation over cellular, and encrypted local storage for offline content?Spotify · audio stream

4. Video Calling / WebRTC ZoomGoogle MeetDiscord

P2P, SFU, MCU, signaling, NAT traversal · 6 problems

#ProblemCompany / Scale
1How does a video conferencing system handle 300-person meetings with screen sharing, adapting per-participant video quality in real-time based on available bandwidth while keeping total meeting latency under 200ms?Zoom · 300 participants
2How does a video calling platform dynamically switch between peer-to-peer (2 users, lowest latency) and server-relayed (3+ users) topology, performing seamless mid-call migration without audio/video interruption?Google Meet · adaptive
3How does a voice chat platform handle 100+ users in a single voice channel with <50ms audio latency, selectively mixing only active speakers while maintaining clear audio for all participants?Discord · voice channels
4How does a real-time communication system establish peer-to-peer connections through NATs and firewalls, discovering public endpoints and falling back to relay when direct connection is impossible, while minimizing connection setup time?WebRTC · NAT traversal
5How does a telehealth video platform ensure HIPAA-compliant calls with recording, maintaining end-to-end encryption, consent-based recording, audit trails, and data residency controls per jurisdiction?Telehealth · compliance
6How does a live audio room platform handle thousands of listeners with <500ms latency, dynamically promoting speakers from the audience while distributing audio efficiently to large audiences?Twitter Spaces · audio rooms

5. Ticket Booking UberAirbnbAmazonTicketmaster

Seat locking, double-booking prevention, flash sales · 6 problems

#ProblemCompany / Scale
1Design a ticket booking system where 5M users attempt to book 50K concert seats simultaneously, guaranteeing no double booking, fair queue ordering with position tracking, distributed seat locking with expiry, and bot detection?BookMyShow · 5M users, 50K seats
2How does a movie theater chain handle seat selection for 500+ screens simultaneously, showing real-time seat availability updates to thousands of concurrent users while preventing double-booking through optimistic locking with sub-second conflict resolution?AMC · real-time seat map
3How does a rental platform prevent double-booking across time zones when two users in different continents try to book the same dates, detecting and resolving calendar conflicts with automatic resolution?Airbnb · calendar sync
4How do airlines handle seat selection with distributed inventory across 100+ booking channels (website, app, agents, GDS), preventing overselling while maintaining responsive seat availability display?Airlines · distributed inv
5How does a ticketing platform manage 1M+ users in a virtual queue with fair ordering, providing real-time position updates and estimated wait times while preventing bot abuse and queue jumping?Ticketmaster · queue
6Design a flash sale system where 20M users try to buy the same product within 2 minutes, preventing overselling while maintaining inventory consistency across 5 regions and confirming orders within seconds?Amazon · 20M users, 2 min flash

6. Cache & CDN CloudflareRedisAkamai

Cache invalidation, thundering herd, edge compute · 8 problems

#ProblemCompany / Scale
1How does a social platform invalidate cached objects across 1000+ servers within 1 second of a write, preventing stale reads while avoiding thundering herd on the backing store?Meta · TAO cache
2How does a streaming platform prevent thundering herd when a hot cache key expires and 100K requests simultaneously hit the database, ensuring only one request rebuilds the cache while others wait or receive stale data?Netflix · EVCache
3How does an edge network serve 45M+ requests/sec from 300+ PoPs without hitting origin, maximizing cache hit rates through tiered caching and intelligent routing for cache misses?Cloudflare · edge
4How does a social platform cache the home timeline for 400M+ users, handling the asymmetry between normal users (hundreds of followers) and celebrities (100M+ followers) without overwhelming write capacity?Twitter · timeline cache
5How does a distributed cache cluster handle 10M+ ops/sec with automatic failover, completing replica promotion within 2 seconds of node failure while maintaining consistent routing during topology changes?Redis Cluster · 10M ops/sec
6How do CDNs purge cached content globally within 150ms for breaking news updates, invalidating stale content at all edge locations without causing origin overload from simultaneous cache misses?CDN · instant purge
7How does a social platform cache ephemeral content (Stories) that expires after 24 hours, ensuring instant access for sequential viewing while automatically evicting expired content without manual cleanup?Instagram · TTL cache
8How does an edge computing platform execute custom logic (authentication, A/B routing, header manipulation) at 300+ PoPs with <1ms cold start, deploying code changes globally in <30 seconds without origin round-trips?Cloudflare Workers · edge compute

7. Queues & Events KafkaRabbitMQAWS SQS

Kafka, exactly-once, dead letters, event sourcing, streaming analytics, notifications · 12 problems

#ProblemCompany / Scale
1How does a ride-hailing platform process 1M+ ride events per second with city-level locality, guaranteeing exactly-once processing semantics and computing real-time surge pricing from the event stream?Uber · Kafka 1M/sec
2How does a payment platform guarantee exactly-once processing when network retries can duplicate requests, ensuring no double-charges while maintaining at-least-once delivery guarantees from upstream systems?Stripe · exactly-once
3How does a professional network handle 4 trillion events/day across 100K+ partitions, supporting schema evolution for backward compatibility while auto-scaling consumers based on lag?LinkedIn · 4T events
4How do you design a dead letter queue that never loses messages, isolating poison messages from healthy processing while providing retry policies, monitoring with alerting, and manual replay tooling for operations?DLQ · poison messages
5How does a streaming platform use event sourcing for microservices, storing all state changes as immutable events, rebuilding materialized views on demand, and handling schema evolution without breaking consumers?Netflix · event sourcing
6How does an e-commerce platform handle order events across 1M+ merchants in a multi-tenant event cluster, enforcing per-merchant quotas and providing priority lanes for high-volume sellers without noisy-neighbor effects?Shopify · multi-tenant
7How do you implement the SAGA pattern for a distributed order?payment?inventory?shipping transaction, coordinating compensating transactions for rollback and handling timeouts across independently deployed services?Saga · choreography
8How does a distributed event streaming system handle consumer group rebalancing without message loss or duplicate processing, minimizing rebalance time while maintaining exactly-once delivery semantics?Kafka · rebalance
9Given ad impression and click events from billions of daily ad requests, design an analytics system that computes CTR/CPC dashboards per advertiser with less than 1-minute delay from event occurrence to dashboard visibility.Google Ads · CTR <1min delay
10Given IoT telemetry from 500M smart devices sending events every few seconds, design a real-time anomaly detection system that detects outages/spikes within 10 seconds, maintaining statistical baselines per device and alerting on deviations.IoT · 500M devices anomaly
11Given repository events (push, fork, PR, issue, star) from millions of repositories, build a "Trending Repositories" system that updates rankings globally every minute, weighting recent activity higher than older activity and segmenting by language/topic.GitHub · trending repos
12How does a notification orchestration platform decide what to send (push/email/SMS/in-app), when to send it (optimal timing per user), and how to batch/deduplicate across channels · processing 1B+ notification decisions/day with user preference enforcement?Notification orchestration

8. Social Media Feeds & Recommendations XInstagramTikTokPinterest

Fan-out on write/read, ranking, real-time updates, personalization engines, trending computation · 16 problems

#ProblemCompany / Scale
1How does a social platform deliver posts to 400M+ users' timelines in real-time, handling the asymmetry between users with few followers and celebrities with millions, while keeping timeline delivery under 5 seconds?Twitter · fan-out
2How does a photo-sharing platform rank your feed from millions of candidate posts, balancing relevance, recency, and popularity while maintaining exploration/exploitation balance to avoid filter bubbles?Instagram · ML ranking
3How does a short-video platform learn your preferences within 3 minutes of first use, leveraging real-time engagement signals (watch time, replays, shares) to personalize recommendations for brand-new users with no history?TikTok · cold-start rec
4How does a forum handle nested comments with millions of votes, supporting deep threading, real-time vote counts, and efficient pagination ("load more replies") without N+1 query performance degradation?Reddit · comment tree
5How does a video platform count views accurately at 1B+ views/day without double-counting, filtering bot traffic and fraudulent views while keeping the public count updated within 5 minutes of actual views?YouTube · view counting
6How does a social platform handle the celebrity problem · a user with 100M followers posts, and you can't write to 100M timelines simultaneously · while still delivering the post to active followers within seconds?Facebook · celebrity fan-out
7How does a professional network generate "People You May Know" recommendations in real-time, computing relationship suggestions from graph connections (friends-of-friends, shared attributes) and updating as new connections form?LinkedIn · graph rec
8How does a visual discovery platform handle infinite scroll with personalized content, pre-fetching upcoming pages, computing layout server-side, and re-ranking in real-time based on scroll behavior and engagement signals?Pinterest · infinite scroll
9How does a music streaming platform generate personalized playlists (Discover Weekly) for 600M+ users, balancing familiar preferences with novel discovery while avoiding filter bubbles and repetitive recommendations?Spotify · Discover Weekly
10How does a video platform recommend the next video with 80%+ click-through rate, narrowing candidates from 1B+ videos to a ranked shortlist in <50ms while incorporating real-time watch signals (watch time, skip, replay)?YouTube · next video rec
11Given an API that records every song play event (userId, songId, albumId, timestamp, country), build a system that continuously computes the top 100 songs/albums for the last 5 min / 1 hour / 1 week globally and per country, processing 5B+ listen events/day with rankings updating within seconds of activity changes.Spotify · top charts 5B/day
12Given a stream of video watch events (videoId, userId, watchDuration, region), design a system that updates the "Trending Videos" page every 30 seconds while handling 50M+ concurrent viewers, weighting scores by views, watch percentage, and velocity.YouTube · Trending 50M concurrent
13How does a social platform implement real-time content moderation at scale, classifying 500M+ posts/day for policy violations using multi-modal ML (text + image + video), routing edge cases to human reviewers within minutes, and handling appeals with audit trails?Meta · content moderation
14How does a social platform implement real-time A/B testing for feed ranking algorithms, splitting traffic across 100+ concurrent experiments, measuring engagement metrics with statistical significance within hours, and safely rolling back experiments that degrade user experience?Meta · feed experimentation
15How does a social platform serve personalized notifications to 2B+ users, deciding what to notify (likes, comments, follows), batching low-priority notifications, computing optimal send times per user based on activity patterns, and suppressing notifications during quiet hours?Instagram · smart notifications
16How does a social platform detect and suppress viral misinformation in real-time, scoring content credibility within seconds of posting using engagement velocity anomalies, cross-referencing fact-check databases, and applying distribution throttling before content reaches millions?Meta · misinformation detection

9. Geo, Ride Sharing & Food Delivery UberGoogle MapsDoorDashZomato

Location tracking, geofencing, matching, ETA, order lifecycle, real-time tracking · 13 problems

#ProblemCompany / Scale
1How does a ride-hailing platform match riders to the nearest available driver within 3 seconds, scoring candidates by distance, ETA, and driver rating while avoiding global scans of all drivers?Uber · H3 matching
2How does a maps platform calculate ETA for millions of simultaneous route requests, incorporating real-time traffic data from GPS probes and ML-based travel time prediction for accuracy within 10% of actual travel time?Google Maps · ETA
3How does a delivery platform optimize routes for 1M+ concurrent orders, batching nearby orders and dynamically re-routing when new orders arrive mid-delivery while minimizing total delivery time?DoorDash · routing
4How does a ride platform ingest and query 5M+ driver location updates every 4 seconds, supporting spatial queries ("find drivers within 2km") without scanning all drivers globally?Uber · location ingestion
5How does an AR game handle millions of players interacting with geo-anchored objects, maintaining server-authoritative state for shared objects while providing smooth client-side rendering at 60fps?Pok·mon GO · geo-spatial
6Given ride request and completion events (driverId, riderId, lat, long, timestamp), build a surge pricing system that recomputes hotspot pricing for every city zone within 10 seconds during peak traffic, computing supply/demand ratios per zone with smoothing to avoid price oscillation.Uber · surge 10s recompute
7How does a local search platform find "restaurants near me" from 200M+ businesses in <50ms, supporting radius queries, real-time availability filtering, and ranking by distance, rating, and relevance?Google Maps · local search
8How does a food delivery platform push live driver-location updates to the customer's map every 2 seconds, showing smooth movement interpolation on the client even when GPS updates arrive irregularly?Zomato/Swiggy · live tracking
9How does a food delivery platform calculate multi-leg ETA (restaurant prep + driver pickup + delivery) that updates in real-time, incorporating per-restaurant prep time models, live traffic, and Bayesian updates as each leg completes?Uber Eats · multi-leg ETA
10How does a food delivery platform manage the 3-party order lifecycle (customer ? restaurant ? driver) with state machine transitions (placed?accepted?preparing?ready?picked_up?delivered), timeout handlers per state, and compensating actions on cancellation?Swiggy · order lifecycle
11How does a food delivery platform assign orders to drivers optimizing for delivery time, driver earnings, and restaurant wait, enforcing constraints (distance < 3km, capacity = 2 orders) and re-assigning when a driver rejects within 30s?DoorDash · driver dispatch
12Build a traffic analytics platform where GPS pings arrive every 3 seconds from 100M vehicles, computing congestion levels per road segment and ETA updates that refresh within 5 seconds, pushing tile-based map updates to clients.Google Maps · 100M vehicles traffic
13How does a geofencing platform detect when millions of devices enter/exit custom geographic boundaries in real-time, processing 10M+ location updates/sec against 100M+ geofences with sub-second trigger latency for marketing notifications and compliance alerts?Radar · geofencing 10M/sec

10. Payment Systems StripePayPalApple PayVisa

Idempotency, ledgers, reconciliation, fraud, trading · 9 problems

#ProblemCompany / Scale
1How does a payment platform process millions of charges without double-charging, ensuring idempotent request handling, two-phase state transitions (pending?captured?settled), and at-least-once delivery with server-side deduplication?Stripe · idempotency
2How does a payment platform detect fraud in real-time across 400M accounts, scoring each transaction in sub-100ms using velocity checks, device fingerprinting, and geo-anomaly detection while routing edge cases to human review?PayPal · fraud ML
3How does a cross-border payment platform handle transfers across 80+ currencies, locking FX rates within 30-second windows, managing multi-currency ledgers, and batching settlements to minimize wire fees?Wise · FX ledger
4How does a trading platform handle stock orders with sub-millisecond latency, maintaining deterministic order matching, lock-free order book operations, and complete audit replay for regulatory compliance?Robinhood · trading
5How does a digital wallet handle offline tap-to-pay transactions, using pre-authorized tokens with device-local transaction limits and deferred settlement that reconciles when connectivity returns?Apple Pay · offline
6How do banks reconcile millions of transactions daily without losing a cent, using double-entry ledgers, end-of-day batch reconciliation, exception queues for mismatches, and cryptographic audit trails for regulatory compliance?Banking · reconciliation
7How does an e-commerce platform handle checkout for 1M+ merchants during Black Friday peak, maintaining order confirmation within seconds while gracefully degrading non-critical features under extreme load?Shopify · flash checkout
8How does a cryptocurrency exchange handle 100K+ trades/sec with real-time order matching, maintaining a deterministic in-memory order book, supporting limit/market/stop orders, and providing guaranteed execution ordering with nanosecond timestamps for regulatory audit?Binance · crypto exchange
9Design a stock market analytics system ingesting millions of trades per second that continuously updates top gainers, losers, and unusual volume spikes with millisecond latency, computing OHLC aggregations per symbol and pushing updates to trading terminals in real-time.Stock exchange · ms-latency analytics

11. API Gateway & Backend CloudflareNetflixKong

Rate limiting, auth, circuit breaking, service mesh · 8 problems

#ProblemCompany / Scale
1How does an edge network rate-limit 45M+ requests/sec without adding latency, distributing rate counters across hundreds of PoPs while using fail-open policies to avoid blocking legitimate traffic during sync delays?Cloudflare · rate limit
2How does a streaming platform's API gateway handle 100B+ API calls/day, supporting dynamic filter/routing rule updates without restart and shedding low-priority requests during overload?Netflix · Zuul
3How does a service mesh handle mutual TLS for 1000s of microservices, auto-rotating certificates and providing zero-code encryption between every service pair without application changes?Istio · mTLS
4How does a developer platform handle API versioning across millions of integrations, supporting deprecation timelines and backward-compatible schema evolution that doesn't break existing clients?GitHub · API versioning
5How does a payment API achieve 99.999% uptime (5 minutes downtime/year), serving requests from multiple active regions with graceful degradation during partial outages?Stripe · five nines
6How does a streaming platform implement circuit breakers to prevent cascading failures, detecting failure rate thresholds (50% in 10s window) and providing fallback responses that degrade gracefully while allowing recovery?Netflix · circuit breaker
7How does an API gateway handle authentication for 10K+ microservices, validating tokens at the edge with caching, and providing service-to-service auth that doesn't require per-request token exchange internally?Kong · auth gateway
8How does a GraphQL federation layer compose a unified schema from 50+ microservice subgraphs, planning cross-service queries efficiently and identifying slow resolvers across subgraphs via distributed tracing?Apollo · federation

12. Database & Storage PostgresMongoDBRedisElasticsearch

Sharding, replication, consistency, migrations · 10 problems

#ProblemCompany / Scale
1How does a database sharding layer transparently shard 10B+ messages across hundreds of shards, supporting online resharding (split/merge without downtime) and connection pooling that reduces backend connections by 100·?Slack · Vitess
2How does a globally distributed database achieve strong consistency with <10ms reads, bounding clock uncertainty across regions and using consensus-based replication across 5+ regions?Google Spanner · TrueTime
3How does a chat platform store trillions of messages in a wide-column database, partitioning by channel with time-ordered clustering, and tuning compaction strategies for time-series append patterns?Discord · ScyllaDB
4How does a platform migrate billions of rows between database schemas with zero downtime, validating correctness through shadow reads and providing instant rollback capability during cutover?Uber · online migration
5How does a serverless database achieve single-digit ms latency at any scale, routing requests to the correct partition via in-memory partition maps and providing adaptive burst capacity?DynamoDB · partition
6How does a productivity platform shard a relational database for millions of workspaces, routing queries by workspace_id, pooling connections efficiently, and automatically rebalancing shards as workspaces grow?Notion · Postgres shard
7How does an object storage service achieve 99.999999999% (11 nines) durability, splitting data into fragments across availability zones with integrity checksums on every read and automatic repair of degraded objects within hours?S3 · 11 nines
8How does a streaming platform handle write-heavy workloads (1M+ writes/sec) in a distributed database, tuning consistency levels and compaction strategies based on read/write ratio while maintaining token-aware routing?Netflix · Cassandra
9How does a distributed SQL database survive entire region failures without data loss, using consensus per data range, leaseholder placement policies, and non-voting replicas for fast failover (<10s RTO)?CockroachDB · multi-region
10How does a search engine index and search petabytes of logs in <100ms, using inverted indexes with time-based rotation and scatter-gather queries across 1000s of shards with early termination?Elasticsearch · log search

13. Distributed Systems ApacheGoogle SpannerCockroachDB

Consensus, leader election, clock sync, partition tolerance · 7 problems

#ProblemCompany / Scale
1How does a distributed key-value store achieve consensus across 5 nodes, handling leader election, log replication, and split-brain prevention requiring majority quorum (3 of 5) for all decisions?etcd · Raft
2How does a distributed event streaming platform handle leader election when a broker dies, reassigning partitions within seconds without message loss while promoting only in-sync replicas?Kafka · KRaft
3How does a distributed system achieve causal ordering of events across data centers without synchronized clocks, using hybrid logical clocks (HLC) to bound uncertainty and provide happens-before guarantees for cross-region transactions?HLC · causal ordering
4How does a distributed system implement linearizable reads without sacrificing availability, choosing between leader leases, quorum reads, and read-repair strategies based on consistency requirements and latency budgets?Linearizability · read strategies
5How does a coordination service manage distributed locks, leader election, and configuration for 1000s of services, detecting liveness via ephemeral sessions and providing sequential ordering guarantees for distributed queues?ZooKeeper · coordination
6How does a wide-column database maintain availability during network splits (AP in CAP), offering tunable consistency levels, hinted handoff for downed nodes, and anti-entropy repair to eventually converge divergent replicas?Cassandra · AP system
7How do distributed systems detect and recover from split-brain scenarios, invalidating stale leaders with monotonic fencing tokens, epoch-based leadership with lease expiry, and ensuring only one leader can make progress at any time?Split-brain · fencing

14. Live Sports & Real-Time Event Broadcasting ESPNCricbuzzDream11Hotstar

Live score push, ball-by-ball updates, millions of concurrent readers consuming the same event stream · 8 problems

#ProblemCompany / Scale
1Build a live scoring platform where every ball event (runs, wicket, over, batsman change) must reach 30M concurrent users globally within sub-second latency during World Cup finals, supporting both persistent connections and fallback polling for all client types.Cricbuzz · 30M concurrent, sub-second
2How does a sports platform handle 10M+ simultaneous score poll requests during a World Cup final without melting the backend, serving fresh scores (=1s stale) from edge while protecting origin servers from traffic spikes?ESPN · World Cup traffic
3How does a fantasy sports platform lock/unlock player selections in real-time as a match starts, performing atomic state transitions triggered by match-start events with eventual consistency for leaderboard updates?Dream11 · lineup lock
4How does a fantasy sports platform calculate live leaderboard rankings for 10M+ users as each ball is bowled, applying pre-computed point deltas per event and updating rankings incrementally without full recomputation?Dream11 · live leaderboard
5How does a live sports platform ingest events from stadium data feeds (ball tracking, hawk-eye) and normalize them into a unified event stream within 200ms, deduplicating events and handling out-of-order delivery from multiple feed sources?Sports data · event ingestion
6How does a betting platform update live odds for 1000+ markets simultaneously as match events occur, recalculating odds within milliseconds and handling stale-data rollback when events are corrected?Betting · live odds
7How does a live commentary platform handle millions of users receiving the same text/audio commentary stream without per-user fan-out, efficiently broadcasting identical content to massive audiences with minimal per-user resource cost?Commentary · broadcast
8Build a live sports notification platform where wicket/goal/touchdown events must push notifications to 100M subscribed users within a few seconds globally, classifying event priority (high: goal/wicket vs low: boundary) and distributing push load across regional gateways.Sports · 100M push notifications

15. File Upload & Media Processing DropboxGoogle DriveYouTube

Chunked upload, sync engines, transcoding pipelines, deduplication · 7 problems

#ProblemCompany / Scale
1How does a cloud storage platform sync file changes across millions of devices within seconds, deduplicating content at the block level and tracking per-file block maps for efficient delta sync?Dropbox · sync engine
2How does a file platform handle resumable uploads for 5GB+ files over unreliable networks, uploading in chunks with per-chunk checksum verification and automatic retry from the last successful chunk?Google Drive · resumable upload
3How does a video platform process 500+ hours of uploaded video per minute into multiple formats, prioritizing live content over VOD, executing DAG-based transcoding pipelines, and storing intermediate results between stages?YouTube · transcoding pipeline
4How does a cloud storage platform deduplicate files across 700M+ users to save petabytes of storage, using content-defined chunking, block-level hashing, and reference counting to safely garbage-collect unreferenced blocks?Dropbox · deduplication
5How does a photo platform generate thumbnails, apply filters, and extract metadata for 100M+ uploads/day, processing each image into multiple resolutions in parallel and pre-warming CDN caches for popular images?Instagram · image processing
6How does a collaboration platform handle concurrent edits to the same file by multiple users, using transform-based resolution for text and lock-based editing for binary files with conflict resolution UI for manual merge?Google Docs · concurrent edit
7How does a cloud platform implement file versioning and point-in-time restore for billions of objects, using copy-on-write semantics, version chains, and lifecycle policies that auto-delete versions older than 30 days?S3 / Dropbox · versioning

17. Scheduling & Calendar Google CalendarCalendlyTemporal

Recurring events, timezone handling, conflict detection, availability · 6 problems

#ProblemCompany / Scale
1How does a calendar platform handle recurring events (every Tuesday, except holidays) across timezones, generating occurrences on-the-fly with exception handling without storing millions of individual instances?Google Calendar · recurrence
2How does a calendar platform detect scheduling conflicts across 100M+ users in real-time, merging free/busy data across multiple calendars and handling concurrent booking attempts with optimistic locking?Outlook · conflict detection
3How does a meeting scheduler find available slots across 10 participants in different timezones, respecting working-hours constraints per timezone and scoring preferences to minimize early morning/late night for any participant?Calendly · availability
4How does a calendar platform sync events across devices and third-party integrations, supporting real-time push notifications for changes with periodic full-sync as fallback and conflict resolution for shared calendars?Google Calendar · sync
5How does a task scheduling platform (cron-at-scale) execute millions of scheduled jobs at their exact trigger time, guaranteeing at-least-once execution with idempotency for missed windows and leader-based clock ticking?Temporal / Airflow · job scheduling
6How does a calendar platform handle timezone changes (DST transitions, country rule changes) without breaking existing events, storing events in local time + timezone ID and re-computing UTC offsets on rule changes?Calendar · DST handling

18. Observability & Monitoring DatadogGrafanaPrometheus

Distributed tracing, metrics aggregation, alerting pipelines, log analytics · 8 problems

#ProblemCompany / Scale
1How does a distributed tracing system correlate requests across 1000+ microservices, propagating trace context through async boundaries (Kafka, queues) and sampling intelligently to keep storage under 1% of total traffic while capturing all error traces?Uber · Jaeger tracing
2How does a metrics platform ingest 500M+ time-series data points per second, supporting real-time aggregation (P50/P99/max) with 10-second granularity and multi-dimensional queries across 100K+ metric names?Datadog · 500M metrics/sec
3How does an alerting system evaluate 10M+ alert rules every 15 seconds without false positives, supporting complex conditions (rate-of-change, anomaly detection), alert grouping, and escalation policies with on-call rotation?PagerDuty · alerting at scale
4How does a log analytics platform ingest 1PB+ of logs daily from millions of sources, indexing them for sub-second search while applying retention policies and providing real-time tail functionality for debugging?Splunk · PB-scale logs
5How does a platform implement real-time error tracking that groups millions of exceptions into unique issues, detecting regressions within minutes of deployment and auto-assigning to the team that owns the failing code path?Sentry · error grouping
6How does a cloud platform build real-time service dependency maps from trace data, detecting cascading failures within seconds and identifying the root-cause service in a chain of 20+ dependent services?AWS X-Ray · dependency map
7How does a platform implement SLO-based monitoring that continuously computes error budgets across 10K+ services, triggering automated responses (traffic shifting, rollback) when burn rate exceeds thresholds?Google · SLO monitoring
8How does a real-time dashboard system serve 100K+ concurrent viewers watching the same metrics, pushing incremental updates via WebSocket without recomputing full queries per viewer?Grafana · live dashboards

19. ML Model Serving & Feature Stores OpenAITensorFlowPyTorchMeta

Real-time inference, feature computation, model deployment, A/B testing · 7 problems

#ProblemCompany / Scale
1How does a ride platform serve ML predictions (ETA, surge, fraud) at 1M+ requests/sec with P99 <10ms latency, loading 100+ models into GPU memory, handling model version rollouts, and falling back to rule-based systems when models are unavailable?Uber · Michelangelo
2How does a feature store compute and serve real-time features (user's last 5 actions, rolling 1-hour spend) for ML models at prediction time, combining batch-computed features with streaming features while maintaining point-in-time correctness?Feast · real-time features
3How does a search platform deploy new ranking models to production without degrading relevance, using shadow scoring, interleaved experiments, and gradual traffic ramp with automatic rollback on metric regression?Google · model deployment
4How does a recommendation platform retrain models on fresh data every hour, incorporating the latest user interactions while ensuring training-serving skew stays below 1% and new models don't catastrophically forget learned patterns?TikTok · online learning
5How does an LLM serving platform handle 10K+ concurrent inference requests with variable-length outputs, optimizing GPU utilization through continuous batching, KV-cache management, and speculative decoding for 3x throughput improvement?OpenAI · LLM serving
6How does an ad platform compute click-through-rate predictions for 10B+ ad candidates daily in <50ms per request, combining sparse features (user history) with dense embeddings and serving from a distributed model across 1000+ inference nodes?Meta · ads prediction
7How does a content platform implement real-time embedding generation for new content (images, videos, text), indexing embeddings for approximate nearest neighbor search within seconds of upload for immediate recommendation eligibility?Pinterest · embedding pipeline

20. Security & Authentication CloudflareAuth0Okta

Session management, DDoS mitigation, rate limiting, zero-trust · 7 problems

#ProblemCompany / Scale
1How does a platform manage sessions for 2B+ users across multiple devices, supporting instant revocation (password change invalidates all sessions), sliding expiry, and device-specific session limits without checking a central store on every request?Google · session management
2How does a CDN mitigate L7 DDoS attacks at 100M+ requests/sec, distinguishing legitimate traffic from bot traffic using behavioral analysis, JavaScript challenges, and adaptive rate limiting without blocking real users during an attack?Cloudflare · DDoS mitigation
3How does a platform implement distributed rate limiting across 300+ edge PoPs, enforcing per-user and per-IP limits with eventual consistency between nodes while using fail-open policies to avoid blocking legitimate traffic during sync delays?Stripe · distributed rate limit
4How does a zero-trust architecture authenticate and authorize every service-to-service call in a 5000+ microservice mesh, issuing short-lived certificates, enforcing least-privilege policies, and detecting lateral movement without adding more than 1ms latency per hop?Google BeyondCorp · zero-trust
5How does an OAuth provider handle 1M+ token issuance/sec during peak login, supporting PKCE flows, token rotation, and cross-device SSO while detecting token theft through binding tokens to device fingerprints?Auth0 · OAuth at scale
6How does a platform implement real-time account takeover detection, scoring login attempts using device fingerprint, geo-velocity, and behavioral biometrics, triggering step-up authentication (MFA) for suspicious sessions within milliseconds?Netflix · ATO detection
7How does a secrets management platform handle 100K+ services fetching credentials, supporting automatic rotation every 24 hours, lease-based access with revocation, and zero-downtime rotation without service restarts?HashiCorp Vault · secrets

21. URL Shortening & Redirection BitlyTinyURLGoogle

ID generation, redirect scaling, analytics, hot key caching · 4 problems

#ProblemCompany / Scale
1How does a URL shortener generate globally unique short codes at 1000+ URLs/sec, ensuring no collisions across distributed nodes while keeping codes short (7 chars) and supporting custom aliases?Bitly · ID generation
2How does a URL shortener handle 100K+ redirect requests/sec with P99 <10ms latency, caching hot URLs at the edge while handling expired links, geo-targeted redirects, and A/B test routing?TinyURL · redirect at scale
3How does a link analytics platform track clicks in real-time (referrer, geo, device, timestamp) for billions of redirects/day without adding latency to the redirect path, computing dashboards with <1min freshness?Bitly · click analytics
4How does a URL shortener handle hot keys (viral links getting 1M+ clicks/sec) without melting the cache layer, using consistent hashing, request coalescing, and tiered caching (L1 in-process ? L2 Redis ? L3 DB)?t.co · hot key caching

22. Email Systems GmailSendGridMailchimp

SMTP delivery, spam filtering, inbox indexing, threading · 4 problems

#ProblemCompany / Scale
1How does an email platform deliver 500M+ emails/day with high deliverability, managing IP reputation, DKIM/SPF/DMARC authentication, bounce handling, and throttling per recipient domain to avoid being blacklisted?SendGrid · email delivery
2How does an email provider classify 1B+ incoming emails/day as spam/ham in real-time using content analysis, sender reputation, behavioral signals, and ML models · with <0.1% false positive rate?Gmail · spam filtering
3How does an email platform index 1B+ mailboxes for instant full-text search, supporting complex queries (from:, has:attachment, date range) with <200ms latency while handling 100K+ new emails/sec ingestion?Gmail · inbox search
4How does an email client implement conversation threading that correctly groups replies, forwards, and CC chains using In-Reply-To/References headers, handling broken threads and cross-client compatibility?Gmail · threading

23. Content Moderation & Trust MetaYouTubeTwitch

Spam detection, abuse prevention, ML moderation, reporting systems · 4 problems

#ProblemCompany / Scale
1How does a social platform classify 500M+ posts/day for policy violations using multi-modal ML (text + image + video), routing borderline cases to human reviewers within minutes while maintaining <1% false positive rate?Meta · automated moderation
2How does a platform detect and suppress coordinated inauthentic behavior (bot networks, brigading) in real-time, identifying clusters of accounts acting in concert using graph analysis and behavioral signals?Twitter · bot detection
3How does a platform implement a user reporting system that handles 10M+ reports/day, prioritizing by severity (child safety > violence > spam), deduplicating reports on the same content, and routing to specialized review queues?YouTube · reporting pipeline
4How does a platform implement real-time toxicity scoring for live chat/comments, blocking harmful messages before they're visible to other users while minimizing latency impact on the posting experience (<100ms added)?Twitch · live chat moderation

24. Configuration & Feature Flags LaunchDarklyNetflixGitHub

Dynamic config, experimentation, A/B testing, rollout systems · 3 problems

#ProblemCompany / Scale
1How does a feature flag platform evaluate 1M+ flag checks/sec with P99 <5ms, supporting complex targeting rules (user segments, percentages, geo), and propagating flag changes to 100K+ servers within 10 seconds?LaunchDarkly · flag evaluation
2How does a platform run 1000+ concurrent A/B experiments, assigning users to variants deterministically (hash-based), computing statistical significance in real-time, and auto-stopping experiments that degrade key metrics?Netflix · experimentation
3How does a platform implement progressive rollouts (1% ? 5% ? 25% ? 100%) with automatic rollback on error rate spike, supporting canary deployments and dark launches without code deploys?Facebook · gradual rollout

25. Web Crawling & Indexing GoogleCloudflareAmazon

Distributed crawlers, politeness, deduplication, indexing pipelines · 3 problems

#ProblemCompany / Scale
1How does a web crawler discover and fetch 10B+ pages across the internet, respecting robots.txt, managing crawl politeness (rate limits per domain), prioritizing fresh/important pages, and deduplicating content?Google · web crawler
2How does a search engine build and maintain an inverted index over 100B+ documents, supporting incremental updates (new/modified pages) without full re-indexing, and serving queries across a distributed index in <200ms?Google · indexing pipeline
3How does a price comparison platform crawl 10M+ product pages daily from 1000+ e-commerce sites, extracting structured data (price, availability, specs) using site-specific parsers, and detecting price changes within minutes?PriceRunner · product crawling

26. ID Generation Systems TwitterStripeSnowflake

Snowflake IDs, distributed unique IDs, ordering guarantees, collision avoidance · 3 problems

#ProblemCompany / Scale
1How does a distributed system generate 100K+ unique, time-sortable IDs per second across 1000+ nodes without coordination, ensuring no collisions and maintaining rough chronological ordering (Snowflake-style)?Twitter Snowflake · time-sorted IDs
2How does a database generate monotonically increasing IDs across a sharded cluster without a single point of failure, supporting 1M+ inserts/sec while maintaining gap-free sequences within each shard?Vitess · sharded sequences
3How does a system generate short, human-readable unique codes (invite codes, order IDs, tracking numbers) that are URL-safe, non-sequential (prevent enumeration), and globally unique across 1B+ entities?Stripe · readable unique IDs

27. Ad Serving & Monetization Google AdsMetaAmazon

Ad ranking, real-time bidding, targeting, impression tracking · 4 problems

#ProblemCompany / Scale
1How does an ad platform select the best ad from 10M+ candidates in <100ms per page load, scoring by predicted CTR, bid amount, relevance, and advertiser budget · serving 10B+ ad requests/day?Google Ads · ad ranking
2How does a real-time bidding (RTB) exchange conduct auctions across 100+ demand-side platforms within 100ms, handling 1M+ bid requests/sec with timeout-based fallback and fraud detection?OpenRTB · real-time bidding
3How does an ad platform track impressions, clicks, and conversions across billions of events/day without double-counting, attributing conversions across devices/sessions, and computing ROI dashboards in near-real-time?Meta Ads · attribution tracking
4How does an ad platform enforce advertiser budgets in real-time across distributed serving nodes, preventing overspend while maximizing delivery, handling budget changes mid-campaign, and pacing spend evenly throughout the day?Google Ads · budget pacing

28. Developer Platform & CI/CD GitHub ActionsJenkinsDockerKubernetes

Build systems, deployment orchestration, artifact storage, release pipelines · 3 problems

#ProblemCompany / Scale
1How does a CI platform execute 10M+ builds/day across a distributed fleet of workers, scheduling jobs by priority and resource requirements, caching build artifacts for 10· speedup, and providing real-time build logs?GitHub Actions · build at scale
2How does a deployment platform orchestrate zero-downtime rollouts across 100K+ servers, supporting canary deployments (1% ? 10% ? 100%), automatic rollback on health check failure, and blue-green switching?Spinnaker · deployment orchestration
3How does an artifact registry store and serve 1PB+ of build artifacts (Docker images, npm packages, Maven JARs) with global replication, content-addressable deduplication, and vulnerability scanning on upload?Artifactory · artifact storage

29. AI & LLM Systems OpenAIAnthropicGooglePerplexity

LLM serving, RAG pipelines, agent orchestration, AI gateways, guardrails, embedding search · 10 problems

#ProblemCompany / Scale
1How does an LLM serving platform handle 100K+ concurrent chat sessions with variable-length outputs, achieving <500ms time-to-first-token through continuous batching, KV-cache paging (PagedAttention), and speculative decoding · while keeping GPU utilization above 80%?OpenAI · LLM serving at scale
2How does a RAG-powered search engine ingest 10M+ documents, chunk them optimally, generate embeddings, and serve grounded answers with citations in <2s · while keeping hallucination rate below 5% through hybrid retrieval (dense + sparse) and cross-encoder reranking?Perplexity · RAG at scale
3How does an AI gateway route 1M+ requests/day across multiple LLM providers (GPT-4, Claude, Llama), implementing semantic caching (30-60% cost reduction), automatic fallback on provider outages, and per-team token budgets · all with <50ms added latency?Enterprise · AI Gateway
4How does an AI coding assistant serve real-time code completions to 10M+ developers with <200ms latency, retrieving relevant context from the user's repository (100K+ files), ranking suggestions by relevance, and adapting to per-user coding patterns?GitHub Copilot · code AI
5How does a multi-agent system orchestrate 5+ specialized AI agents (researcher, coder, reviewer, planner) to complete complex tasks, managing shared memory, tool execution, inter-agent communication, and graceful failure handling · with total cost under $0.50 per task?AutoGen · Agent orchestration
6How does a vector search platform index 1B+ embeddings and serve similarity queries in <10ms at 50K QPS, supporting real-time index updates (new documents searchable in <5s), metadata filtering, and multi-tenancy with per-tenant isolation?Pinecone · Vector search at scale
7How does a content moderation system classify 500M+ user-generated posts/day using multi-modal AI (text + image + video), achieving <2s classification latency, routing edge cases to human reviewers, and handling adversarial attacks that try to bypass filters?Meta · AI content moderation
8How does a conversational AI platform maintain context across multi-turn conversations for 50M+ daily active users, managing conversation memory (short-term buffer + long-term vector store), session persistence, and personalization · while keeping per-user storage costs under $0.001/day?ChatGPT · Conversation memory
9How does a real-time AI translation system serve 1B+ translation requests/day across 100+ language pairs with <300ms latency, dynamically selecting between specialized models per language pair and falling back to general models for rare pairs?Google Translate · AI at scale
10How does an AI safety platform detect prompt injection attacks, jailbreak attempts, and PII leakage across 100M+ LLM requests/day in <50ms per request, using layered classifiers (fast regex ? ML model ? LLM judge) with <0.1% false positive rate on legitimate queries?Lakera · AI guardrails at scale