218 System Design Problems

Filter by Concept

1. Chat & Messaging WhatsAppSlackDiscordTelegram

Real-time delivery, fan-out, presence, ordering at scale · 11 problems

#	Problem	Company / Scale
1	How a real-time chat system can deliver a single message to 50,000 online users within 200ms while handling 10B+ messages per day (~115K msgs/sec, peak ~200K msgs/sec) without hitting scalability bottlenecks?	Slack · 10B msg/day
WebSocket / SSE Kafka Pub/Sub Partitioning Redis Load Balancer Consistency
2	How does a messaging system guarantee delivery to 2B+ users who go offline for hours/days, ensuring zero message loss, correct ordering on reconnect, and end-to-end encryption without the server ever seeing plaintext?	WhatsApp · 2B users
Message Queues Encryption (E2E)NoSQL Ordering & Consistency Sharding Multi-Region
3	How does a chat platform maintain 10M+ concurrent WebSocket connections across thousands of gateway servers, handling heartbeats, shard assignment, graceful failover, and voice signaling without dropping connections?	Discord · millions WS
WebSocket Sharding Service Discovery Load Balancer Fault Tolerance UDP (Voice/WebRTC)
4	How does a multi-region chat system guarantee causal message ordering within conversations when an entire cloud region goes down, while maintaining consistency and conflict resolution across geographically distributed nodes?	Multi-region ordering
Multi-Region Consensus Conflict Resolution Clock Sync (Vector Clocks)Fault Tolerance
5	How does a group chat with 1M+ members deliver a single message without causing a fan-out explosion, keeping delivery latency under 2 seconds for all participants regardless of group size?	Telegram · 1M members
Pub/Sub Partitioning Backpressure Caching Strategies Kafka
6	How does a presence system track online/offline status for 100M+ concurrent users in real-time, delivering status changes to relevant contacts within seconds without flooding the network with unnecessary updates?	WhatsApp · 100M presence
Redis (TTL)Pub/Sub Consistent Hashing WebSocket Redis Pub/Sub
7	How does a chat system handle millions of ephemeral "typing..." events per second without persisting anything to disk, while ensuring sub-100ms delivery to all conversation participants?	Slack · ephemeral events
WebSocket Pub/Sub (In-Memory)Backpressure Redis Pub/Sub
8	How does a search system index 10B+ messages with <5s indexing latency from send to searchable, and return full-text results in <100ms across the entire message corpus?	Slack · billions msgs search
Elasticsearch Kafka (CDC)Sharding Indexing CDC
9	How does a notification system deliver push notifications to millions of devices without duplicates, handling device token lifecycle, retry logic, and cross-platform delivery (iOS/Android) reliably?	WhatsApp · dedup push
Message Queues Dead Letter Queue Idempotency NoSQL
10	How does a multi-device messaging app keep messages, read status, and edits perfectly synced across phone/tablet/desktop, resolving conflicts when edits happen simultaneously on different devices?	Telegram · multi-device sync
Conflict Resolution Eventual Consistency Event Sourcing WebSocket Clock Sync
11	Given message events across millions of channels, design a system that continuously computes the top active channels/users for the last 1 hour without scanning historical data, updating rankings within seconds of activity changes?	Slack · top active channels
Stream Processing Redis (Sorted Sets)Kafka Bloom Filters Caching Redis Streams

2. Real-Time Collaboration Google DocsFigmaNotionMiro

Concurrent editing, CRDTs, OT, multiplayer cursors · 7 problems

#	Problem	Company / Scale
1	How does a collaborative editor handle 100 users editing the same paragraph simultaneously, resolving conflicting character insertions without a central lock while maintaining convergence across all clients within 50ms?	Google Docs · OT
Conflict Resolution Eventual Consistency WebSocket Pub/Sub
2	How does a design tool render 50+ cursors moving at 60fps in real-time, achieving conflict-free state merge across all clients while keeping bandwidth under 10KB/sec per user?	Figma · CRDTs
Conflict Resolution WebSocket Consistency Pub/Sub
3	How does a block-based editor sync granular edits (move block, change text, nest) across devices in <100ms, handling concurrent modifications to the same block structure without data loss?	Notion · block sync
Conflict Resolution WebSocket Event Sourcing Consistency
4	How does a collaborative whiteboard handle 1000+ objects being dragged simultaneously by 50 users, merging conflicting position updates without visual glitches or lost operations?	Miro · spatial CRDT
Conflict Resolution WebSocket Pub/Sub Consistency
5	How does a collaborative code editor sync cursor positions, selections, and edits across continents with <150ms latency, maintaining consistent document state despite network delays between geographically distributed participants?	VS Code Live Share
WebSocket Multi-Region Consistency Pub/Sub
6	How does a multiplayer game sync world state for 100 players at 60fps, handling client-side prediction, server reconciliation, and entity interpolation while keeping perceived latency below 100ms?	Gaming · state sync
WebSocket Consistency Backpressure
7	How does a design platform handle real-time collaboration on documents with 500MB+ media assets, ensuring edit operations remain fast (<100ms) regardless of total document size?	Canva · media + collab
Blob Storage CDN WebSocket Conflict Resolution

3. Video Streaming NetflixYouTubeTwitchSpotify

Adaptive bitrate, CDN, transcoding, live vs VOD · 8 problems

#	Problem	Company / Scale
1	How does a streaming platform deliver 4K HDR video to 230M+ subscribers globally without buffering, adapting quality in real-time to each viewer's bandwidth while minimizing rebuffer events to <0.1% of sessions?	Netflix · 230M users
CDN Load Balancer Multi-Region Caching Backpressure
2	How does a video platform transcode 500+ hours of video uploaded every minute into 8+ resolutions, finishing 4K transcoding in <30 minutes while prioritizing live content over on-demand uploads?	YouTube · 500hr/min
Message Queues Partitioning Blob Storage Auto-Scaling
3	How does a live streaming platform deliver video to 5M+ concurrent viewers with <3 second glass-to-glass latency, while also supporting sub-second interactive streams for smaller audiences?	Twitch · live low-latency
WebSocket CDN Pub/Sub Backpressure Multi-Region
4	How does a streaming platform handle 25M concurrent viewers during a single live event, gracefully degrading quality under load while maintaining stream continuity and minimizing regional failures?	Hotstar · 25M concurrent
CDN Load Balancer Auto-Scaling Multi-Region Backpressure
5	How does a short-video platform achieve instant playback (<200ms to first frame) for an infinite-scroll feed, ensuring zero perceived loading time as users swipe between videos?	TikTok · instant playback
CDN Caching Blob Storage Load Balancer
6	How does adaptive bitrate streaming prevent buffering on degrading networks, seamlessly switching quality mid-stream without visible artifacts while maximizing video quality for available bandwidth?	ABR · HLS/DASH
CDN Caching Load Balancer Backpressure
7	How do CDNs cache and serve video segments at the edge for 1B+ daily requests, maximizing cache hit rates while minimizing origin load and ensuring popular content is always available at the nearest edge?	CDN · edge caching
CDN Caching Consistent Hashing Load Balancer
8	How does an audio streaming platform achieve gapless playback with offline mode, ensuring zero gaps between tracks, seamless quality adaptation over cellular, and encrypted local storage for offline content?	Spotify · audio stream
CDN Caching Blob Storage Encryption

4. Video Calling / WebRTC ZoomGoogle MeetDiscord

P2P, SFU, MCU, signaling, NAT traversal · 6 problems

#	Problem	Company / Scale
1	How does a video conferencing system handle 300-person meetings with screen sharing, adapting per-participant video quality in real-time based on available bandwidth while keeping total meeting latency under 200ms?	Zoom · 300 participants
WebSocket Load Balancer UDP Multi-Region
2	How does a video calling platform dynamically switch between peer-to-peer (2 users, lowest latency) and server-relayed (3+ users) topology, performing seamless mid-call migration without audio/video interruption?	Google Meet · adaptive
WebSocket Load Balancer Service Discovery UDP
3	How does a voice chat platform handle 100+ users in a single voice channel with <50ms audio latency, selectively mixing only active speakers while maintaining clear audio for all participants?	Discord · voice channels
WebSocket Pub/Sub UDP Sharding
4	How does a real-time communication system establish peer-to-peer connections through NATs and firewalls, discovering public endpoints and falling back to relay when direct connection is impossible, while minimizing connection setup time?	WebRTC · NAT traversal
UDP DNS Service Discovery Load Balancer
5	How does a telehealth video platform ensure HIPAA-compliant calls with recording, maintaining end-to-end encryption, consent-based recording, audit trails, and data residency controls per jurisdiction?	Telehealth · compliance
Encryption WebSocket Blob Storage Consistency
6	How does a live audio room platform handle thousands of listeners with <500ms latency, dynamically promoting speakers from the audience while distributing audio efficiently to large audiences?	Twitter Spaces · audio rooms
WebSocket Pub/Sub CDN Load Balancer

5. Ticket Booking UberAirbnbAmazonTicketmaster

Seat locking, double-booking prevention, flash sales · 6 problems

#	Problem	Company / Scale
1	Design a ticket booking system where 5M users attempt to book 50K concert seats simultaneously, guaranteeing no double booking, fair queue ordering with position tracking, distributed seat locking with expiry, and bot detection?	BookMyShow · 5M users, 50K seats
Consistency Redis Message Queues Sharding
2	How does a movie theater chain handle seat selection for 500+ screens simultaneously, showing real-time seat availability updates to thousands of concurrent users while preventing double-booking through optimistic locking with sub-second conflict resolution?	AMC · real-time seat map
Redis WebSocket Consistency Sharding
3	How does a rental platform prevent double-booking across time zones when two users in different continents try to book the same dates, detecting and resolving calendar conflicts with automatic resolution?	Airbnb · calendar sync
Consistency Conflict Resolution Multi-Region NoSQL
4	How do airlines handle seat selection with distributed inventory across 100+ booking channels (website, app, agents, GDS), preventing overselling while maintaining responsive seat availability display?	Airlines · distributed inv
Sharding Consistency Multi-Region Message Queues
5	How does a ticketing platform manage 1M+ users in a virtual queue with fair ordering, providing real-time position updates and estimated wait times while preventing bot abuse and queue jumping?	Ticketmaster · queue
Redis Message Queues WebSocket
6	Design a flash sale system where 20M users try to buy the same product within 2 minutes, preventing overselling while maintaining inventory consistency across 5 regions and confirming orders within seconds?	Amazon · 20M users, 2 min flash
Redis Sharding Kafka Consistency

6. Cache & CDN CloudflareRedisAkamai

Cache invalidation, thundering herd, edge compute · 8 problems

#	Problem	Company / Scale
1	How does a social platform invalidate cached objects across 1000+ servers within 1 second of a write, preventing stale reads while avoiding thundering herd on the backing store?	Meta · TAO cache
Caching Redis CDN Pub/Sub Consistency
2	How does a streaming platform prevent thundering herd when a hot cache key expires and 100K requests simultaneously hit the database, ensuring only one request rebuilds the cache while others wait or receive stale data?	Netflix · EVCache
Caching Redis Consistent Hashing Backpressure
3	How does an edge network serve 45M+ requests/sec from 300+ PoPs without hitting origin, maximizing cache hit rates through tiered caching and intelligent routing for cache misses?	Cloudflare · edge
CDN Caching Load Balancer DNS Multi-Region
4	How does a social platform cache the home timeline for 400M+ users, handling the asymmetry between normal users (hundreds of followers) and celebrities (100M+ followers) without overwhelming write capacity?	Twitter · timeline cache
Caching Redis Sharding Pub/Sub
5	How does a distributed cache cluster handle 10M+ ops/sec with automatic failover, completing replica promotion within 2 seconds of node failure while maintaining consistent routing during topology changes?	Redis Cluster · 10M ops/sec
Redis Sharding Replication Consistent Hashing Fault Tolerance
6	How do CDNs purge cached content globally within 150ms for breaking news updates, invalidating stale content at all edge locations without causing origin overload from simultaneous cache misses?	CDN · instant purge
CDN Pub/Sub Caching Multi-Region
7	How does a social platform cache ephemeral content (Stories) that expires after 24 hours, ensuring instant access for sequential viewing while automatically evicting expired content without manual cleanup?	Instagram · TTL cache
Caching Redis Blob Storage NoSQL
8	How does an edge computing platform execute custom logic (authentication, A/B routing, header manipulation) at 300+ PoPs with <1ms cold start, deploying code changes globally in <30 seconds without origin round-trips?	Cloudflare Workers · edge compute
CDN Caching Serverless Load Balancer

7. Queues & Events KafkaRabbitMQAWS SQS

Kafka, exactly-once, dead letters, event sourcing, streaming analytics, notifications · 16 problems

#	Problem	Company / Scale
1	How does a ride-hailing platform process 1M+ ride events per second with city-level locality, guaranteeing exactly-once processing semantics and computing real-time surge pricing from the event stream?	Uber · Kafka 1M/sec
Kafka Partitioning Event Sourcing Stream Processing Idempotency
2	How does a payment platform guarantee exactly-once processing when network retries can duplicate requests, ensuring no double-charges while maintaining at-least-once delivery guarantees from upstream systems?	Stripe · exactly-once
Idempotency Kafka Consistency Message Queues
3	How does a professional network handle 4 trillion events/day across 100K+ partitions, supporting schema evolution for backward compatibility while auto-scaling consumers based on lag?	LinkedIn · 4T events
Kafka Partitioning Auto-Scaling Schema Registry
4	How do you design a dead letter queue that never loses messages, isolating poison messages from healthy processing while providing retry policies, monitoring with alerting, and manual replay tooling for operations?	DLQ · poison messages
DLQ Message Queues Kafka Idempotency
5	How does a streaming platform use event sourcing for microservices, storing all state changes as immutable events, rebuilding materialized views on demand, and handling schema evolution without breaking consumers?	Netflix · event sourcing
Event Sourcing Kafka CQRS Consistency
6	How does an e-commerce platform handle order events across 1M+ merchants in a multi-tenant event cluster, enforcing per-merchant quotas and providing priority lanes for high-volume sellers without noisy-neighbor effects?	Shopify · multi-tenant
Kafka Partitioning Sharding
7	How do you implement the SAGA pattern for a distributed order?payment?inventory?shipping transaction, coordinating compensating transactions for rollback and handling timeouts across independently deployed services?	Saga · choreography
Kafka Consistency Message Queues Event Sourcing
8	How does a distributed event streaming system handle consumer group rebalancing without message loss or duplicate processing, minimizing rebalance time while maintaining exactly-once delivery semantics?	Kafka · rebalance
Kafka Partitioning Consistency Fault Tolerance
9	Given ad impression and click events from billions of daily ad requests, design an analytics system that computes CTR/CPC dashboards per advertiser with less than 1-minute delay from event occurrence to dashboard visibility.	Google Ads · CTR <1min delay
Stream Processing Kafka Caching Sharding
10	Given IoT telemetry from 500M smart devices sending events every few seconds, design a real-time anomaly detection system that detects outages/spikes within 10 seconds, maintaining statistical baselines per device and alerting on deviations.	IoT · 500M devices anomaly
Stream Processing Kafka Redis
11	Given repository events (push, fork, PR, issue, star) from millions of repositories, build a "Trending Repositories" system that updates rankings globally every minute, weighting recent activity higher than older activity and segmenting by language/topic.	GitHub · trending repos
Stream Processing Redis Kafka Caching
12	How does a notification orchestration platform decide what to send (push/email/SMS/in-app), when to send it (optimal timing per user), and how to batch/deduplicate across channels · processing 1B+ notification decisions/day with user preference enforcement?	Notification orchestration
Message Queues Kafka Pub/Sub Caching
13	How does a notification service deliver push/email/SMS/in-app to 500M users, handling priority queuing, multi-channel routing, device token management, template rendering, and at-least-once delivery at 5B notifications/day?	Notification · core architecture
Message Queues Kafka Load Balancer Idempotency Redis Pub/Sub Redis Streams
14	How does a notification system guarantee delivery via cascading channel fallback (push→SMS→email), using per-notification state machines, timeout-based escalation, and atomic CAS to prevent duplicate sends across channels?	Notification · channel fallback
Message Queues Fault Tolerance Redis Idempotency Redis Pub/Sub
15	How does a preference system check 25B preference records in <1ms per notification, using Bloom filters for fast rejection (70%), Redis for exact lookup, timezone-aware quiet hours, and sliding window frequency caps?	Notification · preferences & caps
Caching Redis Bloom Filters Redis Pub/Sub
16	How does a deduplication layer prevent sending the same notification twice at 5B notifications/day, using content-hash dedup keys, Redis SETNX atomic check-and-set, and 24-hour TTL windows while maintaining at-least-once processing?	Notification · deduplication
Idempotency Redis Kafka Redis Streams

8. Social Media Feeds & Recommendations XInstagramTikTokPinterest

Fan-out on write/read, ranking, real-time updates, personalization engines, trending computation · 18 problems

#	Problem	Company / Scale
1	How does a social platform deliver posts to 400M+ users' timelines in real-time, handling the asymmetry between users with few followers and celebrities with millions, while keeping timeline delivery under 5 seconds?	Twitter · fan-out
Redis Kafka Pub/Sub Sharding Caching Consistent Hashing
2	How does a photo-sharing platform rank your feed from millions of candidate posts, balancing relevance, recency, and popularity while maintaining exploration/exploitation balance to avoid filter bubbles?	Instagram · ML ranking
Caching Kafka Stream Processing Sharding
3	How does a short-video platform learn your preferences within 3 minutes of first use, leveraging real-time engagement signals (watch time, replays, shares) to personalize recommendations for brand-new users with no history?	TikTok · cold-start rec
Caching Stream Processing Kafka Redis
4	How does a forum handle nested comments with millions of votes, supporting deep threading, real-time vote counts, and efficient pagination ("load more replies") without N+1 query performance degradation?	Reddit · comment tree
NoSQL Caching Sharding Indexing
5	How does a video platform count views accurately at 1B+ views/day without double-counting, filtering bot traffic and fraudulent views while keeping the public count updated within 5 minutes of actual views?	YouTube · view counting
Kafka Redis Stream Processing Consistency
6	How does a social platform handle the celebrity problem · a user with 100M followers posts, and you can't write to 100M timelines simultaneously · while still delivering the post to active followers within seconds?	Facebook · celebrity fan-out
Redis Kafka Pub/Sub Caching Sharding
7	How does a professional network generate "People You May Know" recommendations in real-time, computing relationship suggestions from graph connections (friends-of-friends, shared attributes) and updating as new connections form?	LinkedIn · graph rec
NoSQL Caching Sharding Redis
8	How does a visual discovery platform handle infinite scroll with personalized content, pre-fetching upcoming pages, computing layout server-side, and re-ranking in real-time based on scroll behavior and engagement signals?	Pinterest · infinite scroll
Caching CDN Sharding Redis
9	How does a music streaming platform generate personalized playlists (Discover Weekly) for 600M+ users, balancing familiar preferences with novel discovery while avoiding filter bubbles and repetitive recommendations?	Spotify · Discover Weekly
Kafka Caching Stream Processing Blob Storage
10	How does a video platform recommend the next video with 80%+ click-through rate, narrowing candidates from 1B+ videos to a ranked shortlist in <50ms while incorporating real-time watch signals (watch time, skip, replay)?	YouTube · next video rec
Caching Redis Stream Processing Load Balancer
11	Given an API that records every song play event (userId, songId, albumId, timestamp, country), build a system that continuously computes the top 100 songs/albums for the last 5 min / 1 hour / 1 week globally and per country, processing 5B+ listen events/day with rankings updating within seconds of activity changes.	Spotify · top charts 5B/day
Stream Processing Redis Kafka Sharding
12	Given a stream of video watch events (videoId, userId, watchDuration, region), design a system that updates the "Trending Videos" page every 30 seconds while handling 50M+ concurrent viewers, weighting scores by views, watch percentage, and velocity.	YouTube · Trending 50M concurrent
Stream Processing Redis Kafka Caching
13	How does a social platform implement real-time content moderation at scale, classifying 500M+ posts/day for policy violations using multi-modal ML (text + image + video), routing edge cases to human reviewers within minutes, and handling appeals with audit trails?	Meta · content moderation
Kafka Stream Processing Message Queues Caching
14	How does a social platform implement real-time A/B testing for feed ranking algorithms, splitting traffic across 100+ concurrent experiments, measuring engagement metrics with statistical significance within hours, and safely rolling back experiments that degrade user experience?	Meta · feed experimentation
Kafka Caching Redis Stream Processing
15	How does a social platform serve personalized notifications to 2B+ users, deciding what to notify (likes, comments, follows), batching low-priority notifications, computing optimal send times per user based on activity patterns, and suppressing notifications during quiet hours?	Instagram · smart notifications
Message Queues Kafka Redis
16	How does a social platform detect and suppress viral misinformation in real-time, scoring content credibility within seconds of posting using engagement velocity anomalies, cross-referencing fact-check databases, and applying distribution throttling before content reaches millions?	Meta · misinformation detection
Stream Processing Kafka Redis
17	How does a social platform aggregate engagement notifications ("5 people liked your post") instead of sending 5 separate notifications, using time-windowed batching with immediate first-event delivery and 15-minute batch windows?	Instagram · notification batching
Stream Processing Redis Kafka Redis Streams
18	How does a social platform build a real-time in-app notification feed with badge counts, WebSocket push for instant updates, Redis sorted set for the feed, atomic unread counter management, and cursor-based pagination for 500M users?	Facebook · notification feed
WebSocket Redis Pub/Sub Redis Pub/Sub

9. Geo, Ride Sharing & Food Delivery UberGoogle MapsDoorDashZomato

Location tracking, geofencing, matching, ETA, order lifecycle, real-time tracking · 13 problems

#	Problem	Company / Scale
1	How does a ride-hailing platform match riders to the nearest available driver within 3 seconds, scoring candidates by distance, ETA, and driver rating while avoiding global scans of all drivers?	Uber · H3 matching
Sharding Redis Kafka Load Balancer
2	How does a maps platform calculate ETA for millions of simultaneous route requests, incorporating real-time traffic data from GPS probes and ML-based travel time prediction for accuracy within 10% of actual travel time?	Google Maps · ETA
Caching Sharding Stream Processing Redis
3	How does a delivery platform optimize routes for 1M+ concurrent orders, batching nearby orders and dynamically re-routing when new orders arrive mid-delivery while minimizing total delivery time?	DoorDash · routing
Kafka Redis Message Queues Sharding
4	How does a ride platform ingest and query 5M+ driver location updates every 4 seconds, supporting spatial queries ("find drivers within 2km") without scanning all drivers globally?	Uber · location ingestion
Redis Kafka Sharding Partitioning
5	How does an AR game handle millions of players interacting with geo-anchored objects, maintaining server-authoritative state for shared objects while providing smooth client-side rendering at 60fps?	Pok·mon GO · geo-spatial
Redis WebSocket Sharding Consistency
6	Given ride request and completion events (driverId, riderId, lat, long, timestamp), build a surge pricing system that recomputes hotspot pricing for every city zone within 10 seconds during peak traffic, computing supply/demand ratios per zone with smoothing to avoid price oscillation.	Uber · surge 10s recompute
Stream Processing Kafka Redis Caching
7	How does a local search platform find "restaurants near me" from 200M+ businesses in <50ms, supporting radius queries, real-time availability filtering, and ranking by distance, rating, and relevance?	Google Maps · local search
Redis Sharding Pub/Sub Stream Processing
8	How does a food delivery platform push live driver-location updates to the customer's map every 2 seconds, showing smooth movement interpolation on the client even when GPS updates arrive irregularly?	Zomato/Swiggy · live tracking
CDN Caching Blob Storage Sharding
9	How does a food delivery platform calculate multi-leg ETA (restaurant prep + driver pickup + delivery) that updates in real-time, incorporating per-restaurant prep time models, live traffic, and Bayesian updates as each leg completes?	Uber Eats · multi-leg ETA
Kafka Message Queues Redis Consistency
10	How does a food delivery platform manage the 3-party order lifecycle (customer → restaurant → driver) with state machine transitions (placed?accepted?preparing?ready?picked_up?delivered), timeout handlers per state, and compensating actions on cancellation?	Swiggy · order lifecycle
WebSocket Redis Kafka Pub/Sub
11	How does a food delivery platform assign orders to drivers optimizing for delivery time, driver earnings, and restaurant wait, enforcing constraints (distance < 3km, capacity = 2 orders) and re-assigning when a driver rejects within 30s?	DoorDash · driver dispatch
Elasticsearch Caching Redis Sharding
12	Build a traffic analytics platform where GPS pings arrive every 3 seconds from 100M vehicles, computing congestion levels per road segment and ETA updates that refresh within 5 seconds, pushing tile-based map updates to clients.	Google Maps · 100M vehicles traffic
Message Queues Kafka Blob Storage NoSQL
13	How does a geofencing platform detect when millions of devices enter/exit custom geographic boundaries in real-time, processing 10M+ location updates/sec against 100M+ geofences with sub-second trigger latency for marketing notifications and compliance alerts?	Radar · geofencing 10M/sec
Stream Processing Redis Caching Kafka

10. Payment Systems StripePayPalApple PayVisa

Idempotency, ledgers, reconciliation, fraud, trading · 10 problems

#	Problem	Company / Scale
1	How does a payment platform process millions of charges without double-charging, ensuring idempotent request handling, two-phase state transitions (pending?captured?settled), and at-least-once delivery with server-side deduplication?	Stripe · idempotency
Idempotency Transactions Message Queues Consistency DLQ
2	How does a payment platform detect fraud in real-time across 400M accounts, scoring each transaction in sub-100ms using velocity checks, device fingerprinting, and geo-anomaly detection while routing edge cases to human review?	PayPal · fraud ML
3	How does a cross-border payment platform handle transfers across 80+ currencies, locking FX rates within 30-second windows, managing multi-currency ledgers, and batching settlements to minimize wire fees?	Wise · FX ledger
4	How does a trading platform handle stock orders with sub-millisecond latency, maintaining deterministic order matching, lock-free order book operations, and complete audit replay for regulatory compliance?	Robinhood · trading
5	How does a digital wallet handle offline tap-to-pay transactions, using pre-authorized tokens with device-local transaction limits and deferred settlement that reconciles when connectivity returns?	Apple Pay · offline
6	How do banks reconcile millions of transactions daily without losing a cent, using double-entry ledgers, end-of-day batch reconciliation, exception queues for mismatches, and cryptographic audit trails for regulatory compliance?	Banking · reconciliation
7	How does an e-commerce platform handle checkout for 1M+ merchants during Black Friday peak, maintaining order confirmation within seconds while gracefully degrading non-critical features under extreme load?	Shopify · flash checkout
8	How does a cryptocurrency exchange handle 100K+ trades/sec with real-time order matching, maintaining a deterministic in-memory order book, supporting limit/market/stop orders, and providing guaranteed execution ordering with nanosecond timestamps for regulatory audit?	Binance · crypto exchange
9	Design a stock market analytics system ingesting millions of trades per second that continuously updates top gainers, losers, and unusual volume spikes with millisecond latency, computing OHLC aggregations per symbol and pushing updates to trading terminals in real-time.	Stock exchange · ms-latency analytics
10	How does a payment API allow merchants to run batch settlements (500 charges at once) while capping the average rate at 100 req/sec, using token bucket with capacity (burst) vs refill rate (average) as independent parameters?	Payments · burst rate limiting
Rate Limiting Idempotency Redis Transactions

11. API Gateway & Backend CloudflareNetflixKong

Rate limiting, auth, circuit breaking, service mesh · 12 problems

#	Problem	Company / Scale
1	How does an edge network rate-limit 45M+ requests/sec without adding latency, distributing rate counters across hundreds of PoPs while using fail-open policies to avoid blocking legitimate traffic during sync delays?	Cloudflare · rate limit
Rate Limiting API Gateway Redis CDN Load Balancer
2	How does a streaming platform's API gateway handle 100B+ API calls/day, supporting dynamic filter/routing rule updates without restart and shedding low-priority requests during overload?	Netflix · Zuul
3	How does a service mesh handle mutual TLS for 1000s of microservices, auto-rotating certificates and providing zero-code encryption between every service pair without application changes?	Istio · mTLS
4	How does a developer platform handle API versioning across millions of integrations, supporting deprecation timelines and backward-compatible schema evolution that doesn't break existing clients?	GitHub · API versioning
5	How does a payment API achieve 99.999% uptime (5 minutes downtime/year), serving requests from multiple active regions with graceful degradation during partial outages?	Stripe · five nines
6	How does a streaming platform implement circuit breakers to prevent cascading failures, detecting failure rate thresholds (50% in 10s window) and providing fallback responses that degrade gracefully while allowing recovery?	Netflix · circuit breaker
7	How does an API gateway handle authentication for 10K+ microservices, validating tokens at the edge with caching, and providing service-to-service auth that doesn't require per-request token exchange internally?	Kong · auth gateway
8	How does a GraphQL federation layer compose a unified schema from 50+ microservice subgraphs, planning cross-service queries efficiently and identifying slow resolvers across subgraphs via distributed tracing?	Apollo · federation
9	How does a public REST API enforce different rate limits based on subscription tier (free: 100 req/min, paid: 2,000 req/min), keyed by API key, with real-time tier lookup and per-key sliding window counters at scale?	Stripe/GitHub · tiered API limits
Rate Limiting API Gateway Redis REST APIs
10	How does a rate limiter enforce a global per-user limit across 20 app servers behind a load balancer, solving the distributed counter race condition with atomic Redis Lua scripts while keeping overhead under 2ms per request?	Distributed · N-server limiter
Rate Limiting Redis Load Balancer Consistency
11	How does an SMS gateway rate-limit outbound messages to match a carrier's fixed acceptance rate (100 msg/sec, no bursts), using a leaky bucket with bounded queue, priority lanes for OTP, and TTL-based expiry?	Twilio · outbound leaky bucket
Rate Limiting Message Queues Backpressure DLQ
12	How does a multi-tenant SaaS platform prevent one tenant from degrading service for others, using two-layer rate limiting (per-tenant + global capacity), weighted fair queueing, and dynamic whale-tenant isolation?	SaaS · noisy neighbor
Rate Limiting Load Balancer Redis Backpressure

12. Database & Storage PostgresMongoDBRedisElasticsearch

Sharding, replication, consistency, migrations · 10 problems

#	Problem	Company / Scale
1	How does a database sharding layer transparently shard 10B+ messages across hundreds of shards, supporting online resharding (split/merge without downtime) and connection pooling that reduces backend connections by 100·?	Slack · Vitess
Sharding SQL Indexing Replication Partitioning
2	How does a globally distributed database achieve strong consistency with <10ms reads, bounding clock uncertainty across regions and using consensus-based replication across 5+ regions?	Google Spanner · TrueTime
3	How does a chat platform store trillions of messages in a wide-column database, partitioning by channel with time-ordered clustering, and tuning compaction strategies for time-series append patterns?	Discord · ScyllaDB
4	How does a platform migrate billions of rows between database schemas with zero downtime, validating correctness through shadow reads and providing instant rollback capability during cutover?	Uber · online migration
5	How does a serverless database achieve single-digit ms latency at any scale, routing requests to the correct partition via in-memory partition maps and providing adaptive burst capacity?	DynamoDB · partition
6	How does a productivity platform shard a relational database for millions of workspaces, routing queries by workspace_id, pooling connections efficiently, and automatically rebalancing shards as workspaces grow?	Notion · Postgres shard
7	How does an object storage service achieve 99.999999999% (11 nines) durability, splitting data into fragments across availability zones with integrity checksums on every read and automatic repair of degraded objects within hours?	S3 · 11 nines
8	How does a streaming platform handle write-heavy workloads (1M+ writes/sec) in a distributed database, tuning consistency levels and compaction strategies based on read/write ratio while maintaining token-aware routing?	Netflix · Cassandra
9	How does a distributed SQL database survive entire region failures without data loss, using consensus per data range, leaseholder placement policies, and non-voting replicas for fast failover (<10s RTO)?	CockroachDB · multi-region
10	How does a search engine index and search petabytes of logs in <100ms, using inverted indexes with time-based rotation and scatter-gather queries across 1000s of shards with early termination?	Elasticsearch · log search

13. Distributed Systems ApacheGoogle SpannerCockroachDB

Consensus, leader election, clock sync, partition tolerance · 8 problems

#	Problem	Company / Scale
1	How does a distributed key-value store achieve consensus across 5 nodes, handling leader election, log replication, and split-brain prevention requiring majority quorum (3 of 5) for all decisions?	etcd · Raft
Consensus Fault Tolerance Leader Election Replication Consistency
2	How does a distributed event streaming platform handle leader election when a broker dies, reassigning partitions within seconds without message loss while promoting only in-sync replicas?	Kafka · KRaft
3	How does a distributed system achieve causal ordering of events across data centers without synchronized clocks, using hybrid logical clocks (HLC) to bound uncertainty and provide happens-before guarantees for cross-region transactions?	HLC · causal ordering
4	How does a distributed system implement linearizable reads without sacrificing availability, choosing between leader leases, quorum reads, and read-repair strategies based on consistency requirements and latency budgets?	Linearizability · read strategies
5	How does a coordination service manage distributed locks, leader election, and configuration for 1000s of services, detecting liveness via ephemeral sessions and providing sequential ordering guarantees for distributed queues?	ZooKeeper · coordination
6	How does a wide-column database maintain availability during network splits (AP in CAP), offering tunable consistency levels, hinted handoff for downed nodes, and anti-entropy repair to eventually converge divergent replicas?	Cassandra · AP system
7	How do distributed systems detect and recover from split-brain scenarios, invalidating stale leaders with monotonic fencing tokens, epoch-based leadership with lease expiry, and ensuring only one leader can make progress at any time?	Split-brain · fencing
8	How does a platform enforce a single global rate limit of 10,000 req/min for a user across three regions, balancing the fundamental trade-off between regional Redis (low latency, loose accuracy) and central store (strict accuracy, cross-region RTT) using token leasing with async reconciliation?	Global · multi-region limiter
Rate Limiting Multi-Region Redis Consistency Fault Tolerance

14. Live Sports & Real-Time Event Broadcasting ESPNCricbuzzDream11Hotstar

Live score push, ball-by-ball updates, millions of concurrent readers consuming the same event stream · 8 problems

#	Problem	Company / Scale
1	Build a live scoring platform where every ball event (runs, wicket, over, batsman change) must reach 30M concurrent users globally within sub-second latency during World Cup finals, supporting both persistent connections and fallback polling for all client types.	Cricbuzz · 30M concurrent, sub-second
WebSocket Pub/Sub CDN Caching Stream Processing
2	How does a sports platform handle 10M+ simultaneous score poll requests during a World Cup final without melting the backend, serving fresh scores (=1s stale) from edge while protecting origin servers from traffic spikes?	ESPN · World Cup traffic
3	How does a fantasy sports platform lock/unlock player selections in real-time as a match starts, performing atomic state transitions triggered by match-start events with eventual consistency for leaderboard updates?	Dream11 · lineup lock
4	How does a fantasy sports platform calculate live leaderboard rankings for 10M+ users as each ball is bowled, applying pre-computed point deltas per event and updating rankings incrementally without full recomputation?	Dream11 · live leaderboard
5	How does a live sports platform ingest events from stadium data feeds (ball tracking, hawk-eye) and normalize them into a unified event stream within 200ms, deduplicating events and handling out-of-order delivery from multiple feed sources?	Sports data · event ingestion
6	How does a betting platform update live odds for 1000+ markets simultaneously as match events occur, recalculating odds within milliseconds and handling stale-data rollback when events are corrected?	Betting · live odds
7	How does a live commentary platform handle millions of users receiving the same text/audio commentary stream without per-user fan-out, efficiently broadcasting identical content to massive audiences with minimal per-user resource cost?	Commentary · broadcast
8	Build a live sports notification platform where wicket/goal/touchdown events must push notifications to 100M subscribed users within a few seconds globally, classifying event priority (high: goal/wicket vs low: boundary) and distributing push load across regional gateways.	Sports · 100M push notifications

15. File Upload & Media Processing DropboxGoogle DriveYouTube

Chunked upload, sync engines, transcoding pipelines, deduplication · 7 problems

#	Problem	Company / Scale
1	How does a cloud storage platform sync file changes across millions of devices within seconds, deduplicating content at the block level and tracking per-file block maps for efficient delta sync?	Dropbox · sync engine
Conflict Resolution CDC Message Queues Consistency Blob Storage
2	How does a file platform handle resumable uploads for 5GB+ files over unreliable networks, uploading in chunks with per-chunk checksum verification and automatic retry from the last successful chunk?	Google Drive · resumable upload
3	How does a video platform process 500+ hours of uploaded video per minute into multiple formats, prioritizing live content over VOD, executing DAG-based transcoding pipelines, and storing intermediate results between stages?	YouTube · transcoding pipeline
4	How does a cloud storage platform deduplicate files across 700M+ users to save petabytes of storage, using content-defined chunking, block-level hashing, and reference counting to safely garbage-collect unreferenced blocks?	Dropbox · deduplication
5	How does a photo platform generate thumbnails, apply filters, and extract metadata for 100M+ uploads/day, processing each image into multiple resolutions in parallel and pre-warming CDN caches for popular images?	Instagram · image processing
6	How does a collaboration platform handle concurrent edits to the same file by multiple users, using transform-based resolution for text and lock-based editing for binary files with conflict resolution UI for manual merge?	Google Docs · concurrent edit
7	How does a cloud platform implement file versioning and point-in-time restore for billions of objects, using copy-on-write semantics, version chains, and lifecycle policies that auto-delete versions older than 30 days?	S3 / Dropbox · versioning

16. Search Systems GoogleElasticsearchAlgolia

Inverted indexes, ranking, typeahead, personalization, freshness · 9 problems

#	Problem	Company / Scale
1	How does a web search engine index 100B+ pages and return ranked results in <200ms, combining authority scoring with relevance ranking across a tiered index serving architecture with early termination?	Google · web search
Elasticsearch Sharding Indexing Caching Load Balancer
2	How does an e-commerce platform search 500M+ products with filters (price, rating, brand) in <50ms, boosting results by relevance, recency, and popularity while personalizing re-ranking based on purchase history?	Amazon · product search
3	How does a professional network search 900M+ member profiles with complex filters (location, skills, company), supporting real-time index updates for profile changes within 10 seconds of modification?	LinkedIn · people search
4	How does a search engine implement typeahead/autocomplete that returns suggestions in <50ms as the user types, serving pre-computed top-K suggestions per prefix with personalized ranking based on recent searches?	Google · typeahead
5	How does a search engine handle spelling correction and "did you mean" suggestions in real-time, computing edit distance, mining query logs for common corrections, and supporting phonetic matching?	Google · spell correction
6	How does a food delivery platform search restaurants with geo-filtering, real-time availability, and delivery time estimation, re-ranking results by ETA and updating availability as orders come in?	DoorDash · local search
7	How does a search engine keep its index fresh when millions of pages change daily, prioritizing crawl frequency by change rate, supporting real-time index updates for breaking news, and periodically re-indexing for consistency?	Google · index freshness
8	How does a platform implement semantic/vector search that understands meaning beyond keywords, combining embedding-based similarity with traditional keyword scoring for hybrid results using approximate nearest neighbor search?	Semantic · vector search
9	Given search query logs from millions of searches per second, design a system that continuously computes the most searched queries/products/categories globally and regionally, using hierarchical aggregation and approximate top-K algorithms for memory-efficient ranking.	Search · trending queries

17. Scheduling & Calendar Google CalendarCalendlyTemporal

Recurring events, timezone handling, conflict detection, availability · 7 problems

#	Problem	Company / Scale
1	How does a calendar platform handle recurring events (every Tuesday, except holidays) across timezones, generating occurrences on-the-fly with exception handling without storing millions of individual instances?	Google Calendar · recurrence
NoSQL Consistency Sharding Message Queues Multi-Region
2	How does a calendar platform detect scheduling conflicts across 100M+ users in real-time, merging free/busy data across multiple calendars and handling concurrent booking attempts with optimistic locking?	Outlook · conflict detection
3	How does a meeting scheduler find available slots across 10 participants in different timezones, respecting working-hours constraints per timezone and scoring preferences to minimize early morning/late night for any participant?	Calendly · availability
4	How does a calendar platform sync events across devices and third-party integrations, supporting real-time push notifications for changes with periodic full-sync as fallback and conflict resolution for shared calendars?	Google Calendar · sync
5	How does a task scheduling platform (cron-at-scale) execute millions of scheduled jobs at their exact trigger time, guaranteeing at-least-once execution with idempotency for missed windows and leader-based clock ticking?	Temporal / Airflow · job scheduling
6	How does a calendar platform handle timezone changes (DST transitions, country rule changes) without breaking existing events, storing events in local time + timezone ID and re-computing UTC offsets on rule changes?	Calendar · DST handling
7	How does a notification platform schedule 100M notifications/day for future delivery at timezone-aware optimal times ("send at 9am user-local-time"), using delayed queues, batch scheduling per target-minute, and <1s delivery precision?	Notification · scheduled delivery
Message Queues Redis Multi-Region Redis Streams

18. Observability & Monitoring DatadogGrafanaPrometheus

Distributed tracing, metrics aggregation, alerting pipelines, log analytics · 9 problems

#	Problem	Company / Scale
1	How does a distributed tracing system correlate requests across 1000+ microservices, propagating trace context through async boundaries (Kafka, queues) and sampling intelligently to keep storage under 1% of total traffic while capturing all error traces?	Uber · Jaeger tracing
Tracing Kafka Elasticsearch Stream Processing
2	How does a metrics platform ingest 500M+ time-series data points per second, supporting real-time aggregation (P50/P99/max) with 10-second granularity and multi-dimensional queries across 100K+ metric names?	Datadog · 500M metrics/sec
3	How does an alerting system evaluate 10M+ alert rules every 15 seconds without false positives, supporting complex conditions (rate-of-change, anomaly detection), alert grouping, and escalation policies with on-call rotation?	PagerDuty · alerting at scale
4	How does a log analytics platform ingest 1PB+ of logs daily from millions of sources, indexing them for sub-second search while applying retention policies and providing real-time tail functionality for debugging?	Splunk · PB-scale logs
5	How does a platform implement real-time error tracking that groups millions of exceptions into unique issues, detecting regressions within minutes of deployment and auto-assigning to the team that owns the failing code path?	Sentry · error grouping
6	How does a cloud platform build real-time service dependency maps from trace data, detecting cascading failures within seconds and identifying the root-cause service in a chain of 20+ dependent services?	AWS X-Ray · dependency map
7	How does a platform implement SLO-based monitoring that continuously computes error budgets across 10K+ services, triggering automated responses (traffic shifting, rollback) when burn rate exceeds thresholds?	Google · SLO monitoring
8	How does a real-time dashboard system serve 100K+ concurrent viewers watching the same metrics, pushing incremental updates via WebSocket without recomputing full queries per viewer?	Grafana · live dashboards
9	How does a notification platform track delivery metrics (sent, delivered, opened, clicked, bounced) for 5B notifications/day, using tracking pixels, redirect URLs for clicks, ISP feedback loop processing, and real-time per-template effectiveness dashboards?	Notification · analytics & deliverability
Tracing Stream Processing Kafka

19. ML Model Serving & Feature Stores OpenAITensorFlowPyTorchMeta

Real-time inference, feature computation, model deployment, A/B testing · 8 problems

#	Problem	Company / Scale
1	How does a ride platform serve ML predictions (ETA, surge, fraud) at 1M+ requests/sec with P99 <10ms latency, loading 100+ models into GPU memory, handling model version rollouts, and falling back to rule-based systems when models are unavailable?	Uber · Michelangelo
Caching Load Balancer Kafka Docker/K8s Stream Processing
2	How does a feature store compute and serve real-time features (user's last 5 actions, rolling 1-hour spend) for ML models at prediction time, combining batch-computed features with streaming features while maintaining point-in-time correctness?	Feast · real-time features
3	How does a search platform deploy new ranking models to production without degrading relevance, using shadow scoring, interleaved experiments, and gradual traffic ramp with automatic rollback on metric regression?	Google · model deployment
4	How does a recommendation platform retrain models on fresh data every hour, incorporating the latest user interactions while ensuring training-serving skew stays below 1% and new models don't catastrophically forget learned patterns?	TikTok · online learning
5	How does an LLM serving platform handle 10K+ concurrent inference requests with variable-length outputs, optimizing GPU utilization through continuous batching, KV-cache management, and speculative decoding for 3x throughput improvement?	OpenAI · LLM serving
6	How does an ad platform compute click-through-rate predictions for 10B+ ad candidates daily in <50ms per request, combining sparse features (user history) with dense embeddings and serving from a distributed model across 1000+ inference nodes?	Meta · ads prediction
7	How does a content platform implement real-time embedding generation for new content (images, videos, text), indexing embeddings for approximate nearest neighbor search within seconds of upload for immediate recommendation eligibility?	Pinterest · embedding pipeline
8	How does an LLM API rate-limit by tokens consumed (not request count) when requests vary from 50 to 100,000 tokens, using a token bucket with variable-weight deduction, dual limits (TPM + RPM), and reserve-then-refund for unknown output costs?	OpenAI · token-based rate limiting
Rate Limiting API Gateway Redis Backpressure

20. Security & Authentication CloudflareAuth0Okta

Session management, DDoS mitigation, rate limiting, zero-trust · 9 problems

#	Problem	Company / Scale
1	How does a platform manage sessions for 2B+ users across multiple devices, supporting instant revocation (password change invalidates all sessions), sliding expiry, and device-specific session limits without checking a central store on every request?	Google · session management
Authentication Encryption Redis Consistency Multi-Region
2	How does a CDN mitigate L7 DDoS attacks at 100M+ requests/sec, distinguishing legitimate traffic from bot traffic using behavioral analysis, JavaScript challenges, and adaptive rate limiting without blocking real users during an attack?	Cloudflare · DDoS mitigation
3	How does a platform implement distributed rate limiting across 300+ edge PoPs, enforcing per-user and per-IP limits with eventual consistency between nodes while using fail-open policies to avoid blocking legitimate traffic during sync delays?	Stripe · distributed rate limit
4	How does a zero-trust architecture authenticate and authorize every service-to-service call in a 5000+ microservice mesh, issuing short-lived certificates, enforcing least-privilege policies, and detecting lateral movement without adding more than 1ms latency per hop?	Google BeyondCorp · zero-trust
5	How does an OAuth provider handle 1M+ token issuance/sec during peak login, supporting PKCE flows, token rotation, and cross-device SSO while detecting token theft through binding tokens to device fingerprints?	Auth0 · OAuth at scale
6	How does a platform implement real-time account takeover detection, scoring login attempts using device fingerprint, geo-velocity, and behavioral biometrics, triggering step-up authentication (MFA) for suspicious sessions within milliseconds?	Netflix · ATO detection
7	How does a secrets management platform handle 100K+ services fetching credentials, supporting automatic rotation every 24 hours, lease-based access with revocation, and zero-downtime rotation without service restarts?	HashiCorp Vault · secrets
8	How does a platform rate-limit the /login endpoint to stop credential stuffing, using sliding window log for exact counting, dual keys (per-account + per-IP), progressive lockout with escalating penalties, and fail-closed policy where security takes priority over availability?	Auth0 · brute-force protection
Rate Limiting Authentication Redis API Gateway
9	How does a CDN/edge layer implement coarse per-IP rate limiting to absorb volumetric DDoS attacks before traffic reaches origin, using in-memory fixed window counters, approximate counting, and progressive escalation from 429 to TCP RST to L3 null-route?	Cloudflare · edge DDoS mitigation
Rate Limiting CDN Load Balancer Security

21. URL Shortening & Redirection BitlyTinyURLGoogle

ID generation, redirect scaling, analytics, hot key caching · 4 problems

#	Problem	Company / Scale
1	How does a URL shortener generate globally unique short codes at 1000+ URLs/sec, ensuring no collisions across distributed nodes while keeping codes short (7 chars) and supporting custom aliases?	Bitly · ID generation
NoSQL Caching Consistent Hashing Load Balancer CDN
2	How does a URL shortener handle 100K+ redirect requests/sec with P99 <10ms latency, caching hot URLs at the edge while handling expired links, geo-targeted redirects, and A/B test routing?	TinyURL · redirect at scale
3	How does a link analytics platform track clicks in real-time (referrer, geo, device, timestamp) for billions of redirects/day without adding latency to the redirect path, computing dashboards with <1min freshness?	Bitly · click analytics
4	How does a URL shortener handle hot keys (viral links getting 1M+ clicks/sec) without melting the cache layer, using consistent hashing, request coalescing, and tiered caching (L1 in-process → L2 Redis → L3 DB)?	t.co · hot key caching

22. Email Systems GmailSendGridMailchimp

SMTP delivery, spam filtering, inbox indexing, threading · 4 problems

#	Problem	Company / Scale
1	How does an email platform deliver 500M+ emails/day with high deliverability, managing IP reputation, DKIM/SPF/DMARC authentication, bounce handling, and throttling per recipient domain to avoid being blacklisted?	SendGrid · email delivery
Message Queues DLQ DNS Idempotency
2	How does an email provider classify 1B+ incoming emails/day as spam/ham in real-time using content analysis, sender reputation, behavioral signals, and ML models · with <0.1% false positive rate?	Gmail · spam filtering
3	How does an email platform index 1B+ mailboxes for instant full-text search, supporting complex queries (from:, has:attachment, date range) with <200ms latency while handling 100K+ new emails/sec ingestion?	Gmail · inbox search
4	How does an email client implement conversation threading that correctly groups replies, forwards, and CC chains using In-Reply-To/References headers, handling broken threads and cross-client compatibility?	Gmail · threading

23. Content Moderation & Trust MetaYouTubeTwitch

Spam detection, abuse prevention, ML moderation, reporting systems · 4 problems

#	Problem	Company / Scale
1	How does a social platform classify 500M+ posts/day for policy violations using multi-modal ML (text + image + video), routing borderline cases to human reviewers within minutes while maintaining <1% false positive rate?	Meta · automated moderation
Kafka Stream Processing Message Queues Caching Load Balancer
2	How does a platform detect and suppress coordinated inauthentic behavior (bot networks, brigading) in real-time, identifying clusters of accounts acting in concert using graph analysis and behavioral signals?	Twitter · bot detection
3	How does a platform implement a user reporting system that handles 10M+ reports/day, prioritizing by severity (child safety > violence > spam), deduplicating reports on the same content, and routing to specialized review queues?	YouTube · reporting pipeline
4	How does a platform implement real-time toxicity scoring for live chat/comments, blocking harmful messages before they're visible to other users while minimizing latency impact on the posting experience (<100ms added)?	Twitch · live chat moderation

24. Configuration & Feature Flags LaunchDarklyNetflixGitHub

Dynamic config, experimentation, A/B testing, rollout systems · 3 problems

#	Problem	Company / Scale
1	How does a feature flag platform evaluate 1M+ flag checks/sec with P99 <5ms, supporting complex targeting rules (user segments, percentages, geo), and propagating flag changes to 100K+ servers within 10 seconds?	LaunchDarkly · flag evaluation
Caching CDN WebSocket Pub/Sub Consistency
2	How does a platform run 1000+ concurrent A/B experiments, assigning users to variants deterministically (hash-based), computing statistical significance in real-time, and auto-stopping experiments that degrade key metrics?	Netflix · experimentation
3	How does a platform implement progressive rollouts (1% → 5% → 25% → 100%) with automatic rollback on error rate spike, supporting canary deployments and dark launches without code deploys?	Facebook · gradual rollout

25. Web Crawling & Indexing GoogleCloudflareAmazon

Distributed crawlers, politeness, deduplication, indexing pipelines · 3 problems

#	Problem	Company / Scale
1	How does a web crawler discover and fetch 10B+ pages across the internet, respecting robots.txt, managing crawl politeness (rate limits per domain), prioritizing fresh/important pages, and deduplicating content?	Google · web crawler
Message Queues Bloom Filters Sharding DNS
2	How does a search engine build and maintain an inverted index over 100B+ documents, supporting incremental updates (new/modified pages) without full re-indexing, and serving queries across a distributed index in <200ms?	Google · indexing pipeline
3	How does a price comparison platform crawl 10M+ product pages daily from 1000+ e-commerce sites, extracting structured data (price, availability, specs) using site-specific parsers, and detecting price changes within minutes?	PriceRunner · product crawling

26. ID Generation Systems TwitterStripeSnowflake

Snowflake IDs, distributed unique IDs, ordering guarantees, collision avoidance · 3 problems

#	Problem	Company / Scale
1	How does a distributed system generate 100K+ unique, time-sortable IDs per second across 1000+ nodes without coordination, ensuring no collisions and maintaining rough chronological ordering (Snowflake-style)?	Twitter Snowflake · time-sorted IDs
Partitioning Consistency Clock Sync Fault Tolerance Sharding
2	How does a database generate monotonically increasing IDs across a sharded cluster without a single point of failure, supporting 1M+ inserts/sec while maintaining gap-free sequences within each shard?	Vitess · sharded sequences
3	How does a system generate short, human-readable unique codes (invite codes, order IDs, tracking numbers) that are URL-safe, non-sequential (prevent enumeration), and globally unique across 1B+ entities?	Stripe · readable unique IDs

27. Ad Serving & Monetization Google AdsMetaAmazon

Ad ranking, real-time bidding, targeting, impression tracking · 4 problems

#	Problem	Company / Scale
1	How does an ad platform select the best ad from 10M+ candidates in <100ms per page load, scoring by predicted CTR, bid amount, relevance, and advertiser budget · serving 10B+ ad requests/day?	Google Ads · ad ranking
Caching Load Balancer Stream Processing Kafka
2	How does a real-time bidding (RTB) exchange conduct auctions across 100+ demand-side platforms within 100ms, handling 1M+ bid requests/sec with timeout-based fallback and fraud detection?	OpenRTB · real-time bidding
3	How does an ad platform track impressions, clicks, and conversions across billions of events/day without double-counting, attributing conversions across devices/sessions, and computing ROI dashboards in near-real-time?	Meta Ads · attribution tracking
4	How does an ad platform enforce advertiser budgets in real-time across distributed serving nodes, preventing overspend while maximizing delivery, handling budget changes mid-campaign, and pacing spend evenly throughout the day?	Google Ads · budget pacing

28. Developer Platform & CI/CD GitHub ActionsJenkinsDockerKubernetes

Build systems, deployment orchestration, artifact storage, release pipelines · 3 problems

#	Problem	Company / Scale
1	How does a CI platform execute 10M+ builds/day across a distributed fleet of workers, scheduling jobs by priority and resource requirements, caching build artifacts for 10· speedup, and providing real-time build logs?	GitHub Actions · build at scale
Message Queues Docker/K8s Kafka Caching Auto-Scaling
2	How does a deployment platform orchestrate zero-downtime rollouts across 100K+ servers, supporting canary deployments (1% → 10% → 100%), automatic rollback on health check failure, and blue-green switching?	Spinnaker · deployment orchestration
3	How does an artifact registry store and serve 1PB+ of build artifacts (Docker images, npm packages, Maven JARs) with global replication, content-addressable deduplication, and vulnerability scanning on upload?	Artifactory · artifact storage

29. AI & LLM Systems OpenAIAnthropicGooglePerplexity

LLM serving, RAG pipelines, agent orchestration, AI gateways, guardrails, embedding search · 10 problems

#	Problem	Company / Scale
1	How does an LLM serving platform handle 100K+ concurrent chat sessions with variable-length outputs, achieving <500ms time-to-first-token through continuous batching, KV-cache paging (PagedAttention), and speculative decoding · while keeping GPU utilization above 80%?	OpenAI · LLM serving at scale
LLM Serving Load Balancer Auto-Scaling Caching
2	How does a RAG-powered search engine ingest 10M+ documents, chunk them optimally, generate embeddings, and serve grounded answers with citations in <2s · while keeping hallucination rate below 5% through hybrid retrieval (dense + sparse) and cross-encoder reranking?	Perplexity · RAG at scale
RAG Embeddings Elasticsearch Kafka Caching
3	How does an AI gateway route 1M+ requests/day across multiple LLM providers (GPT-4, Claude, Llama), implementing semantic caching (30-60% cost reduction), automatic fallback on provider outages, and per-team token budgets · all with <50ms added latency?	Enterprise · AI Gateway
AI Gateway Caching Load Balancer Fault Tolerance
4	How does an AI coding assistant serve real-time code completions to 10M+ developers with <200ms latency, retrieving relevant context from the user's repository (100K+ files), ranking suggestions by relevance, and adapting to per-user coding patterns?	GitHub Copilot · code AI
RAG LLM Serving Embeddings Caching Sharding
5	How does a multi-agent system orchestrate 5+ specialized AI agents (researcher, coder, reviewer, planner) to complete complex tasks, managing shared memory, tool execution, inter-agent communication, and graceful failure handling · with total cost under $0.50 per task?	AutoGen · Agent orchestration
Agents LLM Serving Message Queues Redis
6	How does a vector search platform index 1B+ embeddings and serve similarity queries in <10ms at 50K QPS, supporting real-time index updates (new documents searchable in <5s), metadata filtering, and multi-tenancy with per-tenant isolation?	Pinecone · Vector search at scale
Embeddings Sharding Replication Caching Consistent Hashing
7	How does a content moderation system classify 500M+ user-generated posts/day using multi-modal AI (text + image + video), achieving <2s classification latency, routing edge cases to human reviewers, and handling adversarial attacks that try to bypass filters?	Meta · AI content moderation
Guardrails Kafka Stream Processing LLM Serving Message Queues
8	How does a conversational AI platform maintain context across multi-turn conversations for 50M+ daily active users, managing conversation memory (short-term buffer + long-term vector store), session persistence, and personalization · while keeping per-user storage costs under $0.001/day?	ChatGPT · Conversation memory
Agents Embeddings Redis NoSQL Sharding
9	How does a real-time AI translation system serve 1B+ translation requests/day across 100+ language pairs with <300ms latency, dynamically selecting between specialized models per language pair and falling back to general models for rare pairs?	Google Translate · AI at scale
LLM Serving AI Gateway Caching Multi-Region Load Balancer
10	How does an AI safety platform detect prompt injection attacks, jailbreak attempts, and PII leakage across 100M+ LLM requests/day in <50ms per request, using layered classifiers (fast regex → ML model → LLM judge) with <0.1% false positive rate on legitimate queries?	Lakera · AI guardrails at scale
Guardrails LLM Serving Stream Processing Redis

Real-Time System Design