Interview Walkthrough (45-min example — "Design Twitter"):Requirements (5 min): "Users post tweets, follow others, view home timeline. NFR: 500M users, 10K tweets/sec, p99 <200ms, 99.99% availability." Estimation (5 min): "500M users × 2 tweets/day = 1B tweets/day ÷ 100K sec = ~10K writes/sec. Read-heavy: 100:1 read/write. Storage: 1B × 200B = 200GB/day." High-Level (10 min): "Client → LB → Tweet Service → DB + Cache. Timeline Service reads from fan-out cache. Media → S3 + CDN." Deep Dive (20 min): "Fan-out on write (push to follower timelines in Redis) vs fan-out on read (pull at read time). Hybrid: push for normal users, pull for celebrities (>1M followers). Sharding tweets by user_id. Timeline cache in Redis sorted sets." Trade-offs (5 min): "Push = fast reads, expensive writes for celebrities. Pull = cheap writes, slow reads. Hybrid balances both." Scaling (5 min): "Shard DB by user_id, Redis Cluster for timelines, CDN for media, Kafka for async fan-out."
Functional vs Non-Functional Requirements
What the system does vs how well it does it — with visual sketches showing the concept
Functional (What)
Non-Functional (How Well)
User can view a homepage with posts, feed, and navigation
Latency — homepage loads in <1.5 seconds (P99)
System stores customer data (profile, preferences, history)
Security — data encrypted at rest (AES-256) and in transit (TLS 1.3)
Users can log into their accounts (auth, sessions)
Scalability — support 5,000 concurrent logged-in users per server
System is always available to customers 24/7
Availability — 99.7% uptime (26 hours max downtime/year)
Customers can access on mobile phones and tablets
Compatibility — works on iOS 14+ and Android 10+ browsers
User can search products by name, category, filters
Throughput — handle 10K searches/sec with <100ms response
User can make payments (checkout, refunds)
Consistency — strong consistency for payments (no double-charge)
User can chat with AI assistant (ask questions, get answers)
Latency (TTFT) — first token in <500ms, stream at 50+ tokens/sec
System answers from company docs (RAG knowledge base)
Accuracy — <5% hallucination rate, grounded in retrieved sources
User can generate images from text (AI image generation)
GPU Throughput — generate image in <10s, serve 1K concurrent users per GPU
Core Challenges:Too many users → horizontal scaling, LB, caching. Too much data → sharding, tiered storage. Low latency → caching, CDN, geo-distribution. High availability → replication, multi-region, graceful degradation.
Interview tip: Always pair each FR with its NFR constraint. "Users can post tweets" → "at 10K tweets/sec with P99 <200ms." This shows you think about both what the system does and how well it must do it.
NFR Metrics & SLOs
How non-functional requirements are measured — pick targets, then design to them
Why percentiles, not averages, for latency: averages hide tail pain. With 100 ms avg you can still have 5 % users at 2 s. p50 = median (typical user), p95 / p99 = the bad days, p99.9 = the angry tweets. SLO is usually written on p99 for user-facing, p99.9 for infra.
"Nines" cheatsheet — downtime per year:99 % = 3.65 days · 99.9 % = 8.77 h · 99.99 % = 52.6 min · 99.999 % = 5.26 min · 99.9999 % = 31.5 s. Each extra 9 ≈ 10× cost & complexity (more replicas, multi-region, chaos testing). Durability uses the same scale but for data loss probability — S3 advertises 11×9 = 1 object lost per 100 B per year.
SLI · SLO · SLA:SLI = the measurement ("p99 latency over 5-min window"). SLO = your internal target ("p99 < 200 ms, 99.9 % of the time"). SLA = the contractual promise to customers (with refund/credit if missed). Always: SLA < SLO < actual performance — leave headroom for the error budget.
Trade-offs (CAP/PACELC reminder): you cannot maximize all NFRs at once. More 9's of availability → weaker consistency or higher cost. Lower latency → larger cache footprint / more PoPs / weaker durability (e.g. async fsync). State the SLO numerically in interviews — "p99 < 200 ms, 99.99 % availability, 11×9 durability" — then derive the architecture from it.
Scaling Basics
Vertical vs Horizontal — the fundamental scaling decision
Vertical (Scale Up)
Horizontal (Scale Out)
Bigger machine (more CPU/RAM/IOPS)
More machines behind a load balancer
Simple — no code changes
Unlimited — add nodes as needed
Has ceiling — biggest machine has limits
Complex — distributed state, consistency
Single point of failure
Fault tolerant (node dies → others serve)
Guarantee: Horizontal scaling provides linear throughput growth — doubling nodes roughly doubles capacity, because each node handles an independent subset of traffic.
Stateless vs Stateful
Stateless — no memory between requests, every request is a stranger · Stateful — server remembers past interactions
Aspect
Stateless
Stateful
Memory
No memory — every request self-contained
Remembers — keeps session, knows past actions
Scaling
Just add servers — any instance handles any request
Sticky sessions + coordination — same user → same server
Failure
Any server takes over — no state lost on crash
State lost on crash — needs replication or persistence
State lives in
Redis / DB / JWT — externalized, not on server
Server memory — tied to specific instance
Examples
REST APIs · HTTP services · Auth via JWT · App servers with Redis
WebSocket (chat) · Game servers · DB connections · Login sessions in memory
Load Balancer
Round Robin — any server, no special routing
IP Hash / Cookie — must route to same server
Stateless → Stateful conversion: Move session to Redis (shared store) → all servers read same session → service becomes stateless. Move auth to JWT (token carries user context) → no server-side session needed. This is how companies horizontally scale without sticky sessions.
Real-world:Netflix — stateless API servers + session in Redis. WhatsApp — stateful WebSocket (connection pinned to server) + Redis for presence. Kubernetes Pods — stateless by design, killed/restarted anytime. Game servers — stateful, entire match state in memory → hard to migrate mid-game.
One-liner:Stateless = no memory, infinite scale (REST API) · Stateful = memory, scaling pain (game server, sessions) → best practice: externalize state, make services stateless.
Serialization & Deserialization
Converting objects in memory to bytes/JSON/Binary (serialize) and back (deserialize) for storage/transmission
Format
Type
Size
Speed
Use Case
JSON
Text
Large
Slow
REST APIs, config files, human-readable
Protocol Buffers
Binary
Small
Fast
gRPC, internal services (Google)
Avro
Binary
Small
Fast
Kafka events, Hadoop (schema in header)
MessagePack
Binary
Small
Fast
Redis, embedded systems
XML
Text
Very large
Slow
SOAP, legacy enterprise
Guarantee: Binary formats (Protobuf, Avro) provide schema evolution with backward/forward compatibility — old consumers can read new messages and vice versa, because fields are identified by number not name. This is critical for zero-downtime deployments.
Schema Registry (Confluent) — central store for Avro/Protobuf schemas. Enforces compatibility rules. Used with Kafka to prevent breaking changes.
Why needed:Network — send objects over HTTP/TCP/Kafka (network only understands bytes). Persistence — store to DB/Redis/disk. Cross-language — Java → Python via Protobuf (no language barrier). Caching — serialize to Redis, deserialize on hit. Distributed Computing / RPC — gRPC serializes to Protobuf binary — faster & smaller than JSON for internal service calls.