1. Foundations — System Design Concepts

⚡ 1 · FOUNDATIONS

System Design Framework

The structured approach to tackle any design interview (45-60 min)

1. REQUIREMENTS (5 min)  — Clarify scope. Ask questions. Define FR + NFR.
2. ESTIMATION (5 min)    — Users, QPS, storage, bandwidth (back-of-envelope).
3. HIGH-LEVEL DESIGN (10 min) — Draw: clients → LB → services → DB → cache.
4. DETAILED DESIGN (20 min)   — Deep dive 2-3 critical components.
5. TRADE-OFFS (5 min)    — Alternatives, bottlenecks, failure modes.
6. SCALING (5 min)       — How to handle 10x, 100x growth.

Interview Walkthrough (45-min example — "Design Twitter"): Requirements (5 min): "Users post tweets, follow others, view home timeline. NFR: 500M users, 10K tweets/sec, p99 <200ms, 99.99% availability." Estimation (5 min): "500M users × 2 tweets/day = 1B tweets/day ÷ 100K sec = ~10K writes/sec. Read-heavy: 100:1 read/write. Storage: 1B × 200B = 200GB/day." High-Level (10 min): "Client → LB → Tweet Service → DB + Cache. Timeline Service reads from fan-out cache. Media → S3 + CDN." Deep Dive (20 min): "Fan-out on write (push to follower timelines in Redis) vs fan-out on read (pull at read time). Hybrid: push for normal users, pull for celebrities (>1M followers). Sharding tweets by user_id. Timeline cache in Redis sorted sets." Trade-offs (5 min): "Push = fast reads, expensive writes for celebrities. Pull = cheap writes, slow reads. Hybrid balances both." Scaling (5 min): "Shard DB by user_id, Redis Cluster for timelines, CDN for media, Kafka for async fan-out."

Functional vs Non-Functional Requirements

What the system does vs how well it does it — with visual sketches showing the concept

Functional (What)	Non-Functional (How Well)
User can view a homepage with posts, feed, and navigation	Latency — homepage loads in <1.5 seconds (P99)
System stores customer data (profile, preferences, history)	Security — data encrypted at rest (AES-256) and in transit (TLS 1.3)
Users can log into their accounts (auth, sessions)	Scalability — support 5,000 concurrent logged-in users per server
System is always available to customers 24/7	Availability — 99.7% uptime (26 hours max downtime/year)
Customers can access on mobile phones and tablets	Compatibility — works on iOS 14+ and Android 10+ browsers
User can search products by name, category, filters	Throughput — handle 10K searches/sec with <100ms response
User can make payments (checkout, refunds)	Consistency — strong consistency for payments (no double-charge)
User can chat with AI assistant (ask questions, get answers)	Latency (TTFT) — first token in <500ms, stream at 50+ tokens/sec
System answers from company docs (RAG knowledge base)	Accuracy — <5% hallucination rate, grounded in retrieved sources
User can generate images from text (AI image generation)	GPU Throughput — generate image in <10s, serve 1K concurrent users per GPU

Core Challenges: Too many users → horizontal scaling, LB, caching. Too much data → sharding, tiered storage. Low latency → caching, CDN, geo-distribution. High availability → replication, multi-region, graceful degradation.

Interview tip: Always pair each FR with its NFR constraint. "Users can post tweets" → "at 10K tweets/sec with P99 <200ms." This shows you think about both what the system does and how well it must do it.

NFR Metrics & SLOs

How non-functional requirements are measured — pick targets, then design to them

NFR	What it means	Metric	Typical Target	Levers
Latency	Time per request (user-perceived speed)	p50 / p95 / p99 / p99.9 ms	p99 < 200 ms (web) · < 50 ms (internal RPC)	Cache, CDN, async, geo-PoP, fewer hops
Throughput	Work served per unit time	RPS / QPS / TPS / msgs·s⁻¹	10K–1M RPS per service	Horizontal scale, batching, sharding
Availability	% of time system is up & serving	"Nines" uptime	99.9 % (8.7 h/yr) · 99.99 % (52 min) · 99.999 % (5 min)	Redundancy, multi-AZ/region, failover, health checks
Durability	% chance data survives (no loss, ever)	Nines of durability	11×9 (S3) · 99.999999999 %	Replication (3×), erasure coding, cross-region backup, WAL
Reliability	Correctness over time (MTBF / MTTR)	Error rate, MTBF, MTTR	Error budget < 0.1 % · MTTR < 5 min	Retries, circuit breakers, idempotency, runbooks
Scalability	Ability to grow with load (linear, ideally)	Cost / RPS, scale factor	Linear up to 10×–100×	Stateless services, sharding, autoscale
Bandwidth	Data moved over network per second	MB/s · Gbps ingress/egress	Stay within VPC/CDN egress budget	Compression, CDN, delta sync, batching
Storage	How much data is kept & for how long	GB / TB / PB, retention	Right-size; tier hot→warm→cold (S3 IA/Glacier)	TTL, compression, tiering, dedup
Consistency	How fresh / agreed data is across replicas	Strong / RYW / Eventual, replica lag	Strong for $ · eventual for likes	Quorum (R+W>N), Raft/Paxos, CRDTs
Security / Privacy	AuthN/AuthZ, encryption, audit	CVE count, % encrypted, audit pass	0 critical CVEs · TLS everywhere · PII encrypted at rest	OAuth2, mTLS, KMS, WAF, RBAC
Cost	$ per request / GB / user	$ / 1M req, $/GB-month	Within unit-economics envelope	Spot, reserved, autoscale-down, caching

Why percentiles, not averages, for latency: averages hide tail pain. With 100 ms avg you can still have 5 % users at 2 s. p50 = median (typical user), p95 / p99 = the bad days, p99.9 = the angry tweets. SLO is usually written on p99 for user-facing, p99.9 for infra.

"Nines" cheatsheet — downtime per year: 99 % = 3.65 days · 99.9 % = 8.77 h · 99.99 % = 52.6 min · 99.999 % = 5.26 min · 99.9999 % = 31.5 s. Each extra 9 ≈ 10× cost & complexity (more replicas, multi-region, chaos testing). Durability uses the same scale but for data loss probability — S3 advertises 11×9 = 1 object lost per 100 B per year.

SLI · SLO · SLA: SLI = the measurement ("p99 latency over 5-min window"). SLO = your internal target ("p99 < 200 ms, 99.9 % of the time"). SLA = the contractual promise to customers (with refund/credit if missed). Always: SLA < SLO < actual performance — leave headroom for the error budget.

Trade-offs (CAP/PACELC reminder): you cannot maximize all NFRs at once. More 9's of availability → weaker consistency or higher cost. Lower latency → larger cache footprint / more PoPs / weaker durability (e.g. async fsync). State the SLO numerically in interviews — "p99 < 200 ms, 99.99 % availability, 11×9 durability" — then derive the architecture from it.

Scaling Basics

Vertical vs Horizontal — the fundamental scaling decision

Vertical (Scale Up)	Horizontal (Scale Out)
Bigger machine (more CPU/RAM/IOPS)	More machines behind a load balancer
Simple — no code changes	Unlimited — add nodes as needed
Has ceiling — biggest machine has limits	Complex — distributed state, consistency
Single point of failure	Fault tolerant (node dies → others serve)

Guarantee: Horizontal scaling provides linear throughput growth — doubling nodes roughly doubles capacity, because each node handles an independent subset of traffic.

Stateless vs Stateful

Stateless — no memory between requests, every request is a stranger · Stateful — server remembers past interactions

Aspect	Stateless	Stateful
Memory	No memory — every request self-contained	Remembers — keeps session, knows past actions
Scaling	Just add servers — any instance handles any request	Sticky sessions + coordination — same user → same server
Failure	Any server takes over — no state lost on crash	State lost on crash — needs replication or persistence
State lives in	Redis / DB / JWT — externalized, not on server	Server memory — tied to specific instance
Examples	REST APIs · HTTP services · Auth via JWT · App servers with Redis	WebSocket (chat) · Game servers · DB connections · Login sessions in memory
Load Balancer	Round Robin — any server, no special routing	IP Hash / Cookie — must route to same server

Stateless → Stateful conversion: Move session to Redis (shared store) → all servers read same session → service becomes stateless. Move auth to JWT (token carries user context) → no server-side session needed. This is how companies horizontally scale without sticky sessions.

Real-world: Netflix — stateless API servers + session in Redis. WhatsApp — stateful WebSocket (connection pinned to server) + Redis for presence. Kubernetes Pods — stateless by design, killed/restarted anytime. Game servers — stateful, entire match state in memory → hard to migrate mid-game.

One-liner: Stateless = no memory, infinite scale (REST API) · Stateful = memory, scaling pain (game server, sessions) → best practice: externalize state, make services stateless.

Serialization & Deserialization

Converting objects in memory to bytes/JSON/Binary (serialize) and back (deserialize) for storage/transmission

Format	Type	Size	Speed	Use Case
JSON	Text	Large	Slow	REST APIs, config files, human-readable
Protocol Buffers	Binary	Small	Fast	gRPC, internal services (Google)
Avro	Binary	Small	Fast	Kafka events, Hadoop (schema in header)
MessagePack	Binary	Small	Fast	Redis, embedded systems
XML	Text	Very large	Slow	SOAP, legacy enterprise

Guarantee: Binary formats (Protobuf, Avro) provide schema evolution with backward/forward compatibility — old consumers can read new messages and vice versa, because fields are identified by number not name. This is critical for zero-downtime deployments.

Schema Registry (Confluent) — central store for Avro/Protobuf schemas. Enforces compatibility rules. Used with Kafka to prevent breaking changes.

Why needed: Network — send objects over HTTP/TCP/Kafka (network only understands bytes). Persistence — store to DB/Redis/disk. Cross-language — Java → Python via Protobuf (no language barrier). Caching — serialize to Redis, deserialize on hit. Distributed Computing / RPC — gRPC serializes to Protobuf binary — faster & smaller than JSON for internal service calls.

Concurrency & I/O Models

The mental models that explain why Redis (single-thread), Nginx, and Node.js handle millions of connections

Concept	Definition	System Design Impact	Real-World
Process	Isolated memory space, own fds	Crash-safe isolation, heavier to create	Nginx workers, Chrome tabs, Gunicorn
Thread	Shared memory within a process	Fast communication but needs locks	Java thread pools, Go goroutines
Concurrency	Interleaving tasks on 1 core	Single thread handles 100K+ connections	Redis, Node.js event loop
Parallelism	Simultaneous execution on N cores	Linear throughput scaling for CPU work	Kafka partitions, MapReduce, ffmpeg
CPU-bound	Bottleneck = computation	Fix: more cores / worker threads	Video encoding, ML inference, zlib
I/O-bound	Bottleneck = waiting (net/disk)	Fix: non-blocking I/O + event loop	Web APIs, DB queries, file uploads
Blocking I/O	Thread sleeps waiting for I/O	10K conn = 10K threads (expensive)	Apache, old Java Servlet
Non-blocking	Thread asks "ready?" + moves on	10K conn = 1 thread (efficient)	Redis, Nginx, Node.js, Netty

The key insight: Most web services are I/O-bound — they spend 90%+ of time waiting on network/DB responses. A single thread with non-blocking I/O can handle 100K connections because it's only doing work when data is actually ready. No idle threads wasting memory.

Who uses what: Redis — 1 thread + epoll (100K-1M ops/sec). Nginx — multi-process, each with event loop (10M+ conn). Node.js — 1 thread + libuv + thread pool for CPU. Go — goroutines (M:N scheduling, millions of lightweight threads). Java Netty — few threads + NIO (Discord uses this). Apache — thread-per-connection (old model, max ~10K).

Interview framing: "Redis is single-threaded because it's I/O-bound — the bottleneck is network, not CPU. Using epoll, one thread handles 100K sockets concurrently. No locks, no context switches, no wasted memory. This is why it achieves sub-millisecond latency."