System Design Concepts

No fluff — visual, concise, interview-ready

🔗 9 · CONSISTENCY & DISTRIBUTED SYSTEMS

CAP Theorem & PACELC

Distributed system can guarantee only 2 of 3: Consistency, Availability, Partition Tolerance

CAP Theorem

Consistency Availability Partition Tolerance CA CP AP
C — all nodes see same data (reads get latest write)
A — every request gets a non-error response
P — system works despite network splits

PACELC

Partition? YES PAC Availability Consistency NO ELC Latency Consistency PACELC
If Partition → choose A or C
Else → choose Latency or Consistency
Most systems are tunable along this spectrum
CAP System Approaches

CA — Consistent + Available

👤 👤 User User Srv Srv Database
Single DB, not distributed. No partition tolerance. Postgres, MySQL (single node)

AP — Available + Partition Tolerant

👤 👤 User User Srv Srv DB DB replication
Always responds, may serve stale data. Cassandra, DynamoDB, CouchDB

CP — Consistent + Partition Tolerant

👤 👤 User User Leader Primary Secondary sync
Returns error if can't guarantee consistency. ZooKeeper, etcd, Spanner, HBase
TypeTrade-offSystemsWhen Partition Happens
CANo partition toleranceSingle-node Postgres, MySQLN/A — not distributed
APEventual consistencyCassandra, DynamoDB, CouchDBServes stale data, resolves later
CPReduced availabilityZooKeeper, etcd, Spanner, HBaseRejects requests until consistent
Since partitions are unavoidable in distributed systems, the real choice is always CP vs AP during a partition. CA only exists for single-node systems.

Consistency Models

From strongest to weakest — each level trades latency for correctness. Choose based on your domain's tolerance for stale reads.

Consistency Spectrum — Strongest to Weakest
STRONGEST (slowest) WEAKEST (fastest) Linearizable single-copy illusion real-time ordering read returns latest write etcd, ZooKeeper, Spanner Sequential all see same order not real-time bound total order preserved Kafka partitions, Zab Causal causally related ordered concurrent ops unordered reply after its comment MongoDB (causal sessions) Read-Your-Writes you see your own writes others may see stale profile update visible DynamoDB (consistent read) Eventual replicas converge no ordering guarantee fastest, most available Cassandra, DNS, S3 high latency perfect correctness low latency stale reads possible Choosing: What happens if a user reads stale data? Money lost → Linearizable | Wrong order → Sequential/Causal | Slightly outdated → Eventual | User confused → Read-your-writes
ModelGuaranteeLatencyUse CaseSystems
LinearizableAppears as single copy, real-time orderingHighest (quorum write + read)Distributed locks, leader election, countersetcd, ZooKeeper, Spanner, CockroachDB
SequentialAll observers see same order (not real-time)HighEvent logs, total-order broadcastKafka partitions, ZAB, Raft log
CausalCausally related writes ordered; concurrent unorderedMediumComments/replies, chat messagesMongoDB (causal sessions), COPS
Read-your-writesClient always sees its own writesMedium-LowProfile updates, settings changesDynamoDB (consistent read), session affinity
Monotonic readsNever see older value after seeing newerLowTimelines, feeds (no "going back in time")Read from same replica, version tracking
EventualReplicas converge given no new writesLowestLikes, DNS, shopping carts, CDNCassandra, DynamoDB, S3, CouchDB
Tunable Consistency (Quorum Systems)
Quorum formula: W + R > N guarantees overlap (strong consistency). N=3, W=2, R=2 — majority quorum (standard). N=3, W=3, R=1 — fast reads, slow writes. N=3, W=1, R=1 — eventual consistency (fastest, may read stale). Cassandra and DynamoDB let you tune per-query.

Strong Consistency Techniques

  • Quorum reads/writes: W+R > N ensures overlap
  • Synchronous replication: wait for all replicas
  • Consensus (Raft/Paxos): leader-based agreement
  • Serializable isolation: DB-level (Spanner, CockroachDB)
  • Cost: higher latency, lower availability during partitions

Eventual Consistency Techniques

  • Async replication: write to leader, replicate later
  • Anti-entropy: Merkle trees detect divergence (Cassandra)
  • Read repair: fix stale replicas on read
  • Hinted handoff: store writes for unavailable nodes
  • Cost: stale reads, conflict resolution needed
CRDTs: Data structures that merge concurrent updates without conflicts. G-Counter (grow-only), PN-Counter (inc/dec), OR-Set (add/remove), RGA (text). Used in Figma, Apple Notes, Redis Active-Active. Guarantee: all replicas converge to same state regardless of message order.
Real-world choices: Spanner — linearizable globally (TrueTime). DynamoDB — eventual by default, strong per-read option. Cassandra — tunable per query (ONE, QUORUM, ALL). MongoDB — causal consistency with sessions. Redis — eventual (async replication), strong with WAIT command.
Anti-patterns: Assuming strong consistency when using eventual (lost updates). Linearizable for everything — unnecessary latency for non-critical reads. No conflict resolution strategy for eventual systems. Ignoring replication lag in read replicas.

Consensus Algorithms

How nodes agree on a single value despite failures — the foundation of distributed coordination

Raft Consensus — Leader Election & Log Replication
LEADER accepts all writes replicates log entries sends heartbeats Follower 1 replicates log votes in elections Follower 2 replicates log votes in elections Follower 3 replicates log votes in elections AppendEntries (log + heartbeat) Commit Rule: majority (2/3) acknowledge → committed Leader waits for majority ACK before responding to client Survives f failures with 2f+1 nodes (e.g., 1 failure with 3 nodes) Leader Failure election timeout (150-300ms) follower → candidate → new leader Client writes to leader only
AlgorithmMechanismFault ModelPerformanceUsed By
RaftLeader election → log replication → safetyCrash (2f+1 nodes)1 RTT for committed writeetcd, CockroachDB, Consul, TiKV
Multi-PaxosProposer → Acceptors (prepare/accept). Majority quorum.Crash (2f+1 nodes)1-2 RTT (optimized)Google Chubby, Spanner, Megastore
ZABZooKeeper Atomic Broadcast. Leader-based total order.Crash (2f+1 nodes)1 RTT (leader stable)ZooKeeper, Kafka (controller)
EPaxosLeaderless Paxos. Any node can propose.Crash (2f+1 nodes)1 RTT (no conflicts)Research, CockroachDB (partial)
PBFTPre-prepare → Prepare → Commit. Tolerates malicious nodes.Byzantine (3f+1)3 RTT (expensive)Blockchain, Hyperledger
Viewstamped Rep.Similar to Raft, predates it. View changes for leader failure.Crash (2f+1 nodes)1 RTTAcademic, influenced Raft
Raft vs Paxos — Key Differences

Raft (Preferred for New Systems)

  • Understandable: designed for clarity (vs Paxos complexity)
  • Strong leader: all writes go through leader
  • Log-based: ordered append-only log replicated
  • Membership changes: joint consensus protocol
  • Implementations: etcd, Consul, CockroachDB, TiKV
  • Tradeoff: leader bottleneck for write throughput

Paxos (Theoretical Foundation)

  • Flexible: no mandatory leader (Multi-Paxos adds one)
  • Proven: mathematically verified safety
  • Complex: notoriously hard to implement correctly
  • Variants: Basic, Multi, Fast, EPaxos, Flexible
  • Implementations: Google Spanner, Chubby, Megastore
  • Tradeoff: implementation complexity, subtle bugs
Guarantees: Safety — all nodes agree on same value (never disagree). Liveness — eventually makes progress (given stable leader). Survives f failures with 2f+1 nodes (e.g., 2 failures with 5 nodes). FLP Impossibility: No deterministic consensus in async system with even 1 crash — practical systems use timeouts to circumvent.
When you need consensus: Leader election (who's the primary?). Distributed locks (mutual exclusion). Atomic broadcast (total order of messages). State machine replication (replicated log → same state). Configuration management (cluster membership).
Anti-patterns: Using consensus for every write — too slow for high-throughput data. Odd cluster sizes only (3, 5, 7) — even numbers don't improve fault tolerance. Cross-region consensus — RTT kills performance (use multi-region Paxos or local consensus + async replication). Ignoring split-brain — always use fencing tokens.
Real-world: etcd — Raft, backbone of Kubernetes (stores all cluster state). Google Spanner — Paxos per shard, globally consistent. CockroachDB — Raft per range (64MB chunks). Kafka KRaft — replacing ZooKeeper with Raft-based controller (Kafka 3.3+).

Distributed Transactions

Coordinating writes across multiple services

What ACID Really Means

Atomicity

All or nothing
Transaction write 1 write 2 write 3 write 4 commit all or nothing DB

Consistency

Preserving DB invariants
consistent state A 💰 Txn consistent state B

Isolation

Concurrent txns isolated
write 1 write 2 Txn A write 3 write 4 Txn B 🔒 isolated DB

Durability

Persisted after commit
Transaction 1. commit DB 2. replicate 2. replicate replica a replica b
ACID — Single DB vs Distributed
PropertySingle DBDistributed (Microservices)
Atomicitycommit/rollback via WAL2PC (strict, blocking) or Saga (eventual + compensating)
ConsistencyDB constraints — PK, FK, UNIQUEApp logic + events — no global constraints
IsolationLocks / MVCCDistributed MVCC, optimistic locks — high latency
DurabilityWAL + diskReplication + quorum writes — survives node failure
Distributed Transaction Patterns

2PC (Two-Phase Commit)

Strong consistency. Coordinator asks all → commit or abort.
Coordinator Service A Service B PHASE 1 prepare? prepare? ✓ YES ✓ YES PHASE 2 COMMIT COMMIT ACK ✓ ACK ✓ If any votes NO → ABORT all ROLLBACK ROLLBACK ⚠ Blocking — if coordinator dies after PREPARE, all participants stuck Used by: Spanner, XA transactions, distributed DBs

Saga (Choreography)

Eventual consistency. Chain of local txns + compensating rollbacks.
Order Svc Payment Svc Inventory Svc HAPPY ✓ T1: create order.created T2: charge payment.done T3: reserve ✓ All steps succeeded FAIL ✗ T3: ✗ fail stock.failed C2: refund 💸 payment.refunded C1: cancel ↩

Outbox Pattern

Reliable event publishing without 2PC. Write event + data in same DB txn.
Service 1 txn Database orders outbox atomic write (same txn) 2. CDC / poll Debezium 3. publish Kafka Consumers ✓ ✓ Event never lost — same txn as data

TCC (Try-Confirm-Cancel)

Reserve first, then confirm or cancel. Used in fintech.
Coordinator Payment Inventory TRY reserve $100 reserve 1 item $100 held 1 item held CONFIRM deduct $100 ✓ deduct 1 item ✓ CANCEL release $100 ↩ release 1 item ↩
PatternConsistencyBlocking?Best For
2PCStrongYes — coordinator SPOFDistributed DBs (Spanner, XA)
SagaEventualNoMost microservices (Uber, Airbnb)
OutboxEventualNoReliable event publishing (Stripe)
TCCEventualNoFintech — reserve before commit
Real-world: Uber — Saga via Temporal/Cadence. Stripe — Outbox for payment events. Airbnb — Saga for booking (reserve→charge→confirm).

Concurrency Control

Multiple transactions accessing same data without corruption

Why Do We Need Locks?
Without locks, two users buying the last ticket simultaneously both read stock=1, both decrement, both succeed → oversold. This is a lost update / race condition. Locks ensure only one writer at a time on the same row.

Pessimistic Locking

Lock first, then work. SELECT ... FOR UPDATE
User A DB (stock=1) User B SELECT ... FOR UPDATE 🔒 lock granted, stock=1 SELECT ... FOR UPDATE BLOCKED waiting for lock UPDATE stock=0; COMMIT ✅ ticket booked, lock released 🔒 lock granted, stock=0 stock=0 → ABORT ❌ sold out ✓ No oversell — B waited, saw stock=0 t1 t2 t3 t4 t5
Pro Safe — no race condition, no retry needed
Con Deadlock risk, throughput bottleneck (others wait)
Use: high contention — flash sales, seat booking, inventory
↳ Many users hit same row simultaneously — conflicts are certain, so lock upfront to prevent overselling/double-booking

Optimistic Locking

No lock. Check version on write. WHERE version=N
User A DB (stock=1, v=1) User B SELECT stock, version stock=1, v=1 SELECT stock, version stock=1, v=1 🔓 no lock held by either UPDATE stock=0 WHERE v=1 ✅ rows=1, v→2 UPDATE stock=0 WHERE v=1 ❌ rows=0 (v=2 now, mismatch) 🔄 retry: SELECT again stock=0, v=2 ❌ sold out ✓ No oversell — B's write rejected by version check t1 t2 t3 t4
Pro High throughput, no blocking, deadlock-free
Con Retry overhead under high contention
Use: low contention — profile edits, wiki pages, config updates
↳ Conflicts are rare, so skipping locks boosts concurrency — if a conflict does happen, just retry cheaply
Isolation Levels: Read UncommittedRead Committed (Postgres default) → Repeatable Read (MySQL default) → Serializable (strongest, slowest). MVCC — readers see snapshot, no read locks. Deadlock prevention: Acquire locks in same order, use timeouts.

Saga — Orchestration vs Choreography

Long-running distributed transactions with compensating actions — when 2PC is too slow or spans too many services

Orchestration Pattern — Central Coordinator
Saga Orchestrator manages state machine decides next step / compensation Temporal / Step Functions / Cadence Happy Path (all succeed) ① Order Service createOrder() status: PENDING compensate: cancelOrder() ② Inventory Svc reserveStock() status: RESERVED compensate: releaseStock() ③ Payment Svc chargeCard() status: CHARGED compensate: refund() ④ Shipping Svc scheduleShipment() status: SHIPPED compensate: cancelShipment() ✓ DONE COMPLETED Compensation Path (Payment fails at step ③) ③ FAILED chargeCard() threw error ② releaseStock() undo reservation ① cancelOrder() mark CANCELLED Saga ROLLED BACK Orchestrator manages: ✓ Step execution order ✗ Compensation on failure ⟳ Retry with backoff 📊 Full visibility/tracing
Choreography Pattern — Event-Driven, No Central Coordinator
Event Bus (Kafka / SNS+SQS / EventBridge) Order Service publishes: OrderCreated Payment Service listens: OrderCreated publishes: PaymentCharged Inventory Service listens: PaymentCharged publishes: StockReserved Shipping Service listens: StockReserved publishes: Shipped On failure: each service publishes compensation event PaymentFailed → Order listens → cancel StockUnavailable → Payment refunds No single point of failure — but harder to trace, debug, and reason about ordering
OrchestrationChoreography
CoordinatorCentral orchestrator (Temporal, Cadence, Step Functions)None — services react to domain events
CouplingServices coupled to orchestratorLoosely coupled — only know their own events
VisibilityEasy to trace — single workflow definitionHard to trace — distributed across event logs
ComplexityGrows linearly with stepsGrows exponentially with interactions
Failure handlingOrchestrator decides compensation orderEach service must handle its own compensation
TestingTest workflow as unitRequires integration tests across services
Best forComplex flows (5+ steps), strict ordering neededSimple flows (2-3 steps), autonomous teams
Saga Implementation Tools
ToolTypeLanguageKey Feature
TemporalOrchestrationGo, Java, TypeScript, PythonDurable execution, automatic retries, versioning
AWS Step FunctionsOrchestrationAny (via Lambda)Serverless, visual workflow, built-in error handling
CadenceOrchestrationGo, JavaUber-built, predecessor to Temporal
Axon FrameworkBothJava/KotlinCQRS + Event Sourcing + Saga built-in
MassTransitBoth.NETState machine sagas, RabbitMQ/Kafka transport
Eventuate TramChoreographyJavaOutbox pattern, CDC-based event publishing
Saga Design Patterns & Considerations

Compensation Design Rules

  • Semantic undo: compensations are business-level (refund, not DB rollback)
  • Idempotent: compensations may run multiple times
  • Commutative: order of compensations shouldn't matter
  • Never fail: compensations must always succeed (retry forever)
  • Backward recovery: undo in reverse order of execution

Common Pitfalls

  • No isolation: dirty reads between saga steps (use semantic locks)
  • Lost compensations: compensation event lost → inconsistent state
  • Cyclic dependencies: A→B→C→A creates deadlock
  • Too many steps: >7 steps = consider splitting into sub-sagas
  • No timeout: saga hangs forever if step never responds
Isolation countermeasures: Semantic locks — mark resources as "processing" (order status = PENDING). Commutative updates — operations that work in any order. Pessimistic view — reread data before compensation. Reread value — verify state hasn't changed before acting.
When to use Sagas vs 2PC: Use 2PC when you need strong consistency across 2-3 databases in same trust boundary (short-lived). Use Sagas when spanning multiple microservices, long-running processes, or when availability matters more than immediate consistency.
Anti-patterns: Saga without compensations — just a distributed transaction that can't roll back. Synchronous sagas — defeats the purpose, use 2PC instead. Shared mutable state — services reading each other's DB directly.
Real-world: Uber — trip lifecycle saga (match → pickup → ride → payment → rating). Netflix — content ingestion pipeline (transcode → validate → publish). Airbnb — booking saga (reserve → charge → confirm host → send confirmation).

Conflict Resolution

Reconciling concurrent writes in eventually consistent systems — when two nodes disagree, who wins?

Conflict Detection & Resolution Strategies
Concurrent Writes (Conflict) Node A x = "blue" @ t=10 Node B x = "red" @ t=11 ⚠ Which value is correct? Network partition prevented sync Last-Writer-Wins (LWW) Highest timestamp wins: x = "red" ⚠ "blue" silently lost (data loss) Simple but dangerous for important data Vector Clocks / Version Vectors Detect: A[1,0] vs B[0,1] = CONFLICT Surface conflict → app/user resolves No data loss, but complex resolution logic CRDTs (Conflict-Free) Mathematically guaranteed merge G-Counter: {A:5, B:7} → sum = 12 OR-Set: union of adds, track removes No conflicts possible — always converges Result: x = "red" Lost: "blue" (acceptable for caches) Result: siblings ["blue","red"] App picks or shows both to user Result: auto-merged No human intervention needed LWW: simple, lossy Vector Clocks: detects conflicts, app resolves CRDTs: auto-merge, no loss, limited data types
StrategyHow It WorksData Loss?Used By
Last-Writer-Wins (LWW)Highest timestamp wins, discard othersYes — silent data lossCassandra (default), DynamoDB
Vector ClocksTrack causal history per node, detect true conflictsNo — surfaces siblings for resolutionRiak, Voldemort
Version VectorsSimplified vector clocks (per-replica, not per-client)No — detects concurrent updatesRiak (newer), Dynamo paper
CRDTsMathematically convergent data structuresNo — auto-mergesFigma, Apple Notes, Redis CRDT, Yjs
Operational TransformTransform concurrent ops to maintain intentNo — preserves all editsGoogle Docs
Application-level mergeCustom merge logic per domainDepends on logicGit (3-way merge), custom apps
CRDT Types — Conflict-Free Replicated Data Types

Counters

  • G-Counter: grow-only (likes, views)
  • PN-Counter: increment + decrement
  • Merge: max per node, sum across

Sets

  • G-Set: grow-only (add, never remove)
  • OR-Set: observed-remove (add + remove)
  • LWW-Element-Set: timestamp per element

Sequences / Text

  • RGA: Replicated Growable Array
  • LSEQ: position-based sequence
  • Used for collaborative text editing
Key Insight: CRDTs guarantee convergence — all replicas reach the same state without coordination. The tradeoff: limited to data types with commutative, associative, idempotent merge operations. Not all business logic fits naturally into CRDTs.
Choosing a strategy: LWW for caches, session data, non-critical writes. Vector clocks for shopping carts, user preferences (surface conflicts). CRDTs for collaborative editing, counters, distributed sets. OT for real-time text collaboration with central server.
Anti-patterns: LWW for financial data — lost writes = lost money. Ignoring clock skew — NTP drift makes LWW unreliable. Unbounded vector clocks — grow forever without pruning. CRDTs for everything — overkill for simple use cases.
Real-world: Figma — CRDTs for multiplayer design (custom implementation). Apple Notes — CRDTs for offline-first sync. Redis — Active-Active with CRDT-based conflict resolution. Riak — vector clocks + sibling resolution. Google Docs — Operational Transform (centralized).

Clock Sync & Time

Why "now" is harder than it looks — clocks drift, networks delay, and ordering requires coordination

Clock Types in Distributed Systems
Physical Clocks wall-clock time (real-world time) NTP ~1-10ms accuracy unbounded drift can jump backward TrueTime GPS + atomic clocks bounded ε (~7ms) Google Spanner only Logical Clocks ordering without real time Lamport single counter total ordering can't detect concurrency Vector Clock counter per node detects concurrency grows with nodes Hybrid Clocks physical + logical combined HLC physical + logical causal + real-time CockroachDB, YugabyteDB Commit Wait wait for ε to pass external consistency Spanner (TrueTime) Where Clocks Matter MVCC Snapshots read at timestamp T Spanner, CockroachDB Lease Expiry leader lease valid until T etcd, ZooKeeper, DynamoDB Event Ordering causal ordering of events Kafka, EventStore Conflict Resolution LWW needs timestamps Cassandra, DynamoDB ⚠ Clock Skew Problem Node A: 10:00:00.000 | Node B: 10:00:00.150 | Δ = 150ms → wrong ordering ✓ Solution: Logical/Hybrid Clocks Don't rely on wall-clock agreement — use causal ordering instead
Clock TypeAccuracyDetects ConcurrencyReal-Time OrderingUsed By
NTP~1-10ms (LAN), ~100ms (WAN)NoApproximateMost systems (default)
TrueTime~7ms boundedNo (but bounded uncertainty)Yes (with commit-wait)Google Spanner
Lamport TimestampN/A (logical)No — total order onlyNoSimple ordering needs
Vector ClockN/A (logical)YesNoRiak, Dynamo
HLCPhysical + logicalYesApproximate + causalCockroachDB, YugabyteDB
Clock Pitfalls & Best Practices

Never Assume

  • System clocks agree across nodes
  • Clock always moves forward (NTP can jump back)
  • Timestamps are unique (sub-ms collisions)
  • Network delay is symmetric
  • Leap seconds don't exist (they do: 23:59:60)

Best Practices

  • Use monotonic clocks for timeouts/durations
  • Use logical clocks for event ordering
  • Use HLC when you need both time + causality
  • Store timestamps as UTC + timezone offset
  • Add node ID to break timestamp ties
Monotonic vs Wall Clock: Monotonic (clock_gettime CLOCK_MONOTONIC) — never goes backward, use for measuring durations, timeouts, lease expiry. Wall clock (gettimeofday) — can jump forward/backward on NTP sync, use only for human-readable timestamps.
How Spanner achieves external consistency: TrueTime returns interval [earliest, latest]. On commit, Spanner waits until latest has passed (commit-wait ~7ms). This guarantees if T1 commits before T2 starts, T1's timestamp < T2's timestamp. Cost: ~7ms added latency per write.
Anti-patterns: Using System.currentTimeMillis() for ordering — clock skew breaks it. Comparing timestamps across machines without bounded uncertainty. Relying on NTP for distributed locks — drift can cause split-brain. Ignoring leap seconds in time-sensitive financial systems.
Real-world: Google Spanner — TrueTime + commit-wait for global consistency. CockroachDB — HLC for serializable transactions without GPS. Amazon — time sync service for EC2 (chrony, ~1ms accuracy). Cloudflare — Roughtime protocol for untrusted time sources.