Distributed system can guarantee only 2 of 3: Consistency, Availability, Partition Tolerance
CAP Theorem
C — all nodes see same data (reads get latest write) A — every request gets a non-error response P — system works despite network splits
PACELC
If Partition → choose A or C Else → choose Latency or Consistency
Most systems are tunable along this spectrum
▸ CAP System Approaches
CA — Consistent + Available
Single DB, not distributed. No partition tolerance. Postgres, MySQL (single node)
AP — Available + Partition Tolerant
Always responds, may serve stale data. Cassandra, DynamoDB, CouchDB
CP — Consistent + Partition Tolerant
Returns error if can't guarantee consistency. ZooKeeper, etcd, Spanner, HBase
Type
Trade-off
Systems
When Partition Happens
CA
No partition tolerance
Single-node Postgres, MySQL
N/A — not distributed
AP
Eventual consistency
Cassandra, DynamoDB, CouchDB
Serves stale data, resolves later
CP
Reduced availability
ZooKeeper, etcd, Spanner, HBase
Rejects requests until consistent
Since partitions are unavoidable in distributed systems, the real choice is always CP vs AP during a partition. CA only exists for single-node systems.
Consistency Models
From strongest to weakest — each level trades latency for correctness. Choose based on your domain's tolerance for stale reads.
▸ Consistency Spectrum — Strongest to Weakest
Model
Guarantee
Latency
Use Case
Systems
Linearizable
Appears as single copy, real-time ordering
Highest (quorum write + read)
Distributed locks, leader election, counters
etcd, ZooKeeper, Spanner, CockroachDB
Sequential
All observers see same order (not real-time)
High
Event logs, total-order broadcast
Kafka partitions, ZAB, Raft log
Causal
Causally related writes ordered; concurrent unordered
Medium
Comments/replies, chat messages
MongoDB (causal sessions), COPS
Read-your-writes
Client always sees its own writes
Medium-Low
Profile updates, settings changes
DynamoDB (consistent read), session affinity
Monotonic reads
Never see older value after seeing newer
Low
Timelines, feeds (no "going back in time")
Read from same replica, version tracking
Eventual
Replicas converge given no new writes
Lowest
Likes, DNS, shopping carts, CDN
Cassandra, DynamoDB, S3, CouchDB
▸ Tunable Consistency (Quorum Systems)
Quorum formula:W + R > N guarantees overlap (strong consistency). N=3, W=2, R=2 — majority quorum (standard). N=3, W=3, R=1 — fast reads, slow writes. N=3, W=1, R=1 — eventual consistency (fastest, may read stale). Cassandra and DynamoDB let you tune per-query.
Cost: higher latency, lower availability during partitions
Eventual Consistency Techniques
Async replication: write to leader, replicate later
Anti-entropy: Merkle trees detect divergence (Cassandra)
Read repair: fix stale replicas on read
Hinted handoff: store writes for unavailable nodes
Cost: stale reads, conflict resolution needed
CRDTs: Data structures that merge concurrent updates without conflicts. G-Counter (grow-only), PN-Counter (inc/dec), OR-Set (add/remove), RGA (text). Used in Figma, Apple Notes, Redis Active-Active. Guarantee: all replicas converge to same state regardless of message order.
Real-world choices:Spanner — linearizable globally (TrueTime). DynamoDB — eventual by default, strong per-read option. Cassandra — tunable per query (ONE, QUORUM, ALL). MongoDB — causal consistency with sessions. Redis — eventual (async replication), strong with WAIT command.
Anti-patterns:Assuming strong consistency when using eventual (lost updates). Linearizable for everything — unnecessary latency for non-critical reads. No conflict resolution strategy for eventual systems. Ignoring replication lag in read replicas.
Consensus Algorithms
How nodes agree on a single value despite failures — the foundation of distributed coordination
Similar to Raft, predates it. View changes for leader failure.
Crash (2f+1 nodes)
1 RTT
Academic, influenced Raft
▸ Raft vs Paxos — Key Differences
Raft (Preferred for New Systems)
Understandable: designed for clarity (vs Paxos complexity)
Strong leader: all writes go through leader
Log-based: ordered append-only log replicated
Membership changes: joint consensus protocol
Implementations: etcd, Consul, CockroachDB, TiKV
Tradeoff: leader bottleneck for write throughput
Paxos (Theoretical Foundation)
Flexible: no mandatory leader (Multi-Paxos adds one)
Proven: mathematically verified safety
Complex: notoriously hard to implement correctly
Variants: Basic, Multi, Fast, EPaxos, Flexible
Implementations: Google Spanner, Chubby, Megastore
Tradeoff: implementation complexity, subtle bugs
Guarantees:Safety — all nodes agree on same value (never disagree). Liveness — eventually makes progress (given stable leader). Survives f failures with 2f+1 nodes (e.g., 2 failures with 5 nodes). FLP Impossibility: No deterministic consensus in async system with even 1 crash — practical systems use timeouts to circumvent.
When you need consensus:Leader election (who's the primary?). Distributed locks (mutual exclusion). Atomic broadcast (total order of messages). State machine replication (replicated log → same state). Configuration management (cluster membership).
Anti-patterns:Using consensus for every write — too slow for high-throughput data. Odd cluster sizes only (3, 5, 7) — even numbers don't improve fault tolerance. Cross-region consensus — RTT kills performance (use multi-region Paxos or local consensus + async replication). Ignoring split-brain — always use fencing tokens.
Real-world:etcd — Raft, backbone of Kubernetes (stores all cluster state). Google Spanner — Paxos per shard, globally consistent. CockroachDB — Raft per range (64MB chunks). Kafka KRaft — replacing ZooKeeper with Raft-based controller (Kafka 3.3+).
Distributed Transactions
Coordinating writes across multiple services
▸ What ACID Really Means
Atomicity
All or nothing
Consistency
Preserving DB invariants
Isolation
Concurrent txns isolated
Durability
Persisted after commit
▸ ACID — Single DB vs Distributed
Property
Single DB
Distributed (Microservices)
Atomicity
commit/rollback via WAL
2PC (strict, blocking) or Saga (eventual + compensating)
Consistency
DB constraints — PK, FK, UNIQUE
App logic + events — no global constraints
Isolation
Locks / MVCC
Distributed MVCC, optimistic locks — high latency
Durability
WAL + disk
Replication + quorum writes — survives node failure
▸ Distributed Transaction Patterns
2PC (Two-Phase Commit)
Strong consistency. Coordinator asks all → commit or abort.
Saga (Choreography)
Eventual consistency. Chain of local txns + compensating rollbacks.
Outbox Pattern
Reliable event publishing without 2PC. Write event + data in same DB txn.
TCC (Try-Confirm-Cancel)
Reserve first, then confirm or cancel. Used in fintech.
Pattern
Consistency
Blocking?
Best For
2PC
Strong
Yes — coordinator SPOF
Distributed DBs (Spanner, XA)
Saga
Eventual
No
Most microservices (Uber, Airbnb)
Outbox
Eventual
No
Reliable event publishing (Stripe)
TCC
Eventual
No
Fintech — reserve before commit
Real-world:Uber — Saga via Temporal/Cadence. Stripe — Outbox for payment events. Airbnb — Saga for booking (reserve→charge→confirm).
Concurrency Control
Multiple transactions accessing same data without corruption
▸ Why Do We Need Locks?
Without locks, two users buying the last ticket simultaneously both read stock=1, both decrement, both succeed → oversold. This is a lost update / race condition. Locks ensure only one writer at a time on the same row.
Pessimistic Locking
Lock first, then work. SELECT ... FOR UPDATE
Pro Safe — no race condition, no retry needed Con Deadlock risk, throughput bottleneck (others wait) Use: high contention — flash sales, seat booking, inventory ↳ Many users hit same row simultaneously — conflicts are certain, so lock upfront to prevent overselling/double-booking
Optimistic Locking
No lock. Check version on write. WHERE version=N
Pro High throughput, no blocking, deadlock-free Con Retry overhead under high contention Use: low contention — profile edits, wiki pages, config updates ↳ Conflicts are rare, so skipping locks boosts concurrency — if a conflict does happen, just retry cheaply
Isolation Levels:Read Uncommitted → Read Committed (Postgres default) → Repeatable Read (MySQL default) → Serializable (strongest, slowest). MVCC — readers see snapshot, no read locks. Deadlock prevention: Acquire locks in same order, use timeouts.
Saga — Orchestration vs Choreography
Long-running distributed transactions with compensating actions — when 2PC is too slow or spans too many services
▸ Orchestration Pattern — Central Coordinator
▸ Choreography Pattern — Event-Driven, No Central Coordinator
Orchestration
Choreography
Coordinator
Central orchestrator (Temporal, Cadence, Step Functions)
Semantic undo: compensations are business-level (refund, not DB rollback)
Idempotent: compensations may run multiple times
Commutative: order of compensations shouldn't matter
Never fail: compensations must always succeed (retry forever)
Backward recovery: undo in reverse order of execution
Common Pitfalls
No isolation: dirty reads between saga steps (use semantic locks)
Lost compensations: compensation event lost → inconsistent state
Cyclic dependencies: A→B→C→A creates deadlock
Too many steps: >7 steps = consider splitting into sub-sagas
No timeout: saga hangs forever if step never responds
Isolation countermeasures:Semantic locks — mark resources as "processing" (order status = PENDING). Commutative updates — operations that work in any order. Pessimistic view — reread data before compensation. Reread value — verify state hasn't changed before acting.
When to use Sagas vs 2PC: Use 2PC when you need strong consistency across 2-3 databases in same trust boundary (short-lived). Use Sagas when spanning multiple microservices, long-running processes, or when availability matters more than immediate consistency.
Anti-patterns:Saga without compensations — just a distributed transaction that can't roll back. Synchronous sagas — defeats the purpose, use 2PC instead. Shared mutable state — services reading each other's DB directly.
Reconciling concurrent writes in eventually consistent systems — when two nodes disagree, who wins?
▸ Conflict Detection & Resolution Strategies
Strategy
How It Works
Data Loss?
Used By
Last-Writer-Wins (LWW)
Highest timestamp wins, discard others
Yes — silent data loss
Cassandra (default), DynamoDB
Vector Clocks
Track causal history per node, detect true conflicts
No — surfaces siblings for resolution
Riak, Voldemort
Version Vectors
Simplified vector clocks (per-replica, not per-client)
No — detects concurrent updates
Riak (newer), Dynamo paper
CRDTs
Mathematically convergent data structures
No — auto-merges
Figma, Apple Notes, Redis CRDT, Yjs
Operational Transform
Transform concurrent ops to maintain intent
No — preserves all edits
Google Docs
Application-level merge
Custom merge logic per domain
Depends on logic
Git (3-way merge), custom apps
▸ CRDT Types — Conflict-Free Replicated Data Types
Counters
G-Counter: grow-only (likes, views)
PN-Counter: increment + decrement
Merge: max per node, sum across
Sets
G-Set: grow-only (add, never remove)
OR-Set: observed-remove (add + remove)
LWW-Element-Set: timestamp per element
Sequences / Text
RGA: Replicated Growable Array
LSEQ: position-based sequence
Used for collaborative text editing
Key Insight:CRDTs guarantee convergence — all replicas reach the same state without coordination. The tradeoff: limited to data types with commutative, associative, idempotent merge operations. Not all business logic fits naturally into CRDTs.
Choosing a strategy:LWW for caches, session data, non-critical writes. Vector clocks for shopping carts, user preferences (surface conflicts). CRDTs for collaborative editing, counters, distributed sets. OT for real-time text collaboration with central server.
Anti-patterns:LWW for financial data — lost writes = lost money. Ignoring clock skew — NTP drift makes LWW unreliable. Unbounded vector clocks — grow forever without pruning. CRDTs for everything — overkill for simple use cases.
Real-world:Figma — CRDTs for multiplayer design (custom implementation). Apple Notes — CRDTs for offline-first sync. Redis — Active-Active with CRDT-based conflict resolution. Riak — vector clocks + sibling resolution. Google Docs — Operational Transform (centralized).
Clock Sync & Time
Why "now" is harder than it looks — clocks drift, networks delay, and ordering requires coordination
▸ Clock Types in Distributed Systems
Clock Type
Accuracy
Detects Concurrency
Real-Time Ordering
Used By
NTP
~1-10ms (LAN), ~100ms (WAN)
No
Approximate
Most systems (default)
TrueTime
~7ms bounded
No (but bounded uncertainty)
Yes (with commit-wait)
Google Spanner
Lamport Timestamp
N/A (logical)
No — total order only
No
Simple ordering needs
Vector Clock
N/A (logical)
Yes
No
Riak, Dynamo
HLC
Physical + logical
Yes
Approximate + causal
CockroachDB, YugabyteDB
▸ Clock Pitfalls & Best Practices
Never Assume
System clocks agree across nodes
Clock always moves forward (NTP can jump back)
Timestamps are unique (sub-ms collisions)
Network delay is symmetric
Leap seconds don't exist (they do: 23:59:60)
Best Practices
Use monotonic clocks for timeouts/durations
Use logical clocks for event ordering
Use HLC when you need both time + causality
Store timestamps as UTC + timezone offset
Add node ID to break timestamp ties
Monotonic vs Wall Clock:Monotonic (clock_gettime CLOCK_MONOTONIC) — never goes backward, use for measuring durations, timeouts, lease expiry. Wall clock (gettimeofday) — can jump forward/backward on NTP sync, use only for human-readable timestamps.
How Spanner achieves external consistency: TrueTime returns interval [earliest, latest]. On commit, Spanner waits until latest has passed (commit-wait ~7ms). This guarantees if T1 commits before T2 starts, T1's timestamp < T2's timestamp. Cost: ~7ms added latency per write.
Anti-patterns:Using System.currentTimeMillis() for ordering — clock skew breaks it. Comparing timestamps across machines without bounded uncertainty. Relying on NTP for distributed locks — drift can cause split-brain. Ignoring leap seconds in time-sensitive financial systems.
Real-world:Google Spanner — TrueTime + commit-wait for global consistency. CockroachDB — HLC for serializable transactions without GPS. Amazon — time sync service for EC2 (chrony, ~1ms accuracy). Cloudflare — Roughtime protocol for untrusted time sources.