9. Consistency & Distributed Systems — System Design Concepts

🔗 9 · CONSISTENCY & DISTRIBUTED SYSTEMS

CAP Theorem & PACELC

Distributed system can guarantee only 2 of 3: Consistency, Availability, Partition Tolerance

CAP Theorem

C — all nodes see same data (reads get latest write)
A — every request gets a non-error response
P — system works despite network splits

PACELC

If Partition → choose A or C
Else → choose Latency or Consistency
Most systems are tunable along this spectrum

▸ CAP System Approaches

CA — Consistent + Available

Single DB, not distributed. No partition tolerance. Postgres, MySQL (single node)

AP — Available + Partition Tolerant

Always responds, may serve stale data. Cassandra, DynamoDB, CouchDB

CP — Consistent + Partition Tolerant

Returns error if can't guarantee consistency. ZooKeeper, etcd, Spanner, HBase

Type	Trade-off	Systems	When Partition Happens
CA	No partition tolerance	Single-node Postgres, MySQL	N/A — not distributed
AP	Eventual consistency	Cassandra, DynamoDB, CouchDB	Serves stale data, resolves later
CP	Reduced availability	ZooKeeper, etcd, Spanner, HBase	Rejects requests until consistent

Since partitions are unavoidable in distributed systems, the real choice is always CP vs AP during a partition. CA only exists for single-node systems.

Consistency Models

From strongest to weakest — each level trades latency for correctness. Choose based on your domain's tolerance for stale reads.

▸ Consistency Spectrum — Strongest to Weakest

Model	Guarantee	Latency	Use Case	Systems
Linearizable	Appears as single copy, real-time ordering	Highest (quorum write + read)	Distributed locks, leader election, counters	etcd, ZooKeeper, Spanner, CockroachDB
Sequential	All observers see same order (not real-time)	High	Event logs, total-order broadcast	Kafka partitions, ZAB, Raft log
Causal	Causally related writes ordered; concurrent unordered	Medium	Comments/replies, chat messages	MongoDB (causal sessions), COPS
Read-your-writes	Client always sees its own writes	Medium-Low	Profile updates, settings changes	DynamoDB (consistent read), session affinity
Monotonic reads	Never see older value after seeing newer	Low	Timelines, feeds (no "going back in time")	Read from same replica, version tracking
Eventual	Replicas converge given no new writes	Lowest	Likes, DNS, shopping carts, CDN	Cassandra, DynamoDB, S3, CouchDB

▸ Tunable Consistency (Quorum Systems)

Quorum formula: W + R > N guarantees overlap (strong consistency). N=3, W=2, R=2 — majority quorum (standard). N=3, W=3, R=1 — fast reads, slow writes. N=3, W=1, R=1 — eventual consistency (fastest, may read stale). Cassandra and DynamoDB let you tune per-query.

Strong Consistency Techniques

Quorum reads/writes: W+R > N ensures overlap
Synchronous replication: wait for all replicas
Consensus (Raft/Paxos): leader-based agreement
Serializable isolation: DB-level (Spanner, CockroachDB)
Cost: higher latency, lower availability during partitions

Eventual Consistency Techniques

Async replication: write to leader, replicate later
Anti-entropy: Merkle trees detect divergence (Cassandra)
Read repair: fix stale replicas on read
Hinted handoff: store writes for unavailable nodes
Cost: stale reads, conflict resolution needed

CRDTs: Data structures that merge concurrent updates without conflicts. G-Counter (grow-only), PN-Counter (inc/dec), OR-Set (add/remove), RGA (text). Used in Figma, Apple Notes, Redis Active-Active. Guarantee: all replicas converge to same state regardless of message order.

Real-world choices: Spanner — linearizable globally (TrueTime). DynamoDB — eventual by default, strong per-read option. Cassandra — tunable per query (ONE, QUORUM, ALL). MongoDB — causal consistency with sessions. Redis — eventual (async replication), strong with WAIT command.

Anti-patterns: Assuming strong consistency when using eventual (lost updates). Linearizable for everything — unnecessary latency for non-critical reads. No conflict resolution strategy for eventual systems. Ignoring replication lag in read replicas.

Consensus Algorithms

How nodes agree on a single value despite failures — the foundation of distributed coordination

▸ Raft Consensus — Leader Election & Log Replication

Algorithm	Mechanism	Fault Model	Performance	Used By
Raft	Leader election → log replication → safety	Crash (2f+1 nodes)	1 RTT for committed write	etcd, CockroachDB, Consul, TiKV
Multi-Paxos	Proposer → Acceptors (prepare/accept). Majority quorum.	Crash (2f+1 nodes)	1-2 RTT (optimized)	Google Chubby, Spanner, Megastore
ZAB	ZooKeeper Atomic Broadcast. Leader-based total order.	Crash (2f+1 nodes)	1 RTT (leader stable)	ZooKeeper, Kafka (controller)
EPaxos	Leaderless Paxos. Any node can propose.	Crash (2f+1 nodes)	1 RTT (no conflicts)	Research, CockroachDB (partial)
PBFT	Pre-prepare → Prepare → Commit. Tolerates malicious nodes.	Byzantine (3f+1)	3 RTT (expensive)	Blockchain, Hyperledger
Viewstamped Rep.	Similar to Raft, predates it. View changes for leader failure.	Crash (2f+1 nodes)	1 RTT	Academic, influenced Raft

▸ Raft vs Paxos — Key Differences

Raft (Preferred for New Systems)

Understandable: designed for clarity (vs Paxos complexity)
Strong leader: all writes go through leader
Log-based: ordered append-only log replicated
Membership changes: joint consensus protocol
Implementations: etcd, Consul, CockroachDB, TiKV
Tradeoff: leader bottleneck for write throughput

Paxos (Theoretical Foundation)

Flexible: no mandatory leader (Multi-Paxos adds one)
Proven: mathematically verified safety
Complex: notoriously hard to implement correctly
Variants: Basic, Multi, Fast, EPaxos, Flexible
Implementations: Google Spanner, Chubby, Megastore
Tradeoff: implementation complexity, subtle bugs

Guarantees: Safety — all nodes agree on same value (never disagree). Liveness — eventually makes progress (given stable leader). Survives f failures with 2f+1 nodes (e.g., 2 failures with 5 nodes). FLP Impossibility: No deterministic consensus in async system with even 1 crash — practical systems use timeouts to circumvent.

When you need consensus: Leader election (who's the primary?). Distributed locks (mutual exclusion). Atomic broadcast (total order of messages). State machine replication (replicated log → same state). Configuration management (cluster membership).

Anti-patterns: Using consensus for every write — too slow for high-throughput data. Odd cluster sizes only (3, 5, 7) — even numbers don't improve fault tolerance. Cross-region consensus — RTT kills performance (use multi-region Paxos or local consensus + async replication). Ignoring split-brain — always use fencing tokens.

Real-world: etcd — Raft, backbone of Kubernetes (stores all cluster state). Google Spanner — Paxos per shard, globally consistent. CockroachDB — Raft per range (64MB chunks). Kafka KRaft — replacing ZooKeeper with Raft-based controller (Kafka 3.3+).

Distributed Transactions

Coordinating writes across multiple services

▸ What ACID Really Means

Atomicity

All or nothing

Consistency

Preserving DB invariants

Isolation

Concurrent txns isolated

Durability

Persisted after commit

▸ ACID — Single DB vs Distributed

Property	Single DB	Distributed (Microservices)
Atomicity	commit/rollback via WAL	2PC (strict, blocking) or Saga (eventual + compensating)
Consistency	DB constraints — PK, FK, UNIQUE	App logic + events — no global constraints
Isolation	Locks / MVCC	Distributed MVCC, optimistic locks — high latency
Durability	WAL + disk	Replication + quorum writes — survives node failure

▸ Distributed Transaction Patterns

2PC (Two-Phase Commit)

Strong consistency. Coordinator asks all → commit or abort.

Saga (Choreography)

Eventual consistency. Chain of local txns + compensating rollbacks.

Outbox Pattern

Reliable event publishing without 2PC. Write event + data in same DB txn.

TCC (Try-Confirm-Cancel)

Reserve first, then confirm or cancel. Used in fintech.

Pattern	Consistency	Blocking?	Best For
2PC	Strong	Yes — coordinator SPOF	Distributed DBs (Spanner, XA)
Saga	Eventual	No	Most microservices (Uber, Airbnb)
Outbox	Eventual	No	Reliable event publishing (Stripe)
TCC	Eventual	No	Fintech — reserve before commit

Real-world: Uber — Saga via Temporal/Cadence. Stripe — Outbox for payment events. Airbnb — Saga for booking (reserve→charge→confirm).

Concurrency Control

Multiple transactions accessing same data without corruption

▸ Why Do We Need Locks?

Without locks, two users buying the last ticket simultaneously both read stock=1, both decrement, both succeed → oversold. This is a lost update / race condition. Locks ensure only one writer at a time on the same row.

Pessimistic Locking

Lock first, then work. SELECT ... FOR UPDATE

Pro Safe — no race condition, no retry needed
Con Deadlock risk, throughput bottleneck (others wait)
Use: high contention — flash sales, seat booking, inventory
↳ Many users hit same row simultaneously — conflicts are certain, so lock upfront to prevent overselling/double-booking

Optimistic Locking

No lock. Check version on write. WHERE version=N

Pro High throughput, no blocking, deadlock-free
Con Retry overhead under high contention
Use: low contention — profile edits, wiki pages, config updates
↳ Conflicts are rare, so skipping locks boosts concurrency — if a conflict does happen, just retry cheaply

Isolation Levels: Read Uncommitted → Read Committed (Postgres default) → Repeatable Read (MySQL default) → Serializable (strongest, slowest). MVCC — readers see snapshot, no read locks. Deadlock prevention: Acquire locks in same order, use timeouts.

Saga — Orchestration vs Choreography

Long-running distributed transactions with compensating actions — when 2PC is too slow or spans too many services

▸ Orchestration Pattern — Central Coordinator

▸ Choreography Pattern — Event-Driven, No Central Coordinator

	Orchestration	Choreography
Coordinator	Central orchestrator (Temporal, Cadence, Step Functions)	None — services react to domain events
Coupling	Services coupled to orchestrator	Loosely coupled — only know their own events
Visibility	Easy to trace — single workflow definition	Hard to trace — distributed across event logs
Complexity	Grows linearly with steps	Grows exponentially with interactions
Failure handling	Orchestrator decides compensation order	Each service must handle its own compensation
Testing	Test workflow as unit	Requires integration tests across services
Best for	Complex flows (5+ steps), strict ordering needed	Simple flows (2-3 steps), autonomous teams

▸ Saga Implementation Tools

Tool	Type	Language	Key Feature
Temporal	Orchestration	Go, Java, TypeScript, Python	Durable execution, automatic retries, versioning
AWS Step Functions	Orchestration	Any (via Lambda)	Serverless, visual workflow, built-in error handling
Cadence	Orchestration	Go, Java	Uber-built, predecessor to Temporal
Axon Framework	Both	Java/Kotlin	CQRS + Event Sourcing + Saga built-in
MassTransit	Both	.NET	State machine sagas, RabbitMQ/Kafka transport
Eventuate Tram	Choreography	Java	Outbox pattern, CDC-based event publishing

▸ Saga Design Patterns & Considerations

Compensation Design Rules

Semantic undo: compensations are business-level (refund, not DB rollback)
Idempotent: compensations may run multiple times
Commutative: order of compensations shouldn't matter
Never fail: compensations must always succeed (retry forever)
Backward recovery: undo in reverse order of execution

Common Pitfalls

No isolation: dirty reads between saga steps (use semantic locks)
Lost compensations: compensation event lost → inconsistent state
Cyclic dependencies: A→B→C→A creates deadlock
Too many steps: >7 steps = consider splitting into sub-sagas
No timeout: saga hangs forever if step never responds

Isolation countermeasures: Semantic locks — mark resources as "processing" (order status = PENDING). Commutative updates — operations that work in any order. Pessimistic view — reread data before compensation. Reread value — verify state hasn't changed before acting.

When to use Sagas vs 2PC: Use 2PC when you need strong consistency across 2-3 databases in same trust boundary (short-lived). Use Sagas when spanning multiple microservices, long-running processes, or when availability matters more than immediate consistency.

Anti-patterns: Saga without compensations — just a distributed transaction that can't roll back. Synchronous sagas — defeats the purpose, use 2PC instead. Shared mutable state — services reading each other's DB directly.

Real-world: Uber — trip lifecycle saga (match → pickup → ride → payment → rating). Netflix — content ingestion pipeline (transcode → validate → publish). Airbnb — booking saga (reserve → charge → confirm host → send confirmation).

Conflict Resolution

Reconciling concurrent writes in eventually consistent systems — when two nodes disagree, who wins?

▸ Conflict Detection & Resolution Strategies

Strategy	How It Works	Data Loss?	Used By
Last-Writer-Wins (LWW)	Highest timestamp wins, discard others	Yes — silent data loss	Cassandra (default), DynamoDB
Vector Clocks	Track causal history per node, detect true conflicts	No — surfaces siblings for resolution	Riak, Voldemort
Version Vectors	Simplified vector clocks (per-replica, not per-client)	No — detects concurrent updates	Riak (newer), Dynamo paper
CRDTs	Mathematically convergent data structures	No — auto-merges	Figma, Apple Notes, Redis CRDT, Yjs
Operational Transform	Transform concurrent ops to maintain intent	No — preserves all edits	Google Docs
Application-level merge	Custom merge logic per domain	Depends on logic	Git (3-way merge), custom apps

▸ CRDT Types — Conflict-Free Replicated Data Types

Counters

G-Counter: grow-only (likes, views)
PN-Counter: increment + decrement
Merge: max per node, sum across

Sets

G-Set: grow-only (add, never remove)
OR-Set: observed-remove (add + remove)
LWW-Element-Set: timestamp per element

Sequences / Text

RGA: Replicated Growable Array
LSEQ: position-based sequence
Used for collaborative text editing

Key Insight: CRDTs guarantee convergence — all replicas reach the same state without coordination. The tradeoff: limited to data types with commutative, associative, idempotent merge operations. Not all business logic fits naturally into CRDTs.

Choosing a strategy: LWW for caches, session data, non-critical writes. Vector clocks for shopping carts, user preferences (surface conflicts). CRDTs for collaborative editing, counters, distributed sets. OT for real-time text collaboration with central server.

Anti-patterns: LWW for financial data — lost writes = lost money. Ignoring clock skew — NTP drift makes LWW unreliable. Unbounded vector clocks — grow forever without pruning. CRDTs for everything — overkill for simple use cases.

Real-world: Figma — CRDTs for multiplayer design (custom implementation). Apple Notes — CRDTs for offline-first sync. Redis — Active-Active with CRDT-based conflict resolution. Riak — vector clocks + sibling resolution. Google Docs — Operational Transform (centralized).

Clock Sync & Time

Why "now" is harder than it looks — clocks drift, networks delay, and ordering requires coordination

▸ Clock Types in Distributed Systems

Clock Type	Accuracy	Detects Concurrency	Real-Time Ordering	Used By
NTP	~1-10ms (LAN), ~100ms (WAN)	No	Approximate	Most systems (default)
TrueTime	~7ms bounded	No (but bounded uncertainty)	Yes (with commit-wait)	Google Spanner
Lamport Timestamp	N/A (logical)	No — total order only	No	Simple ordering needs
Vector Clock	N/A (logical)	Yes	No	Riak, Dynamo
HLC	Physical + logical	Yes	Approximate + causal	CockroachDB, YugabyteDB

▸ Clock Pitfalls & Best Practices

Never Assume

System clocks agree across nodes
Clock always moves forward (NTP can jump back)
Timestamps are unique (sub-ms collisions)
Network delay is symmetric
Leap seconds don't exist (they do: 23:59:60)

Best Practices

Use monotonic clocks for timeouts/durations
Use logical clocks for event ordering
Use HLC when you need both time + causality
Store timestamps as UTC + timezone offset
Add node ID to break timestamp ties

Monotonic vs Wall Clock: Monotonic (clock_gettime CLOCK_MONOTONIC) — never goes backward, use for measuring durations, timeouts, lease expiry. Wall clock (gettimeofday) — can jump forward/backward on NTP sync, use only for human-readable timestamps.

How Spanner achieves external consistency: TrueTime returns interval [earliest, latest]. On commit, Spanner waits until latest has passed (commit-wait ~7ms). This guarantees if T1 commits before T2 starts, T1's timestamp < T2's timestamp. Cost: ~7ms added latency per write.

Anti-patterns: Using System.currentTimeMillis() for ordering — clock skew breaks it. Comparing timestamps across machines without bounded uncertainty. Relying on NTP for distributed locks — drift can cause split-brain. Ignoring leap seconds in time-sensitive financial systems.

Real-world: Google Spanner — TrueTime + commit-wait for global consistency. CockroachDB — HLC for serializable transactions without GPS. Amazon — time sync service for EC2 (chrony, ~1ms accuracy). Cloudflare — Roughtime protocol for untrusted time sources.