System Design Case Study

How does etcd achieve consensus across 5 nodes with Raft?

?? Design a distributed KV store: Raft consensus, leader election, split-brain prevention, majority quorum
Concepts Involved

Problem Statement

How does a distributed key-value store achieve consensus across 5 nodes, handling leader election, log replication, and split-brain prevention requiring majority quorum (3 of 5) for all decisions?

Core challenge: 5 nodes must agree on the same sequence of operations. A node crashes. A network partition splits the cluster. How do you guarantee all surviving nodes have the same data, elect exactly one leader, and prevent split-brain?
5 Nodes
typical cluster
tolerates 2 failures
1 Leader
at any time
all writes through leader
Majority
quorum (3/5)
for all decisions
<10ms
write latency
same-region cluster

Raft Consensus · How It Works

Leader election ? log replication ? safety guarantee

CLIENT LAYER Client (Writer) PUT /key ? Leader only all writes serialized PUT /key=val Client (Reader) Linearizable reads via ReadIndex Leader confirms still leader before read Client Guarantees Linearizable: read reflects latest committed write Sequential consistency across all clients CONSENSUS LAYER · Log Replication LEADER (Node 1) Accepts ALL writes, serializes into log Replicates AppendEntries ? all followers Heartbeat every 100ms (empty AppendEntries) Term = monotonically increasing epoch number Single point of serialization ? no conflicts AppendEntries Follower (Node 2) replicates log, votes ACK ? term 4, idx 12 Follower (Node 3) replicates log, votes ACK ? term 4, idx 12 Follower (Node 4) replicates log, votes ACK ? term 4, idx 12 Node 5 (down) network partition ? unreachable Replicated Log (all nodes) idx 10 | term 4 | SET x=1 | COMMITTED ? idx 11 | term 4 | SET y=2 | COMMITTED ? idx 12 | term 4 | DEL z | COMMITTED ? idx 13 | term 4 | SET w=3 | PENDING ? Committed = majority (3/5) have entry Once committed ? never lost (safety) 3 ACK ? = majority ? COMMIT Node 5 down ? doesn't block progress SAFETY LAYER · Election, Commit Rules, Membership Commit Rule Majority (3/5) ACK ? committed Committed entries applied to state machine 2f+1 nodes tolerates f failures Only one partition can have majority ? prevents split-brain Leader Election Timeout: 150-300ms (randomized) Follower ? Candidate ? RequestVote Majority votes ? becomes new Leader Random timeout prevents split vote New term incremented on each election Membership Change Joint consensus protocol Old + new config both must agree Safe add/remove nodes No split-brain during reconfig One node at a time (simplified) KEY GUARANTEES Safety All nodes agree on same log order Never diverge Liveness Eventually progresses if majority available Tolerates f failures Log Completeness New leader has ALL committed entries No data loss on failover Fencing (Terms) Monotonic term numbers prevent stale leader Old leader can't write 2f+1 nodes tolerates f failures | heartbeat every 100ms | <10ms write latency same-region
PhaseMechanismGuarantee
Leader ElectionFollower times out (150-300ms random) ? becomes candidate ? requests votes ? majority grants ? becomes leaderExactly one leader per term. Random timeout prevents split vote.
Log ReplicationLeader appends entry ? sends AppendEntries to all followers ? majority ACK ? entry committedCommitted entries never lost. All nodes converge to same log.
SafetyCandidate must have all committed entries to win election (log completeness)New leader always has all committed data. No data loss on failover.
HeartbeatLeader sends empty AppendEntries every 100msFollowers know leader is alive. No unnecessary elections.
Membership ChangeJoint consensus: old + new config both must agree during transitionSafe add/remove nodes without split-brain during reconfiguration.
Split-brain prevention: Majority quorum (3/5) means at most one partition can have a majority. The minority partition cannot elect a leader (can't get 3 votes). Stale leader in minority detects it lost majority ? steps down. Fencing tokens (monotonic term numbers) prevent stale leaders from making progress.
etcd specifics: Backbone of Kubernetes (stores all cluster state). Watch API · clients subscribe to key changes (efficient notification). Lease-based TTL · keys auto-expire (used for leader election, service discovery). Linearizable reads via ReadIndex (leader confirms it's still leader before serving read).
Limitations: Write throughput limited by leader (~10K writes/sec). Cross-region latency · consensus requires RTT to majority (100ms+ cross-region). Large values · etcd optimized for small KV (<1.5MB per value). Cluster size · 3 or 5 nodes (more = slower consensus).
Real-world: Kubernetes · all cluster state in etcd. CockroachDB · Raft per data range. TiKV · Raft for distributed KV. Consul · Raft for service catalog. Kafka KRaft · replacing ZooKeeper with Raft controller.

Interview Cheat Sheet

The 7 things to say for consensus/Raft design

1. Majority quorum (2f+1) · 5 nodes tolerates 2 failures. Only one partition can have majority.
2. Leader handles all writes · single point of serialization, no conflicts
3. Random election timeout (150-300ms) · prevents split vote, ensures one candidate wins
4. Log completeness · candidate must have all committed entries to win election (no data loss)
5. Committed = majority ACK · entry is durable once 3/5 nodes have it
6. Heartbeats every 100ms · followers know leader is alive, no unnecessary elections
7. Fencing tokens (term numbers) · stale leader can't make progress (monotonically increasing term)