How does etcd achieve consensus with Raft?

🎯 Design a distributed KV store: Raft consensus, leader election, split-brain prevention, majority quorum

Concepts Involved

Consensus Fault Tolerance Leader Election Replication Consistency

Problem Statement

How does a distributed key-value store achieve consensus across 5 nodes, handling leader election, log replication, and split-brain prevention requiring majority quorum (3 of 5) for all decisions?

Core challenge: 5 nodes must agree on the same sequence of operations. A node crashes. A network partition splits the cluster. How do you guarantee all surviving nodes have the same data, elect exactly one leader, and prevent split-brain✗

5 Nodes

typical cluster

tolerates 2 failures

1 Leader

at any time

all writes through leader

Majority

quorum (3/5)

for all decisions

<10ms

write latency

same-region cluster

Raft Consensus · How It Works

Leader election → log replication → safety guarantee

Phase	Mechanism	Guarantee
Leader Election	Follower times out (150-300ms random) → becomes candidate → requests votes → majority grants → becomes leader	Exactly one leader per term. Random timeout prevents split vote.
Log Replication	Leader appends entry → sends AppendEntries to all followers → majority ACK → entry committed	Committed entries never lost. All nodes converge to same log.
Safety	Candidate must have all committed entries to win election (log completeness)	New leader always has all committed data. No data loss on failover.
Heartbeat	Leader sends empty AppendEntries every 100ms	Followers know leader is alive. No unnecessary elections.
Membership Change	Joint consensus: old + new config both must agree during transition	Safe add/remove nodes without split-brain during reconfiguration.

Split-brain prevention: Majority quorum (3/5) means at most one partition can have a majority. The minority partition cannot elect a leader (can't get 3 votes). Stale leader in minority detects it lost majority → steps down. Fencing tokens (monotonic term numbers) prevent stale leaders from making progress.

etcd specifics: Backbone of Kubernetes (stores all cluster state). Watch API · clients subscribe to key changes (efficient notification). Lease-based TTL · keys auto-expire (used for leader election, service discovery). Linearizable reads via ReadIndex (leader confirms it's still leader before serving read).

Limitations: Write throughput limited by leader (~10K writes/sec). Cross-region latency · consensus requires RTT to majority (100ms+ cross-region). Large values · etcd optimized for small KV (<1.5MB per value). Cluster size · 3 or 5 nodes (more = slower consensus).

Real-world: Kubernetes · all cluster state in etcd. CockroachDB · Raft per data range. TiKV · Raft for distributed KV. Consul · Raft for service catalog. Kafka KRaft · replacing ZooKeeper with Raft controller.

Interview Cheat Sheet

The 7 things to say for consensus/Raft design

1. Majority quorum (2f+1) · 5 nodes tolerates 2 failures. Only one partition can have majority.
2. Leader handles all writes · single point of serialization, no conflicts
3. Random election timeout (150-300ms) · prevents split vote, ensures one candidate wins
4. Log completeness · candidate must have all committed entries to win election (no data loss)
5. Committed = majority ACK · entry is durable once 3/5 nodes have it
6. Heartbeats every 100ms · followers know leader is alive, no unnecessary elections
7. Fencing tokens (term numbers) · stale leader can't make progress (monotonically increasing term)

System Design Case Study

Problem Statement

Raft Consensus · How It Works

Interview Cheat Sheet