System Design Case Study

How does Meta's TAO invalidate cached objects across 1000+ servers within 1 second?

? Design a cache invalidation system: 1000+ servers, <1s propagation, no thundering herd
Concepts Involved

Problem Statement

How does a social platform invalidate cached objects across 1000+ servers within 1 second of a write, preventing stale reads while avoiding thundering herd on the backing store?

Core challenge: A user updates their profile photo. That photo URL is cached on 1000+ servers worldwide. You must invalidate all copies within 1 second without causing 1000 simultaneous DB reads (thundering herd).
1000+
cache servers
globally distributed
<1s
invalidation latency
write ? all caches cleared
10B+
cache reads/sec
99.9% hit rate target
0
thundering herd
single refill per key

Functional Requirements

What the system must do

Must Have

1. On write to DB, invalidate all cached copies within 1 second
2. Prevent thundering herd · only one request refills cache per key
3. Support graph-structured data (objects + associations)
4. Maintain 99.9%+ cache hit rate despite invalidations
5. Handle cross-region invalidation with bounded staleness
6. Support lease-based protection against stale sets

Out of Scope

? Full database design (focus on cache layer)
? Application-level business logic
? User authentication and authorization
? CDN for static assets (separate system)

Non-Functional Requirements

Quality constraints shaping architecture

PropertyTargetDesign Impact
Invalidation Latency<1 second globallyPub/sub invalidation bus (not polling). McRouter + mcrouter-based fanout.
Hit Rate99.9%+Tiered caching (L1 local + L2 regional). Warm cache on startup.
Throughput10B+ reads/secMemcached clusters per region. Consistent hashing for key distribution.
ConsistencyRead-after-write for writerWriter reads from DB directly after write. Others get eventual consistency.
Availability99.99%Cache miss falls through to DB. Gutter pool absorbs failed nodes.

High-Level Architecture

Meta's TAO: a graph-aware distributed cache with invalidation

? TAO Architecture · Write Path & Invalidation
LAYER 1: WRITE PATH · App writes to DB ? Publish invalidation (DELETE key, not UPDATE) ? Invalidation Bus Client Write update profile photo change relationship post new content write request ? leader TAO Leader Cache write-through to MySQL DELETE key locally (not UPDATE) publish invalidation event lease-based stale-set protection writer reads from DB directly Invalidation Bus pub/sub: key invalidated events fan-out to ALL follower caches message: DELETE(key) + version <1s propagation globally cross-region via dedicated channel MySQL source of truth graph data (TAO) objects + edges write-through LAYER 2: CACHE TOPOLOGY · Leader ? Followers (async <1s) + Lease mechanism (prevents stale-set) Leader Cache receives invalidations first serves reads (low latency) issues lease tokens on miss revokes lease on invalidation 1 leader per shard per region <1s async Follower 1 delete(key) Follower 2 delete(key) Follower N delete(key) Lease Mechanism (Stale-Set Protection) 1. On cache miss ? issue lease token (SETNX-like, ~10s TTL) 2. Only lease holder can SET the value in cache 3. If invalidation arrives ? lease REVOKED (prevents stale data) 4. Other requesters wait for lease result (no thundering herd) Result: 1 DB read per key per invalidation, not 1000 LAYER 3: READ PATH · Cache hit ? serve | Cache miss ? Lease ? DB query ? SET ? Serve (thundering herd protected) Reader GET object/edge billions/sec Follower Cache HIT ? return immediately MISS ? request lease from leader miss Lease Issued holder fetches from DB SET with lease token ? serve Thundering Herd Protection 1000 concurrent misses on same key only 1 goes to DB (lease holder) Gutter Pool (Failure Handling) failed cache node ? traffic goes to backup pool (not DB) ? prevents cascade failure Why DELETE not UPDATE? DELETE is commutative (order-independent) · concurrent writes can't leave stale data. Next read refills from DB. DELETE not UPDATE (commutative, order-independent) | Lease prevents stale-set race | Leader/follower: <1s propagation | 10B+ reads/sec 1000+ cache servers | 99.9% hit rate | Gutter pool absorbs failures | Only 1 DB read per key per invalidation (lease-based)

Key Design Decisions

The architectural choices that make this work at Meta's scale

DecisionChoiceWhyAlternative Rejected
Invalidation strategyDelete on write (not update)Avoids race conditions between concurrent writesWrite-through (complex ordering)
Thundering herdLease-based (token on first miss)Only 1 request hits DB per key per invalidationMutex lock (doesn't scale across servers)
PropagationPub/sub invalidation bus<1s fanout to all followersPolling (too slow), TTL-only (too stale)
TopologyLeader-follower per regionWrites go to leader, reads from local followerFlat (no write coordination)
Failure handlingGutter pool (backup cache tier)Failed node traffic goes to gutter, not DBDirect DB fallback (overloads DB)
Data modelGraph-aware (objects + edges)Social data is naturally a graphKey-value only (loses relationship semantics)
Why DELETE not UPDATE: If two writes happen concurrently (W1 sets value=A, W2 sets value=B), update-based invalidation can leave cache with stale value if messages arrive out of order. DELETE is commutative · order doesn't matter, next read always gets fresh data from DB.
Lease mechanism: On cache miss, server issues a lease token (short-lived, ~10s). Only the lease holder can SET the value. If an invalidation arrives while lease is active, the lease is revoked · preventing stale data from being cached. This solves both thundering herd AND stale-set race conditions.
Tradeoffs: Eventual consistency · followers may serve stale data for up to 1 second after write. Complexity · leader/follower topology adds operational overhead. Cross-region latency · invalidation across regions takes longer (bounded by network RTT).
Real-world: Meta TAO · serves 10B+ reads/sec for social graph. Netflix EVCache · similar lease-based thundering herd protection. Twitter · Manhattan cache with pub/sub invalidation. Memcached at Facebook paper (2013) · foundational design for this pattern.

Interview Cheat Sheet

The 8 things to say for cache invalidation design

1. DELETE on write, not UPDATE · commutative, order-independent, prevents stale-set races
2. Lease mechanism · only lease holder can SET value, revoked on invalidation (prevents thundering herd + stale)
3. Leader/follower cache topology · writes go to leader, async replicate to followers (<1s)
4. Pub/Sub invalidation bus · write triggers invalidation event ? all cache servers subscribe
5. Thundering herd protection · on miss, only 1 request goes to DB (others wait for lease result)
6. Eventual consistency accepted · followers may serve stale for ~1s (acceptable for social data)
7. Version/generation numbers · detect stale writes (only accept SET if version matches)
8. Cross-region: async replication · invalidation propagates across regions via dedicated channel