How does a feature flag platform evaluate 1M+ flag checks per second with P99 <5ms latency, support complex targeting rules (user segments, percentages, custom attributes), and propagate flag changes to 100K+ servers within 10 seconds?
Core challenge: Flag evaluation must be local and fast (no network call per check), but changes must propagate globally in seconds. Complex targeting rules (10% of users in segment X with attribute Y) need efficient evaluation without per-request API calls.
1M+
flag checks / sec
P99 <5ms
evaluation latency
10s
global propagation
100K+
connected servers
Architecture
Local evaluation: The SDK downloads all flag rules at startup and keeps them in-memory. Every flag check is a local computation · hash user context against targeting rules. No network call per evaluation ? sub-millisecond P99.
Propagation via streaming: Flag changes push through SSE/WebSocket relay layer. Delta updates (only changed flags) minimize bandwidth. Relay proxies handle fan-out to 100K+ connected SDKs. Fallback: polling every 30s if stream disconnects.
Anti-patterns:API call per flag check · adds latency, creates SPOF. Polling-only updates · minutes of stale flags. No fallback defaults · service crashes if flag service is down. Unbounded rule complexity · evaluation time explodes.
Targeting rules: Evaluated in order: individual targets ? segment rules ? percentage rollout ? default. Percentage uses deterministic hashing (same user always gets same variant) · consistent experience across requests.
Scale Estimation
Step
Derivation
Result
Design Impact
1
Flag checks: 1M/sec across all services
1M evaluations/sec
Must be local (in-process) · no network call per check
2
Flags per project: ~500 flags · 10 rules each
~5K rules in memory
Fits in ~10MB RAM per SDK instance · trivial
3
Connected SDKs: 100K servers · 1 connection each
100K persistent connections
Relay proxy layer for fan-out (not direct to origin)
4
Propagation: change ? relay ? all SDKs
<10s end-to-end
SSE streaming with delta updates (only changed flags)
5
Event analytics: 1M checks/sec · 100 bytes
~100 MB/sec telemetry
Sampled (1-10%) + aggregated client-side before sending
Resilience & Edge Cases
Failure
Impact
Recovery
Stream disconnected
SDK can't receive flag updates
Use last-known state (in-memory cache). Fallback: poll every 30s. Hardcoded defaults as last resort.
Flag service completely down
No updates propagate
SDK continues with cached rules indefinitely. App never crashes due to flag service outage.
Bad flag rule deployed
Feature broken for all users
Kill switch: emergency flag override. Instant propagation via streaming. Audit log for rollback.
Percentage rollout inconsistency
User sees different variant on different requests
Deterministic hash: hash(user_id + flag_key) % 100 · always same result for same user.
Stale SDK cache after deploy
New service instance has no flags
SDK fetches full flag set on startup (blocking init). Ready to serve only after initial load.
Interview Cheat Sheet
1.In-process evaluation · SDK holds rules locally, no network call per check (<1ms) 2.Streaming propagation · SSE/WebSocket push, delta updates, 10s global propagation 3.Deterministic hashing · hash(user_id + flag) % 100 for consistent percentage rollouts 4.Rule evaluation order · individual ? segment ? percentage ? default 5.Graceful degradation · SDK uses last-known state if stream disconnects, hardcoded defaults as last resort 6.Relay proxy · intermediate cache for high fan-out, reduces load on origin