How does Uber process 1M+ ride events/sec with exactly-once?

🎯 Design an event streaming platform: 1M+ events/sec, city-level locality, exactly-once, real-time surge pricing

Concepts Involved

Kafka Partitioning Event Sourcing Stream Processing Idempotency

Problem Statement

How does a ride-hailing platform process 1M+ ride events per second with city-level locality, guaranteeing exactly-once processing semantics and computing real-time surge pricing from the event stream?

Core challenge: Every ride generates 10+ events (requested, matched, en-route, arrived, started, GPS·N, completed, rated). At Uber's scale, that's millions of events/sec that must be processed exactly once, partitioned by city, and feed real-time pricing models.

1M+

events / second

peak during rush hour

Exactly-Once

processing guarantee

no double-charge, no lost ride

<10s

surge price update

supply/demand per zone

600+

cities worldwide

city-level partitioning

Functional Requirements

What the system must do

Must Have (Core)

1. Ingest ride lifecycle events (requested → matched → started → GPS → Completed → rated)
2. Guarantee exactly-once processing · no duplicate charges, no lost events
3. Partition events by city/zone for locality-aware processing
4. Compute real-time surge pricing per zone (supply/demand ratio) within 10 seconds
5. Feed downstream consumers: billing, analytics, ML models, driver payouts
6. Support event replay for debugging and reprocessing after bug fixes

Out of Scope

✗ Driver-rider matching algorithm (separate service)
✗ GPS routing and ETA calculation
✗ Payment processing (downstream consumer)
✗ Push notifications to riders/drivers
✗ Historical analytics (batch pipeline)

Non-Functional Requirements

Quality constraints shaping architecture

Property	Target	Design Impact
Throughput	1M+ events/sec sustained, 3M peak	Kafka with 10K+ partitions, multi-cluster
Latency	<10s for surge pricing update	Flink streaming with tumbling windows
Exactly-once	Zero duplicates in downstream state	Kafka transactions + idempotent producers + Flink checkpoints
Ordering	Per-ride ordering (not global)	Partition key = ride_id ensures all events for one ride are ordered
Durability	Zero event loss after producer ACK	acks=all, min.insync.replicas=2, RF=3
Availability	99.99% · rides can't stop	Multi-AZ Kafka, cross-region replication for DR
Scalability	Linear scale with city count	City-based topic partitioning, independent consumer groups per use case

High-Level Architecture

Event-driven pipeline: produce → partition → process → materialize

Key Design Decisions

The architectural choices that make this work at Uber's scale

Decision	Choice	Why	Alternative Rejected
Partition Key	ride_id for ride events, city_zone for GPS	All events for one ride land in same partition → ordered	Random (loses ordering), user_id (hot partition for frequent riders)
Exactly-Once	Kafka transactions + Flink checkpoints	End-to-end guarantee without application-level dedup	At-least-once + app dedup (complex, error-prone)
Surge Computation	Flink tumbling window (10s)	Real-time supply/demand ratio per zone	Batch (too slow), polling DB (doesn't scale)
Multi-City	Topic-per-city or city prefix in partition key	City-level isolation, independent scaling	Single global topic (cross-city interference)
Schema	Avro + Schema Registry	Backward-compatible evolution, compact binary	JSON (no schema enforcement), Protobuf (less Kafka ecosystem support)
Retention	7 days in Kafka, then S3 (Parquet)	Replay window for bug fixes + cold storage for analytics	Infinite retention (cost), short retention (can't replay)

Surge pricing algorithm: Every 10 seconds, Flink counts active ride requests (demand) and available drivers (supply) per H3 hex zone. Ratio > 1.5 → surge multiplier applied. Smoothing prevents oscillation (exponential moving average). Results written to Redis → rider app reads on next request.

Real-world scale: Uber's Kafka processes 4+ trillion messages/day across multiple clusters. Each city has independent consumer groups. GPS events alone = ~5M updates/sec globally (1M drivers · every 4 seconds).

Failure modes: Consumer lag → surge pricing stale → auto-scale consumers via KEDA. Kafka broker failure → ISR takes over, no data loss. Flink checkpoint failure → restart from last checkpoint, reprocess (idempotent sinks handle duplicates).

Scale Estimation

Back-of-envelope math · derive infrastructure from given numbers

Given: 1M events/sec peak · 600 cities · 5M active drivers (GPS every 4s) · 10 event types per ride · avg 15M rides/day

Step	Derivation	Result	Design Impact
1	GPS updates: 5M drivers · 4s interval	1.25M GPS/sec	Separate topic for GPS (high volume, lower criticality)
2	Ride events: 15M rides · 10 events · 86400s	~1,700 ride events/sec avg	Much lower than GPS · but each is critical (exactly-once)
3	Peak ride events: 1,700 · 5· peak factor	~8,500 ride events/sec peak	Kafka handles easily · GPS is the real throughput challenge
4	Total events: 1.25M GPS + 8.5K ride + misc	~1.3M events/sec peak	Need 10K+ Kafka partitions across multiple clusters
5	Kafka partitions: 1.3M · 5K msg/partition/sec	~260 partitions minimum	Use 1000+ for headroom and city-level isolation
6	Storage: 1.3M · 1KB avg · 86400s · 7 days	~786 TB retention	Tiered storage: 7d hot (SSD) → S3 cold (Parquet)
7	Flink workers for surge: 600 cities · 1 worker	~600 Flink task slots	Each city independently computes surge every 10s
8	Kafka brokers: 1.3M msg/sec · 100K/broker	~13 brokers minimum	Use 30+ for replication (RF=3) and headroom

APIs

Event schema + producer/consumer interfaces

// --- Ride Event Schema (Avro) ---
{
  "type": "record",
  "name": "RideEvent",
  "namespace": "com.uber.events",
  "fields": [
    {"name": "event_id",    "type": "string"},       // UUID, idempotency key
    {"name": "ride_id",     "type": "string"},       // partition key
    {"name": "event_type",  "type": {"type": "enum", "symbols": [
      "RIDE_REQUESTED", "DRIVER_MATCHED", "DRIVER_EN_ROUTE",
      "RIDER_PICKED_UP", "TRIP_IN_PROGRESS", "GPS_UPDATE",
      "TRIP_COMPLETED", "FARE_CALCULATED", "PAYMENT_CHARGED", "RATING_SUBMITTED"
    ]}},
    {"name": "timestamp",   "type": "long"},         // epoch millis
    {"name": "city_id",     "type": "string"},       // for routing
    {"name": "driver_id",   "type": ["null","string"]},
    {"name": "rider_id",    "type": "string"},
    {"name": "location",    "type": ["null", {"type": "record", "name": "Location",
      "fields": [{"name":"lat","type":"double"},{"name":"lng","type":"double"}]
    }]},
    {"name": "metadata",    "type": {"type": "map", "values": "string"}}
  ]
}

// --- Producer (Trip Service) ---
POST /internal/events/publish
Headers: X-Idempotency-Key: {event_id}
Body: { ride_id, event_type, ... }
? Kafka produce with key=ride_id, transactional

// --- Consumer (Surge Pricing) ---
Kafka Consumer Group: "surge-pricing-service"
Topic: "uber.ride-events" (filtered: RIDE_REQUESTED, TRIP_COMPLETED)
Processing: Flink tumbling window (10s) → count per H3 zone → write Redis

// --- Surge Query API ---
GET /api/surge?lat={lat}&lng={lng}
? Redis HGET surge:{h3_cell} multiplier
Response: { "multiplier": 1.8, "zone": "h3_abc123", "expires_in_sec": 10 }

Data Model

Hot path (Kafka + Redis) for real-time · Cold path (Cassandra + S3) for state and analytics

Store	Data	Access Pattern	Retention
Kafka	All raw events (immutable log)	Sequential read by consumer groups	7 days hot, then tiered to S3
Redis	Surge multipliers per zone	GET surge:{h3_cell} · <1ms	TTL 10s (auto-refresh by Flink)
Redis	Active ride state machine	HGET ride:{ride_id} status	TTL 24h (ride lifecycle)
Cassandra	Completed ride history	Query by rider_id or driver_id + time range	Forever (compliance)
S3 (Parquet)	Archived events for analytics	Batch queries via Spark/Presto	Years (data lake)
Schema Registry	Avro schemas (versioned)	Lookup by schema_id on deserialize	Forever (all versions kept)

Ride state machine (Redis): Each ride is a hash: {status, driver_id, rider_id, pickup_lat, pickup_lng, started_at, ...}. State transitions are atomic (Lua script): only valid transitions allowed (e.g., can't go from COMPLETED → EN_ROUTE). Flink updates state on each event.

Resilience & Edge Cases

What breaks and how the system recovers

Failure	Impact	Recovery
Kafka broker dies	Partitions on that broker unavailable for ~10s	ISR replica promoted to leader. Producers retry. Zero data loss (acks=all).
Flink job crashes	Surge pricing stale for affected cities	Restart from last checkpoint (every 30s). Reprocess buffered events. Idempotent Redis writes.
Consumer lag > 5min	Surge pricing outdated → wrong prices	KEDA auto-scales consumers. Alert at 2min lag. Fallback: use last-known surge.
Schema incompatibility	Consumer can't deserialize new events	Schema Registry rejects breaking changes in CI. Backward-compatible only.
Duplicate events (network retry)	Could double-charge rider	Idempotent producer (producer_id + seq). Flink dedup by event_id. Sink upsert by ride_id.
City-wide outage	All events for one city lost	Cross-region Kafka replication (MirrorMaker 2). Failover to DR cluster within 60s.
Poison message	Consumer crashes in loop	DLQ after 3 retries. Alert. Manual inspection. Fix + replay from DLQ.

Tech Stack & Tradeoffs

Why each technology was chosen

Component	Technology	Why Chosen	What Was Rejected
Event Bus	Apache Kafka	1M+ msg/sec, durable, replayable, exactly-once	RabbitMQ (no replay), SQS (no ordering), Pulsar (less mature)
Stream Processing	Apache Flink	Exactly-once, event-time windows, checkpointing	Spark Streaming (micro-batch, higher latency), Kafka Streams (simpler but less powerful)
Hot State	Redis Cluster	Sub-ms reads, atomic ops (Lua), TTL	Memcached (no persistence), DynamoDB (higher latency)
Cold Storage	Apache Cassandra	Write-heavy, time-series friendly, linear scale	PostgreSQL (doesn't scale writes), MongoDB (less predictable at scale)
Analytics	S3 + Parquet + Presto	Cheap, columnar, SQL-on-anything	Snowflake (cost at this volume), Redshift (less flexible)
Schema	Avro + Confluent Schema Registry	Compact binary, backward-compatible evolution	Protobuf (less Kafka tooling), JSON (no schema enforcement)

Interview Cheat Sheet

The 8 things to say in a system design interview for this problem

1. Partition by ride_id · guarantees per-ride ordering without global coordination
2. Exactly-once = idempotent producer + Kafka transactions + Flink checkpoints + idempotent sinks
3. Separate topics by event type · GPS (high volume, lossy OK) vs ride events (low volume, critical)
4. Surge pricing via Flink tumbling windows · 10s window, count supply/demand per H3 zone, write to Redis
5. City-level isolation · independent consumer groups per city, no cross-city interference
6. Schema Registry · enforce backward compatibility, reject breaking changes in CI
7. Tiered storage · 7d hot in Kafka, then S3 Parquet for analytics (cost-effective)
8. DLQ + auto-scaling · poison messages isolated, consumer lag triggers KEDA scale-up

System Design Case Study