How does Uber process 1M+ ride events per second with exactly-once semantics?
How does a ride-hailing platform process 1M+ ride events per second with city-level locality, guaranteeing exactly-once processing semantics and computing real-time surge pricing from the event stream?
What the system must do
Quality constraints shaping architecture
| Property | Target | Design Impact |
|---|---|---|
| Throughput | 1M+ events/sec sustained, 3M peak | Kafka with 10K+ partitions, multi-cluster |
| Latency | <10s for surge pricing update | Flink streaming with tumbling windows |
| Exactly-once | Zero duplicates in downstream state | Kafka transactions + idempotent producers + Flink checkpoints |
| Ordering | Per-ride ordering (not global) | Partition key = ride_id ensures all events for one ride are ordered |
| Durability | Zero event loss after producer ACK | acks=all, min.insync.replicas=2, RF=3 |
| Availability | 99.99% · rides can't stop | Multi-AZ Kafka, cross-region replication for DR |
| Scalability | Linear scale with city count | City-based topic partitioning, independent consumer groups per use case |
Event-driven pipeline: produce ? partition ? process ? materialize
The architectural choices that make this work at Uber's scale
| Decision | Choice | Why | Alternative Rejected |
|---|---|---|---|
| Partition Key | ride_id for ride events, city_zone for GPS | All events for one ride land in same partition ? ordered | Random (loses ordering), user_id (hot partition for frequent riders) |
| Exactly-Once | Kafka transactions + Flink checkpoints | End-to-end guarantee without application-level dedup | At-least-once + app dedup (complex, error-prone) |
| Surge Computation | Flink tumbling window (10s) | Real-time supply/demand ratio per zone | Batch (too slow), polling DB (doesn't scale) |
| Multi-City | Topic-per-city or city prefix in partition key | City-level isolation, independent scaling | Single global topic (cross-city interference) |
| Schema | Avro + Schema Registry | Backward-compatible evolution, compact binary | JSON (no schema enforcement), Protobuf (less Kafka ecosystem support) |
| Retention | 7 days in Kafka, then S3 (Parquet) | Replay window for bug fixes + cold storage for analytics | Infinite retention (cost), short retention (can't replay) |
Back-of-envelope math · derive infrastructure from given numbers
| Step | Derivation | Result | Design Impact |
|---|---|---|---|
| 1 | GPS updates: 5M drivers · 4s interval | 1.25M GPS/sec | Separate topic for GPS (high volume, lower criticality) |
| 2 | Ride events: 15M rides · 10 events · 86400s | ~1,700 ride events/sec avg | Much lower than GPS · but each is critical (exactly-once) |
| 3 | Peak ride events: 1,700 · 5· peak factor | ~8,500 ride events/sec peak | Kafka handles easily · GPS is the real throughput challenge |
| 4 | Total events: 1.25M GPS + 8.5K ride + misc | ~1.3M events/sec peak | Need 10K+ Kafka partitions across multiple clusters |
| 5 | Kafka partitions: 1.3M · 5K msg/partition/sec | ~260 partitions minimum | Use 1000+ for headroom and city-level isolation |
| 6 | Storage: 1.3M · 1KB avg · 86400s · 7 days | ~786 TB retention | Tiered storage: 7d hot (SSD) ? S3 cold (Parquet) |
| 7 | Flink workers for surge: 600 cities · 1 worker | ~600 Flink task slots | Each city independently computes surge every 10s |
| 8 | Kafka brokers: 1.3M msg/sec · 100K/broker | ~13 brokers minimum | Use 30+ for replication (RF=3) and headroom |
Event schema + producer/consumer interfaces
// --- Ride Event Schema (Avro) ---
{
"type": "record",
"name": "RideEvent",
"namespace": "com.uber.events",
"fields": [
{"name": "event_id", "type": "string"}, // UUID, idempotency key
{"name": "ride_id", "type": "string"}, // partition key
{"name": "event_type", "type": {"type": "enum", "symbols": [
"RIDE_REQUESTED", "DRIVER_MATCHED", "DRIVER_EN_ROUTE",
"RIDER_PICKED_UP", "TRIP_IN_PROGRESS", "GPS_UPDATE",
"TRIP_COMPLETED", "FARE_CALCULATED", "PAYMENT_CHARGED", "RATING_SUBMITTED"
]}},
{"name": "timestamp", "type": "long"}, // epoch millis
{"name": "city_id", "type": "string"}, // for routing
{"name": "driver_id", "type": ["null","string"]},
{"name": "rider_id", "type": "string"},
{"name": "location", "type": ["null", {"type": "record", "name": "Location",
"fields": [{"name":"lat","type":"double"},{"name":"lng","type":"double"}]
}]},
{"name": "metadata", "type": {"type": "map", "values": "string"}}
]
}
// --- Producer (Trip Service) ---
POST /internal/events/publish
Headers: X-Idempotency-Key: {event_id}
Body: { ride_id, event_type, ... }
? Kafka produce with key=ride_id, transactional
// --- Consumer (Surge Pricing) ---
Kafka Consumer Group: "surge-pricing-service"
Topic: "uber.ride-events" (filtered: RIDE_REQUESTED, TRIP_COMPLETED)
Processing: Flink tumbling window (10s) ? count per H3 zone ? write Redis
// --- Surge Query API ---
GET /api/surge?lat={lat}&lng={lng}
? Redis HGET surge:{h3_cell} multiplier
Response: { "multiplier": 1.8, "zone": "h3_abc123", "expires_in_sec": 10 }
Hot path (Kafka + Redis) for real-time · Cold path (Cassandra + S3) for state and analytics
| Store | Data | Access Pattern | Retention |
|---|---|---|---|
| Kafka | All raw events (immutable log) | Sequential read by consumer groups | 7 days hot, then tiered to S3 |
| Redis | Surge multipliers per zone | GET surge:{h3_cell} · <1ms | TTL 10s (auto-refresh by Flink) |
| Redis | Active ride state machine | HGET ride:{ride_id} status | TTL 24h (ride lifecycle) |
| Cassandra | Completed ride history | Query by rider_id or driver_id + time range | Forever (compliance) |
| S3 (Parquet) | Archived events for analytics | Batch queries via Spark/Presto | Years (data lake) |
| Schema Registry | Avro schemas (versioned) | Lookup by schema_id on deserialize | Forever (all versions kept) |
{status, driver_id, rider_id, pickup_lat, pickup_lng, started_at, ...}. State transitions are atomic (Lua script): only valid transitions allowed (e.g., can't go from COMPLETED ? EN_ROUTE). Flink updates state on each event.What breaks and how the system recovers
| Failure | Impact | Recovery |
|---|---|---|
| Kafka broker dies | Partitions on that broker unavailable for ~10s | ISR replica promoted to leader. Producers retry. Zero data loss (acks=all). |
| Flink job crashes | Surge pricing stale for affected cities | Restart from last checkpoint (every 30s). Reprocess buffered events. Idempotent Redis writes. |
| Consumer lag > 5min | Surge pricing outdated ? wrong prices | KEDA auto-scales consumers. Alert at 2min lag. Fallback: use last-known surge. |
| Schema incompatibility | Consumer can't deserialize new events | Schema Registry rejects breaking changes in CI. Backward-compatible only. |
| Duplicate events (network retry) | Could double-charge rider | Idempotent producer (producer_id + seq). Flink dedup by event_id. Sink upsert by ride_id. |
| City-wide outage | All events for one city lost | Cross-region Kafka replication (MirrorMaker 2). Failover to DR cluster within 60s. |
| Poison message | Consumer crashes in loop | DLQ after 3 retries. Alert. Manual inspection. Fix + replay from DLQ. |
Why each technology was chosen
| Component | Technology | Why Chosen | What Was Rejected |
|---|---|---|---|
| Event Bus | Apache Kafka | 1M+ msg/sec, durable, replayable, exactly-once | RabbitMQ (no replay), SQS (no ordering), Pulsar (less mature) |
| Stream Processing | Apache Flink | Exactly-once, event-time windows, checkpointing | Spark Streaming (micro-batch, higher latency), Kafka Streams (simpler but less powerful) |
| Hot State | Redis Cluster | Sub-ms reads, atomic ops (Lua), TTL | Memcached (no persistence), DynamoDB (higher latency) |
| Cold Storage | Apache Cassandra | Write-heavy, time-series friendly, linear scale | PostgreSQL (doesn't scale writes), MongoDB (less predictable at scale) |
| Analytics | S3 + Parquet + Presto | Cheap, columnar, SQL-on-anything | Snowflake (cost at this volume), Redshift (less flexible) |
| Schema | Avro + Confluent Schema Registry | Compact binary, backward-compatible evolution | Protobuf (less Kafka tooling), JSON (no schema enforcement) |
The 8 things to say in a system design interview for this problem