How does a ride platform serve ML predictions (ETA, surge, fraud) at 1M+ requests/sec with P99 <10ms latency, loading 100+ models into GPU memory, handling model version rollouts, and falling back to rule-based systems when models are unavailable?
Core challenge: Every ride request needs 5+ ML predictions (ETA, price, fraud score, driver match, route). Each prediction must complete in <10ms at P99. Models are retrained daily. Bad model = bad ETAs = angry users. How do you serve, version, and rollback safely?
1M+
predictions / sec
across all models
<10ms
P99 latency
per prediction
100+
production models
retrained daily
Canary
rollout strategy
auto-rollback on regression
Architecture · ML Serving Platform
Layer
Component
Role
Feature Store
Online store (Redis) + Offline store (Hive)
Serve pre-computed features at prediction time (<5ms). Batch features joined with real-time features.
Model Registry
Versioned model artifacts (S3 + metadata DB)
Store trained models with metrics, lineage, approval status. Immutable versions.
Serving Layer
TensorFlow Serving / Triton / custom
Load model into GPU/CPU memory. Batch inference requests. Auto-scale by QPS.
Router
Traffic splitting + canary
Route 5% to new model version, 95% to current. Compare metrics. Auto-promote or rollback.
Fallback
Rule-based system
If model times out or errors ? use heuristic (e.g., historical average ETA). Never block the ride.
Monitoring
Prediction quality metrics
Track accuracy, latency, feature drift. Alert on degradation. Auto-rollback trigger.
Feature serving: ML models need features at prediction time. Online features (user's last 5 rides, current location) served from Redis (<2ms). Batch features (user lifetime stats, driver rating) pre-computed hourly. Real-time features (surge in last 5 min) computed by Flink, written to Redis. Feature store ensures training-serving consistency · same feature logic in both paths.
Model rollout: ? Train new model ? Validate offline (A/B metrics on historical data) ? Shadow mode (run alongside prod, compare outputs, no user impact) ? Canary (5% traffic) ? Monitor for 24h ? Full rollout or auto-rollback. Key metric: ETA accuracy (predicted vs actual). Regression > 2% ? auto-rollback.
Failure modes:Model timeout ? fallback to rules (never block ride request). Feature store down ? use default/cached features (degraded accuracy, not failure). Bad model deployed ? canary catches within minutes, auto-rollback. Training-serving skew ? feature store ensures same computation in both paths.
Real-world:Uber · Michelangelo (end-to-end ML platform). Netflix · Metaflow for ML pipelines. Google · TFX (TensorFlow Extended). Meta · FBLearner for model training + serving. Spotify · ML for Discover Weekly recommendations.
Interview Cheat Sheet
The 7 things to say for ML serving design
1.Feature store (online + offline) · Redis for real-time features (<5ms), Hive for batch features 2.Model registry with versioning · immutable artifacts, approval workflow, rollback capability 3.Canary rollout · 5% traffic to new model, monitor accuracy, auto-rollback on regression 4.Shadow mode · run new model alongside prod, compare outputs, no user impact 5.Fallback to rules · if model times out or errors, use heuristic (never block the request) 6.Training-serving consistency · feature store ensures same computation in both paths 7.Batch inference for request batching · group multiple predictions, amortize GPU overhead