Messages wait in a queue for consumers to process — task distribution & load balancing with retry + DLQ
Guarantees:At-least-once delivery (consumer must be idempotent). Ordering per queue (FIFO). Dead Letter Queue — after N retries, message moved to DLQ. Visibility timeout — message invisible to others while being processed.
Limitations:No message replay. Limited throughput for high-volume streaming. Fan-out harder than Kafka. No long-term retention.
RabbitMQ exchanges: Direct (exact key) · Fanout (broadcast) · Topic (pattern: order.*.created) · Headers. SQS: Standard (at-least-once, best-effort order) vs FIFO (exactly-once, strict order, 3K msg/sec).
Message Streams - Apache Kafka
Messages continuously appended to a durable ordered log — 1M+ events/sec, replayable history, multiple independent readers.
▸ Real-World: Uber Ride Events in Kafka — Complete Flow
▸ Kafka: Producers, Consumers, Brokers
Producers
Acks:0=fire-forget · 1=leader ACK · all=all ISR ACK (safest) Retries: On failure, retry with backoff. Idempotent producer (enable.idempotence=true) deduplicates retries via sequence numbers. Batching: linger.ms (wait to fill batch) + batch.size (max bytes). Bigger batch = higher throughput, slight latency. Compression: snappy (fast), lz4 (balanced), zstd (best ratio). Compress at producer, decompress at consumer. Partitioner: Default = hash(key) % partitions. Sticky partitioner for null keys (batch to one partition, then rotate).
Consumers
Consumer Group: Each partition → exactly 1 consumer in group. Add consumers = parallel. Max consumers = partitions. Offsets: Track position per partition. Committed to __consumer_offsets topic. auto.offset.reset: earliest (replay all) / latest (new only). Delivery:At-most-once (commit before process) · At-least-once (process then commit, may re-read) · Exactly-once (transactions). Rebalance: Consumer joins/leaves → partitions reassigned. Cooperative/incremental rebalance (Kafka 2.4+) avoids stop-the-world. Poll: max.poll.records, max.poll.interval.ms. Too slow = kicked from group.
Brokers & Topics
Broker: Single Kafka server. Cluster = N brokers. Each broker stores subset of partitions. Replication: RF=3 → 3 copies. Leader handles reads/writes. Followers in ISR (In-Sync Replicas) replicate. min.insync.replicas=2 + acks=all = no data loss. Segments: Partition = ordered log split into segment files (1GB default). Each segment has .log + .index + .timeindex. Retention: Time-based (retention.ms=7 days) or size-based (retention.bytes=1TB). Log compaction = keep latest value per key (state snapshots). KRaft: Replaces ZooKeeper (Kafka 3.3+). Controller quorum via Raft. Simpler ops, faster failover.
Config
Setting
Effect
acks=all + min.insync=2
Producer waits for 2+ replicas
Zero data loss (even if 1 broker dies)
enable.idempotence=true
Sequence numbers per producer
No duplicates on retry
Transactions
Atomic multi-partition writes
Exactly-once end-to-end
Unclean leader election=false
Only ISR members can become leader
No data loss (may reduce availability)
Log compaction
Keep latest value per key
State snapshots, changelogs (KTable)
Pub/Sub (SNS / Google Pub/Sub)
Publishers broadcast (fan-out) to a topic — every subscriber receives its own independent copy
Producer publishes to TOPIC (not a queue)
↓ topic: "order-events"
┌────┼────────┬────────────┐
↓ ↓ ↓ ↓
Sub A Sub B Sub C Sub D
(Email) (Analytics) (Audit) (Webhook)
Each subscriber gets its OWN copy of every message.
Subscribers are independent — different speeds, different retry logic.
Guarantees:At-least-once delivery to every subscriber. Independent processing. Fully managed. Google Pub/Sub: seek to timestamp, ordering keys, exactly-once delivery. SNS: push to SQS/Lambda/HTTP/email/SMS.
Limitations:No long retention by default. No consumer-side replay in SNS. Ordering not guaranteed by default. Not suited for point-to-point work queues.
Real-world:Spotify — Google Pub/Sub for event-driven microservices. Shopify — SNS+SQS for order event fan-out. Uber — Google Pub/Sub for cross-service event propagation.
Message Queues vs Event Streams vs Pub/Sub
Three fundamentally different messaging patterns
┌──────────────────────────────────────────────────────────────┐
│ MESSAGE QUEUE (Point-to-Point) │
└──────────────────────────────────────────────────────────────┘
Producer → [ Queue ] → Consumer A
(message deleted after ACK)
One message → one consumer (ownership)
┌──────────────────────────────────────────────────────────────┐
│ EVENT STREAM (Log/replay-based) │
└──────────────────────────────────────────────────────────────┘
Producer → [ Append-only Log ] → offset 0 → offset 5 → offset 9
│ │ │
Group A Group B Group C
Data is NEVER removed, only appended
┌──────────────────────────────────────────────────────────────┐
│ PUB/SUB (Fan-out) │
└──────────────────────────────────────────────────────────────┘
Producer → [ Topic ]
├→ Subscriber A → receives COPY (independent)
├→ Subscriber B → receives COPY (independent)
└→ Subscriber C → receives COPY (independent)
One event → many independent receivers
Feature
Message Queue
Event Stream
Pub/Sub
Definition
Messages wait in a queue for a consumer to pick up & process
Messages appended to a durable ordered log — replayable history
Publisher broadcasts to a topic; every subscriber gets its own copy
Core Idea
Queue of tasks / jobs
Durable ordered event log
Broadcast events to subscribers
Concrete Question
"Which worker should process this image upload / email / payment retry?"
"What exactly happened in this system / user journey over time?"
"Which systems should know that an order was placed?"
Why this fits best
Best when exactly one worker should process a task / job
Best when events must be retained as an ordered replayable history
Best when one event must be broadcast (fan-out) to many independent subscribers
Why others don't fit
Pub/Sub broadcasts to many consumers causing duplicate processing; Streams retain/replay history — unnecessary for one-time task execution
Queue removes messages after processing; Pub/Sub focuses on live delivery rather than durable replayable event history
Queue is designed for task distribution where only one consumer gets the message; Streams add retention/replay/ordering overhead when only real-time notification is needed
Model
Competing consumers — task distribution & load balancing
Independent readers — each group reads the full log at its own pace
Key Insight: The fundamental difference is the consumption model. Queues are competing consumers (work distribution). Streams are independent consumers (each reads everything). Pub/Sub is broadcast (everyone gets a copy). Many real systems combine them — e.g., SNS → SQS → Lambda.
Dead Letter Queue (DLQ)
Quarantine poison messages so the main queue keeps flowing — critical for fault isolation & observability
▸ DLQ Lifecycle — Retry, Quarantine & Recovery
Concept
Detail
Best Practice
maxReceiveCount
Number of delivery attempts before moving to DLQ
3–5 retries with exponential back-off
Visibility Timeout
Time message is hidden from other consumers during processing
Best Practices:Always alarm on DLQ depth > 0 — even 1 message means something is broken. Set up CloudWatch / Prometheus metrics on ApproximateNumberOfMessagesVisible. Include correlation IDs and original timestamps in message headers for debugging. Use separate DLQs per source queue to isolate failure domains.
Recovery Playbook: ① Alert fires → ② Inspect DLQ messages (peek, don't consume) → ③ Identify root cause (schema? downstream? bug?) → ④ Fix the consumer/downstream → ⑤ Redrive messages back to source queue → ⑥ Verify processing succeeds → ⑦ Post-mortem if recurring.
Anti-patterns:Ignoring DLQ — messages expire silently. No alarm — failures go unnoticed for days. Infinite retries without DLQ — blocks the entire queue. Same retention on DLQ as source — messages may expire before investigation. No metadata — can't trace why message failed.
Real-world:Uber — DLQ per microservice, auto-redrive after circuit breaker resets. Netflix — DLQ + S3 archival for compliance audit trail. Stripe — webhook DLQ with exponential backoff (5 retries over 3 days).
Event Sourcing
Persist facts (events) as the source of truth — derive state by replaying the event log. Never mutate, only append.
Wins:Perfect audit trail — every state change recorded. Time-travel debugging — reconstruct state at any point. Rebuild projections — fix bugs, replay, get corrected views. Natural fit for financial ledgers, order systems, collaboration tools.
Costs:Schema evolution of events is hard (upcasting, versioning). Eventual consistency on read models. Snapshot/compaction strategy needed for long-lived aggregates. Steeper learning curve — team must think in events, not state.
Real-world:Stripe — payment state machine as events. Datomic — immutable database (event-sourced by design). LMAX Exchange — event-sourced trading engine (6M orders/sec). Git — commits are events, working tree is projection.
CQRS (Command Query Responsibility Segregation)
Separate the write model (commands) from the read model (queries) — scale, optimize, and evolve them independently
▸ CQRS Architecture — Write Path vs Read Path
Concept
Write Side
Read Side
Model
Domain model / aggregates — enforces invariants
Denormalized views — optimized for specific queries
Pattern: Commands → Aggregate → Events persisted → Projectors subscribe → Read models updated async. The event store IS the write model. Projections ARE the read models. Rebuild any projection by replaying events from the beginning.
Consistency strategies:Pull-based: query handler checks if projection is up-to-date (compare position). Push-based: projector publishes "ready" event. Hybrid: serve stale + indicate "updating" in UI. Inline projection: update synchronously in same transaction (sacrifices scalability for consistency).
Anti-patterns:Querying the write model — defeats the purpose. Bidirectional sync — creates conflicts. Shared database for read/write — coupling returns. Over-engineering — CQRS for a TODO app.
Real-world:Microsoft — Azure architecture patterns (official CQRS guidance). Uber — trip service (write) + rider-facing API (read from cache). Netflix — catalog writes vs personalized read views. Shopify — order writes vs merchant dashboard reads.
Ordering Guarantees
Where order is preserved — and where it is not. Critical for financial transactions, state machines, and causal consistency.
▸ Partition-Based Ordering — How Keys Determine Order
Broker
Order Scope
Mechanism
Throughput Impact
Kafka
Per-partition (key → partition)
murmur2(key) % numPartitions
More partitions = more parallelism, less global order
SQS Standard
Best effort — may reorder
Distributed architecture, no ordering guarantee
Unlimited throughput
SQS FIFO
Strict per MessageGroupId
Deduplication + sequencing per group
3,000 msg/sec (batching: 30K)
RabbitMQ
Per queue (single consumer)
FIFO within a single queue
Prefetch count affects perceived order
Pub/Sub
Optional ordering keys
orderingKey on publish
Ordered messages go to same region
Kinesis
Per shard (partition key)
MD5(partitionKey) → shard
1 MB/sec or 1000 records/sec per shard
Azure Event Hubs
Per partition
Partition key → consistent hash
1 MB/sec ingress per throughput unit
▸ Patterns for Maintaining Order
Partition Key Design
userId: all user events ordered (profile, orders, sessions)
orderId: order lifecycle events in sequence
accountId: financial transactions ordered per account
deviceId: IoT telemetry ordered per device
⚠ Hot keys: popular users → partition hotspot
Cross-Partition Ordering
Sequence numbers: embed monotonic seq in payload
Vector clocks: causal ordering across partitions
Lamport timestamps: logical ordering
Single partition: sacrifice parallelism for global order
External sequencer: Redis INCR or DB sequence
Key Insight:Use a stable partition key (userId, accountId, orderId) so all related events land in the same partition and stay ordered. Choose the key based on your consistency boundary — what entity needs its events in order?
Rebalancing risk: When partitions are added/removed, key-to-partition mapping changes. Use sticky partitioning or consistent hashing to minimize disruption. During rebalance, consumers may see temporary out-of-order delivery — handle with buffering + reordering window.
Anti-patterns:Random partition key — destroys ordering. Too few partitions — bottleneck on throughput. Assuming global order in multi-partition topics. Processing out-of-order without idempotency — corrupted state.
Schema Registry
Central contract for event payloads — prevents "it broke prod". Enforces backward/forward compatibility across producer and consumer versions.
▸ Schema Registry — Serialize, Register, Validate
Compatibility
Rule
Producers May
Consumers May
Use Case
Backward
New schema can read old data
Add optional fields, remove fields
Read old data with new code
Most common — consumers upgrade first
Forward
Old schema can read new data
Add fields, remove optional fields
Read new data with old code
Producers upgrade first
Full
Both backward + forward
Only add/remove optional fields
Both directions work
Safest — independent deployments
Transitive
Compatible with ALL previous versions
Strictest constraints
Any version reads any other
Long-lived topics, many consumers
None
No checks
Anything
May break
Development only — never in prod
▸ Schema Registry Implementations
Tool
Formats
Integration
Key Feature
Confluent Schema Registry
Avro, Protobuf, JSON Schema
Kafka native, REST API
De facto standard, subject strategies, schema references
AWS Glue Schema Registry
Avro, JSON Schema, Protobuf
MSK, Kinesis, Lambda
Serverless, IAM integration, auto-registration
Apicurio Registry
Avro, Protobuf, JSON, OpenAPI, GraphQL
Kafka, HTTP, gRPC
Open source, CNCF, multi-format
Azure Schema Registry
Avro
Event Hubs
Azure-native, RBAC, client-side caching
Buf (BSR)
Protobuf
gRPC, Connect
Breaking change detection, linting, code gen
▸ Schema Evolution Best Practices
Safe Changes (Backward Compatible)
Add optional field with default value
Add new enum value (if consumer ignores unknown)
Widen numeric type (int → long)
Add new union member (Avro)
Deprecate field (keep, stop writing)
Breaking Changes (Avoid)
Remove required field
Rename field (without alias)
Change field type (string → int)
Remove enum value
Change field from optional to required
Best Practices:Enforce compatibility in CI — reject PRs that break schema contracts. Use subject naming strategies (TopicNameStrategy, RecordNameStrategy) to control scope. Cache schemas on producer/consumer side (schema ID → schema). Set FULL_TRANSITIVE for critical topics.
Subject Strategies:TopicNameStrategy (default) — one schema per topic. RecordNameStrategy — schema per record type (multiple types in one topic). TopicRecordNameStrategy — schema per topic+record combo (most flexible).
Anti-patterns:No schema registry — "just use JSON" leads to silent breakage. Compatibility = NONE in prod — ticking time bomb. Not versioning schemas — can't roll back. Tight coupling — producer and consumer must deploy simultaneously.
Real-world:LinkedIn — Avro + Confluent SR for all Kafka topics (thousands of schemas). Uber — Protobuf + custom registry for gRPC services. Netflix — Avro schemas with automated compatibility testing in CI. Shopify — Protobuf for event-driven architecture with Buf for linting.