ADR-003: Kafka (Amazon MSK) for Async Event Propagation¶
Status: Accepted
Context¶
Multiple downstream systems need to react to booking lifecycle events:
- The Availability Service must update its Redis bitmap when a booking is created or cancelled.
- The Search Service must update OpenSearch when availability changes.
- The Notification Service must send email and SMS confirmations and cancellation notices.
These updates are not required to be synchronous with the booking write. Making the Booking Service synchronously call three other services would:
- Increase booking P99 latency by the sum of all downstream call latencies.
- Couple booking success to the availability of downstream services (if the Notification Service is down, bookings fail).
- Make it harder to add new consumers in future without modifying the Booking Service.
The options evaluated were:
- Amazon MSK (Managed Kafka) — durable log, ordered per partition, replayable
- Amazon SQS — managed queue, at-least-once delivery, no ordering by default
- Amazon EventBridge — managed event bus, schema registry, fan-out to rules
- Synchronous HTTP fan-out — simplest, but tightly coupled
Decision¶
Use Amazon MSK (Managed Kafka) with three topics: booking-events, availability-events, and practitioner-events. All topics are partitioned by practitioner_id to guarantee per-practitioner event ordering.
| Setting | Value | Rationale |
|---|---|---|
| Brokers | 3 × kafka.m5.xlarge |
One per AZ; survives single broker loss |
| Replication factor | 3 | Full redundancy across all 3 brokers |
| Min in-sync replicas | 2 | Producer acknowledgement requires at least 2 replicas to confirm |
| Retention | 7 days / 168 hours | Enables full read model rebuild via replay |
| Partitions per topic | 6 | Sufficient parallelism for all consumer groups at projected throughput |
Consequences¶
Positive:
- The Booking Service publishes one event and is done. All downstream updates are decoupled — downstream service failures do not affect booking success or latency.
- Per-
practitioner_idpartitioning guarantees that events for a given practitioner are processed in order by each consumer group. This prevents the Availability Service from applying aBookingCancelledbefore theBookingCreatedit was cancelling. - 7-day retention allows the OpenSearch index and Redis availability bitmaps to be fully rebuilt from the event log if either is lost or corrupted — no separate backup/restore procedure needed.
- Adding a new consumer (e.g., an analytics service) requires no changes to existing producers or other consumers.
Negative:
- MSK adds operational overhead: broker scaling, partition management, consumer group lag monitoring.
- At-least-once delivery means consumers must be idempotent. The Availability Service uses the booking ID as an idempotency key when updating the Redis bitmap.
- Kafka's ordering guarantee is per-partition, not global. Events for different practitioners may be interleaved across partitions — this is intentional and correct.
Why not SQS: SQS does not support multiple independent consumer groups reading the same queue. Fan-out would require separate SQS queues per consumer, and there is no replay capability — once a message is consumed and deleted, it is gone.
Why not EventBridge: EventBridge is well-suited for rule-based routing but has lower throughput limits and no native replay from a persistent log. Rebuilding a read model from EventBridge would require a separate data store (e.g., S3 archive).