Skip to content

ADR-003: Kafka (Amazon MSK) for Async Event Propagation

Status: Accepted


Context

Multiple downstream systems need to react to booking lifecycle events:

  • The Availability Service must update its Redis bitmap when a booking is created or cancelled.
  • The Search Service must update OpenSearch when availability changes.
  • The Notification Service must send email and SMS confirmations and cancellation notices.

These updates are not required to be synchronous with the booking write. Making the Booking Service synchronously call three other services would:

  1. Increase booking P99 latency by the sum of all downstream call latencies.
  2. Couple booking success to the availability of downstream services (if the Notification Service is down, bookings fail).
  3. Make it harder to add new consumers in future without modifying the Booking Service.

The options evaluated were:

  1. Amazon MSK (Managed Kafka) — durable log, ordered per partition, replayable
  2. Amazon SQS — managed queue, at-least-once delivery, no ordering by default
  3. Amazon EventBridge — managed event bus, schema registry, fan-out to rules
  4. Synchronous HTTP fan-out — simplest, but tightly coupled

Decision

Use Amazon MSK (Managed Kafka) with three topics: booking-events, availability-events, and practitioner-events. All topics are partitioned by practitioner_id to guarantee per-practitioner event ordering.

Setting Value Rationale
Brokers 3 × kafka.m5.xlarge One per AZ; survives single broker loss
Replication factor 3 Full redundancy across all 3 brokers
Min in-sync replicas 2 Producer acknowledgement requires at least 2 replicas to confirm
Retention 7 days / 168 hours Enables full read model rebuild via replay
Partitions per topic 6 Sufficient parallelism for all consumer groups at projected throughput

Consequences

Positive:

  • The Booking Service publishes one event and is done. All downstream updates are decoupled — downstream service failures do not affect booking success or latency.
  • Per-practitioner_id partitioning guarantees that events for a given practitioner are processed in order by each consumer group. This prevents the Availability Service from applying a BookingCancelled before the BookingCreated it was cancelling.
  • 7-day retention allows the OpenSearch index and Redis availability bitmaps to be fully rebuilt from the event log if either is lost or corrupted — no separate backup/restore procedure needed.
  • Adding a new consumer (e.g., an analytics service) requires no changes to existing producers or other consumers.

Negative:

  • MSK adds operational overhead: broker scaling, partition management, consumer group lag monitoring.
  • At-least-once delivery means consumers must be idempotent. The Availability Service uses the booking ID as an idempotency key when updating the Redis bitmap.
  • Kafka's ordering guarantee is per-partition, not global. Events for different practitioners may be interleaved across partitions — this is intentional and correct.

Why not SQS: SQS does not support multiple independent consumer groups reading the same queue. Fan-out would require separate SQS queues per consumer, and there is no replay capability — once a message is consumed and deleted, it is gone.

Why not EventBridge: EventBridge is well-suited for rule-based routing but has lower throughput limits and no native replay from a persistent log. Rebuilding a read model from EventBridge would require a separate data store (e.g., S3 archive).