Fault Tolerance & Disaster Recovery¶

The platform is designed with no single point of failure. Every major component has an explicit failure mode and mitigation. This page documents what happens when something breaks and what the system does automatically versus what requires operator intervention.

Component Failure Modes¶

Component	Failure Mode	Mitigation	Recovery Time
RDS (Booking)	Primary instance down	Multi-AZ automatic failover — Aurora promotes a reader replica	~60 seconds
ElastiCache	Node failure	Automatic failover to replica within the shard	~30 seconds
OpenSearch	Data node down	Replica shards are promoted; cluster self-heals by rebalancing	~minutes
MSK (Kafka)	Single broker down	3-broker cluster with replication factor 3; min ISR 2 — no data loss	Immediate (no failover needed)
EKS pods	Pod crash	Kubernetes restarts the pod; HPA maintains the minimum replica count	Seconds (restart)
Redis unavailable	Cache miss on availability bitmap	Booking Service falls back to an RDS query for the availability check — slower but correct	No downtime; degraded latency
OpenSearch unavailable	Search queries fail	Booking and availability remain fully functional; search returns an error	No booking impact
Availability Service down	Stale bitmap	Booking Service falls back to direct RDS availability query	No booking failures
Single AZ loss	Subset of pods and replicas unavailable	Multi-AZ pod anti-affinity means other AZs maintain quorum; NAT Gateways are per-AZ	Automatic failover

Key Design Decisions for Resilience¶

Multi-AZ Everything¶

Every stateful component is deployed across 3 AZs:

RDS Aurora: 3-AZ cluster with reader replica in a separate AZ from the writer
ElastiCache: 3 shards, 2 replicas per shard, spread across AZs
MSK: 3 brokers, one per AZ
OpenSearch: 3 data nodes, one per AZ
EKS nodes: Managed node groups span all 3 AZs; pod anti-affinity prefers AZ spread

The NAT Gateway architecture is AZ-local (one per AZ) so a NAT Gateway failure only affects outbound traffic from one AZ's private subnet, not all three.

Graceful Degradation¶

The system is designed to degrade gracefully under partial failures rather than failing completely:

Booking still works when Search is down

Patients who already know which practitioner they want can book directly via the Booking Service. The search path is unavailable, but no bookings are lost.

Booking still works when Redis is unavailable

The Redis pre-check is an optimisation, not a correctness gate. The Booking Service falls back to an RDS query. Throughput is reduced (more DB load), but bookings succeed.

Search continues when OpenSearch is rebuilding

During an OpenSearch cluster recovery, queries that hit unavailable shards return partial results or errors. The Booking and Availability services are unaffected.

Event Replay for Read Model Recovery¶

Both OpenSearch and Redis are derived read models built from the Kafka event log. If either is lost or corrupted:

Drop the affected index or flush the Redis keyspace.
Reset the consumer group offset to the beginning (up to 7 days of history).
The consumer reprocesses all events and rebuilds the read model.

This eliminates the need for separate backup/restore procedures for search indices or Redis bitmaps.

Kubernetes Resilience¶

Horizontal Pod Autoscaler¶

Every service runs with an HPA. Minimum replicas in production (values-production.yaml):

Service	Min Replicas (Production)	Max Replicas
Booking	3	20
Search	3	30
Availability	3	10
Patient	3	10
Practitioner	3	10
Notification	3	10

Pod Disruption Budget¶

Every service has a PDB. In production (values-production.yaml), minAvailable: 2 ensures that node drain or cluster upgrades never take more than one replica offline at a time.

podDisruptionBudget:
  enabled: true
  minAvailable: 2    # production override; base default is 1

Liveness and Readiness Probes¶

All services expose /healthz (liveness) and /readyz (readiness) endpoints:

Readiness — pod only receives traffic when it can serve requests. This prevents traffic being sent to pods that are warming up or waiting for a DB connection.
Liveness — unhealthy pods are restarted automatically. The startup probe allows up to 5 minutes for slow initial starts (e.g., during DB migration).

Helm `--atomic` Flag¶

All Helm deploys use --atomic. If a rollout fails to reach the ready state within the timeout window, Helm automatically rolls back to the previous release. This means a bad deploy never leaves the cluster in a partially upgraded state.

Disaster Recovery¶

RDS Backup and Restore¶

Automated snapshots with 7-day retention (daily backup window 03:00–04:00 UTC)
Point-in-time recovery available within the 7-day window
Manual snapshot before any major schema migration

Terraform State¶

State stored in S3 with server-side encryption
DynamoDB locking prevents concurrent applies
S3 versioning retains state history for rollback

Runbook: Total Region Failure¶

In the event of a full eu-west-1 outage:

Declare incident; activate DR runbook.
Stand up a new regional stack using the same Terraform modules in eu-west-3 (Paris) or another EU region.
Restore RDS from the latest automated snapshot (RDS cross-region snapshot copy must be pre-configured — not yet automated).
Update DNS records to point to the new ALB.
ETA: ~2 hours RTO for a cold DR region; ~30 minutes if a warm standby is pre-provisioned.

Cold DR gap

Cross-region RDS snapshot copy and a warm standby stack are not yet provisioned. This is a known gap — RTO is 2+ hours in a full-region loss scenario.