Fault Tolerance & Disaster Recovery¶
The platform is designed with no single point of failure. Every major component has an explicit failure mode and mitigation. This page documents what happens when something breaks and what the system does automatically versus what requires operator intervention.
Component Failure Modes¶
| Component | Failure Mode | Mitigation | Recovery Time |
|---|---|---|---|
| RDS (Booking) | Primary instance down | Multi-AZ automatic failover — Aurora promotes a reader replica | ~60 seconds |
| ElastiCache | Node failure | Automatic failover to replica within the shard | ~30 seconds |
| OpenSearch | Data node down | Replica shards are promoted; cluster self-heals by rebalancing | ~minutes |
| MSK (Kafka) | Single broker down | 3-broker cluster with replication factor 3; min ISR 2 — no data loss | Immediate (no failover needed) |
| EKS pods | Pod crash | Kubernetes restarts the pod; HPA maintains the minimum replica count | Seconds (restart) |
| Redis unavailable | Cache miss on availability bitmap | Booking Service falls back to an RDS query for the availability check — slower but correct | No downtime; degraded latency |
| OpenSearch unavailable | Search queries fail | Booking and availability remain fully functional; search returns an error | No booking impact |
| Availability Service down | Stale bitmap | Booking Service falls back to direct RDS availability query | No booking failures |
| Single AZ loss | Subset of pods and replicas unavailable | Multi-AZ pod anti-affinity means other AZs maintain quorum; NAT Gateways are per-AZ | Automatic failover |
Key Design Decisions for Resilience¶
Multi-AZ Everything¶
Every stateful component is deployed across 3 AZs:
- RDS Aurora: 3-AZ cluster with reader replica in a separate AZ from the writer
- ElastiCache: 3 shards, 2 replicas per shard, spread across AZs
- MSK: 3 brokers, one per AZ
- OpenSearch: 3 data nodes, one per AZ
- EKS nodes: Managed node groups span all 3 AZs; pod anti-affinity prefers AZ spread
The NAT Gateway architecture is AZ-local (one per AZ) so a NAT Gateway failure only affects outbound traffic from one AZ's private subnet, not all three.
Graceful Degradation¶
The system is designed to degrade gracefully under partial failures rather than failing completely:
Booking still works when Search is down
Patients who already know which practitioner they want can book directly via the Booking Service. The search path is unavailable, but no bookings are lost.
Booking still works when Redis is unavailable
The Redis pre-check is an optimisation, not a correctness gate. The Booking Service falls back to an RDS query. Throughput is reduced (more DB load), but bookings succeed.
Search continues when OpenSearch is rebuilding
During an OpenSearch cluster recovery, queries that hit unavailable shards return partial results or errors. The Booking and Availability services are unaffected.
Event Replay for Read Model Recovery¶
Both OpenSearch and Redis are derived read models built from the Kafka event log. If either is lost or corrupted:
- Drop the affected index or flush the Redis keyspace.
- Reset the consumer group offset to the beginning (up to 7 days of history).
- The consumer reprocesses all events and rebuilds the read model.
This eliminates the need for separate backup/restore procedures for search indices or Redis bitmaps.
Kubernetes Resilience¶
Horizontal Pod Autoscaler¶
Every service runs with an HPA. Minimum replicas in production (values-production.yaml):
| Service | Min Replicas (Production) | Max Replicas |
|---|---|---|
| Booking | 3 | 20 |
| Search | 3 | 30 |
| Availability | 3 | 10 |
| Patient | 3 | 10 |
| Practitioner | 3 | 10 |
| Notification | 3 | 10 |
Pod Disruption Budget¶
Every service has a PDB. In production (values-production.yaml), minAvailable: 2 ensures that node drain or cluster upgrades never take more than one replica offline at a time.
Liveness and Readiness Probes¶
All services expose /healthz (liveness) and /readyz (readiness) endpoints:
- Readiness — pod only receives traffic when it can serve requests. This prevents traffic being sent to pods that are warming up or waiting for a DB connection.
- Liveness — unhealthy pods are restarted automatically. The startup probe allows up to 5 minutes for slow initial starts (e.g., during DB migration).
Helm --atomic Flag¶
All Helm deploys use --atomic. If a rollout fails to reach the ready state within the timeout window, Helm automatically rolls back to the previous release. This means a bad deploy never leaves the cluster in a partially upgraded state.
Disaster Recovery¶
RDS Backup and Restore¶
- Automated snapshots with 7-day retention (daily backup window 03:00–04:00 UTC)
- Point-in-time recovery available within the 7-day window
- Manual snapshot before any major schema migration
Terraform State¶
- State stored in S3 with server-side encryption
- DynamoDB locking prevents concurrent applies
- S3 versioning retains state history for rollback
Runbook: Total Region Failure¶
In the event of a full eu-west-1 outage:
- Declare incident; activate DR runbook.
- Stand up a new regional stack using the same Terraform modules in
eu-west-3(Paris) or another EU region. - Restore RDS from the latest automated snapshot (RDS cross-region snapshot copy must be pre-configured — not yet automated).
- Update DNS records to point to the new ALB.
- ETA: ~2 hours RTO for a cold DR region; ~30 minutes if a warm standby is pre-provisioned.
Cold DR gap
Cross-region RDS snapshot copy and a warm standby stack are not yet provisioned. This is a known gap — RTO is 2+ hours in a full-region loss scenario.