Monitoring & Observability¶
The platform follows a three-pillar observability model: metrics, logs, and traces. All three are collected from EKS pods and AWS managed services without any changes to service code.
Metrics¶
Amazon CloudWatch¶
AWS-native metrics are available out of the box for all managed services:
| Service | Key Metrics |
|---|---|
| RDS Aurora | CPUUtilization, DatabaseConnections, WriteLatency, ReadLatency, FreeableMemory |
| ElastiCache | CurrConnections, CacheHits, CacheMisses, EngineCPUUtilization, NetworkBytesIn/Out |
| OpenSearch | IndexingRate, SearchRate, JVMMemoryPressure, KibanaHealthyNodes |
| MSK (Kafka) | BytesInPerSec, BytesOutPerSec, OffsetLag (consumer group lag), UnderReplicatedPartitions |
| ALB | RequestCount, TargetResponseTime, HTTPCode_Target_5XX_Count, HealthyHostCount |
Prometheus (EKS Add-on)¶
Application-level metrics are scraped by Prometheus running as an EKS add-on. Services expose a /metrics endpoint (Prometheus format) on a separate port. Standard Kubernetes and Node.js/JVM client libraries emit:
- HTTP request rate, error rate, and latency histograms (RED metrics)
- Database connection pool utilisation
- Kafka consumer lag (via the Kafka client's built-in JMX metrics)
- JVM heap usage (for Java runtimes)
- Node.js event loop lag and active handles
Grafana¶
Grafana runs on EKS and is configured with two data sources:
- Prometheus — for EKS pod-level metrics and application metrics
- CloudWatch — for managed service metrics (RDS, ElastiCache, MSK, ALB)
Key dashboards:
- System Overview — RPS, error rate, P99 latency per service
- Booking Write Path — booking success/failure rate, DB write latency, Kafka produce latency
- Search Performance — OpenSearch query latency, cache hit rate, Redis memory utilisation
- Infrastructure Health — RDS connections and replication lag, ElastiCache eviction rate, MSK consumer lag
Logging¶
Fluent Bit → CloudWatch Logs¶
Fluent Bit runs as a DaemonSet on every EKS node. It tails container stdout/stderr, enriches log records with Kubernetes metadata (namespace, pod name, container name, node name), and ships to CloudWatch Logs.
Log groups per service:
/eks/appointment-system/booking-service
/eks/appointment-system/search-service
/eks/appointment-system/availability-service
/eks/appointment-system/patient-service
/eks/appointment-system/practitioner-service
/eks/appointment-system/notification-service
All services emit structured JSON logs with at minimum:
{
"level": "info",
"timestamp": "2026-03-30T12:00:00.000Z",
"service": "booking-service",
"traceId": "abc123",
"requestId": "req-xyz",
"message": "Booking created",
"bookingId": "b-001",
"practitionerId": "p-001",
"durationMs": 42
}
MSK Broker Logs¶
MSK broker logs are shipped to CloudWatch under /aws/msk/{project} with 14-day retention. This captures topic configuration changes, consumer group rebalances, and broker errors.
Tracing¶
AWS X-Ray / OpenTelemetry¶
Distributed traces are collected via the AWS Distro for OpenTelemetry (ADOT) collector, running as a sidecar or DaemonSet. The ADOT collector exports to AWS X-Ray.
Every inbound HTTP request generates a trace with spans for:
- ALB → service ingress
- Service handler duration
- Outbound RDS queries (instrumented via the database driver)
- Outbound Redis calls
- Outbound OpenSearch queries
- Kafka produce calls
- Outbound SES/SNS calls
The traceId is propagated via HTTP headers (X-Amzn-Trace-Id) and included in all structured log lines, allowing a single trace ID to correlate logs and spans across services.
Alerting¶
Alerts are configured in CloudWatch Alarms and routed to the on-call channel:
| Alert | Condition | Severity |
|---|---|---|
| Booking error rate | 5XX_Count / RequestCount > 1% over 5 minutes |
Critical |
| RDS high connections | DatabaseConnections > 80% of max_connections |
Warning |
| Kafka consumer lag | OffsetLag > 10,000 for any consumer group |
Warning |
| Redis evictions | Evictions > 0 — indicates memory pressure |
Warning |
| EKS pod crash loop | CrashLoopBackOff detected via kube-state-metrics |
Critical |
| OpenSearch cluster health | ClusterStatus.red = 1 |
Critical |
| ALB 5XX spike | HTTPCode_ELB_5XX_Count > 10 per minute |
Critical |
Booking SLO
The primary SLO is booking success rate >= 99.9% over a rolling 30-minute window. The Booking error rate alert is the primary SLO burn-rate signal.