Monitoring & Observability¶

The platform follows a three-pillar observability model: metrics, logs, and traces. All three are collected from EKS pods and AWS managed services without any changes to service code.

Metrics¶

Amazon CloudWatch¶

AWS-native metrics are available out of the box for all managed services:

Service	Key Metrics
RDS Aurora	`CPUUtilization`, `DatabaseConnections`, `WriteLatency`, `ReadLatency`, `FreeableMemory`
ElastiCache	`CurrConnections`, `CacheHits`, `CacheMisses`, `EngineCPUUtilization`, `NetworkBytesIn/Out`
OpenSearch	`IndexingRate`, `SearchRate`, `JVMMemoryPressure`, `KibanaHealthyNodes`
MSK (Kafka)	`BytesInPerSec`, `BytesOutPerSec`, `OffsetLag` (consumer group lag), `UnderReplicatedPartitions`
ALB	`RequestCount`, `TargetResponseTime`, `HTTPCode_Target_5XX_Count`, `HealthyHostCount`

Prometheus (EKS Add-on)¶

Application-level metrics are scraped by Prometheus running as an EKS add-on. Services expose a /metrics endpoint (Prometheus format) on a separate port. Standard Kubernetes and Node.js/JVM client libraries emit:

HTTP request rate, error rate, and latency histograms (RED metrics)
Database connection pool utilisation
Kafka consumer lag (via the Kafka client's built-in JMX metrics)
JVM heap usage (for Java runtimes)
Node.js event loop lag and active handles

Grafana¶

Grafana runs on EKS and is configured with two data sources:

Prometheus — for EKS pod-level metrics and application metrics
CloudWatch — for managed service metrics (RDS, ElastiCache, MSK, ALB)

Key dashboards:

System Overview — RPS, error rate, P99 latency per service
Booking Write Path — booking success/failure rate, DB write latency, Kafka produce latency
Search Performance — OpenSearch query latency, cache hit rate, Redis memory utilisation
Infrastructure Health — RDS connections and replication lag, ElastiCache eviction rate, MSK consumer lag

Logging¶

Fluent Bit → CloudWatch Logs¶

Fluent Bit runs as a DaemonSet on every EKS node. It tails container stdout/stderr, enriches log records with Kubernetes metadata (namespace, pod name, container name, node name), and ships to CloudWatch Logs.

Log groups per service:

/eks/appointment-system/booking-service
/eks/appointment-system/search-service
/eks/appointment-system/availability-service
/eks/appointment-system/patient-service
/eks/appointment-system/practitioner-service
/eks/appointment-system/notification-service

All services emit structured JSON logs with at minimum:

{
  "level": "info",
  "timestamp": "2026-03-30T12:00:00.000Z",
  "service": "booking-service",
  "traceId": "abc123",
  "requestId": "req-xyz",
  "message": "Booking created",
  "bookingId": "b-001",
  "practitionerId": "p-001",
  "durationMs": 42
}

MSK Broker Logs¶

MSK broker logs are shipped to CloudWatch under /aws/msk/{project} with 14-day retention. This captures topic configuration changes, consumer group rebalances, and broker errors.

Tracing¶

AWS X-Ray / OpenTelemetry¶

Distributed traces are collected via the AWS Distro for OpenTelemetry (ADOT) collector, running as a sidecar or DaemonSet. The ADOT collector exports to AWS X-Ray.

Every inbound HTTP request generates a trace with spans for:

ALB → service ingress
Service handler duration
Outbound RDS queries (instrumented via the database driver)
Outbound Redis calls
Outbound OpenSearch queries
Kafka produce calls
Outbound SES/SNS calls

The traceId is propagated via HTTP headers (X-Amzn-Trace-Id) and included in all structured log lines, allowing a single trace ID to correlate logs and spans across services.

Alerting¶

Alerts are configured in CloudWatch Alarms and routed to the on-call channel:

Alert	Condition	Severity
Booking error rate	`5XX_Count / RequestCount > 1%` over 5 minutes	Critical
RDS high connections	`DatabaseConnections > 80%` of `max_connections`	Warning
Kafka consumer lag	`OffsetLag > 10,000` for any consumer group	Warning
Redis evictions	`Evictions > 0` — indicates memory pressure	Warning
EKS pod crash loop	`CrashLoopBackOff` detected via kube-state-metrics	Critical
OpenSearch cluster health	`ClusterStatus.red = 1`	Critical
ALB 5XX spike	`HTTPCode_ELB_5XX_Count > 10` per minute	Critical

Booking SLO

The primary SLO is booking success rate >= 99.9% over a rolling 30-minute window. The Booking error rate alert is the primary SLO burn-rate signal.