Skip to content

Monitoring & Observability

The platform follows a three-pillar observability model: metrics, logs, and traces. All three are collected from EKS pods and AWS managed services without any changes to service code.


Metrics

Amazon CloudWatch

AWS-native metrics are available out of the box for all managed services:

Service Key Metrics
RDS Aurora CPUUtilization, DatabaseConnections, WriteLatency, ReadLatency, FreeableMemory
ElastiCache CurrConnections, CacheHits, CacheMisses, EngineCPUUtilization, NetworkBytesIn/Out
OpenSearch IndexingRate, SearchRate, JVMMemoryPressure, KibanaHealthyNodes
MSK (Kafka) BytesInPerSec, BytesOutPerSec, OffsetLag (consumer group lag), UnderReplicatedPartitions
ALB RequestCount, TargetResponseTime, HTTPCode_Target_5XX_Count, HealthyHostCount

Prometheus (EKS Add-on)

Application-level metrics are scraped by Prometheus running as an EKS add-on. Services expose a /metrics endpoint (Prometheus format) on a separate port. Standard Kubernetes and Node.js/JVM client libraries emit:

  • HTTP request rate, error rate, and latency histograms (RED metrics)
  • Database connection pool utilisation
  • Kafka consumer lag (via the Kafka client's built-in JMX metrics)
  • JVM heap usage (for Java runtimes)
  • Node.js event loop lag and active handles

Grafana

Grafana runs on EKS and is configured with two data sources:

  1. Prometheus — for EKS pod-level metrics and application metrics
  2. CloudWatch — for managed service metrics (RDS, ElastiCache, MSK, ALB)

Key dashboards:

  • System Overview — RPS, error rate, P99 latency per service
  • Booking Write Path — booking success/failure rate, DB write latency, Kafka produce latency
  • Search Performance — OpenSearch query latency, cache hit rate, Redis memory utilisation
  • Infrastructure Health — RDS connections and replication lag, ElastiCache eviction rate, MSK consumer lag

Logging

Fluent Bit → CloudWatch Logs

Fluent Bit runs as a DaemonSet on every EKS node. It tails container stdout/stderr, enriches log records with Kubernetes metadata (namespace, pod name, container name, node name), and ships to CloudWatch Logs.

Log groups per service:

/eks/appointment-system/booking-service
/eks/appointment-system/search-service
/eks/appointment-system/availability-service
/eks/appointment-system/patient-service
/eks/appointment-system/practitioner-service
/eks/appointment-system/notification-service

All services emit structured JSON logs with at minimum:

{
  "level": "info",
  "timestamp": "2026-03-30T12:00:00.000Z",
  "service": "booking-service",
  "traceId": "abc123",
  "requestId": "req-xyz",
  "message": "Booking created",
  "bookingId": "b-001",
  "practitionerId": "p-001",
  "durationMs": 42
}

MSK Broker Logs

MSK broker logs are shipped to CloudWatch under /aws/msk/{project} with 14-day retention. This captures topic configuration changes, consumer group rebalances, and broker errors.


Tracing

AWS X-Ray / OpenTelemetry

Distributed traces are collected via the AWS Distro for OpenTelemetry (ADOT) collector, running as a sidecar or DaemonSet. The ADOT collector exports to AWS X-Ray.

Every inbound HTTP request generates a trace with spans for:

  • ALB → service ingress
  • Service handler duration
  • Outbound RDS queries (instrumented via the database driver)
  • Outbound Redis calls
  • Outbound OpenSearch queries
  • Kafka produce calls
  • Outbound SES/SNS calls

The traceId is propagated via HTTP headers (X-Amzn-Trace-Id) and included in all structured log lines, allowing a single trace ID to correlate logs and spans across services.


Alerting

Alerts are configured in CloudWatch Alarms and routed to the on-call channel:

Alert Condition Severity
Booking error rate 5XX_Count / RequestCount > 1% over 5 minutes Critical
RDS high connections DatabaseConnections > 80% of max_connections Warning
Kafka consumer lag OffsetLag > 10,000 for any consumer group Warning
Redis evictions Evictions > 0 — indicates memory pressure Warning
EKS pod crash loop CrashLoopBackOff detected via kube-state-metrics Critical
OpenSearch cluster health ClusterStatus.red = 1 Critical
ALB 5XX spike HTTPCode_ELB_5XX_Count > 10 per minute Critical

Booking SLO

The primary SLO is booking success rate >= 99.9% over a rolling 30-minute window. The Booking error rate alert is the primary SLO burn-rate signal.