# Component: Health Monitoring ## Purpose The Health Monitoring component collects and reports operational health metrics from site clusters to the central cluster, providing engineers with visibility into the status of the distributed system. ## Location Site clusters (metric collection and reporting). Central cluster (aggregation and display). ## Responsibilities ### Site Side - Collect health metrics from local subsystems. - Periodically report metrics to the central cluster via the Communication Layer. ### Central Side - Receive and store health metrics from all sites. - Detect site connectivity status (online/offline) based on heartbeat presence. - Present health data in the Central UI dashboard. ## Monitored Metrics | Metric | Source | Description | |--------|--------|-------------| | Site online/offline | Communication Layer | Whether the site is reachable (based on heartbeat) | | Active/standby node status | Cluster Infrastructure | Which node is active, which is standby | | Data connection health | Data Connection Layer | Connected/disconnected/reconnecting per data connection | | Tag resolution counts | Data Connection Layer | Per connection: total subscribed tags vs. successfully resolved tags | | Script error rates | Site Runtime (Script Actors) | Frequency of script failures | | Alarm evaluation error rates | Site Runtime (Alarm Actors) | Frequency of alarm evaluation failures | | Store-and-forward buffer depth | Store-and-Forward Engine | Pending messages by category — external, notification (notifications awaiting forward to central), DB write | | Dead letter count | Akka.NET EventStream | Messages sent to actors that no longer exist — indicates stale references or timing issues | | Notification Outbox queue depth | Notification Outbox (central) | Count of `Pending` + `Retrying` notifications — central-computed, not site-reported | | Notification Outbox stuck count | Notification Outbox (central) | Count of `Pending` / `Retrying` notifications older than the configurable stuck-age threshold — central-computed, not site-reported | | Notification Outbox parked count | Notification Outbox (central) | Count of `Parked` notifications — central-computed, not site-reported | ## Reporting Protocol - Sites send a **health report message** to central at a configurable interval (default: **30 seconds**). - Each report is a **flat snapshot** containing the current values of all monitored metrics, a **monotonic sequence number**, and the **report timestamp** from the site. Central replaces the previous state for that site only if the incoming sequence number is higher than the last received — this prevents stale reports (e.g., delayed in transit or from a pre-failover node) from overwriting newer state. - **Offline detection**: If central does not receive a report within a configurable timeout window (default: **60 seconds** — 2x the report interval), the site is marked as **offline**. This gives one missed report as grace before marking offline. - **Online recovery**: When central receives a health report from a site that was marked offline, the site is automatically marked **online**. No manual acknowledgment required — the metrics in the report provide immediate visibility into the site's condition. ## Error Rate Metrics Script error rates and alarm evaluation error rates are calculated as **raw counts per reporting interval**: - The site maintains a counter for each metric that increments on every failure. - Each health report includes the count since the last report. The counter resets after each report is sent. - Central displays these as "X errors in the last 30 seconds" (or whatever the configured interval is). - **Script errors** include all failures that prevent a script from completing successfully: unhandled exceptions, timeouts, recursion limit violations, and any other error condition. - **Alarm evaluation errors** include all failures during alarm condition evaluation. - For detailed diagnostics (error types, stack traces, affected instances), operators use the **Site Event Log Viewer** — the health dashboard is for quick triage, not forensics. ## Notification Outbox KPIs The Notification Outbox is a **central** component, so its KPIs are **central-computed** rather than collected from sites and carried in the site health report: - The dashboard surfaces three **headline** outbox KPIs: **queue depth** (`Pending` + `Retrying`), **stuck count** (`Pending` / `Retrying` rows older than the configurable stuck-age threshold), and **parked count** (`Parked`). - The Notification Outbox component computes these on demand from the central `Notifications` table; the health dashboard polls it for the headline tiles. - The fuller KPI set — which also includes **delivered (last interval)** and **oldest pending age** — lives on the Central UI **Notification Outbox** page, not the health dashboard. - Outbox KPIs are **point-in-time**, computed on demand from the `Notifications` table. There is no time-series store — consistent with Health Monitoring's "current status only" philosophy. The outbox's own ~1-year row retention answers historical questions directly. These are distinct from the site-reported **Store-and-forward buffer depth** notification metric, which now covers the **site→central leg** — notifications still buffered in a site's Store-and-Forward Engine awaiting forward to central — and remains part of the site health report. ## Central Storage - Health metrics are held **in memory** at the central cluster for display in the UI. - No historical health data is persisted — the dashboard shows current/latest status only. - Notification Outbox KPIs are not stored by Health Monitoring; they are computed point-in-time from the central `Notifications` table each time the dashboard refreshes — consistent with the current-status-only philosophy. - Site connectivity history (online/offline transitions) may optionally be logged via the Audit Log or a separate mechanism if needed in the future. ## No Alerting - Health monitoring is **display-only** for now — no automated notifications or alerts are triggered by health status changes. - This can be extended in the future. ## Dependencies - **Communication Layer**: Transports health reports from sites to central. - **Data Connection Layer (site)**: Provides connection health metrics. - **Site Runtime (site)**: Provides script error rate and alarm evaluation error rate metrics. - **Store-and-Forward Engine (site)**: Provides buffer depth metrics, including the notification backlog awaiting forward to central. - **Cluster Infrastructure (site)**: Provides node role status. - **Notification Outbox (central)**: Provides central-computed outbox KPIs — queue depth, stuck count, parked count — for the headline dashboard tiles. ## Interactions - **Central UI**: Health Monitoring Dashboard displays aggregated metrics. - **Communication Layer**: Health reports flow as periodic messages.