Component: Health Monitoring

Purpose

The Health Monitoring component collects and reports operational health metrics from site clusters to the central cluster, providing engineers with visibility into the status of the distributed system.

Location

Site clusters (metric collection and reporting). Central cluster (aggregation and display).

Responsibilities

Site Side

Collect health metrics from local subsystems.
Periodically report metrics to the central cluster via the Communication Layer.

Central Side

Receive and store health metrics from all sites.
Detect site connectivity status (online/offline) based on heartbeat presence.
Present health data in the Central UI dashboard.

Monitored Metrics

Metric	Source	Description
Site online/offline	Communication Layer	Whether the site is reachable (based on heartbeat)
Active/standby node status	Cluster Infrastructure	Which node is active, which is standby
Data connection health	Data Connection Layer	Connected/disconnected/reconnecting per data connection
Tag resolution counts	Data Connection Layer	Per connection: total subscribed tags vs. successfully resolved tags
Script error rates	Site Runtime (Script Actors)	Frequency of script failures
Alarm evaluation error rates	Site Runtime (Alarm Actors)	Frequency of alarm evaluation failures
Store-and-forward buffer depth	Store-and-Forward Engine	Pending messages by category — external, notification (notifications awaiting forward to central), DB write
Dead letter count	Akka.NET EventStream	Messages sent to actors that no longer exist — indicates stale references or timing issues
Notification Outbox queue depth	Notification Outbox (central)	Count of `Pending` + `Retrying` notifications — central-computed, not site-reported
Notification Outbox stuck count	Notification Outbox (central)	Count of `Pending` / `Retrying` notifications older than the configurable stuck-age threshold — central-computed, not site-reported
Notification Outbox parked count	Notification Outbox (central)	Count of `Parked` notifications — central-computed, not site-reported

Reporting Protocol

Sites send a health report message to central at a configurable interval (default: 30 seconds).
Each report is a flat snapshot containing the current values of all monitored metrics, a monotonic sequence number, and the report timestamp from the site. Central replaces the previous state for that site only if the incoming sequence number is higher than the last received — this prevents stale reports (e.g., delayed in transit or from a pre-failover node) from overwriting newer state.
Offline detection: If central does not receive a report within a configurable timeout window (default: 60 seconds — 2x the report interval), the site is marked as offline. This gives one missed report as grace before marking offline.
Online recovery: When central receives a health report from a site that was marked offline, the site is automatically marked online. No manual acknowledgment required — the metrics in the report provide immediate visibility into the site's condition.

Error Rate Metrics

Script error rates and alarm evaluation error rates are calculated as raw counts per reporting interval:

The site maintains a counter for each metric that increments on every failure.
Each health report includes the count since the last report. The counter resets after each report is sent.
Central displays these as "X errors in the last 30 seconds" (or whatever the configured interval is).
Script errors include all failures that prevent a script from completing successfully: unhandled exceptions, timeouts, recursion limit violations, and any other error condition.
Alarm evaluation errors include all failures during alarm condition evaluation.
For detailed diagnostics (error types, stack traces, affected instances), operators use the Site Event Log Viewer — the health dashboard is for quick triage, not forensics.

Notification Outbox KPIs

The Notification Outbox is a central component, so its KPIs are central-computed rather than collected from sites and carried in the site health report:

The dashboard surfaces three headline outbox KPIs: queue depth (Pending + Retrying), stuck count (Pending / Retrying rows older than the configurable stuck-age threshold), and parked count (Parked).
The Notification Outbox component computes these on demand from the central Notifications table; the health dashboard polls it for the headline tiles.
The fuller KPI set — which also includes delivered (last interval) and oldest pending age — lives on the Central UI Notification Outbox page, not the health dashboard.
Outbox KPIs are point-in-time, computed on demand from the Notifications table. There is no time-series store — consistent with Health Monitoring's "current status only" philosophy. The outbox's own ~1-year row retention answers historical questions directly.

These are distinct from the site-reported Store-and-forward buffer depth notification metric, which now covers the site→central leg — notifications still buffered in a site's Store-and-Forward Engine awaiting forward to central — and remains part of the site health report.

Central Storage

Health metrics are held in memory at the central cluster for display in the UI.
No historical health data is persisted — the dashboard shows current/latest status only.
Notification Outbox KPIs are not stored by Health Monitoring; they are computed point-in-time from the central Notifications table each time the dashboard refreshes — consistent with the current-status-only philosophy.
Site connectivity history (online/offline transitions) may optionally be logged via the Audit Log or a separate mechanism if needed in the future.

No Alerting

Health monitoring is display-only for now — no automated notifications or alerts are triggered by health status changes.
This can be extended in the future.

Dependencies

Communication Layer: Transports health reports from sites to central.
Data Connection Layer (site): Provides connection health metrics.
Site Runtime (site): Provides script error rate and alarm evaluation error rate metrics.
Store-and-Forward Engine (site): Provides buffer depth metrics, including the notification backlog awaiting forward to central.
Cluster Infrastructure (site): Provides node role status.
Notification Outbox (central): Provides central-computed outbox KPIs — queue depth, stuck count, parked count — for the headline dashboard tiles.

Interactions

Central UI: Health Monitoring Dashboard displays aggregated metrics.
Communication Layer: Health reports flow as periodic messages.

6.9 KiB Raw Blame History