Files
scadalink-design/Component-HealthMonitoring.md
Joseph Doherty 19c7e6880f Refine Data Connection Layer: error handling, reconnection, write failures, health reporting
Add connection lifecycle (fixed-interval auto-reconnect, immediate bad quality on
disconnect, transparent re-subscribe), synchronous write failure errors to scripts,
periodic tag path resolution retry, and enhanced health reporting with tag resolution
counts. Update cross-references in Health Monitoring and Site Runtime.
2026-03-16 07:51:37 -04:00

2.9 KiB

Component: Health Monitoring

Purpose

The Health Monitoring component collects and reports operational health metrics from site clusters to the central cluster, providing engineers with visibility into the status of the distributed system.

Location

Site clusters (metric collection and reporting). Central cluster (aggregation and display).

Responsibilities

Site Side

  • Collect health metrics from local subsystems.
  • Periodically report metrics to the central cluster via the Communication Layer.

Central Side

  • Receive and store health metrics from all sites.
  • Detect site connectivity status (online/offline) based on heartbeat presence.
  • Present health data in the Central UI dashboard.

Monitored Metrics

Metric Source Description
Site online/offline Communication Layer Whether the site is reachable (based on heartbeat)
Active/standby node status Cluster Infrastructure Which node is active, which is standby
Data connection health Data Connection Layer Connected/disconnected/reconnecting per data connection
Tag resolution counts Data Connection Layer Per connection: total subscribed tags vs. successfully resolved tags
Script error rates Site Runtime (Script Actors) Frequency of script failures
Alarm evaluation error rates Site Runtime (Alarm Actors) Frequency of alarm evaluation failures
Store-and-forward buffer depth Store-and-Forward Engine Pending messages by category (external, notification, DB write)

Reporting Protocol

  • Sites send a health report message to central at a configurable interval (e.g., every 30 seconds).
  • Each report contains the current values of all monitored metrics.
  • If central does not receive a report within a timeout window, the site is marked as offline.

Central Storage

  • Health metrics are held in memory at the central cluster for display in the UI.
  • No historical health data is persisted — the dashboard shows current/latest status only.
  • Site connectivity history (online/offline transitions) may optionally be logged via the Audit Log or a separate mechanism if needed in the future.

No Alerting

  • Health monitoring is display-only for now — no automated notifications or alerts are triggered by health status changes.
  • This can be extended in the future.

Dependencies

  • Communication Layer: Transports health reports from sites to central.
  • Data Connection Layer (site): Provides connection health metrics.
  • Site Runtime (site): Provides script error rate and alarm evaluation error rate metrics.
  • Store-and-Forward Engine (site): Provides buffer depth metrics.
  • Cluster Infrastructure (site): Provides node role status.

Interactions

  • Central UI: Health Monitoring Dashboard displays aggregated metrics.
  • Communication Layer: Health reports flow as periodic messages.