Files
scadalink-design/Component-HealthMonitoring.md
Joseph Doherty 19c7e6880f Refine Data Connection Layer: error handling, reconnection, write failures, health reporting
Add connection lifecycle (fixed-interval auto-reconnect, immediate bad quality on
disconnect, transparent re-subscribe), synchronous write failure errors to scripts,
periodic tag path resolution retry, and enhanced health reporting with tag resolution
counts. Update cross-references in Health Monitoring and Site Runtime.
2026-03-16 07:51:37 -04:00

63 lines
2.9 KiB
Markdown

# Component: Health Monitoring
## Purpose
The Health Monitoring component collects and reports operational health metrics from site clusters to the central cluster, providing engineers with visibility into the status of the distributed system.
## Location
Site clusters (metric collection and reporting). Central cluster (aggregation and display).
## Responsibilities
### Site Side
- Collect health metrics from local subsystems.
- Periodically report metrics to the central cluster via the Communication Layer.
### Central Side
- Receive and store health metrics from all sites.
- Detect site connectivity status (online/offline) based on heartbeat presence.
- Present health data in the Central UI dashboard.
## Monitored Metrics
| Metric | Source | Description |
|--------|--------|-------------|
| Site online/offline | Communication Layer | Whether the site is reachable (based on heartbeat) |
| Active/standby node status | Cluster Infrastructure | Which node is active, which is standby |
| Data connection health | Data Connection Layer | Connected/disconnected/reconnecting per data connection |
| Tag resolution counts | Data Connection Layer | Per connection: total subscribed tags vs. successfully resolved tags |
| Script error rates | Site Runtime (Script Actors) | Frequency of script failures |
| Alarm evaluation error rates | Site Runtime (Alarm Actors) | Frequency of alarm evaluation failures |
| Store-and-forward buffer depth | Store-and-Forward Engine | Pending messages by category (external, notification, DB write) |
## Reporting Protocol
- Sites send a **health report message** to central at a configurable interval (e.g., every 30 seconds).
- Each report contains the current values of all monitored metrics.
- If central does not receive a report within a timeout window, the site is marked as **offline**.
## Central Storage
- Health metrics are held **in memory** at the central cluster for display in the UI.
- No historical health data is persisted — the dashboard shows current/latest status only.
- Site connectivity history (online/offline transitions) may optionally be logged via the Audit Log or a separate mechanism if needed in the future.
## No Alerting
- Health monitoring is **display-only** for now — no automated notifications or alerts are triggered by health status changes.
- This can be extended in the future.
## Dependencies
- **Communication Layer**: Transports health reports from sites to central.
- **Data Connection Layer (site)**: Provides connection health metrics.
- **Site Runtime (site)**: Provides script error rate and alarm evaluation error rate metrics.
- **Store-and-Forward Engine (site)**: Provides buffer depth metrics.
- **Cluster Infrastructure (site)**: Provides node role status.
## Interactions
- **Central UI**: Health Monitoring Dashboard displays aggregated metrics.
- **Communication Layer**: Health reports flow as periodic messages.