# Component: Health Monitoring ## Purpose The Health Monitoring component collects and reports operational health metrics from site clusters to the central cluster, providing engineers with visibility into the status of the distributed system. ## Location Site clusters (metric collection and reporting). Central cluster (aggregation and display). ## Responsibilities ### Site Side - Collect health metrics from local subsystems. - Periodically report metrics to the central cluster via the Communication Layer. ### Central Side - Receive and store health metrics from all sites. - Detect site connectivity status (online/offline) based on heartbeat presence. - Present health data in the Central UI dashboard. ## Monitored Metrics | Metric | Source | Description | |--------|--------|-------------| | Site online/offline | Communication Layer | Whether the site is reachable (based on heartbeat) | | Active/standby node status | Cluster Infrastructure | Which node is active, which is standby | | Data connection health | Data Connection Layer | Connected/disconnected per data connection | | Script error rates | Site Runtime (Script Actors) | Frequency of script failures | | Alarm evaluation error rates | Site Runtime (Alarm Actors) | Frequency of alarm evaluation failures | | Store-and-forward buffer depth | Store-and-Forward Engine | Pending messages by category (external, notification, DB write) | ## Reporting Protocol - Sites send a **health report message** to central at a configurable interval (e.g., every 30 seconds). - Each report contains the current values of all monitored metrics. - If central does not receive a report within a timeout window, the site is marked as **offline**. ## Central Storage - Health metrics are held **in memory** at the central cluster for display in the UI. - No historical health data is persisted — the dashboard shows current/latest status only. - Site connectivity history (online/offline transitions) may optionally be logged via the Audit Log or a separate mechanism if needed in the future. ## No Alerting - Health monitoring is **display-only** for now — no automated notifications or alerts are triggered by health status changes. - This can be extended in the future. ## Dependencies - **Communication Layer**: Transports health reports from sites to central. - **Data Connection Layer (site)**: Provides connection health metrics. - **Site Runtime (site)**: Provides script error rate and alarm evaluation error rate metrics. - **Store-and-Forward Engine (site)**: Provides buffer depth metrics. - **Cluster Infrastructure (site)**: Provides node role status. ## Interactions - **Central UI**: Health Monitoring Dashboard displays aggregated metrics. - **Communication Layer**: Health reports flow as periodic messages.