Set 30-second report interval with 60-second absolute timeout for offline detection. Define error rates as raw counts per interval (reset after each report). Script errors include all failure types. Automatic online recovery on first received report. Flat snapshot report structure.
75 lines
4.2 KiB
Markdown
75 lines
4.2 KiB
Markdown
# Component: Health Monitoring
|
|
|
|
## Purpose
|
|
|
|
The Health Monitoring component collects and reports operational health metrics from site clusters to the central cluster, providing engineers with visibility into the status of the distributed system.
|
|
|
|
## Location
|
|
|
|
Site clusters (metric collection and reporting). Central cluster (aggregation and display).
|
|
|
|
## Responsibilities
|
|
|
|
### Site Side
|
|
- Collect health metrics from local subsystems.
|
|
- Periodically report metrics to the central cluster via the Communication Layer.
|
|
|
|
### Central Side
|
|
- Receive and store health metrics from all sites.
|
|
- Detect site connectivity status (online/offline) based on heartbeat presence.
|
|
- Present health data in the Central UI dashboard.
|
|
|
|
## Monitored Metrics
|
|
|
|
| Metric | Source | Description |
|
|
|--------|--------|-------------|
|
|
| Site online/offline | Communication Layer | Whether the site is reachable (based on heartbeat) |
|
|
| Active/standby node status | Cluster Infrastructure | Which node is active, which is standby |
|
|
| Data connection health | Data Connection Layer | Connected/disconnected/reconnecting per data connection |
|
|
| Tag resolution counts | Data Connection Layer | Per connection: total subscribed tags vs. successfully resolved tags |
|
|
| Script error rates | Site Runtime (Script Actors) | Frequency of script failures |
|
|
| Alarm evaluation error rates | Site Runtime (Alarm Actors) | Frequency of alarm evaluation failures |
|
|
| Store-and-forward buffer depth | Store-and-Forward Engine | Pending messages by category (external, notification, DB write) |
|
|
|
|
## Reporting Protocol
|
|
|
|
- Sites send a **health report message** to central at a configurable interval (default: **30 seconds**).
|
|
- Each report is a **flat snapshot** containing the current values of all monitored metrics. Central replaces the entire previous state for that site on receipt.
|
|
- **Offline detection**: If central does not receive a report within a configurable timeout window (default: **60 seconds** — 2x the report interval), the site is marked as **offline**. This gives one missed report as grace before marking offline.
|
|
- **Online recovery**: When central receives a health report from a site that was marked offline, the site is automatically marked **online**. No manual acknowledgment required — the metrics in the report provide immediate visibility into the site's condition.
|
|
|
|
## Error Rate Metrics
|
|
|
|
Script error rates and alarm evaluation error rates are calculated as **raw counts per reporting interval**:
|
|
|
|
- The site maintains a counter for each metric that increments on every failure.
|
|
- Each health report includes the count since the last report. The counter resets after each report is sent.
|
|
- Central displays these as "X errors in the last 30 seconds" (or whatever the configured interval is).
|
|
- **Script errors** include all failures that prevent a script from completing successfully: unhandled exceptions, timeouts, recursion limit violations, and any other error condition.
|
|
- **Alarm evaluation errors** include all failures during alarm condition evaluation.
|
|
- For detailed diagnostics (error types, stack traces, affected instances), operators use the **Site Event Log Viewer** — the health dashboard is for quick triage, not forensics.
|
|
|
|
## Central Storage
|
|
|
|
- Health metrics are held **in memory** at the central cluster for display in the UI.
|
|
- No historical health data is persisted — the dashboard shows current/latest status only.
|
|
- Site connectivity history (online/offline transitions) may optionally be logged via the Audit Log or a separate mechanism if needed in the future.
|
|
|
|
## No Alerting
|
|
|
|
- Health monitoring is **display-only** for now — no automated notifications or alerts are triggered by health status changes.
|
|
- This can be extended in the future.
|
|
|
|
## Dependencies
|
|
|
|
- **Communication Layer**: Transports health reports from sites to central.
|
|
- **Data Connection Layer (site)**: Provides connection health metrics.
|
|
- **Site Runtime (site)**: Provides script error rate and alarm evaluation error rate metrics.
|
|
- **Store-and-Forward Engine (site)**: Provides buffer depth metrics.
|
|
- **Cluster Infrastructure (site)**: Provides node role status.
|
|
|
|
## Interactions
|
|
|
|
- **Central UI**: Health Monitoring Dashboard displays aggregated metrics.
|
|
- **Communication Layer**: Health reports flow as periodic messages.
|