Refine Health Monitoring: timing defaults, offline detection, error rate calculation

Set 30-second report interval with 60-second absolute timeout for offline detection.
Define error rates as raw counts per interval (reset after each report). Script errors
include all failure types. Automatic online recovery on first received report. Flat
snapshot report structure.
This commit is contained in:
Joseph Doherty
2026-03-16 08:10:16 -04:00
parent 3dd62adf42
commit 57eae0c1db
2 changed files with 62 additions and 3 deletions

View File

@@ -33,9 +33,21 @@ Site clusters (metric collection and reporting). Central cluster (aggregation an
## Reporting Protocol
- Sites send a **health report message** to central at a configurable interval (e.g., every 30 seconds).
- Each report contains the current values of all monitored metrics.
- If central does not receive a report within a timeout window, the site is marked as **offline**.
- Sites send a **health report message** to central at a configurable interval (default: **30 seconds**).
- Each report is a **flat snapshot** containing the current values of all monitored metrics. Central replaces the entire previous state for that site on receipt.
- **Offline detection**: If central does not receive a report within a configurable timeout window (default: **60 seconds** — 2x the report interval), the site is marked as **offline**. This gives one missed report as grace before marking offline.
- **Online recovery**: When central receives a health report from a site that was marked offline, the site is automatically marked **online**. No manual acknowledgment required — the metrics in the report provide immediate visibility into the site's condition.
## Error Rate Metrics
Script error rates and alarm evaluation error rates are calculated as **raw counts per reporting interval**:
- The site maintains a counter for each metric that increments on every failure.
- Each health report includes the count since the last report. The counter resets after each report is sent.
- Central displays these as "X errors in the last 30 seconds" (or whatever the configured interval is).
- **Script errors** include all failures that prevent a script from completing successfully: unhandled exceptions, timeouts, recursion limit violations, and any other error condition.
- **Alarm evaluation errors** include all failures during alarm condition evaluation.
- For detailed diagnostics (error types, stack traces, affected instances), operators use the **Site Event Log Viewer** — the health dashboard is for quick triage, not forensics.
## Central Storage