Refine Health Monitoring: timing defaults, offline detection, error rate calculation

Set 30-second report interval with 60-second absolute timeout for offline detection. Define error rates as raw counts per interval (reset after each report). Script errors include all failure types. Automatic online recovery on first received report. Flat snapshot report structure.
2026-03-16 08:10:16 -04:00
parent 3dd62adf42
commit 57eae0c1db
2 changed files with 62 additions and 3 deletions
--- a/Component-HealthMonitoring.md
+++ b/Component-HealthMonitoring.md
@@ -33,9 +33,21 @@ Site clusters (metric collection and reporting). Central cluster (aggregation an

 ## Reporting Protocol

- Sites send a **health report message** to central at a configurable interval (e.g., every 30 seconds).
- Each report contains the current values of all monitored metrics.
- If central does not receive a report within a timeout window, the site is marked as **offline**.
+- Sites send a **health report message** to central at a configurable interval (default: **30 seconds**).
+- Each report is a **flat snapshot** containing the current values of all monitored metrics. Central replaces the entire previous state for that site on receipt.
+- **Offline detection**: If central does not receive a report within a configurable timeout window (default: **60 seconds** — 2x the report interval), the site is marked as **offline**. This gives one missed report as grace before marking offline.
+- **Online recovery**: When central receives a health report from a site that was marked offline, the site is automatically marked **online**. No manual acknowledgment required — the metrics in the report provide immediate visibility into the site's condition.
+
+## Error Rate Metrics
+
+Script error rates and alarm evaluation error rates are calculated as **raw counts per reporting interval**:
+
+- The site maintains a counter for each metric that increments on every failure.
+- Each health report includes the count since the last report. The counter resets after each report is sent.
+- Central displays these as "X errors in the last 30 seconds" (or whatever the configured interval is).
+- **Script errors** include all failures that prevent a script from completing successfully: unhandled exceptions, timeouts, recursion limit violations, and any other error condition.
+- **Alarm evaluation errors** include all failures during alarm condition evaluation.
+- For detailed diagnostics (error types, stack traces, affected instances), operators use the **Site Event Log Viewer** — the health dashboard is for quick triage, not forensics.

 ## Central Storage

--- a/docs/plans/2026-03-16-health-monitoring-refinement-design.md
+++ b/docs/plans/2026-03-16-health-monitoring-refinement-design.md
@@ -0,0 +1,47 @@
+# Health Monitoring Refinement — Design
+
+**Date**: 2026-03-16
+**Component**: Health Monitoring (`Component-HealthMonitoring.md`)
+**Status**: Approved
+
+## Problem
+
+The Health Monitoring doc listed metrics and described the reporting concept but lacked concrete timing defaults, offline detection logic, error rate calculation methodology, and report structure definition.
+
+## Decisions
+
+### Report Interval & Offline Detection
+- Default report interval: **30 seconds** (configurable).
+- Offline detection: **Absolute timeout of 60 seconds** (2x report interval). If no report received within the window, site is marked offline.
+- Simple single-clock approach — no counting of missed reports.
+
+### Online Recovery
+- **Automatic** — first health report from an offline site marks it online. No manual acknowledgment.
+
+### Error Rate Calculation
+- **Raw counts per reporting interval**. Site increments counters, includes them in the report, resets after each send.
+- Central displays as "X errors in the last 30 seconds."
+- No rolling windows or cumulative counters — keeps both sides simple.
+
+### Error Scope
+- **Script errors**: All failures — unhandled exceptions, timeouts, recursion limit violations, any error preventing completion.
+- **Alarm evaluation errors**: All failures during condition evaluation.
+- Detailed diagnostics via Site Event Log Viewer, not the health dashboard.
+
+### Report Structure
+- **Flat snapshot** — single message with all metric values. Central replaces previous state for that site on receipt.
+
+## Affected Documents
+
+| Document | Change |
+|----------|--------|
+| `Component-HealthMonitoring.md` | Expanded Reporting Protocol with concrete defaults and offline/online logic. Added Error Rate Metrics section. |
+
+## Alternatives Considered
+
+- **Missed-reports threshold for offline**: Rejected — absolute timeout is simpler to reason about and implement.
+- **Rolling window error rates**: Rejected — adds site-side state and complexity for a dashboard metric.
+- **Cumulative counters (Prometheus-style)**: Rejected — requires central-side rate calculation and handles restarts awkwardly.
+- **Unhandled exceptions only for error count**: Rejected — health dashboard needs a single "are scripts healthy?" signal, not categorized failure types.
+- **Categorized report structure**: Rejected — flat snapshot is simpler; dashboard handles display grouping.
+- **Manual acknowledgment for online recovery**: Rejected — creates unnecessary operator busy-work in a system designed for automatic recovery.