# Health Monitoring Refinement — Design **Date**: 2026-03-16 **Component**: Health Monitoring (`Component-HealthMonitoring.md`) **Status**: Approved ## Problem The Health Monitoring doc listed metrics and described the reporting concept but lacked concrete timing defaults, offline detection logic, error rate calculation methodology, and report structure definition. ## Decisions ### Report Interval & Offline Detection - Default report interval: **30 seconds** (configurable). - Offline detection: **Absolute timeout of 60 seconds** (2x report interval). If no report received within the window, site is marked offline. - Simple single-clock approach — no counting of missed reports. ### Online Recovery - **Automatic** — first health report from an offline site marks it online. No manual acknowledgment. ### Error Rate Calculation - **Raw counts per reporting interval**. Site increments counters, includes them in the report, resets after each send. - Central displays as "X errors in the last 30 seconds." - No rolling windows or cumulative counters — keeps both sides simple. ### Error Scope - **Script errors**: All failures — unhandled exceptions, timeouts, recursion limit violations, any error preventing completion. - **Alarm evaluation errors**: All failures during condition evaluation. - Detailed diagnostics via Site Event Log Viewer, not the health dashboard. ### Report Structure - **Flat snapshot** — single message with all metric values. Central replaces previous state for that site on receipt. ## Affected Documents | Document | Change | |----------|--------| | `Component-HealthMonitoring.md` | Expanded Reporting Protocol with concrete defaults and offline/online logic. Added Error Rate Metrics section. | ## Alternatives Considered - **Missed-reports threshold for offline**: Rejected — absolute timeout is simpler to reason about and implement. - **Rolling window error rates**: Rejected — adds site-side state and complexity for a dashboard metric. - **Cumulative counters (Prometheus-style)**: Rejected — requires central-side rate calculation and handles restarts awkwardly. - **Unhandled exceptions only for error count**: Rejected — health dashboard needs a single "are scripts healthy?" signal, not categorized failure types. - **Categorized report structure**: Rejected — flat snapshot is simpler; dashboard handles display grouping. - **Manual acknowledgment for online recovery**: Rejected — creates unnecessary operator busy-work in a system designed for automatic recovery.