Files

Joseph Doherty 57eae0c1db Refine Health Monitoring: timing defaults, offline detection, error rate calculation

Set 30-second report interval with 60-second absolute timeout for offline detection.
Define error rates as raw counts per interval (reset after each report). Script errors
include all failure types. Automatic online recovery on first received report. Flat
snapshot report structure.

2026-03-16 08:10:16 -04:00

2.5 KiB

Raw Blame History

Health Monitoring Refinement — Design

Date: 2026-03-16 Component: Health Monitoring (Component-HealthMonitoring.md) Status: Approved

Problem

The Health Monitoring doc listed metrics and described the reporting concept but lacked concrete timing defaults, offline detection logic, error rate calculation methodology, and report structure definition.

Decisions

Report Interval & Offline Detection

Default report interval: 30 seconds (configurable).
Offline detection: Absolute timeout of 60 seconds (2x report interval). If no report received within the window, site is marked offline.
Simple single-clock approach — no counting of missed reports.

Online Recovery

Automatic — first health report from an offline site marks it online. No manual acknowledgment.

Error Rate Calculation

Raw counts per reporting interval. Site increments counters, includes them in the report, resets after each send.
Central displays as "X errors in the last 30 seconds."
No rolling windows or cumulative counters — keeps both sides simple.

Error Scope

Script errors: All failures — unhandled exceptions, timeouts, recursion limit violations, any error preventing completion.
Alarm evaluation errors: All failures during condition evaluation.
Detailed diagnostics via Site Event Log Viewer, not the health dashboard.

Report Structure

Flat snapshot — single message with all metric values. Central replaces previous state for that site on receipt.

Affected Documents

Document	Change
`Component-HealthMonitoring.md`	Expanded Reporting Protocol with concrete defaults and offline/online logic. Added Error Rate Metrics section.

Alternatives Considered

Missed-reports threshold for offline: Rejected — absolute timeout is simpler to reason about and implement.
Rolling window error rates: Rejected — adds site-side state and complexity for a dashboard metric.
Cumulative counters (Prometheus-style): Rejected — requires central-side rate calculation and handles restarts awkwardly.
Unhandled exceptions only for error count: Rejected — health dashboard needs a single "are scripts healthy?" signal, not categorized failure types.
Categorized report structure: Rejected — flat snapshot is simpler; dashboard handles display grouping.
Manual acknowledgment for online recovery: Rejected — creates unnecessary operator busy-work in a system designed for automatic recovery.

2.5 KiB Raw Blame History