Set 30-second report interval with 60-second absolute timeout for offline detection. Define error rates as raw counts per interval (reset after each report). Script errors include all failure types. Automatic online recovery on first received report. Flat snapshot report structure.
2.5 KiB
2.5 KiB
Health Monitoring Refinement — Design
Date: 2026-03-16
Component: Health Monitoring (Component-HealthMonitoring.md)
Status: Approved
Problem
The Health Monitoring doc listed metrics and described the reporting concept but lacked concrete timing defaults, offline detection logic, error rate calculation methodology, and report structure definition.
Decisions
Report Interval & Offline Detection
- Default report interval: 30 seconds (configurable).
- Offline detection: Absolute timeout of 60 seconds (2x report interval). If no report received within the window, site is marked offline.
- Simple single-clock approach — no counting of missed reports.
Online Recovery
- Automatic — first health report from an offline site marks it online. No manual acknowledgment.
Error Rate Calculation
- Raw counts per reporting interval. Site increments counters, includes them in the report, resets after each send.
- Central displays as "X errors in the last 30 seconds."
- No rolling windows or cumulative counters — keeps both sides simple.
Error Scope
- Script errors: All failures — unhandled exceptions, timeouts, recursion limit violations, any error preventing completion.
- Alarm evaluation errors: All failures during condition evaluation.
- Detailed diagnostics via Site Event Log Viewer, not the health dashboard.
Report Structure
- Flat snapshot — single message with all metric values. Central replaces previous state for that site on receipt.
Affected Documents
| Document | Change |
|---|---|
Component-HealthMonitoring.md |
Expanded Reporting Protocol with concrete defaults and offline/online logic. Added Error Rate Metrics section. |
Alternatives Considered
- Missed-reports threshold for offline: Rejected — absolute timeout is simpler to reason about and implement.
- Rolling window error rates: Rejected — adds site-side state and complexity for a dashboard metric.
- Cumulative counters (Prometheus-style): Rejected — requires central-side rate calculation and handles restarts awkwardly.
- Unhandled exceptions only for error count: Rejected — health dashboard needs a single "are scripts healthy?" signal, not categorized failure types.
- Categorized report structure: Rejected — flat snapshot is simpler; dashboard handles display grouping.
- Manual acknowledgment for online recovery: Rejected — creates unnecessary operator busy-work in a system designed for automatic recovery.