Organize documentation by moving requirements (HighLevelReqs, Component-*, lmxproxy_protocol) to docs/requirements/ and test infrastructure docs to docs/test_infra/. Updates all cross-references in README, CLAUDE.md, infra/README, component docs, and 23 plan files.
48 lines
2.5 KiB
Markdown
48 lines
2.5 KiB
Markdown
# Health Monitoring Refinement — Design
|
|
|
|
**Date**: 2026-03-16
|
|
**Component**: Health Monitoring (`docs/requirements/Component-HealthMonitoring.md`)
|
|
**Status**: Approved
|
|
|
|
## Problem
|
|
|
|
The Health Monitoring doc listed metrics and described the reporting concept but lacked concrete timing defaults, offline detection logic, error rate calculation methodology, and report structure definition.
|
|
|
|
## Decisions
|
|
|
|
### Report Interval & Offline Detection
|
|
- Default report interval: **30 seconds** (configurable).
|
|
- Offline detection: **Absolute timeout of 60 seconds** (2x report interval). If no report received within the window, site is marked offline.
|
|
- Simple single-clock approach — no counting of missed reports.
|
|
|
|
### Online Recovery
|
|
- **Automatic** — first health report from an offline site marks it online. No manual acknowledgment.
|
|
|
|
### Error Rate Calculation
|
|
- **Raw counts per reporting interval**. Site increments counters, includes them in the report, resets after each send.
|
|
- Central displays as "X errors in the last 30 seconds."
|
|
- No rolling windows or cumulative counters — keeps both sides simple.
|
|
|
|
### Error Scope
|
|
- **Script errors**: All failures — unhandled exceptions, timeouts, recursion limit violations, any error preventing completion.
|
|
- **Alarm evaluation errors**: All failures during condition evaluation.
|
|
- Detailed diagnostics via Site Event Log Viewer, not the health dashboard.
|
|
|
|
### Report Structure
|
|
- **Flat snapshot** — single message with all metric values. Central replaces previous state for that site on receipt.
|
|
|
|
## Affected Documents
|
|
|
|
| Document | Change |
|
|
|----------|--------|
|
|
| `docs/requirements/Component-HealthMonitoring.md` | Expanded Reporting Protocol with concrete defaults and offline/online logic. Added Error Rate Metrics section. |
|
|
|
|
## Alternatives Considered
|
|
|
|
- **Missed-reports threshold for offline**: Rejected — absolute timeout is simpler to reason about and implement.
|
|
- **Rolling window error rates**: Rejected — adds site-side state and complexity for a dashboard metric.
|
|
- **Cumulative counters (Prometheus-style)**: Rejected — requires central-side rate calculation and handles restarts awkwardly.
|
|
- **Unhandled exceptions only for error count**: Rejected — health dashboard needs a single "are scripts healthy?" signal, not categorized failure types.
|
|
- **Categorized report structure**: Rejected — flat snapshot is simpler; dashboard handles display grouping.
|
|
- **Manual acknowledgment for online recovery**: Rejected — creates unnecessary operator busy-work in a system designed for automatic recovery.
|