diff --git a/Component-HealthMonitoring.md b/Component-HealthMonitoring.md
index d2428ea..0521b53 100644
--- a/Component-HealthMonitoring.md
+++ b/Component-HealthMonitoring.md
@@ -33,9 +33,21 @@ Site clusters (metric collection and reporting). Central cluster (aggregation an
 
 ## Reporting Protocol
 
-- Sites send a **health report message** to central at a configurable interval (e.g., every 30 seconds).
-- Each report contains the current values of all monitored metrics.
-- If central does not receive a report within a timeout window, the site is marked as **offline**.
+- Sites send a **health report message** to central at a configurable interval (default: **30 seconds**).
+- Each report is a **flat snapshot** containing the current values of all monitored metrics. Central replaces the entire previous state for that site on receipt.
+- **Offline detection**: If central does not receive a report within a configurable timeout window (default: **60 seconds** — 2x the report interval), the site is marked as **offline**. This gives one missed report as grace before marking offline.
+- **Online recovery**: When central receives a health report from a site that was marked offline, the site is automatically marked **online**. No manual acknowledgment required — the metrics in the report provide immediate visibility into the site's condition.
+
+## Error Rate Metrics
+
+Script error rates and alarm evaluation error rates are calculated as **raw counts per reporting interval**:
+
+- The site maintains a counter for each metric that increments on every failure.
+- Each health report includes the count since the last report. The counter resets after each report is sent.
+- Central displays these as "X errors in the last 30 seconds" (or whatever the configured interval is).
+- **Script errors** include all failures that prevent a script from completing successfully: unhandled exceptions, timeouts, recursion limit violations, and any other error condition.
+- **Alarm evaluation errors** include all failures during alarm condition evaluation.
+- For detailed diagnostics (error types, stack traces, affected instances), operators use the **Site Event Log Viewer** — the health dashboard is for quick triage, not forensics.
 
 ## Central Storage
 
diff --git a/docs/plans/2026-03-16-health-monitoring-refinement-design.md b/docs/plans/2026-03-16-health-monitoring-refinement-design.md
new file mode 100644
index 0000000..23ea44b
--- /dev/null
+++ b/docs/plans/2026-03-16-health-monitoring-refinement-design.md
@@ -0,0 +1,47 @@
+# Health Monitoring Refinement — Design
+
+**Date**: 2026-03-16
+**Component**: Health Monitoring (`Component-HealthMonitoring.md`)
+**Status**: Approved
+
+## Problem
+
+The Health Monitoring doc listed metrics and described the reporting concept but lacked concrete timing defaults, offline detection logic, error rate calculation methodology, and report structure definition.
+
+## Decisions
+
+### Report Interval & Offline Detection
+- Default report interval: **30 seconds** (configurable).
+- Offline detection: **Absolute timeout of 60 seconds** (2x report interval). If no report received within the window, site is marked offline.
+- Simple single-clock approach — no counting of missed reports.
+
+### Online Recovery
+- **Automatic** — first health report from an offline site marks it online. No manual acknowledgment.
+
+### Error Rate Calculation
+- **Raw counts per reporting interval**. Site increments counters, includes them in the report, resets after each send.
+- Central displays as "X errors in the last 30 seconds."
+- No rolling windows or cumulative counters — keeps both sides simple.
+
+### Error Scope
+- **Script errors**: All failures — unhandled exceptions, timeouts, recursion limit violations, any error preventing completion.
+- **Alarm evaluation errors**: All failures during condition evaluation.
+- Detailed diagnostics via Site Event Log Viewer, not the health dashboard.
+
+### Report Structure
+- **Flat snapshot** — single message with all metric values. Central replaces previous state for that site on receipt.
+
+## Affected Documents
+
+| Document | Change |
+|----------|--------|
+| `Component-HealthMonitoring.md` | Expanded Reporting Protocol with concrete defaults and offline/online logic. Added Error Rate Metrics section. |
+
+## Alternatives Considered
+
+- **Missed-reports threshold for offline**: Rejected — absolute timeout is simpler to reason about and implement.
+- **Rolling window error rates**: Rejected — adds site-side state and complexity for a dashboard metric.
+- **Cumulative counters (Prometheus-style)**: Rejected — requires central-side rate calculation and handles restarts awkwardly.
+- **Unhandled exceptions only for error count**: Rejected — health dashboard needs a single "are scripts healthy?" signal, not categorized failure types.
+- **Categorized report structure**: Rejected — flat snapshot is simpler; dashboard handles display grouping.
+- **Manual acknowledgment for online recovery**: Rejected — creates unnecessary operator busy-work in a system designed for automatic recovery.