Initial design docs from claude.ai refinement sessions

2026-03-16 07:39:26 -04:00
commit 1944f94fed
22 changed files with 2821 additions and 0 deletions
--- a/Component-HealthMonitoring.md
+++ b/Component-HealthMonitoring.md
@@ -0,0 +1,61 @@
+# Component: Health Monitoring
+
+## Purpose
+
+The Health Monitoring component collects and reports operational health metrics from site clusters to the central cluster, providing engineers with visibility into the status of the distributed system.
+
+## Location
+
+Site clusters (metric collection and reporting). Central cluster (aggregation and display).
+
+## Responsibilities
+
+### Site Side
+- Collect health metrics from local subsystems.
+- Periodically report metrics to the central cluster via the Communication Layer.
+
+### Central Side
+- Receive and store health metrics from all sites.
+- Detect site connectivity status (online/offline) based on heartbeat presence.
+- Present health data in the Central UI dashboard.
+
+## Monitored Metrics
+
+| Metric | Source | Description |
+|--------|--------|-------------|
+| Site online/offline | Communication Layer | Whether the site is reachable (based on heartbeat) |
+| Active/standby node status | Cluster Infrastructure | Which node is active, which is standby |
+| Data connection health | Data Connection Layer | Connected/disconnected per data connection |
+| Script error rates | Site Runtime (Script Actors) | Frequency of script failures |
+| Alarm evaluation error rates | Site Runtime (Alarm Actors) | Frequency of alarm evaluation failures |
+| Store-and-forward buffer depth | Store-and-Forward Engine | Pending messages by category (external, notification, DB write) |
+
+## Reporting Protocol
+
+- Sites send a **health report message** to central at a configurable interval (e.g., every 30 seconds).
+- Each report contains the current values of all monitored metrics.
+- If central does not receive a report within a timeout window, the site is marked as **offline**.
+
+## Central Storage
+
+- Health metrics are held **in memory** at the central cluster for display in the UI.
+- No historical health data is persisted — the dashboard shows current/latest status only.
+- Site connectivity history (online/offline transitions) may optionally be logged via the Audit Log or a separate mechanism if needed in the future.
+
+## No Alerting
+
+- Health monitoring is **display-only** for now — no automated notifications or alerts are triggered by health status changes.
+- This can be extended in the future.
+
+## Dependencies
+
+- **Communication Layer**: Transports health reports from sites to central.
+- **Data Connection Layer (site)**: Provides connection health metrics.
+- **Site Runtime (site)**: Provides script error rate and alarm evaluation error rate metrics.
+- **Store-and-Forward Engine (site)**: Provides buffer depth metrics.
+- **Cluster Infrastructure (site)**: Provides node role status.
+
+## Interactions
+
+- **Central UI**: Health Monitoring Dashboard displays aggregated metrics.
+- **Communication Layer**: Health reports flow as periodic messages.