Files
scadalink-design/Component-HealthMonitoring.md
Joseph Doherty 34694adba2 Apply Codex review findings across all 17 components
Template Engine: add composed member addressing (path-qualified canonical names),
override granularity per entity type, semantic validation (call targets, arg types),
graph acyclicity enforcement, revision hashes for flattened configs.

Deployment Manager: add deployment ID + idempotency, per-instance operation lock
covering all mutating commands, state transition matrix, site-side apply atomicity
(all-or-nothing), artifact version compatibility policy.

Site Runtime: add script trust model (forbidden APIs, execution timeout, constrained
compilation), concurrency/serialization rules (Instance Actor serializes mutations),
site-wide stream backpressure (per-subscriber buffering, fire-and-forget publish).

Communication: add application-level correlation IDs for protocol safety beyond
Akka.NET transport guarantees.

External System Gateway: add 408/429 as transient errors, CachedCall idempotency
note, dedicated dispatcher for blocking I/O isolation.

Health Monitoring: add monotonic sequence numbers to prevent stale report overwrites.

Security: require LDAPS/StartTLS for LDAP connections.

Central UI: add failover behavior (SignalR reconnect, JWT survives, shared Data
Protection keys, load balancer readiness).

Cluster Infrastructure: add down-if-alone=on for safe singleton ownership.

Site Event Logging: clarify active-node-only logging (no replication), add 1GB
storage cap with oldest-first purge.

Host: add readiness gating (health check endpoint, no traffic until operational).

Commons: add message contract versioning policy (additive-only evolution).

Configuration Database: add optimistic concurrency on deployment status records.
2026-03-16 09:06:12 -04:00

75 lines
4.5 KiB
Markdown

# Component: Health Monitoring
## Purpose
The Health Monitoring component collects and reports operational health metrics from site clusters to the central cluster, providing engineers with visibility into the status of the distributed system.
## Location
Site clusters (metric collection and reporting). Central cluster (aggregation and display).
## Responsibilities
### Site Side
- Collect health metrics from local subsystems.
- Periodically report metrics to the central cluster via the Communication Layer.
### Central Side
- Receive and store health metrics from all sites.
- Detect site connectivity status (online/offline) based on heartbeat presence.
- Present health data in the Central UI dashboard.
## Monitored Metrics
| Metric | Source | Description |
|--------|--------|-------------|
| Site online/offline | Communication Layer | Whether the site is reachable (based on heartbeat) |
| Active/standby node status | Cluster Infrastructure | Which node is active, which is standby |
| Data connection health | Data Connection Layer | Connected/disconnected/reconnecting per data connection |
| Tag resolution counts | Data Connection Layer | Per connection: total subscribed tags vs. successfully resolved tags |
| Script error rates | Site Runtime (Script Actors) | Frequency of script failures |
| Alarm evaluation error rates | Site Runtime (Alarm Actors) | Frequency of alarm evaluation failures |
| Store-and-forward buffer depth | Store-and-Forward Engine | Pending messages by category (external, notification, DB write) |
## Reporting Protocol
- Sites send a **health report message** to central at a configurable interval (default: **30 seconds**).
- Each report is a **flat snapshot** containing the current values of all monitored metrics, a **monotonic sequence number**, and the **report timestamp** from the site. Central replaces the previous state for that site only if the incoming sequence number is higher than the last received — this prevents stale reports (e.g., delayed in transit or from a pre-failover node) from overwriting newer state.
- **Offline detection**: If central does not receive a report within a configurable timeout window (default: **60 seconds** — 2x the report interval), the site is marked as **offline**. This gives one missed report as grace before marking offline.
- **Online recovery**: When central receives a health report from a site that was marked offline, the site is automatically marked **online**. No manual acknowledgment required — the metrics in the report provide immediate visibility into the site's condition.
## Error Rate Metrics
Script error rates and alarm evaluation error rates are calculated as **raw counts per reporting interval**:
- The site maintains a counter for each metric that increments on every failure.
- Each health report includes the count since the last report. The counter resets after each report is sent.
- Central displays these as "X errors in the last 30 seconds" (or whatever the configured interval is).
- **Script errors** include all failures that prevent a script from completing successfully: unhandled exceptions, timeouts, recursion limit violations, and any other error condition.
- **Alarm evaluation errors** include all failures during alarm condition evaluation.
- For detailed diagnostics (error types, stack traces, affected instances), operators use the **Site Event Log Viewer** — the health dashboard is for quick triage, not forensics.
## Central Storage
- Health metrics are held **in memory** at the central cluster for display in the UI.
- No historical health data is persisted — the dashboard shows current/latest status only.
- Site connectivity history (online/offline transitions) may optionally be logged via the Audit Log or a separate mechanism if needed in the future.
## No Alerting
- Health monitoring is **display-only** for now — no automated notifications or alerts are triggered by health status changes.
- This can be extended in the future.
## Dependencies
- **Communication Layer**: Transports health reports from sites to central.
- **Data Connection Layer (site)**: Provides connection health metrics.
- **Site Runtime (site)**: Provides script error rate and alarm evaluation error rate metrics.
- **Store-and-Forward Engine (site)**: Provides buffer depth metrics.
- **Cluster Infrastructure (site)**: Provides node role status.
## Interactions
- **Central UI**: Health Monitoring Dashboard displays aggregated metrics.
- **Communication Layer**: Health reports flow as periodic messages.