scadalink-design/docs/requirements/Component-HealthMonitoring.md

# Component: Health Monitoring

## Purpose

The Health Monitoring component collects and reports operational health metrics from site clusters to the central cluster, providing engineers with visibility into the status of the distributed system.

## Location

Site clusters (metric collection and reporting). Central cluster (aggregation and display).

## Responsibilities

### Site Side
- Collect health metrics from local subsystems.
- Periodically report metrics to the central cluster via the Communication Layer.

### Central Side
- Receive and store health metrics from all sites.
- Detect site connectivity status (online/offline) based on heartbeat presence.
- Present health data in the Central UI dashboard.

## Monitored Metrics

| Metric | Source | Description |
|--------|--------|-------------|
| Site online/offline | Communication Layer | Whether the site is reachable (based on heartbeat) |
| Active/standby node status | Cluster Infrastructure | Which node is active, which is standby |
| Data connection health | Data Connection Layer | Connected/disconnected/reconnecting per data connection |
| Tag resolution counts | Data Connection Layer | Per connection: total subscribed tags vs. successfully resolved tags |
| Script error rates | Site Runtime (Script Actors) | Frequency of script failures |
| Alarm evaluation error rates | Site Runtime (Alarm Actors) | Frequency of alarm evaluation failures |
| Store-and-forward buffer depth | Store-and-Forward Engine | Pending messages by category — external, notification (notifications awaiting forward to central), DB write |
| Dead letter count | Akka.NET EventStream | Messages sent to actors that no longer exist — indicates stale references or timing issues |
| Notification Outbox queue depth | Notification Outbox (central) | Count of `Pending` + `Retrying` notifications — central-computed, not site-reported |
| Notification Outbox stuck count | Notification Outbox (central) | Count of `Pending` / `Retrying` notifications older than the configurable stuck-age threshold — central-computed, not site-reported |
| Notification Outbox parked count | Notification Outbox (central) | Count of `Parked` notifications — central-computed, not site-reported |
| `SiteAuditBacklog` | Audit Log (site) | Count of `Pending` rows in the site-local `AuditLog` plus oldest-pending-age plus on-disk bytes. A configurable threshold drives a Health dashboard warning on the affected site tile. |
| `SiteAuditWriteFailures` | Audit Log (site) | Count of failed hot-path audit appends at the site since the last health report. |
| `SiteAuditTelemetryStalled` | Audit Log (site) | Boolean flag set when reconciliation reports a non-draining site-local audit backlog over two consecutive cycles. |
| `CentralAuditWriteFailures` | Audit Log (central) | Count of central direct-write audit failures (Inbound API middleware, Notification Outbox dispatcher, and any other central direct writers) since the last interval. |
| `AuditRedactionFailure` | Audit Log (central) | Count of payload redactor errors (over-redacted payloads, safety-net hit) since the last interval. |

## Reporting Protocol

- Sites send a **health report message** to central at a configurable interval (default: **30 seconds**).
- Each report is a **flat snapshot** containing the current values of all monitored metrics, a **monotonic sequence number**, and the **report timestamp** from the site. Central replaces the previous state for that site only if the incoming sequence number is higher than the last received — this prevents stale reports (e.g., delayed in transit or from a pre-failover node) from overwriting newer state.
- **Offline detection**: If central does not receive a report within a configurable timeout window (default: **60 seconds** — 2x the report interval), the site is marked as **offline**. This gives one missed report as grace before marking offline.
- **Online recovery**: When central receives a health report from a site that was marked offline, the site is automatically marked **online**. No manual acknowledgment required — the metrics in the report provide immediate visibility into the site's condition.

## Error Rate Metrics

Script error rates and alarm evaluation error rates are calculated as **raw counts per reporting interval**:

- The site maintains a counter for each metric that increments on every failure.
- Each health report includes the count since the last report. The counter resets after each report is sent.
- Central displays these as "X errors in the last 30 seconds" (or whatever the configured interval is).
- **Script errors** include all failures that prevent a script from completing successfully: unhandled exceptions, timeouts, recursion limit violations, and any other error condition.
- **Alarm evaluation errors** include all failures during alarm condition evaluation.
- For detailed diagnostics (error types, stack traces, affected instances), operators use the **Site Event Log Viewer** — the health dashboard is for quick triage, not forensics.

## Notification Outbox KPIs

The Notification Outbox is a **central** component, so its KPIs are **central-computed** rather than collected from sites and carried in the site health report:

- The dashboard surfaces three **headline** outbox KPIs: **queue depth** (`Pending` + `Retrying`), **stuck count** (`Pending` / `Retrying` rows older than the configurable stuck-age threshold), and **parked count** (`Parked`).
- The Notification Outbox component computes these on demand from the central `Notifications` table; the health dashboard polls it for the headline tiles.
- The fuller KPI set — which also includes **delivered (last interval)** and **oldest pending age** — lives on the Central UI **Notification Outbox** page, not the health dashboard.
- Outbox KPIs are **point-in-time**, computed on demand from the `Notifications` table. There is no time-series store — consistent with Health Monitoring's "current status only" philosophy. The outbox's own ~1-year row retention answers historical questions directly.

These are distinct from the site-reported **Store-and-forward buffer depth** notification metric, which now covers the **site→central leg** — notifications still buffered in a site's Store-and-Forward Engine awaiting forward to central — and remains part of the site health report.

## Site Call Audit KPIs

The Site Call Audit is a **central** component, so its KPIs — like the Notification Outbox's — are **central-computed** rather than collected from sites and carried in the site health report:

- The dashboard surfaces Site Call Audit **headline** KPI tiles alongside the existing Notification Outbox tiles.
- The Site Call Audit component computes these on demand from the central `SiteCalls` table, **global and per-source-site**; the health dashboard polls it for the headline tiles.
- The KPI set is **buffered count** (`Pending` + `Retrying`), **parked count** (`Parked`), **failed (last interval)**, **delivered (last interval)**, **oldest pending age**, and **stuck count** (`Pending` / `Retrying` rows older than the configurable stuck-age threshold).
- **Stuck** is `Pending` / `Retrying` rows older than a configurable threshold (default **10 minutes**) — **display-only** (KPI count plus a row badge), with no escalation or alerting, consistent with the Notification Outbox stuck metric.
- Site Call Audit KPIs are **point-in-time**, computed on demand from the `SiteCalls` table. There is no time-series store — consistent with Health Monitoring's "current status only" philosophy.

Unlike the Notification Outbox, the Site Call Audit is **not a dispatcher** — cached calls are delivered by each site's Store-and-Forward Engine, and the `SiteCalls` table is an eventually-consistent central mirror of site-owned status.

## Audit Log KPIs

The Audit Log spans both sites (hot-path append + telemetry forward) and central (direct-write + ingest + redaction). Its operational health surfaces as three new dashboard tiles grouped under **Audit**:

- **Audit volume** — rate of audit rows landing in the central `AuditLog` table over the last interval, sourced from the Audit Log component on the active central node.
- **Audit error rate** — combined view of `CentralAuditWriteFailures` and `AuditRedactionFailure` over the last interval; non-zero values warrant a check of the Audit Log component's own logs.
- **Audit backlog** — global aggregate of `SiteAuditBacklog` across reporting sites (count of `Pending` site-local audit rows, oldest pending age, on-disk bytes). The per-site tile also surfaces a warning badge when its `SiteAuditBacklog` crosses the configurable threshold or when `SiteAuditTelemetryStalled` is set.

These tiles are **point-in-time** like the Notification Outbox and Site Call Audit KPI tiles — no time-series store; consistent with Health Monitoring's "current status only" philosophy. The site-scoped `SiteAuditBacklog` / `SiteAuditWriteFailures` / `SiteAuditTelemetryStalled` metrics arrive in the existing site health report; the central-scoped `CentralAuditWriteFailures` / `AuditRedactionFailure` metrics are central-computed alongside the existing central KPIs.

## Central Storage

- Health metrics are held **in memory** at the central cluster for display in the UI.
- No historical health data is persisted — the dashboard shows current/latest status only.
- Notification Outbox and Site Call Audit KPIs are not stored by Health Monitoring; they are computed point-in-time from the central `Notifications` and `SiteCalls` tables respectively each time the dashboard refreshes — consistent with the current-status-only philosophy.
- Site connectivity history (online/offline transitions) may optionally be logged via the Audit Log or a separate mechanism if needed in the future.

## No Alerting

- Health monitoring is **display-only** for now — no automated notifications or alerts are triggered by health status changes.
- This can be extended in the future.

## Dependencies

- **Communication Layer**: Transports health reports from sites to central.
- **Data Connection Layer (site)**: Provides connection health metrics.
- **Site Runtime (site)**: Provides script error rate and alarm evaluation error rate metrics.
- **Store-and-Forward Engine (site)**: Provides buffer depth metrics, including the notification backlog awaiting forward to central.
- **Cluster Infrastructure (site)**: Provides node role status.
- **Notification Outbox (central)**: Provides central-computed outbox KPIs — queue depth, stuck count, parked count — for the headline dashboard tiles.
- **Site Call Audit (central)**: Provides central-computed cached-call KPIs — buffered count, parked count, failed/delivered (last interval), oldest pending age, stuck count — for the headline dashboard tiles.
- **Audit Log (#23)**: Provides the site-reported `SiteAuditBacklog` / `SiteAuditWriteFailures` / `SiteAuditTelemetryStalled` metrics (via the site health report) and the central-computed `CentralAuditWriteFailures` / `AuditRedactionFailure` metrics, plus the central audit-row rate feeding the **Audit** dashboard tile group (Audit volume, Audit error rate, Audit backlog).

## Interactions

- **Central UI**: Health Monitoring Dashboard displays aggregated metrics.
- **Communication Layer**: Health reports flow as periodic messages.