# Health Monitoring The Health Monitoring component collects operational metrics from site cluster subsystems, forwards them to central on a 30-second cadence, and exposes an in-memory aggregated view that the Central UI health dashboard polls. ## Overview Health Monitoring (#11) runs on both site and central nodes with different roles on each side. It exists as a display-only, in-memory signal layer — no historical persistence, no alerting. The philosophy is current-status-only: the dashboard answers "what is the system doing right now?" rather than "what happened over the last hour?". The component code lives in `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/`: - `SiteHealthCollector` / `ISiteHealthCollector` — the site-side thread-safe accumulator that other site subsystems push metrics into. Script Actors, Alarm Actors, DCL connection actors, and the Audit Log bridge all call into this singleton. - `HealthReportSender` — a `BackgroundService` that ticks every `ReportInterval` (30 s), atomically drains the collector into a `SiteHealthReport`, stamps a monotonic sequence number, and fires it to central over `IHealthReportTransport` (Akka Tell, fire-and-forget). Only the active node sends — the standby runs the loop but skips the send. - `CentralHealthAggregator` / `ICentralHealthAggregator` — the central-side `BackgroundService` and in-memory store. Receives `SiteHealthReport` messages, applies them atomically via compare-and-swap on `SiteHealthState` records, and runs a periodic offline-detection sweep. - `CentralHealthReportLoop` — central-only counterpart to `HealthReportSender`: generates a synthetic `SiteHealthReport` for the central cluster itself (site ID `$central`) so central appears as a first-class card on `/monitoring/health`. Only the Primary (leader) node generates reports. - `SiteHealthState` — immutable record holding the latest report, last-heartbeat timestamp, sequence number, and online/offline flag for one site. Handed directly to UI callers; never torn because the aggregator swaps it atomically. DI entry points are split by role: `AddSiteHealthMonitoring` for site nodes (registers `ISiteHealthCollector` + starts `HealthReportSender`); `AddCentralHealthAggregation` for central nodes (registers `CentralHealthAggregator` + starts `CentralHealthReportLoop`); `AddHealthMonitoring` for nodes that need the collector but not the sender (shared use). All three are idempotent with respect to `HealthMonitoringOptionsValidator` registration. ## Key Concepts ### Monotonic sequence numbers Sequence numbers are seeded at construction with the current Unix epoch in milliseconds rather than starting at zero. This ensures that, after a site failover, the newly-active node's first report always carries a higher sequence number than any report the previous active node sent — the central aggregator's sequence guard would otherwise silently discard the new active's reports as stale. The same seeding applies to `CentralHealthReportLoop` for the `$central` synthetic site. ### Raw error counts per interval `ScriptErrorCount`, `AlarmEvaluationErrorCount`, `DeadLetterCount`, `SiteAuditWriteFailures`, and `AuditRedactionFailure` are raw counts accumulated since the previous report, not rates. `SiteHealthCollector.CollectReport` atomically reads and resets each counter via `Interlocked.Exchange`. If the transport `Send` throws, `HealthReportSender` restores the drained counts back into the collector via `Interlocked.Add` so they roll forward into the next report rather than being silently lost. Concurrent increments that arrive during a failed send accumulate against zero and are preserved by the restore `Add`. ### Online/offline detection Online status is driven by `LastHeartbeatAt`, not by `LastReportReceivedAt`. Heartbeats arrive from `SiteCommunicationActor` every ~5 s (`CommunicationOptions.TransportHeartbeatInterval`), so the 60 s `OfflineTimeout` tolerates roughly twelve missed heartbeats before declaring a site offline. A single-node failover — where the standby is alive but the active cannot produce a full report — therefore does not trigger a false offline transition. The synthetic `$central` site has no heartbeat source; its only signal is the 30 s `CentralHealthReportLoop` self-report. It therefore gets a longer `CentralOfflineTimeout` (default 3 × `ReportInterval` = 90 s), equivalent to one missed self-report. The validator rejects any configuration where `CentralOfflineTimeout < OfflineTimeout`. The offline-check `PeriodicTimer` runs at half the shorter of the two timeouts so whichever site class has the tighter window is swept at least twice within it. ### Dead-letter monitoring `SiteHealthCollector.IncrementDeadLetter` is called by the site's Akka `EventStream` dead-letter subscriber. Each call atomically increments `_deadLetterCount`; the count appears in the next health report as `DeadLetterCount`. Dead letters indicate messages sent to actors that no longer exist — typically stale references or timing races after instance redeploy. The health dashboard surfaces the count per report interval for quick triage; Site Event Log Viewer provides the per-message detail. ## Architecture ### Site-side collection `SiteHealthCollector` holds all per-interval counters (`_scriptErrorCount`, `_alarmErrorCount`, `_deadLetterCount`, `_siteAuditWriteFailures`, `_auditRedactionFailures`) as `int` fields touched only through `Interlocked` operations, and snapshot state (`ConcurrentDictionary` for connection health, tag resolution, and S&F buffer depths) that is overwritten rather than incremented. This split means `CollectReport` can atomically reset the counters in one pass while taking a point-in-time copy of the dictionaries, with no locks: ```csharp // SiteHealthCollector.CollectReport (abbreviated) public SiteHealthReport CollectReport(string siteId) { var scriptErrors = Interlocked.Exchange(ref _scriptErrorCount, 0); var alarmErrors = Interlocked.Exchange(ref _alarmErrorCount, 0); var deadLetters = Interlocked.Exchange(ref _deadLetterCount, 0); var auditFailures = Interlocked.Exchange(ref _siteAuditWriteFailures, 0); var redactFailures = Interlocked.Exchange(ref _auditRedactionFailures, 0); return new SiteHealthReport( SiteId: siteId, SequenceNumber: 0, // caller stamps monotonic seq ReportTimestamp: _timeProvider.GetUtcNow(), ScriptErrorCount: scriptErrors, DeadLetterCount: deadLetters, SiteAuditWriteFailures: auditFailures, AuditRedactionFailure: redactFailures, /* ... connection snapshots, instance counts, S&F depths */ SiteAuditBacklog: _siteAuditBacklog); } ``` The `SequenceNumber` field in the returned record is always `0`; `HealthReportSender` overwrites it with the atomically-incremented monotonic counter immediately before calling `_transport.Send`. ### Site-side report send `HealthReportSender` is an active-node-only sender: at the top of each tick it checks `_collector.IsActiveNode` and skips the remainder when `false`. The active/standby flag is set by the Deployment Manager singleton ownership check, not by this component. The send itself is synchronous and fire-and-forget (`IHealthReportTransport.Send` wraps an Akka `Tell`). A transport exception is caught, logged, and the interval counts are restored before re-throwing — the outer `catch (Exception)` swallows the rethrow so the background service never terminates from a single failed send. ### Central aggregation `CentralHealthAggregator` stores one `SiteHealthState` record per site in a `ConcurrentDictionary`. Every write (from `ProcessReport` or `MarkHeartbeat`) uses a compare-and-swap loop: ```csharp // CentralHealthAggregator.ProcessReport (core CAS path) var updated = existing with { LatestReport = report, LastReportReceivedAt = now, LastHeartbeatAt = now, LastSequenceNumber = report.SequenceNumber, IsOnline = true }; if (_siteStates.TryUpdate(report.SiteId, updated, existing)) return; // CAS lost — retry with fresh value ``` Sequence numbers guard against stale reports from a pre-failover node overwriting the new active's fresher state. A heartbeat for an unknown site (e.g., just after a central restart) registers the site as online with a null `LatestReport` so the site is not shown as "unknown" during the failover window. The offline sweep runs on a `PeriodicTimer` at `ComputeCheckInterval(_options)` — half the shorter of `OfflineTimeout` and `CentralOfflineTimeout`. It checks `LastHeartbeatAt` (not report time) and applies a single non-retried CAS: if the CAS loses, the site was just heard from and leaving it online is correct. ### Audit Log metrics bridge Audit Log registers `AddAuditLogHealthMetricsBridge` on site nodes after `AddSiteHealthMonitoring`. This replaces the default no-op failure counters with two bridges that forward directly into `ISiteHealthCollector`: - `HealthMetricsAuditWriteFailureCounter` — called by `FallbackAuditWriter` on every primary SQLite failure; increments `SiteAuditWriteFailures`. - `HealthMetricsAuditRedactionFailureCounter` — called by the payload redactor on every over-redaction event; increments `AuditRedactionFailure`. A third collaborator, `SiteAuditBacklogReporter`, is a hosted service that polls `ISiteAuditQueue.GetBacklogStatsAsync` every 30 s and pushes a `SiteAuditBacklogSnapshot` into `ISiteHealthCollector.UpdateSiteAuditBacklog`. The snapshot (`PendingCount`, `OldestPendingUtc`, `OnDiskBytes`) rides as `SiteAuditBacklog` on the next health report. The poll runs in a separate service rather than inline in `CollectReport` to keep the report path free of synchronous SQLite I/O. On central, `AuditCentralHealthSnapshot` (in the Audit Log component) is the symmetric counterpart: it accumulates `CentralAuditWriteFailures`, `AuditRedactionFailure`, and a per-site `SiteAuditTelemetryStalled` map fed by `SiteAuditTelemetryStalledTracker`. These are read by the central health dashboard alongside the aggregated site states. See [Audit Log](./AuditLog.md) for the full counter and stall-detection design. ## Usage Site subsystems call `ISiteHealthCollector` directly — it is a DI singleton. Examples of callers by subsystem: | Caller | Method | Metric in report | |--------|--------|-----------------| | Script Actor (on unhandled exception) | `IncrementScriptError()` | `ScriptErrorCount` | | Alarm Actor (on eval failure) | `IncrementAlarmError()` | `AlarmEvaluationErrorCount` | | Akka EventStream dead-letter subscriber | `IncrementDeadLetter()` | `DeadLetterCount` | | DCL connection actor | `UpdateConnectionHealth(name, health)` | `DataConnectionStatuses` | | DCL connection actor | `UpdateTagResolution(name, total, resolved)` | `TagResolutionCounts` | | DCL connection actor | `UpdateTagQuality(name, good, bad, uncertain)` | `DataConnectionTagQuality` | | Audit Log bridge | `IncrementSiteAuditWriteFailures()` | `SiteAuditWriteFailures` | | Audit Log bridge | `IncrementAuditRedactionFailure()` | `AuditRedactionFailure` | | `SiteAuditBacklogReporter` | `UpdateSiteAuditBacklog(snapshot)` | `SiteAuditBacklog` | | `HealthReportSender` | `SetParkedMessageCount(count)` | `ParkedMessageCount` | Central consumers resolve `ICentralHealthAggregator` and call `GetAllSiteStates()` or `GetSiteState(siteId)` to read a snapshot-safe dictionary of `SiteHealthState` records. The health dashboard polls this on a ~10 s timer. Because `SiteHealthState` is an immutable record swapped atomically, a consumer can hold the reference without risk of a torn read. ## Configuration Options class: `HealthMonitoringOptions`, bound from the `ScadaBridge:HealthMonitoring` section. Validated at startup by `HealthMonitoringOptionsValidator` (registered with `ValidateOnStart`) so a bad configuration fails with a clear key-naming message rather than an opaque `ArgumentOutOfRangeException` inside a `PeriodicTimer` constructor. | Key | Default | Constraint | Description | |-----|---------|-----------|-------------| | `ScadaBridge:HealthMonitoring:ReportInterval` | `00:00:30` (30 s) | Must be `> 0` | Interval at which site nodes emit health reports to central. Also the `CentralHealthReportLoop` self-report cadence. | | `ScadaBridge:HealthMonitoring:OfflineTimeout` | `00:01:00` (60 s) | Must be `> 0` | Silence window after which a real site is marked offline. Driven by `LastHeartbeatAt`, not last report time. | | `ScadaBridge:HealthMonitoring:CentralOfflineTimeout` | `00:03:00` (3 min) | Must be `>= OfflineTimeout` | Grace window for the `$central` synthetic site, which has no heartbeat source. Defaults to 3× `ReportInterval`. | The offline-check cadence is derived at runtime as `min(OfflineTimeout, CentralOfflineTimeout) / 2` — not directly configurable. ## Dependencies & Interactions - [Commons (#16)](./Commons.md) — defines `SiteHealthReport`, `SiteHealthReportReplica`, `NodeStatus`, `SiteAuditBacklogSnapshot`, and the `ISiteHealthCollector` / `ICentralHealthAggregator` interfaces consumed throughout. `SiteHealthReport` is an additive record; new fields use default values so existing producers remain valid. - [Central–Site Communication (#5)](./Communication.md) — transports `SiteHealthReport` messages from site to central via Akka remoting (fire-and-forget Tell through `IHealthReportTransport`). Also delivers heartbeats from `SiteCommunicationActor` that `CentralHealthAggregator.MarkHeartbeat` uses to keep sites online between reports. `SiteHealthReportReplica` is broadcast on DistributedPubSub so both central nodes maintain identical aggregator state. - [Site Runtime (#3)](./SiteRuntime.md) — Script Actors call `IncrementScriptError`; Alarm Actors call `IncrementAlarmError`; the Deployment Manager singleton ownership check drives `SetActiveNode`. - [Data Connection Layer (#4)](./DataConnectionLayer.md) — connection actors call `UpdateConnectionHealth`, `UpdateTagResolution`, `UpdateConnectionEndpoint`, `UpdateTagQuality`, and `RemoveConnection` on `ISiteHealthCollector`. - [Store-and-Forward Engine (#6)](./StoreAndForward.md) — `HealthReportSender` queries `StoreAndForwardStorage` for `GetParkedMessageCountAsync` and `GetBufferDepthByCategoryAsync`; the results populate `ParkedMessageCount` and `StoreAndForwardBufferDepths` (keyed by `StoreAndForwardCategory` name). - [Cluster Infrastructure (#13)](./ClusterInfrastructure.md) — `IClusterNodeProvider` supplies cluster node list and `SelfIsPrimary` flag to both `HealthReportSender` and `CentralHealthReportLoop`. Heartbeat cadence (default 5 s) is owned by Cluster Infrastructure / `SiteCommunicationActor`. - [Audit Log (#23)](./AuditLog.md) — `AddAuditLogHealthMetricsBridge` wires `HealthMetricsAuditWriteFailureCounter` and `HealthMetricsAuditRedactionFailureCounter` into the site collector, and registers `SiteAuditBacklogReporter` to poll the site-local SQLite drain backlog. On central, `AuditCentralHealthSnapshot` exposes `CentralAuditWriteFailures`, `AuditRedactionFailure`, and per-site `SiteAuditTelemetryStalled` alongside the aggregated site states on the health dashboard. - [Central UI (#9)](./CentralUI.md) — the health dashboard resolves `ICentralHealthAggregator` and polls `GetAllSiteStates()` on a ~10 s timer. Notification Outbox and Site Call Audit KPIs are computed on demand from their own central tables by those components; Health Monitoring does not own or cache them. - [Host (#15)](./Host.md) — implements `ISiteIdentityProvider` (supplies `SiteId` for report payloads) and `IClusterNodeProvider`, and calls the appropriate `Add*` entry points from the role-specific composition root. ## Troubleshooting ### A site flaps online/offline during single-node failover The 60 s `OfflineTimeout` is driven by heartbeats, not reports. The standby node keeps sending heartbeats even when the active is down. If the site still shows as offline during a failover window shorter than 60 s, check that `SiteCommunicationActor` is running on the standby (it is not a singleton — both nodes run it) and that heartbeats are reaching central. Temporarily increasing `OfflineTimeout` reduces false-offline transitions at the cost of slower genuine-offline detection. ### Reports from the new active are silently discarded after failover This happens when the new active's process-start sequence numbers fall below the prior active's last sequence number. `HealthReportSender` seeds `_sequenceNumber` with `TimeProvider.GetUtcNow().ToUnixTimeMilliseconds()` at construction, so this should not occur unless the new node's clock is significantly behind the old node's. Check time synchronization between site nodes. ### `$central` shows as offline `CentralHealthReportLoop` only generates reports when `IClusterNodeProvider.SelfIsPrimary` is true. If both central nodes are healthy but the `$central` entry shows offline, the primary node's loop may have stalled or the Akka cluster may be in a split-brain state. Check `CentralHealthReportLoop` logs for `"Failed to generate central health report"` errors. ### `SiteAuditWriteFailures` non-zero Non-zero `SiteAuditWriteFailures` in consecutive reports indicates the site-local SQLite audit writer is throwing persistently and rows are being routed to the `RingBufferFallback`. Check disk space and SQLite file health at the site node. See [Audit Log](./AuditLog.md) — the fallback ring is drop-oldest; sustained failure loses rows. ## Related Documentation - [Health Monitoring design specification](../requirements/Component-HealthMonitoring.md) - [Audit Log](./AuditLog.md) - [Central–Site Communication](./Communication.md) - [Site Runtime](./SiteRuntime.md) - [Data Connection Layer](./DataConnectionLayer.md) - [Store-and-Forward Engine](./StoreAndForward.md) - [Cluster Infrastructure](./ClusterInfrastructure.md) - [Commons](./Commons.md)