Files
ScadaBridge/docs/components/HealthMonitoring.md
T
Joseph Doherty 9175b0c013 docs(components): accuracy fixes from deep review (batch 3)
NotificationService (Notify.Send returns string not NotificationId;
MaxConcurrentConnections unenforced; AddHttpClient), NotificationOutbox
(one Attempted row always, terminal row only on terminal status), SiteCallAudit
(direct dual-write, no Tell; KPI tiles consumed by CentralUI), HealthMonitoring
(CentralOfflineTimeout 180s = 6x ReportInterval; HealthReportSender gates on
IsActiveNode), SiteEventLogging (active-node purge seam not wired; runs on both
nodes), InboundAPI (whole System.Diagnostics namespace forbidden).
2026-06-03 16:37:15 -04:00

185 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Health Monitoring
The Health Monitoring component collects operational metrics from site cluster subsystems, forwards them to central on a 30-second cadence, and exposes an in-memory aggregated view that the Central UI health dashboard polls.
## Overview
Health Monitoring (#11) runs on both site and central nodes with different roles on each side. It exists as a display-only, in-memory signal layer — no historical persistence, no alerting. The philosophy is current-status-only: the dashboard answers "what is the system doing right now?" rather than "what happened over the last hour?".
The component code lives in `src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/`:
- `SiteHealthCollector` / `ISiteHealthCollector` — the site-side thread-safe accumulator that other site subsystems push metrics into. Script Actors, Alarm Actors, DCL connection actors, and the Audit Log bridge all call into this singleton.
- `HealthReportSender` — a `BackgroundService` that ticks every `ReportInterval` (30 s), atomically drains the collector into a `SiteHealthReport`, stamps a monotonic sequence number, and fires it to central over `IHealthReportTransport` (Akka Tell, fire-and-forget). Only the active node sends — the standby runs the loop but skips the send.
- `CentralHealthAggregator` / `ICentralHealthAggregator` — the central-side `BackgroundService` and in-memory store. Receives `SiteHealthReport` messages, applies them atomically via compare-and-swap on `SiteHealthState` records, and runs a periodic offline-detection sweep.
- `CentralHealthReportLoop` — central-only counterpart to `HealthReportSender`: generates a synthetic `SiteHealthReport` for the central cluster itself (site ID `$central`) so central appears as a first-class card on `/monitoring/health`. Only the Primary (leader) node generates reports.
- `SiteHealthState` — immutable record holding the latest report, last-heartbeat timestamp, sequence number, and online/offline flag for one site. Handed directly to UI callers; never torn because the aggregator swaps it atomically.
DI entry points are split by role: `AddSiteHealthMonitoring` for site nodes (registers `ISiteHealthCollector` + starts `HealthReportSender`); `AddCentralHealthAggregation` for central nodes (registers `CentralHealthAggregator` + starts `CentralHealthReportLoop`); `AddHealthMonitoring` for nodes that need the collector but not the sender (shared use). All three are idempotent with respect to `HealthMonitoringOptionsValidator` registration.
## Key Concepts
### Monotonic sequence numbers
Sequence numbers are seeded at construction with the current Unix epoch in milliseconds rather than starting at zero. This ensures that, after a site failover, the newly-active node's first report always carries a higher sequence number than any report the previous active node sent — the central aggregator's sequence guard would otherwise silently discard the new active's reports as stale. The same seeding applies to `CentralHealthReportLoop` for the `$central` synthetic site.
### Raw error counts per interval
`ScriptErrorCount`, `AlarmEvaluationErrorCount`, `DeadLetterCount`, `SiteAuditWriteFailures`, and `AuditRedactionFailure` are raw counts accumulated since the previous report, not rates. `SiteHealthCollector.CollectReport` atomically reads and resets each counter via `Interlocked.Exchange`. If the transport `Send` throws, `HealthReportSender` restores the drained counts back into the collector via `Interlocked.Add` so they roll forward into the next report rather than being silently lost. Concurrent increments that arrive during a failed send accumulate against zero and are preserved by the restore `Add`.
### Online/offline detection
Online status is driven by `LastHeartbeatAt`, not by `LastReportReceivedAt`. Heartbeats arrive from `SiteCommunicationActor` every ~5 s (`CommunicationOptions.TransportHeartbeatInterval`), so the 60 s `OfflineTimeout` tolerates roughly twelve missed heartbeats before declaring a site offline. A single-node failover — where the standby is alive but the active cannot produce a full report — therefore does not trigger a false offline transition.
The synthetic `$central` site has no heartbeat source; its only signal is the 30 s `CentralHealthReportLoop` self-report. It therefore gets a longer `CentralOfflineTimeout` (default 6 × `ReportInterval` = 180 s / 3 min), equivalent to ~6 missed report periods. The validator rejects any configuration where `CentralOfflineTimeout < OfflineTimeout`.
The offline-check `PeriodicTimer` runs at half the shorter of the two timeouts so whichever site class has the tighter window is swept at least twice within it.
### Dead-letter monitoring
`SiteHealthCollector.IncrementDeadLetter` is called by the site's Akka `EventStream` dead-letter subscriber. Each call atomically increments `_deadLetterCount`; the count appears in the next health report as `DeadLetterCount`. Dead letters indicate messages sent to actors that no longer exist — typically stale references or timing races after instance redeploy. The health dashboard surfaces the count per report interval for quick triage; Site Event Log Viewer provides the per-message detail.
## Architecture
### Site-side collection
`SiteHealthCollector` holds all per-interval counters (`_scriptErrorCount`, `_alarmErrorCount`, `_deadLetterCount`, `_siteAuditWriteFailures`, `_auditRedactionFailures`) as `int` fields touched only through `Interlocked` operations, and snapshot state (`ConcurrentDictionary` for connection health, tag resolution, and S&F buffer depths) that is overwritten rather than incremented. This split means `CollectReport` can atomically reset the counters in one pass while taking a point-in-time copy of the dictionaries, with no locks:
```csharp
// SiteHealthCollector.CollectReport (abbreviated)
public SiteHealthReport CollectReport(string siteId)
{
var scriptErrors = Interlocked.Exchange(ref _scriptErrorCount, 0);
var alarmErrors = Interlocked.Exchange(ref _alarmErrorCount, 0);
var deadLetters = Interlocked.Exchange(ref _deadLetterCount, 0);
var auditFailures = Interlocked.Exchange(ref _siteAuditWriteFailures, 0);
var redactFailures = Interlocked.Exchange(ref _auditRedactionFailures, 0);
return new SiteHealthReport(
SiteId: siteId,
SequenceNumber: 0, // caller stamps monotonic seq
ReportTimestamp: _timeProvider.GetUtcNow(),
ScriptErrorCount: scriptErrors,
DeadLetterCount: deadLetters,
SiteAuditWriteFailures: auditFailures,
AuditRedactionFailure: redactFailures,
/* ... connection snapshots, instance counts, S&F depths */
SiteAuditBacklog: _siteAuditBacklog);
}
```
The `SequenceNumber` field in the returned record is always `0`; `HealthReportSender` overwrites it with the atomically-incremented monotonic counter immediately before calling `_transport.Send`.
### Site-side report send
`HealthReportSender` is an active-node-only sender: at the top of each tick it checks `_collector.IsActiveNode` and skips the remainder when `false`. The active/standby flag is set by the Deployment Manager singleton ownership check, not by this component.
The send itself is synchronous and fire-and-forget (`IHealthReportTransport.Send` wraps an Akka `Tell`). A transport exception is caught, logged, and the interval counts are restored before re-throwing — the outer `catch (Exception)` swallows the rethrow so the background service never terminates from a single failed send.
### Central aggregation
`CentralHealthAggregator` stores one `SiteHealthState` record per site in a `ConcurrentDictionary<string, SiteHealthState>`. Every write (from `ProcessReport` or `MarkHeartbeat`) uses a compare-and-swap loop:
```csharp
// CentralHealthAggregator.ProcessReport (core CAS path)
var updated = existing with
{
LatestReport = report,
LastReportReceivedAt = now,
LastHeartbeatAt = now,
LastSequenceNumber = report.SequenceNumber,
IsOnline = true
};
if (_siteStates.TryUpdate(report.SiteId, updated, existing))
return;
// CAS lost — retry with fresh value
```
Sequence numbers guard against stale reports from a pre-failover node overwriting the new active's fresher state. A heartbeat for an unknown site (e.g., just after a central restart) registers the site as online with a null `LatestReport` so the site is not shown as "unknown" during the failover window.
The offline sweep runs on a `PeriodicTimer` at `ComputeCheckInterval(_options)` — half the shorter of `OfflineTimeout` and `CentralOfflineTimeout`. It checks `LastHeartbeatAt` (not report time) and applies a single non-retried CAS: if the CAS loses, the site was just heard from and leaving it online is correct.
### Audit Log metrics bridge
Audit Log registers `AddAuditLogHealthMetricsBridge` on site nodes after `AddSiteHealthMonitoring`. This replaces the default no-op failure counters with two bridges that forward directly into `ISiteHealthCollector`:
- `HealthMetricsAuditWriteFailureCounter` — called by `FallbackAuditWriter` on every primary SQLite failure; increments `SiteAuditWriteFailures`.
- `HealthMetricsAuditRedactionFailureCounter` — called by the payload redactor on every over-redaction event; increments `AuditRedactionFailure`.
A third collaborator, `SiteAuditBacklogReporter`, is a hosted service that polls `ISiteAuditQueue.GetBacklogStatsAsync` every 30 s and pushes a `SiteAuditBacklogSnapshot` into `ISiteHealthCollector.UpdateSiteAuditBacklog`. The snapshot (`PendingCount`, `OldestPendingUtc`, `OnDiskBytes`) rides as `SiteAuditBacklog` on the next health report. The poll runs in a separate service rather than inline in `CollectReport` to keep the report path free of synchronous SQLite I/O.
On central, `AuditCentralHealthSnapshot` (in the Audit Log component) is the symmetric counterpart: it accumulates `CentralAuditWriteFailures`, `AuditRedactionFailure`, and a per-site `SiteAuditTelemetryStalled` map fed by `SiteAuditTelemetryStalledTracker`. These are read by the central health dashboard alongside the aggregated site states. See [Audit Log](./AuditLog.md) for the full counter and stall-detection design.
## Usage
Site subsystems call `ISiteHealthCollector` directly — it is a DI singleton. Examples of callers by subsystem:
| Caller | Method | Metric in report |
|--------|--------|-----------------|
| Script Actor (on unhandled exception) | `IncrementScriptError()` | `ScriptErrorCount` |
| Alarm Actor (on eval failure) | `IncrementAlarmError()` | `AlarmEvaluationErrorCount` |
| Akka EventStream dead-letter subscriber | `IncrementDeadLetter()` | `DeadLetterCount` |
| DCL connection actor | `UpdateConnectionHealth(name, health)` | `DataConnectionStatuses` |
| DCL connection actor | `UpdateTagResolution(name, total, resolved)` | `TagResolutionCounts` |
| DCL connection actor | `UpdateTagQuality(name, good, bad, uncertain)` | `DataConnectionTagQuality` |
| Audit Log bridge | `IncrementSiteAuditWriteFailures()` | `SiteAuditWriteFailures` |
| Audit Log bridge | `IncrementAuditRedactionFailure()` | `AuditRedactionFailure` |
| `SiteAuditBacklogReporter` | `UpdateSiteAuditBacklog(snapshot)` | `SiteAuditBacklog` |
| `HealthReportSender` | `SetParkedMessageCount(count)` | `ParkedMessageCount` |
Central consumers resolve `ICentralHealthAggregator` and call `GetAllSiteStates()` or `GetSiteState(siteId)` to read a snapshot-safe dictionary of `SiteHealthState` records. The health dashboard polls this on a ~10 s timer. Because `SiteHealthState` is an immutable record swapped atomically, a consumer can hold the reference without risk of a torn read.
## Configuration
Options class: `HealthMonitoringOptions`, bound from the `ScadaBridge:HealthMonitoring` section. Validated at startup by `HealthMonitoringOptionsValidator` (registered with `ValidateOnStart`) so a bad configuration fails with a clear key-naming message rather than an opaque `ArgumentOutOfRangeException` inside a `PeriodicTimer` constructor.
| Key | Default | Constraint | Description |
|-----|---------|-----------|-------------|
| `ScadaBridge:HealthMonitoring:ReportInterval` | `00:00:30` (30 s) | Must be `> 0` | Interval at which site nodes emit health reports to central. Also the `CentralHealthReportLoop` self-report cadence. |
| `ScadaBridge:HealthMonitoring:OfflineTimeout` | `00:01:00` (60 s) | Must be `> 0` | Silence window after which a real site is marked offline. Driven by `LastHeartbeatAt`, not last report time. |
| `ScadaBridge:HealthMonitoring:CentralOfflineTimeout` | `00:03:00` (3 min) | Must be `>= OfflineTimeout` | Grace window for the `$central` synthetic site, which has no heartbeat source. Defaults to 6× `ReportInterval`. |
The offline-check cadence is derived at runtime as `min(OfflineTimeout, CentralOfflineTimeout) / 2` — not directly configurable.
## Dependencies & Interactions
- [Commons (#16)](./Commons.md) — defines `SiteHealthReport`, `SiteHealthReportReplica`, `NodeStatus`, `SiteAuditBacklogSnapshot`, and the `ISiteHealthCollector` / `ICentralHealthAggregator` interfaces consumed throughout. `SiteHealthReport` is an additive record; new fields use default values so existing producers remain valid.
- [CentralSite Communication (#5)](./Communication.md) — transports `SiteHealthReport` messages from site to central via Akka remoting (fire-and-forget Tell through `IHealthReportTransport`). Also delivers heartbeats from `SiteCommunicationActor` that `CentralHealthAggregator.MarkHeartbeat` uses to keep sites online between reports. `SiteHealthReportReplica` is broadcast on DistributedPubSub so both central nodes maintain identical aggregator state.
- [Site Runtime (#3)](./SiteRuntime.md) — Script Actors call `IncrementScriptError`; Alarm Actors call `IncrementAlarmError`; the Deployment Manager singleton ownership check drives `SetActiveNode`.
- [Data Connection Layer (#4)](./DataConnectionLayer.md) — connection actors call `UpdateConnectionHealth`, `UpdateTagResolution`, `UpdateConnectionEndpoint`, `UpdateTagQuality`, and `RemoveConnection` on `ISiteHealthCollector`.
- [Store-and-Forward Engine (#6)](./StoreAndForward.md) — `HealthReportSender` queries `StoreAndForwardStorage` for `GetParkedMessageCountAsync` and `GetBufferDepthByCategoryAsync`; the results populate `ParkedMessageCount` and `StoreAndForwardBufferDepths` (keyed by `StoreAndForwardCategory` name).
- [Cluster Infrastructure (#13)](./ClusterInfrastructure.md) — `IClusterNodeProvider` supplies the cluster node list to `HealthReportSender` (for the node-list payload); `HealthReportSender`'s active/standby gate is `_collector.IsActiveNode`, which is set externally by `DeploymentManagerActor.PreStart`/`PostStop`. `CentralHealthReportLoop` reads both `GetClusterNodes()` and `SelfIsPrimary` from `IClusterNodeProvider`. Heartbeat cadence (default 5 s) is owned by Cluster Infrastructure / `SiteCommunicationActor`.
- [Audit Log (#23)](./AuditLog.md) — `AddAuditLogHealthMetricsBridge` wires `HealthMetricsAuditWriteFailureCounter` and `HealthMetricsAuditRedactionFailureCounter` into the site collector, and registers `SiteAuditBacklogReporter` to poll the site-local SQLite drain backlog. On central, `AuditCentralHealthSnapshot` exposes `CentralAuditWriteFailures`, `AuditRedactionFailure`, and per-site `SiteAuditTelemetryStalled` alongside the aggregated site states on the health dashboard.
- [Central UI (#9)](./CentralUI.md) — the health dashboard resolves `ICentralHealthAggregator` and polls `GetAllSiteStates()` on a ~10 s timer. Notification Outbox and Site Call Audit KPIs are computed on demand from their own central tables by those components; Health Monitoring does not own or cache them.
- [Host (#15)](./Host.md) — implements `ISiteIdentityProvider` (supplies `SiteId` for report payloads) and `IClusterNodeProvider`, and calls the appropriate `Add*` entry points from the role-specific composition root.
## Troubleshooting
### A site flaps online/offline during single-node failover
The 60 s `OfflineTimeout` is driven by heartbeats, not reports. The standby node keeps sending heartbeats even when the active is down. If the site still shows as offline during a failover window shorter than 60 s, check that `SiteCommunicationActor` is running on the standby (it is not a singleton — both nodes run it) and that heartbeats are reaching central. Temporarily increasing `OfflineTimeout` reduces false-offline transitions at the cost of slower genuine-offline detection.
### Reports from the new active are silently discarded after failover
This happens when the new active's process-start sequence numbers fall below the prior active's last sequence number. `HealthReportSender` seeds `_sequenceNumber` with `TimeProvider.GetUtcNow().ToUnixTimeMilliseconds()` at construction, so this should not occur unless the new node's clock is significantly behind the old node's. Check time synchronization between site nodes.
### `$central` shows as offline
`CentralHealthReportLoop` only generates reports when `IClusterNodeProvider.SelfIsPrimary` is true. If both central nodes are healthy but the `$central` entry shows offline, the primary node's loop may have stalled or the Akka cluster may be in a split-brain state. Check `CentralHealthReportLoop` logs for `"Failed to generate central health report"` errors.
### `SiteAuditWriteFailures` non-zero
Non-zero `SiteAuditWriteFailures` in consecutive reports indicates the site-local SQLite audit writer is throwing persistently and rows are being routed to the `RingBufferFallback`. Check disk space and SQLite file health at the site node. See [Audit Log](./AuditLog.md) — the fallback ring is drop-oldest; sustained failure loses rows.
## Related Documentation
- [Health Monitoring design specification](../requirements/Component-HealthMonitoring.md)
- [Audit Log](./AuditLog.md)
- [CentralSite Communication](./Communication.md)
- [Site Runtime](./SiteRuntime.md)
- [Data Connection Layer](./DataConnectionLayer.md)
- [Store-and-Forward Engine](./StoreAndForward.md)
- [Cluster Infrastructure](./ClusterInfrastructure.md)
- [Commons](./Commons.md)