Files
Joseph Doherty 9175b0c013 docs(components): accuracy fixes from deep review (batch 3)
NotificationService (Notify.Send returns string not NotificationId;
MaxConcurrentConnections unenforced; AddHttpClient), NotificationOutbox
(one Attempted row always, terminal row only on terminal status), SiteCallAudit
(direct dual-write, no Tell; KPI tiles consumed by CentralUI), HealthMonitoring
(CentralOfflineTimeout 180s = 6x ReportInterval; HealthReportSender gates on
IsActiveNode), SiteEventLogging (active-node purge seam not wired; runs on both
nodes), InboundAPI (whole System.Diagnostics namespace forbidden).
2026-06-03 16:37:15 -04:00

18 KiB
Raw Permalink Blame History

Health Monitoring

The Health Monitoring component collects operational metrics from site cluster subsystems, forwards them to central on a 30-second cadence, and exposes an in-memory aggregated view that the Central UI health dashboard polls.

Overview

Health Monitoring (#11) runs on both site and central nodes with different roles on each side. It exists as a display-only, in-memory signal layer — no historical persistence, no alerting. The philosophy is current-status-only: the dashboard answers "what is the system doing right now?" rather than "what happened over the last hour?".

The component code lives in src/ZB.MOM.WW.ScadaBridge.HealthMonitoring/:

  • SiteHealthCollector / ISiteHealthCollector — the site-side thread-safe accumulator that other site subsystems push metrics into. Script Actors, Alarm Actors, DCL connection actors, and the Audit Log bridge all call into this singleton.
  • HealthReportSender — a BackgroundService that ticks every ReportInterval (30 s), atomically drains the collector into a SiteHealthReport, stamps a monotonic sequence number, and fires it to central over IHealthReportTransport (Akka Tell, fire-and-forget). Only the active node sends — the standby runs the loop but skips the send.
  • CentralHealthAggregator / ICentralHealthAggregator — the central-side BackgroundService and in-memory store. Receives SiteHealthReport messages, applies them atomically via compare-and-swap on SiteHealthState records, and runs a periodic offline-detection sweep.
  • CentralHealthReportLoop — central-only counterpart to HealthReportSender: generates a synthetic SiteHealthReport for the central cluster itself (site ID $central) so central appears as a first-class card on /monitoring/health. Only the Primary (leader) node generates reports.
  • SiteHealthState — immutable record holding the latest report, last-heartbeat timestamp, sequence number, and online/offline flag for one site. Handed directly to UI callers; never torn because the aggregator swaps it atomically.

DI entry points are split by role: AddSiteHealthMonitoring for site nodes (registers ISiteHealthCollector + starts HealthReportSender); AddCentralHealthAggregation for central nodes (registers CentralHealthAggregator + starts CentralHealthReportLoop); AddHealthMonitoring for nodes that need the collector but not the sender (shared use). All three are idempotent with respect to HealthMonitoringOptionsValidator registration.

Key Concepts

Monotonic sequence numbers

Sequence numbers are seeded at construction with the current Unix epoch in milliseconds rather than starting at zero. This ensures that, after a site failover, the newly-active node's first report always carries a higher sequence number than any report the previous active node sent — the central aggregator's sequence guard would otherwise silently discard the new active's reports as stale. The same seeding applies to CentralHealthReportLoop for the $central synthetic site.

Raw error counts per interval

ScriptErrorCount, AlarmEvaluationErrorCount, DeadLetterCount, SiteAuditWriteFailures, and AuditRedactionFailure are raw counts accumulated since the previous report, not rates. SiteHealthCollector.CollectReport atomically reads and resets each counter via Interlocked.Exchange. If the transport Send throws, HealthReportSender restores the drained counts back into the collector via Interlocked.Add so they roll forward into the next report rather than being silently lost. Concurrent increments that arrive during a failed send accumulate against zero and are preserved by the restore Add.

Online/offline detection

Online status is driven by LastHeartbeatAt, not by LastReportReceivedAt. Heartbeats arrive from SiteCommunicationActor every ~5 s (CommunicationOptions.TransportHeartbeatInterval), so the 60 s OfflineTimeout tolerates roughly twelve missed heartbeats before declaring a site offline. A single-node failover — where the standby is alive but the active cannot produce a full report — therefore does not trigger a false offline transition.

The synthetic $central site has no heartbeat source; its only signal is the 30 s CentralHealthReportLoop self-report. It therefore gets a longer CentralOfflineTimeout (default 6 × ReportInterval = 180 s / 3 min), equivalent to ~6 missed report periods. The validator rejects any configuration where CentralOfflineTimeout < OfflineTimeout.

The offline-check PeriodicTimer runs at half the shorter of the two timeouts so whichever site class has the tighter window is swept at least twice within it.

Dead-letter monitoring

SiteHealthCollector.IncrementDeadLetter is called by the site's Akka EventStream dead-letter subscriber. Each call atomically increments _deadLetterCount; the count appears in the next health report as DeadLetterCount. Dead letters indicate messages sent to actors that no longer exist — typically stale references or timing races after instance redeploy. The health dashboard surfaces the count per report interval for quick triage; Site Event Log Viewer provides the per-message detail.

Architecture

Site-side collection

SiteHealthCollector holds all per-interval counters (_scriptErrorCount, _alarmErrorCount, _deadLetterCount, _siteAuditWriteFailures, _auditRedactionFailures) as int fields touched only through Interlocked operations, and snapshot state (ConcurrentDictionary for connection health, tag resolution, and S&F buffer depths) that is overwritten rather than incremented. This split means CollectReport can atomically reset the counters in one pass while taking a point-in-time copy of the dictionaries, with no locks:

// SiteHealthCollector.CollectReport (abbreviated)
public SiteHealthReport CollectReport(string siteId)
{
    var scriptErrors   = Interlocked.Exchange(ref _scriptErrorCount, 0);
    var alarmErrors    = Interlocked.Exchange(ref _alarmErrorCount, 0);
    var deadLetters    = Interlocked.Exchange(ref _deadLetterCount, 0);
    var auditFailures  = Interlocked.Exchange(ref _siteAuditWriteFailures, 0);
    var redactFailures = Interlocked.Exchange(ref _auditRedactionFailures, 0);

    return new SiteHealthReport(
        SiteId: siteId,
        SequenceNumber: 0,            // caller stamps monotonic seq
        ReportTimestamp: _timeProvider.GetUtcNow(),
        ScriptErrorCount: scriptErrors,
        DeadLetterCount: deadLetters,
        SiteAuditWriteFailures: auditFailures,
        AuditRedactionFailure: redactFailures,
        /* ... connection snapshots, instance counts, S&F depths */
        SiteAuditBacklog: _siteAuditBacklog);
}

The SequenceNumber field in the returned record is always 0; HealthReportSender overwrites it with the atomically-incremented monotonic counter immediately before calling _transport.Send.

Site-side report send

HealthReportSender is an active-node-only sender: at the top of each tick it checks _collector.IsActiveNode and skips the remainder when false. The active/standby flag is set by the Deployment Manager singleton ownership check, not by this component.

The send itself is synchronous and fire-and-forget (IHealthReportTransport.Send wraps an Akka Tell). A transport exception is caught, logged, and the interval counts are restored before re-throwing — the outer catch (Exception) swallows the rethrow so the background service never terminates from a single failed send.

Central aggregation

CentralHealthAggregator stores one SiteHealthState record per site in a ConcurrentDictionary<string, SiteHealthState>. Every write (from ProcessReport or MarkHeartbeat) uses a compare-and-swap loop:

// CentralHealthAggregator.ProcessReport (core CAS path)
var updated = existing with
{
    LatestReport = report,
    LastReportReceivedAt = now,
    LastHeartbeatAt = now,
    LastSequenceNumber = report.SequenceNumber,
    IsOnline = true
};

if (_siteStates.TryUpdate(report.SiteId, updated, existing))
    return;

// CAS lost — retry with fresh value

Sequence numbers guard against stale reports from a pre-failover node overwriting the new active's fresher state. A heartbeat for an unknown site (e.g., just after a central restart) registers the site as online with a null LatestReport so the site is not shown as "unknown" during the failover window.

The offline sweep runs on a PeriodicTimer at ComputeCheckInterval(_options) — half the shorter of OfflineTimeout and CentralOfflineTimeout. It checks LastHeartbeatAt (not report time) and applies a single non-retried CAS: if the CAS loses, the site was just heard from and leaving it online is correct.

Audit Log metrics bridge

Audit Log registers AddAuditLogHealthMetricsBridge on site nodes after AddSiteHealthMonitoring. This replaces the default no-op failure counters with two bridges that forward directly into ISiteHealthCollector:

  • HealthMetricsAuditWriteFailureCounter — called by FallbackAuditWriter on every primary SQLite failure; increments SiteAuditWriteFailures.
  • HealthMetricsAuditRedactionFailureCounter — called by the payload redactor on every over-redaction event; increments AuditRedactionFailure.

A third collaborator, SiteAuditBacklogReporter, is a hosted service that polls ISiteAuditQueue.GetBacklogStatsAsync every 30 s and pushes a SiteAuditBacklogSnapshot into ISiteHealthCollector.UpdateSiteAuditBacklog. The snapshot (PendingCount, OldestPendingUtc, OnDiskBytes) rides as SiteAuditBacklog on the next health report. The poll runs in a separate service rather than inline in CollectReport to keep the report path free of synchronous SQLite I/O.

On central, AuditCentralHealthSnapshot (in the Audit Log component) is the symmetric counterpart: it accumulates CentralAuditWriteFailures, AuditRedactionFailure, and a per-site SiteAuditTelemetryStalled map fed by SiteAuditTelemetryStalledTracker. These are read by the central health dashboard alongside the aggregated site states. See Audit Log for the full counter and stall-detection design.

Usage

Site subsystems call ISiteHealthCollector directly — it is a DI singleton. Examples of callers by subsystem:

Caller Method Metric in report
Script Actor (on unhandled exception) IncrementScriptError() ScriptErrorCount
Alarm Actor (on eval failure) IncrementAlarmError() AlarmEvaluationErrorCount
Akka EventStream dead-letter subscriber IncrementDeadLetter() DeadLetterCount
DCL connection actor UpdateConnectionHealth(name, health) DataConnectionStatuses
DCL connection actor UpdateTagResolution(name, total, resolved) TagResolutionCounts
DCL connection actor UpdateTagQuality(name, good, bad, uncertain) DataConnectionTagQuality
Audit Log bridge IncrementSiteAuditWriteFailures() SiteAuditWriteFailures
Audit Log bridge IncrementAuditRedactionFailure() AuditRedactionFailure
SiteAuditBacklogReporter UpdateSiteAuditBacklog(snapshot) SiteAuditBacklog
HealthReportSender SetParkedMessageCount(count) ParkedMessageCount

Central consumers resolve ICentralHealthAggregator and call GetAllSiteStates() or GetSiteState(siteId) to read a snapshot-safe dictionary of SiteHealthState records. The health dashboard polls this on a ~10 s timer. Because SiteHealthState is an immutable record swapped atomically, a consumer can hold the reference without risk of a torn read.

Configuration

Options class: HealthMonitoringOptions, bound from the ScadaBridge:HealthMonitoring section. Validated at startup by HealthMonitoringOptionsValidator (registered with ValidateOnStart) so a bad configuration fails with a clear key-naming message rather than an opaque ArgumentOutOfRangeException inside a PeriodicTimer constructor.

Key Default Constraint Description
ScadaBridge:HealthMonitoring:ReportInterval 00:00:30 (30 s) Must be > 0 Interval at which site nodes emit health reports to central. Also the CentralHealthReportLoop self-report cadence.
ScadaBridge:HealthMonitoring:OfflineTimeout 00:01:00 (60 s) Must be > 0 Silence window after which a real site is marked offline. Driven by LastHeartbeatAt, not last report time.
ScadaBridge:HealthMonitoring:CentralOfflineTimeout 00:03:00 (3 min) Must be >= OfflineTimeout Grace window for the $central synthetic site, which has no heartbeat source. Defaults to 6× ReportInterval.

The offline-check cadence is derived at runtime as min(OfflineTimeout, CentralOfflineTimeout) / 2 — not directly configurable.

Dependencies & Interactions

  • Commons (#16) — defines SiteHealthReport, SiteHealthReportReplica, NodeStatus, SiteAuditBacklogSnapshot, and the ISiteHealthCollector / ICentralHealthAggregator interfaces consumed throughout. SiteHealthReport is an additive record; new fields use default values so existing producers remain valid.
  • CentralSite Communication (#5) — transports SiteHealthReport messages from site to central via Akka remoting (fire-and-forget Tell through IHealthReportTransport). Also delivers heartbeats from SiteCommunicationActor that CentralHealthAggregator.MarkHeartbeat uses to keep sites online between reports. SiteHealthReportReplica is broadcast on DistributedPubSub so both central nodes maintain identical aggregator state.
  • Site Runtime (#3) — Script Actors call IncrementScriptError; Alarm Actors call IncrementAlarmError; the Deployment Manager singleton ownership check drives SetActiveNode.
  • Data Connection Layer (#4) — connection actors call UpdateConnectionHealth, UpdateTagResolution, UpdateConnectionEndpoint, UpdateTagQuality, and RemoveConnection on ISiteHealthCollector.
  • Store-and-Forward Engine (#6)HealthReportSender queries StoreAndForwardStorage for GetParkedMessageCountAsync and GetBufferDepthByCategoryAsync; the results populate ParkedMessageCount and StoreAndForwardBufferDepths (keyed by StoreAndForwardCategory name).
  • Cluster Infrastructure (#13)IClusterNodeProvider supplies the cluster node list to HealthReportSender (for the node-list payload); HealthReportSender's active/standby gate is _collector.IsActiveNode, which is set externally by DeploymentManagerActor.PreStart/PostStop. CentralHealthReportLoop reads both GetClusterNodes() and SelfIsPrimary from IClusterNodeProvider. Heartbeat cadence (default 5 s) is owned by Cluster Infrastructure / SiteCommunicationActor.
  • Audit Log (#23)AddAuditLogHealthMetricsBridge wires HealthMetricsAuditWriteFailureCounter and HealthMetricsAuditRedactionFailureCounter into the site collector, and registers SiteAuditBacklogReporter to poll the site-local SQLite drain backlog. On central, AuditCentralHealthSnapshot exposes CentralAuditWriteFailures, AuditRedactionFailure, and per-site SiteAuditTelemetryStalled alongside the aggregated site states on the health dashboard.
  • Central UI (#9) — the health dashboard resolves ICentralHealthAggregator and polls GetAllSiteStates() on a ~10 s timer. Notification Outbox and Site Call Audit KPIs are computed on demand from their own central tables by those components; Health Monitoring does not own or cache them.
  • Host (#15) — implements ISiteIdentityProvider (supplies SiteId for report payloads) and IClusterNodeProvider, and calls the appropriate Add* entry points from the role-specific composition root.

Troubleshooting

A site flaps online/offline during single-node failover

The 60 s OfflineTimeout is driven by heartbeats, not reports. The standby node keeps sending heartbeats even when the active is down. If the site still shows as offline during a failover window shorter than 60 s, check that SiteCommunicationActor is running on the standby (it is not a singleton — both nodes run it) and that heartbeats are reaching central. Temporarily increasing OfflineTimeout reduces false-offline transitions at the cost of slower genuine-offline detection.

Reports from the new active are silently discarded after failover

This happens when the new active's process-start sequence numbers fall below the prior active's last sequence number. HealthReportSender seeds _sequenceNumber with TimeProvider.GetUtcNow().ToUnixTimeMilliseconds() at construction, so this should not occur unless the new node's clock is significantly behind the old node's. Check time synchronization between site nodes.

$central shows as offline

CentralHealthReportLoop only generates reports when IClusterNodeProvider.SelfIsPrimary is true. If both central nodes are healthy but the $central entry shows offline, the primary node's loop may have stalled or the Akka cluster may be in a split-brain state. Check CentralHealthReportLoop logs for "Failed to generate central health report" errors.

SiteAuditWriteFailures non-zero

Non-zero SiteAuditWriteFailures in consecutive reports indicates the site-local SQLite audit writer is throwing persistently and rows are being routed to the RingBufferFallback. Check disk space and SQLite file health at the site node. See Audit Log — the fallback ring is drop-oldest; sustained failure loses rows.