fix(health): replicate site health reports between central nodes

CentralHealthAggregator is a per-node hosted singleton, but site health reports flow through ClusterClient which round-robins each report to one central node only. The other node's aggregator never saw those reports and marked sites offline at the 60s threshold — sites constantly flapped between online and offline on the monitoring page. On receive, the active CentralCommunicationActor now republishes a SiteHealthReportReplica wrapper on a DistributedPubSub topic. Both central nodes subscribe to the topic and process replicas through a dedicated path that updates the local aggregator without re-broadcasting (avoids fan-out loops). The aggregator's existing sequence-number idempotency makes self-delivery a cheap no-op. DistributedPubSubExtensionProvider is now listed in the HOCON `akka.extensions` block so the mediator is initialised at cluster start, eliminating a race where the first Subscribe arrived before the extension was loaded.
2026-05-13 06:20:07 -04:00
parent d9caa3dd7e
commit 6f1f6b8467
3 changed files with 62 additions and 0 deletions
--- a/src/ScadaLink.Commons/Messages/Health/SiteHealthReport.cs
+++ b/src/ScadaLink.Commons/Messages/Health/SiteHealthReport.cs
@@ -21,3 +21,13 @@ public record SiteHealthReport(
    IReadOnlyDictionary<string, TagQualityCounts>? DataConnectionTagQuality = null,
    int ParkedMessageCount = 0,
    IReadOnlyList<NodeStatus>? ClusterNodes = null);
+
+/// <summary>
+/// Broadcast wrapper used between central nodes to keep per-node
+/// CentralHealthAggregator state in sync. ClusterClient load-balances each
+/// incoming SiteHealthReport to one central node; that node re-publishes
+/// this wrapper on a DistributedPubSub topic so the peer node's aggregator
+/// also processes the report (idempotently — sequence numbers guard against
+/// double-counting).
+/// </summary>
+public record SiteHealthReportReplica(SiteHealthReport Report);