Files
ScadaBridge/docs/components/ClusterInfrastructure.md
Joseph Doherty c5fb02d640 docs(components): accuracy fixes from deep review (batch 1)
Commons (third-party dep, 7 namespaces, retired ApiKey, repo SaveChanges
carve-out), ConfigurationDatabase (5 persisted + 1 non-persisted computed col),
ClusterInfrastructure (abbreviated HOCON note, RemotingPort default),
Host (component matrix: CI/HealthMonitoring/ExternalSystemGateway have no
actors; DeadLetterMonitorActor runs on both roles), Security (Bearer not
X-API-Key; ApiKeyAdmin registered by Host), Communication (Task.Run/Sender).
2026-06-03 16:32:01 -04:00

19 KiB
Raw Permalink Blame History

Cluster Infrastructure

The Cluster Infrastructure component manages Akka.NET cluster formation, active/standby failover, split-brain resolution, and the singleton hosting that all other ScadaBridge components depend on. Every site and central cluster is a two-node active/standby pair governed by the same configuration contract and bootstrap logic.

Overview

Cluster Infrastructure (#13) is a design responsibility spanning two projects rather than a single buildable project:

  • src/ZB.MOM.WW.ScadaBridge.ClusterInfrastructure/ owns the cluster configuration contract: ClusterOptions (seed nodes, failure-detection timings, split-brain settings), ClusterOptionsValidator, and the AddClusterInfrastructure DI extension that registers the validator. It does not start an actor system.
  • src/ZB.MOM.WW.ScadaBridge.Host/ owns the cluster bootstrap and runtime wiring: AkkaHostedService builds the Akka HOCON from ClusterOptions and NodeOptions, starts the ActorSystem, wires CoordinatedShutdown, and creates all role-specific actors including the cluster singletons.

This split is deliberate. The Host is the single deployable binary and the only project that performs Akka.NET bootstrap, so all cluster bring-up lives there. ClusterInfrastructure is the portable configuration contract that the Host consumes — it can be referenced by tests and other components without pulling in the Host.

Both central and site clusters run this same topology: two nodes, one active (cluster leader), one standby, with automatic failover and no manual intervention required for dual-node recovery.

Key Concepts

Active/standby via cluster leadership

Akka.NET cluster leadership determines which node is "active". The cluster leader is the oldest node in the cluster, as tracked by the keep-oldest split-brain resolver. ActiveNodeGate (in the Host) exposes IsActiveNode by checking whether cluster.SelfMember.Status == MemberStatus.Up and cluster.State.Leader == cluster.SelfAddress. Cluster singletons — which run on the oldest Up member — automatically migrate to the surviving node on failover.

Configuration contract vs. bootstrap split

ClusterOptions holds the cluster-wide formation and failure-detection settings. Node-identity settings — remoting hostname/port, role (Central or Site), site identifier, gRPC port — live in NodeOptions (ScadaBridge:Node section), owned by the Host. This split prevents the configuration contract from acquiring a hard dependency on Host-specific concerns.

Singleton hosting

Cluster Infrastructure provides the hosting platform; each singleton is owned and created by the component responsible for it. The Host's RegisterCentralActors and RegisterSiteActorsAsync methods wire every singleton via ClusterSingletonManager and a companion ClusterSingletonProxy so other actors can address it through a stable path regardless of which node currently hosts it.

Architecture

HOCON assembly

AkkaHostedService.BuildHocon constructs the Akka HOCON document from the bound options at startup. All interpolated values pass through QuoteHocon (string escaping) and DurationHocon (millisecond rendering) so the document is never corrupted by hostnames or timing values containing special characters or sub-second precision.

The snippet below is abbreviated to highlight the cluster stanzas. The full method also emits three additional stanzas: akka.extensions (registers DistributedPubSubExtensionProvider), akka.remote.dot-netty.tcp (binds NodeOptions.NodeHostname and NodeOptions.RemotingPort), and akka.remote.transport-failure-detector (heartbeat interval and acceptable-heartbeat-pause from CommunicationOptions.TransportHeartbeatInterval / TransportFailureThreshold).

// Abbreviated — see AkkaHostedService.BuildHocon for the full method.
public static string BuildHocon(
    NodeOptions nodeOptions,
    ClusterOptions clusterOptions,
    IEnumerable<string> roles,
    TimeSpan transportHeartbeat,
    TimeSpan transportFailure)
{
    var seedNodesStr = string.Join(",",
        clusterOptions.SeedNodes.Select(QuoteHocon));
    var rolesStr = string.Join(",", roles.Select(QuoteHocon));

    return $@"
audit-telemetry-dispatcher {{
    type = ForkJoinDispatcher
    throughput = 100
    dedicated-thread-pool {{
        thread-count = 2
    }}
}}
akka {{
    // akka.extensions, akka.remote.dot-netty.tcp, and
    // akka.remote.transport-failure-detector also emitted here (see full method).
    actor {{
        provider = cluster
    }}
    cluster {{
        seed-nodes = [{seedNodesStr}]
        roles = [{rolesStr}]
        min-nr-of-members = {clusterOptions.MinNrOfMembers}
        split-brain-resolver {{
            active-strategy = {QuoteHocon(clusterOptions.SplitBrainResolverStrategy)}
            stable-after = {DurationHocon(clusterOptions.StableAfter)}
            keep-oldest {{
                down-if-alone = {(clusterOptions.DownIfAlone ? "on" : "off")}
            }}
        }}
        failure-detector {{
            heartbeat-interval = {DurationHocon(clusterOptions.HeartbeatInterval)}
            acceptable-heartbeat-pause = {DurationHocon(clusterOptions.FailureDetectionThreshold)}
        }}
        run-coordinated-shutdown-when-down = on
    }}
    coordinated-shutdown {{
        run-by-clr-shutdown-hook = on
    }}
}}";
}

The HOCON also defines the audit-telemetry-dispatcher (a two-thread ForkJoinDispatcher) so SiteAuditTelemetryActor's SQLite reads and gRPC pushes never contend with the default dispatcher used by hot-path actors.

Split-brain resolution

The keep-oldest strategy is the only strategy ClusterOptionsValidator permits for ScadaBridge's two-node clusters. Quorum strategies (keep-majority, static-quorum) cannot distinguish a crash from a partition with two nodes — both sides would be below quorum and both would shut down. Keep-oldest with down-if-alone = on ensures at most one node runs the cluster at any time:

  • On a network partition, the older node stays active; the younger node downs itself.
  • If the oldest node finds itself alone (no reachable members), it downs itself rather than running in isolation. Without down-if-alone, the oldest node could run as a single-node cluster while the younger node forms its own — producing two live clusters with divergent singleton state.

Failure detection and failover timeline

Detection uses two independent Akka heartbeat channels:

  • Cluster failure detector (akka.cluster.failure-detector): monitors membership, triggers Unreachable events that the split-brain resolver acts on.
  • Transport failure detector (akka.remote.transport-failure-detector): monitors the underlying TCP transport between nodes; configured separately from CommunicationOptions.TransportHeartbeatInterval / TransportFailureThreshold.

With the defaults in ClusterOptions, the total failover budget is approximately 25 seconds:

Phase Duration Source
Failure detection (acceptable-heartbeat-pause) 10 s ClusterOptions.FailureDetectionThreshold
Split-brain stable-after 15 s ClusterOptions.StableAfter
Singleton restart < 1 s Actor PreStart

Graceful shutdown and singleton handover

When a node is stopped cleanly, CoordinatedShutdown runs before the CLR exits (run-by-clr-shutdown-hook = on). The cluster-leave phase signals Akka to migrate singletons before the actor system terminates, so handover happens in seconds rather than waiting for the full failure-detection timeout. SiteCallAuditActor has an explicit graceful-stop task registered on PhaseClusterLeave with a 10-second timeout to drain any in-flight EF Core upsert before handover opens:

siteCallAuditShutdown.AddTask(
    Akka.Actor.CoordinatedShutdown.PhaseClusterLeave,
    "drain-site-call-audit-singleton",
    async () =>
    {
        try
        {
            await siteCallAuditSingletonManager.GracefulStop(TimeSpan.FromSeconds(10));
        }
        catch (Exception ex)
        {
            _logger.LogWarning(ex,
                "SiteCallAudit singleton did not drain within the graceful-stop "
                + "timeout; falling through to PoisonPill handover");
        }
        return Akka.Done.Instance;
    });

Cluster roles and singleton scoping

Each node carries one or more cluster roles set in the HOCON roles list. Site nodes carry both a base "Site" role and a site-specific role ("site-{SiteId}", e.g. "site-site-a"). Singletons on site clusters are scoped to the site-specific role so each site's singleton runs on exactly one node of that site's cluster, not on any other site's nodes. Central singletons use no role scope — all central nodes share the "Central" role.

Dual-node recovery

Because both nodes are configured as seed nodes, whichever node starts first after a simultaneous failure forms a new cluster; the second joins when it comes up. No startup ordering dependency exists, and no manual intervention is required. The keep-oldest resolver handles the "both starting fresh" case naturally — there is no pre-existing cluster to conflict with.

Cluster singletons hosted

The Host wires the following singletons. Cluster Infrastructure provides the ClusterSingletonManager / ClusterSingletonProxy pattern; each singleton's behaviour is documented in the owning component.

Central singletons (active central node, no role scope):

Singleton name Actor class Owner
notification-outbox NotificationOutboxActor Notification Outbox (#21)
audit-log-ingest AuditLogIngestActor Audit Log (#23)
site-call-audit SiteCallAuditActor Site Call Audit (#22)

Site singletons (active site node, scoped to "site-{SiteId}" role):

Singleton name Actor class Owner
deployment-manager DeploymentManagerActor Site Runtime (#3)
event-log-handler EventLogHandlerActor Site Event Logging (#12)

SiteAuditTelemetryActor (Audit Log #23) is not a singleton — it runs on every site node and reads node-local SQLite. It is created directly with ActorOf and bound to the audit-telemetry-dispatcher.

Usage

Registering the configuration contract

Every host calls AddClusterInfrastructure to register ClusterOptionsValidator:

services.AddClusterInfrastructure();

This registers ClusterOptionsValidator as an IValidateOptions<ClusterOptions> singleton. Because the Host binds ClusterOptions with ValidateOnStart, a misconfigured ScadaBridge:Cluster section (wrong strategy, MinNrOfMembers != 1, DownIfAlone = false, fewer than two seed nodes) throws an OptionsValidationException at startup rather than booting into a broken cluster.

Checking active-node status

Components that must run only on the active node resolve IActiveNodeGate (registered by the Host's Central composition root):

public bool IsActiveNode
{
    get
    {
        var system = _akkaService.ActorSystem;
        if (system == null) return false;
        var cluster = Cluster.Get(system);
        var self = cluster.SelfMember;
        if (self.Status != MemberStatus.Up) return false;
        var leader = cluster.State.Leader;
        return leader != null && leader == self.Address;
    }
}

This returns false while the actor system is warming up — the safe-by-default answer matching the standby case. The Inbound API uses this gate to return HTTP 503 on standby nodes.

Configuration

ClusterOptions is bound from ScadaBridge:Cluster. NodeOptions is bound from ScadaBridge:Node.

ScadaBridge:Cluster

Key Type Default Description
SeedNodes List<string> (required) Akka seed-node URIs. Must contain at least 2 entries; both nodes list both themselves and their partner.
SplitBrainResolverStrategy string "keep-oldest" Must be "keep-oldest". Quorum strategies are rejected by ClusterOptionsValidator.
StableAfter TimeSpan 00:00:15 Cluster must be stable for this duration before the resolver acts to down unreachable nodes.
HeartbeatInterval TimeSpan 00:00:02 Cluster failure-detector heartbeat frequency. Must be less than FailureDetectionThreshold.
FailureDetectionThreshold TimeSpan 00:00:10 acceptable-heartbeat-pause for the cluster failure detector.
MinNrOfMembers int 1 Must be 1. A value of 2 blocks the cluster singleton after failover.
DownIfAlone bool true Must be true. See split-brain resolution above.

ScadaBridge:Node

Key Type Default Description
Role string (required) "Central" or "Site".
NodeHostname string (required) Hostname this node advertises to the Akka cluster remoting layer.
NodeName string "" Semantic label stamped on audit rows (SourceNode). Conventional values: node-a/node-b for sites, central-a/central-b for central.
SiteId string? (required for Site) Site identifier; appended to the site-specific cluster role (site-{SiteId}).
RemotingPort int 8081 Akka.NET TCP remoting port. Code default is 8081; the site deployment overrides this to 8082 via appsettings.Site.json.
GrpcPort int 8083 Kestrel HTTP/2 port for SiteStreamGrpcServer (site nodes only). Must differ from RemotingPort.
MetricsPort int 8084 Kestrel HTTP/1.1 port for the Prometheus /metrics scrape endpoint (site nodes only). Must differ from RemotingPort and GrpcPort.

Representative docker configuration (central node A)

{
  "ScadaBridge": {
    "Node": {
      "Role": "Central",
      "NodeName": "central-a",
      "NodeHostname": "scadabridge-central-a",
      "RemotingPort": 8081
    },
    "Cluster": {
      "SeedNodes": [
        "akka.tcp://scadabridge@scadabridge-central-a:8081",
        "akka.tcp://scadabridge@scadabridge-central-b:8081"
      ],
      "SplitBrainResolverStrategy": "keep-oldest",
      "StableAfter": "00:00:15",
      "HeartbeatInterval": "00:00:02",
      "FailureDetectionThreshold": "00:00:10",
      "MinNrOfMembers": 1
    }
  }
}

DownIfAlone is not present in the docker files because its default value of true is correct and ClusterOptionsValidator rejects false.

Dependencies & Interactions

  • Host (#15) — owns the Akka.NET bootstrap. AkkaHostedService consumes ClusterOptions and NodeOptions, assembles the HOCON, starts the ActorSystem, creates all role-specific actors, and wires CoordinatedShutdown. The ClusterInfrastructure project has no compile-time dependency on the Host; the dependency is reversed at runtime.
  • Commons (#16) — provides INodeIdentityProvider (implemented by NodeIdentityProvider in the Host), which supplies the NodeName label that audit writers stamp on the SourceNode column. Also provides IClusterNodeProvider (implemented by AkkaClusterNodeProvider in the Host), which the Health Monitoring component uses to report per-node up/down status.
  • Health Monitoring (#11) — uses IClusterNodeProvider to list cluster members and determine whether the local node is primary; uses IActiveNodeGate (central only) to gate active-node-only health paths. The active/standby distinction reported to central health originates here.
  • Site Runtime (#3) — the Deployment Manager singleton is the most operationally critical singleton this infrastructure hosts. It re-creates the full Instance Actor hierarchy from local SQLite on failover. Staggered Instance Actor startup after failover is Site Runtime's responsibility; this component provides the singleton placement guarantee.
  • Notification Outbox (#21), Site Call Audit (#22), Audit Log (#23) — each hosts one or more central singletons wired by RegisterCentralActors. Cluster Infrastructure provides the ClusterSingletonManager/ClusterSingletonProxy boilerplate and the graceful-shutdown hooks; the business logic lives in the owning component.
  • CentralSite Communication (#5)CentralCommunicationActor and SiteCommunicationActor are created and registered with ClusterClientReceptionist inside the same AkkaHostedService startup, making them addressable by remote ClusterClient instances. The transport-level heartbeat (TransportHeartbeatInterval, TransportFailureThreshold) is configured separately from the cluster failure-detector and comes from CommunicationOptions.
  • Inbound API (#14) — resolves IActiveNodeGate to return HTTP 503 on standby central nodes. Gate returns false until the actor system is Up and this node is the cluster leader.
  • Design spec: Component-ClusterInfrastructure.md.

Troubleshooting

Node fails to join cluster on startup

ClusterOptionsValidator rejects fewer than two seed nodes, a non-keep-oldest strategy, MinNrOfMembers != 1, or DownIfAlone = false at startup with an OptionsValidationException. Check that both seed-node URIs reference the Akka remoting port, not the gRPC port (8083) or metrics port (8084) — on site nodes, StartupValidator explicitly rejects seed entries whose port matches GrpcPort.

Singleton not starting after failover

If the surviving node is Up but singletons do not start, MinNrOfMembers is the first thing to check. A value of 2 keeps the surviving node waiting for a second member indefinitely. The validator enforces 1, but a manually patched appsettings.json that bypasses the validator could produce this.

Two live clusters (split-brain)

If DownIfAlone = false were accepted (the validator rejects it), the oldest node could run alone while the younger forms its own cluster, producing two live clusters with divergent singleton state and dual MS SQL writers on central. ClusterOptionsValidator makes this configuration impossible to boot.

Graceful shutdown takes longer than expected

If a clean node stop takes up to 25 seconds instead of seconds, CoordinatedShutdown may not be running — check that run-by-clr-shutdown-hook = on is present in the assembled HOCON (it is emitted unconditionally by BuildHocon) and that the Windows Service stop signal reaches the process rather than being killed. A SIGKILL / TerminateProcess bypasses CoordinatedShutdown entirely; the surviving node then has to wait the full failure-detection window.