Commons (third-party dep, 7 namespaces, retired ApiKey, repo SaveChanges carve-out), ConfigurationDatabase (5 persisted + 1 non-persisted computed col), ClusterInfrastructure (abbreviated HOCON note, RemotingPort default), Host (component matrix: CI/HealthMonitoring/ExternalSystemGateway have no actors; DeadLetterMonitorActor runs on both roles), Security (Bearer not X-API-Key; ApiKeyAdmin registered by Host), Communication (Task.Run/Sender).
19 KiB
Cluster Infrastructure
The Cluster Infrastructure component manages Akka.NET cluster formation, active/standby failover, split-brain resolution, and the singleton hosting that all other ScadaBridge components depend on. Every site and central cluster is a two-node active/standby pair governed by the same configuration contract and bootstrap logic.
Overview
Cluster Infrastructure (#13) is a design responsibility spanning two projects rather than a single buildable project:
src/ZB.MOM.WW.ScadaBridge.ClusterInfrastructure/owns the cluster configuration contract:ClusterOptions(seed nodes, failure-detection timings, split-brain settings),ClusterOptionsValidator, and theAddClusterInfrastructureDI extension that registers the validator. It does not start an actor system.src/ZB.MOM.WW.ScadaBridge.Host/owns the cluster bootstrap and runtime wiring:AkkaHostedServicebuilds the Akka HOCON fromClusterOptionsandNodeOptions, starts theActorSystem, wiresCoordinatedShutdown, and creates all role-specific actors including the cluster singletons.
This split is deliberate. The Host is the single deployable binary and the only project that performs Akka.NET bootstrap, so all cluster bring-up lives there. ClusterInfrastructure is the portable configuration contract that the Host consumes — it can be referenced by tests and other components without pulling in the Host.
Both central and site clusters run this same topology: two nodes, one active (cluster leader), one standby, with automatic failover and no manual intervention required for dual-node recovery.
Key Concepts
Active/standby via cluster leadership
Akka.NET cluster leadership determines which node is "active". The cluster leader is the oldest node in the cluster, as tracked by the keep-oldest split-brain resolver. ActiveNodeGate (in the Host) exposes IsActiveNode by checking whether cluster.SelfMember.Status == MemberStatus.Up and cluster.State.Leader == cluster.SelfAddress. Cluster singletons — which run on the oldest Up member — automatically migrate to the surviving node on failover.
Configuration contract vs. bootstrap split
ClusterOptions holds the cluster-wide formation and failure-detection settings. Node-identity settings — remoting hostname/port, role (Central or Site), site identifier, gRPC port — live in NodeOptions (ScadaBridge:Node section), owned by the Host. This split prevents the configuration contract from acquiring a hard dependency on Host-specific concerns.
Singleton hosting
Cluster Infrastructure provides the hosting platform; each singleton is owned and created by the component responsible for it. The Host's RegisterCentralActors and RegisterSiteActorsAsync methods wire every singleton via ClusterSingletonManager and a companion ClusterSingletonProxy so other actors can address it through a stable path regardless of which node currently hosts it.
Architecture
HOCON assembly
AkkaHostedService.BuildHocon constructs the Akka HOCON document from the bound options at startup. All interpolated values pass through QuoteHocon (string escaping) and DurationHocon (millisecond rendering) so the document is never corrupted by hostnames or timing values containing special characters or sub-second precision.
The snippet below is abbreviated to highlight the cluster stanzas. The full method also emits three additional stanzas: akka.extensions (registers DistributedPubSubExtensionProvider), akka.remote.dot-netty.tcp (binds NodeOptions.NodeHostname and NodeOptions.RemotingPort), and akka.remote.transport-failure-detector (heartbeat interval and acceptable-heartbeat-pause from CommunicationOptions.TransportHeartbeatInterval / TransportFailureThreshold).
// Abbreviated — see AkkaHostedService.BuildHocon for the full method.
public static string BuildHocon(
NodeOptions nodeOptions,
ClusterOptions clusterOptions,
IEnumerable<string> roles,
TimeSpan transportHeartbeat,
TimeSpan transportFailure)
{
var seedNodesStr = string.Join(",",
clusterOptions.SeedNodes.Select(QuoteHocon));
var rolesStr = string.Join(",", roles.Select(QuoteHocon));
return $@"
audit-telemetry-dispatcher {{
type = ForkJoinDispatcher
throughput = 100
dedicated-thread-pool {{
thread-count = 2
}}
}}
akka {{
// akka.extensions, akka.remote.dot-netty.tcp, and
// akka.remote.transport-failure-detector also emitted here (see full method).
actor {{
provider = cluster
}}
cluster {{
seed-nodes = [{seedNodesStr}]
roles = [{rolesStr}]
min-nr-of-members = {clusterOptions.MinNrOfMembers}
split-brain-resolver {{
active-strategy = {QuoteHocon(clusterOptions.SplitBrainResolverStrategy)}
stable-after = {DurationHocon(clusterOptions.StableAfter)}
keep-oldest {{
down-if-alone = {(clusterOptions.DownIfAlone ? "on" : "off")}
}}
}}
failure-detector {{
heartbeat-interval = {DurationHocon(clusterOptions.HeartbeatInterval)}
acceptable-heartbeat-pause = {DurationHocon(clusterOptions.FailureDetectionThreshold)}
}}
run-coordinated-shutdown-when-down = on
}}
coordinated-shutdown {{
run-by-clr-shutdown-hook = on
}}
}}";
}
The HOCON also defines the audit-telemetry-dispatcher (a two-thread ForkJoinDispatcher) so SiteAuditTelemetryActor's SQLite reads and gRPC pushes never contend with the default dispatcher used by hot-path actors.
Split-brain resolution
The keep-oldest strategy is the only strategy ClusterOptionsValidator permits for ScadaBridge's two-node clusters. Quorum strategies (keep-majority, static-quorum) cannot distinguish a crash from a partition with two nodes — both sides would be below quorum and both would shut down. Keep-oldest with down-if-alone = on ensures at most one node runs the cluster at any time:
- On a network partition, the older node stays active; the younger node downs itself.
- If the oldest node finds itself alone (no reachable members), it downs itself rather than running in isolation. Without
down-if-alone, the oldest node could run as a single-node cluster while the younger node forms its own — producing two live clusters with divergent singleton state.
Failure detection and failover timeline
Detection uses two independent Akka heartbeat channels:
- Cluster failure detector (
akka.cluster.failure-detector): monitors membership, triggersUnreachableevents that the split-brain resolver acts on. - Transport failure detector (
akka.remote.transport-failure-detector): monitors the underlying TCP transport between nodes; configured separately fromCommunicationOptions.TransportHeartbeatInterval/TransportFailureThreshold.
With the defaults in ClusterOptions, the total failover budget is approximately 25 seconds:
| Phase | Duration | Source |
|---|---|---|
Failure detection (acceptable-heartbeat-pause) |
10 s | ClusterOptions.FailureDetectionThreshold |
| Split-brain stable-after | 15 s | ClusterOptions.StableAfter |
| Singleton restart | < 1 s | Actor PreStart |
Graceful shutdown and singleton handover
When a node is stopped cleanly, CoordinatedShutdown runs before the CLR exits (run-by-clr-shutdown-hook = on). The cluster-leave phase signals Akka to migrate singletons before the actor system terminates, so handover happens in seconds rather than waiting for the full failure-detection timeout. SiteCallAuditActor has an explicit graceful-stop task registered on PhaseClusterLeave with a 10-second timeout to drain any in-flight EF Core upsert before handover opens:
siteCallAuditShutdown.AddTask(
Akka.Actor.CoordinatedShutdown.PhaseClusterLeave,
"drain-site-call-audit-singleton",
async () =>
{
try
{
await siteCallAuditSingletonManager.GracefulStop(TimeSpan.FromSeconds(10));
}
catch (Exception ex)
{
_logger.LogWarning(ex,
"SiteCallAudit singleton did not drain within the graceful-stop "
+ "timeout; falling through to PoisonPill handover");
}
return Akka.Done.Instance;
});
Cluster roles and singleton scoping
Each node carries one or more cluster roles set in the HOCON roles list. Site nodes carry both a base "Site" role and a site-specific role ("site-{SiteId}", e.g. "site-site-a"). Singletons on site clusters are scoped to the site-specific role so each site's singleton runs on exactly one node of that site's cluster, not on any other site's nodes. Central singletons use no role scope — all central nodes share the "Central" role.
Dual-node recovery
Because both nodes are configured as seed nodes, whichever node starts first after a simultaneous failure forms a new cluster; the second joins when it comes up. No startup ordering dependency exists, and no manual intervention is required. The keep-oldest resolver handles the "both starting fresh" case naturally — there is no pre-existing cluster to conflict with.
Cluster singletons hosted
The Host wires the following singletons. Cluster Infrastructure provides the ClusterSingletonManager / ClusterSingletonProxy pattern; each singleton's behaviour is documented in the owning component.
Central singletons (active central node, no role scope):
| Singleton name | Actor class | Owner |
|---|---|---|
notification-outbox |
NotificationOutboxActor |
Notification Outbox (#21) |
audit-log-ingest |
AuditLogIngestActor |
Audit Log (#23) |
site-call-audit |
SiteCallAuditActor |
Site Call Audit (#22) |
Site singletons (active site node, scoped to "site-{SiteId}" role):
| Singleton name | Actor class | Owner |
|---|---|---|
deployment-manager |
DeploymentManagerActor |
Site Runtime (#3) |
event-log-handler |
EventLogHandlerActor |
Site Event Logging (#12) |
SiteAuditTelemetryActor (Audit Log #23) is not a singleton — it runs on every site node and reads node-local SQLite. It is created directly with ActorOf and bound to the audit-telemetry-dispatcher.
Usage
Registering the configuration contract
Every host calls AddClusterInfrastructure to register ClusterOptionsValidator:
services.AddClusterInfrastructure();
This registers ClusterOptionsValidator as an IValidateOptions<ClusterOptions> singleton. Because the Host binds ClusterOptions with ValidateOnStart, a misconfigured ScadaBridge:Cluster section (wrong strategy, MinNrOfMembers != 1, DownIfAlone = false, fewer than two seed nodes) throws an OptionsValidationException at startup rather than booting into a broken cluster.
Checking active-node status
Components that must run only on the active node resolve IActiveNodeGate (registered by the Host's Central composition root):
public bool IsActiveNode
{
get
{
var system = _akkaService.ActorSystem;
if (system == null) return false;
var cluster = Cluster.Get(system);
var self = cluster.SelfMember;
if (self.Status != MemberStatus.Up) return false;
var leader = cluster.State.Leader;
return leader != null && leader == self.Address;
}
}
This returns false while the actor system is warming up — the safe-by-default answer matching the standby case. The Inbound API uses this gate to return HTTP 503 on standby nodes.
Configuration
ClusterOptions is bound from ScadaBridge:Cluster. NodeOptions is bound from ScadaBridge:Node.
ScadaBridge:Cluster
| Key | Type | Default | Description |
|---|---|---|---|
SeedNodes |
List<string> |
(required) | Akka seed-node URIs. Must contain at least 2 entries; both nodes list both themselves and their partner. |
SplitBrainResolverStrategy |
string |
"keep-oldest" |
Must be "keep-oldest". Quorum strategies are rejected by ClusterOptionsValidator. |
StableAfter |
TimeSpan |
00:00:15 |
Cluster must be stable for this duration before the resolver acts to down unreachable nodes. |
HeartbeatInterval |
TimeSpan |
00:00:02 |
Cluster failure-detector heartbeat frequency. Must be less than FailureDetectionThreshold. |
FailureDetectionThreshold |
TimeSpan |
00:00:10 |
acceptable-heartbeat-pause for the cluster failure detector. |
MinNrOfMembers |
int |
1 |
Must be 1. A value of 2 blocks the cluster singleton after failover. |
DownIfAlone |
bool |
true |
Must be true. See split-brain resolution above. |
ScadaBridge:Node
| Key | Type | Default | Description |
|---|---|---|---|
Role |
string |
(required) | "Central" or "Site". |
NodeHostname |
string |
(required) | Hostname this node advertises to the Akka cluster remoting layer. |
NodeName |
string |
"" |
Semantic label stamped on audit rows (SourceNode). Conventional values: node-a/node-b for sites, central-a/central-b for central. |
SiteId |
string? |
(required for Site) | Site identifier; appended to the site-specific cluster role (site-{SiteId}). |
RemotingPort |
int |
8081 |
Akka.NET TCP remoting port. Code default is 8081; the site deployment overrides this to 8082 via appsettings.Site.json. |
GrpcPort |
int |
8083 |
Kestrel HTTP/2 port for SiteStreamGrpcServer (site nodes only). Must differ from RemotingPort. |
MetricsPort |
int |
8084 |
Kestrel HTTP/1.1 port for the Prometheus /metrics scrape endpoint (site nodes only). Must differ from RemotingPort and GrpcPort. |
Representative docker configuration (central node A)
{
"ScadaBridge": {
"Node": {
"Role": "Central",
"NodeName": "central-a",
"NodeHostname": "scadabridge-central-a",
"RemotingPort": 8081
},
"Cluster": {
"SeedNodes": [
"akka.tcp://scadabridge@scadabridge-central-a:8081",
"akka.tcp://scadabridge@scadabridge-central-b:8081"
],
"SplitBrainResolverStrategy": "keep-oldest",
"StableAfter": "00:00:15",
"HeartbeatInterval": "00:00:02",
"FailureDetectionThreshold": "00:00:10",
"MinNrOfMembers": 1
}
}
}
DownIfAlone is not present in the docker files because its default value of true is correct and ClusterOptionsValidator rejects false.
Dependencies & Interactions
- Host (#15) — owns the Akka.NET bootstrap.
AkkaHostedServiceconsumesClusterOptionsandNodeOptions, assembles the HOCON, starts theActorSystem, creates all role-specific actors, and wiresCoordinatedShutdown. TheClusterInfrastructureproject has no compile-time dependency on the Host; the dependency is reversed at runtime. - Commons (#16) — provides
INodeIdentityProvider(implemented byNodeIdentityProviderin the Host), which supplies theNodeNamelabel that audit writers stamp on theSourceNodecolumn. Also providesIClusterNodeProvider(implemented byAkkaClusterNodeProviderin the Host), which the Health Monitoring component uses to report per-node up/down status. - Health Monitoring (#11) — uses
IClusterNodeProviderto list cluster members and determine whether the local node is primary; usesIActiveNodeGate(central only) to gate active-node-only health paths. The active/standby distinction reported to central health originates here. - Site Runtime (#3) — the Deployment Manager singleton is the most operationally critical singleton this infrastructure hosts. It re-creates the full Instance Actor hierarchy from local SQLite on failover. Staggered Instance Actor startup after failover is Site Runtime's responsibility; this component provides the singleton placement guarantee.
- Notification Outbox (#21), Site Call Audit (#22), Audit Log (#23) — each hosts one or more central singletons wired by
RegisterCentralActors. Cluster Infrastructure provides theClusterSingletonManager/ClusterSingletonProxyboilerplate and the graceful-shutdown hooks; the business logic lives in the owning component. - Central–Site Communication (#5) —
CentralCommunicationActorandSiteCommunicationActorare created and registered withClusterClientReceptionistinside the sameAkkaHostedServicestartup, making them addressable by remoteClusterClientinstances. The transport-level heartbeat (TransportHeartbeatInterval,TransportFailureThreshold) is configured separately from the cluster failure-detector and comes fromCommunicationOptions. - Inbound API (#14) — resolves
IActiveNodeGateto return HTTP 503 on standby central nodes. Gate returnsfalseuntil the actor system isUpand this node is the cluster leader. - Design spec: Component-ClusterInfrastructure.md.
Troubleshooting
Node fails to join cluster on startup
ClusterOptionsValidator rejects fewer than two seed nodes, a non-keep-oldest strategy, MinNrOfMembers != 1, or DownIfAlone = false at startup with an OptionsValidationException. Check that both seed-node URIs reference the Akka remoting port, not the gRPC port (8083) or metrics port (8084) — on site nodes, StartupValidator explicitly rejects seed entries whose port matches GrpcPort.
Singleton not starting after failover
If the surviving node is Up but singletons do not start, MinNrOfMembers is the first thing to check. A value of 2 keeps the surviving node waiting for a second member indefinitely. The validator enforces 1, but a manually patched appsettings.json that bypasses the validator could produce this.
Two live clusters (split-brain)
If DownIfAlone = false were accepted (the validator rejects it), the oldest node could run alone while the younger forms its own cluster, producing two live clusters with divergent singleton state and dual MS SQL writers on central. ClusterOptionsValidator makes this configuration impossible to boot.
Graceful shutdown takes longer than expected
If a clean node stop takes up to 25 seconds instead of seconds, CoordinatedShutdown may not be running — check that run-by-clr-shutdown-hook = on is present in the assembled HOCON (it is emitted unconditionally by BuildHocon) and that the Windows Service stop signal reaches the process rather than being killed. A SIGKILL / TerminateProcess bypasses CoordinatedShutdown entirely; the surviving node then has to wait the full failure-detection window.