ScadaBridge/docs/components/ClusterInfrastructure.md

# Cluster Infrastructure

The Cluster Infrastructure component manages Akka.NET cluster formation, active/standby failover, split-brain resolution, and the singleton hosting that all other ScadaBridge components depend on. Every site and central cluster is a two-node active/standby pair governed by the same configuration contract and bootstrap logic.

## Overview

Cluster Infrastructure (#13) is a **design responsibility** spanning two projects rather than a single buildable project:

- **`src/ZB.MOM.WW.ScadaBridge.ClusterInfrastructure/`** owns the cluster configuration contract: `ClusterOptions` (seed nodes, failure-detection timings, split-brain settings), `ClusterOptionsValidator`, and the `AddClusterInfrastructure` DI extension that registers the validator. It does not start an actor system.
- **`src/ZB.MOM.WW.ScadaBridge.Host/`** owns the cluster bootstrap and runtime wiring: `AkkaHostedService` builds the Akka HOCON from `ClusterOptions` and `NodeOptions`, starts the `ActorSystem`, wires `CoordinatedShutdown`, and creates all role-specific actors including the cluster singletons.

This split is deliberate. The Host is the single deployable binary and the only project that performs Akka.NET bootstrap, so all cluster bring-up lives there. `ClusterInfrastructure` is the portable configuration contract that the Host consumes — it can be referenced by tests and other components without pulling in the Host.

Both central and site clusters run this same topology: two nodes, one active (cluster leader), one standby, with automatic failover and no manual intervention required for dual-node recovery.

## Key Concepts

### Active/standby via cluster leadership

Akka.NET cluster leadership determines which node is "active". The cluster leader is the oldest node in the cluster, as tracked by the keep-oldest split-brain resolver. `ActiveNodeGate` (in the Host) exposes `IsActiveNode` by checking whether `cluster.SelfMember.Status == MemberStatus.Up` and `cluster.State.Leader == cluster.SelfAddress`. Cluster singletons — which run on the oldest `Up` member — automatically migrate to the surviving node on failover.

### Configuration contract vs. bootstrap split

`ClusterOptions` holds the cluster-wide formation and failure-detection settings. Node-identity settings — remoting hostname/port, role (`Central` or `Site`), site identifier, gRPC port — live in `NodeOptions` (`ScadaBridge:Node` section), owned by the Host. This split prevents the configuration contract from acquiring a hard dependency on Host-specific concerns.

### Singleton hosting

Cluster Infrastructure provides the hosting platform; each singleton is owned and created by the component responsible for it. The Host's `RegisterCentralActors` and `RegisterSiteActorsAsync` methods wire every singleton via `ClusterSingletonManager` and a companion `ClusterSingletonProxy` so other actors can address it through a stable path regardless of which node currently hosts it.

## Architecture

### HOCON assembly

`AkkaHostedService.BuildHocon` constructs the Akka HOCON document from the bound options at startup. All interpolated values pass through `QuoteHocon` (string escaping) and `DurationHocon` (millisecond rendering) so the document is never corrupted by hostnames or timing values containing special characters or sub-second precision:

```csharp
public static string BuildHocon(
    NodeOptions nodeOptions,
    ClusterOptions clusterOptions,
    IEnumerable<string> roles,
    TimeSpan transportHeartbeat,
    TimeSpan transportFailure)
{
    var seedNodesStr = string.Join(",",
        clusterOptions.SeedNodes.Select(QuoteHocon));
    var rolesStr = string.Join(",", roles.Select(QuoteHocon));

    return $@"
audit-telemetry-dispatcher {{
    type = ForkJoinDispatcher
    throughput = 100
    dedicated-thread-pool {{
        thread-count = 2
    }}
}}
akka {{
    actor {{
        provider = cluster
    }}
    cluster {{
        seed-nodes = [{seedNodesStr}]
        roles = [{rolesStr}]
        min-nr-of-members = {clusterOptions.MinNrOfMembers}
        split-brain-resolver {{
            active-strategy = {QuoteHocon(clusterOptions.SplitBrainResolverStrategy)}
            stable-after = {DurationHocon(clusterOptions.StableAfter)}
            keep-oldest {{
                down-if-alone = {(clusterOptions.DownIfAlone ? "on" : "off")}
            }}
        }}
        failure-detector {{
            heartbeat-interval = {DurationHocon(clusterOptions.HeartbeatInterval)}
            acceptable-heartbeat-pause = {DurationHocon(clusterOptions.FailureDetectionThreshold)}
        }}
        run-coordinated-shutdown-when-down = on
    }}
    coordinated-shutdown {{
        run-by-clr-shutdown-hook = on
    }}
}}";
}
```

The HOCON also defines the `audit-telemetry-dispatcher` (a two-thread `ForkJoinDispatcher`) so `SiteAuditTelemetryActor`'s SQLite reads and gRPC pushes never contend with the default dispatcher used by hot-path actors.

### Split-brain resolution

The keep-oldest strategy is the only strategy `ClusterOptionsValidator` permits for ScadaBridge's two-node clusters. Quorum strategies (`keep-majority`, `static-quorum`) cannot distinguish a crash from a partition with two nodes — both sides would be below quorum and both would shut down. Keep-oldest with `down-if-alone = on` ensures at most one node runs the cluster at any time:

- On a network partition, the older node stays active; the younger node downs itself.
- If the oldest node finds itself alone (no reachable members), it downs itself rather than running in isolation. Without `down-if-alone`, the oldest node could run as a single-node cluster while the younger node forms its own — producing two live clusters with divergent singleton state.

### Failure detection and failover timeline

Detection uses two independent Akka heartbeat channels:

- **Cluster failure detector** (`akka.cluster.failure-detector`): monitors membership, triggers `Unreachable` events that the split-brain resolver acts on.
- **Transport failure detector** (`akka.remote.transport-failure-detector`): monitors the underlying TCP transport between nodes; configured separately from `CommunicationOptions.TransportHeartbeatInterval` / `TransportFailureThreshold`.

With the defaults in `ClusterOptions`, the total failover budget is approximately 25 seconds:

| Phase | Duration | Source |
|-------|----------|--------|
| Failure detection (`acceptable-heartbeat-pause`) | 10 s | `ClusterOptions.FailureDetectionThreshold` |
| Split-brain stable-after | 15 s | `ClusterOptions.StableAfter` |
| Singleton restart | < 1 s | Actor `PreStart` |

### Graceful shutdown and singleton handover

When a node is stopped cleanly, `CoordinatedShutdown` runs before the CLR exits (`run-by-clr-shutdown-hook = on`). The cluster-leave phase signals Akka to migrate singletons before the actor system terminates, so handover happens in seconds rather than waiting for the full failure-detection timeout. `SiteCallAuditActor` has an explicit graceful-stop task registered on `PhaseClusterLeave` with a 10-second timeout to drain any in-flight EF Core upsert before handover opens:

```csharp
siteCallAuditShutdown.AddTask(
    Akka.Actor.CoordinatedShutdown.PhaseClusterLeave,
    "drain-site-call-audit-singleton",
    async () =>
    {
        try
        {
            await siteCallAuditSingletonManager.GracefulStop(TimeSpan.FromSeconds(10));
        }
        catch (Exception ex)
        {
            _logger.LogWarning(ex,
                "SiteCallAudit singleton did not drain within the graceful-stop "
                + "timeout; falling through to PoisonPill handover");
        }
        return Akka.Done.Instance;
    });
```

### Cluster roles and singleton scoping

Each node carries one or more cluster roles set in the HOCON `roles` list. Site nodes carry both a base `"Site"` role and a site-specific role (`"site-{SiteId}"`, e.g. `"site-site-a"`). Singletons on site clusters are scoped to the site-specific role so each site's singleton runs on exactly one node of that site's cluster, not on any other site's nodes. Central singletons use no role scope — all central nodes share the `"Central"` role.

### Dual-node recovery

Because both nodes are configured as seed nodes, whichever node starts first after a simultaneous failure forms a new cluster; the second joins when it comes up. No startup ordering dependency exists, and no manual intervention is required. The keep-oldest resolver handles the "both starting fresh" case naturally — there is no pre-existing cluster to conflict with.

### Cluster singletons hosted

The Host wires the following singletons. Cluster Infrastructure provides the `ClusterSingletonManager` / `ClusterSingletonProxy` pattern; each singleton's behaviour is documented in the owning component.

**Central singletons (active central node, no role scope):**

| Singleton name | Actor class | Owner |
|----------------|-------------|-------|
| `notification-outbox` | `NotificationOutboxActor` | Notification Outbox (#21) |
| `audit-log-ingest` | `AuditLogIngestActor` | Audit Log (#23) |
| `site-call-audit` | `SiteCallAuditActor` | Site Call Audit (#22) |

**Site singletons (active site node, scoped to `"site-{SiteId}"` role):**

| Singleton name | Actor class | Owner |
|----------------|-------------|-------|
| `deployment-manager` | `DeploymentManagerActor` | Site Runtime (#3) |
| `event-log-handler` | `EventLogHandlerActor` | Site Event Logging (#12) |

`SiteAuditTelemetryActor` (Audit Log #23) is **not** a singleton — it runs on every site node and reads node-local SQLite. It is created directly with `ActorOf` and bound to the `audit-telemetry-dispatcher`.

## Usage

### Registering the configuration contract

Every host calls `AddClusterInfrastructure` to register `ClusterOptionsValidator`:

```csharp
services.AddClusterInfrastructure();
```

This registers `ClusterOptionsValidator` as an `IValidateOptions<ClusterOptions>` singleton. Because the Host binds `ClusterOptions` with `ValidateOnStart`, a misconfigured `ScadaBridge:Cluster` section (wrong strategy, `MinNrOfMembers != 1`, `DownIfAlone = false`, fewer than two seed nodes) throws an `OptionsValidationException` at startup rather than booting into a broken cluster.

### Checking active-node status

Components that must run only on the active node resolve `IActiveNodeGate` (registered by the Host's Central composition root):

```csharp
public bool IsActiveNode
{
    get
    {
        var system = _akkaService.ActorSystem;
        if (system == null) return false;
        var cluster = Cluster.Get(system);
        var self = cluster.SelfMember;
        if (self.Status != MemberStatus.Up) return false;
        var leader = cluster.State.Leader;
        return leader != null && leader == self.Address;
    }
}
```

This returns `false` while the actor system is warming up — the safe-by-default answer matching the standby case. The Inbound API uses this gate to return HTTP 503 on standby nodes.

## Configuration

`ClusterOptions` is bound from `ScadaBridge:Cluster`. `NodeOptions` is bound from `ScadaBridge:Node`.

### `ScadaBridge:Cluster`

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `SeedNodes` | `List<string>` | (required) | Akka seed-node URIs. Must contain at least 2 entries; both nodes list both themselves and their partner. |
| `SplitBrainResolverStrategy` | `string` | `"keep-oldest"` | Must be `"keep-oldest"`. Quorum strategies are rejected by `ClusterOptionsValidator`. |
| `StableAfter` | `TimeSpan` | `00:00:15` | Cluster must be stable for this duration before the resolver acts to down unreachable nodes. |
| `HeartbeatInterval` | `TimeSpan` | `00:00:02` | Cluster failure-detector heartbeat frequency. Must be less than `FailureDetectionThreshold`. |
| `FailureDetectionThreshold` | `TimeSpan` | `00:00:10` | `acceptable-heartbeat-pause` for the cluster failure detector. |
| `MinNrOfMembers` | `int` | `1` | Must be `1`. A value of `2` blocks the cluster singleton after failover. |
| `DownIfAlone` | `bool` | `true` | Must be `true`. See split-brain resolution above. |

### `ScadaBridge:Node`

| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `Role` | `string` | (required) | `"Central"` or `"Site"`. |
| `NodeHostname` | `string` | (required) | Hostname this node advertises to the Akka cluster remoting layer. |
| `NodeName` | `string` | `""` | Semantic label stamped on audit rows (`SourceNode`). Conventional values: `node-a`/`node-b` for sites, `central-a`/`central-b` for central. |
| `SiteId` | `string?` | (required for Site) | Site identifier; appended to the site-specific cluster role (`site-{SiteId}`). |
| `RemotingPort` | `int` | `8081` | Akka.NET TCP remoting port. Default `8081` for central, `8082` for site. |
| `GrpcPort` | `int` | `8083` | Kestrel HTTP/2 port for `SiteStreamGrpcServer` (site nodes only). Must differ from `RemotingPort`. |
| `MetricsPort` | `int` | `8084` | Kestrel HTTP/1.1 port for the Prometheus `/metrics` scrape endpoint (site nodes only). Must differ from `RemotingPort` and `GrpcPort`. |

### Representative docker configuration (central node A)

```json
{
  "ScadaBridge": {
    "Node": {
      "Role": "Central",
      "NodeName": "central-a",
      "NodeHostname": "scadabridge-central-a",
      "RemotingPort": 8081
    },
    "Cluster": {
      "SeedNodes": [
        "akka.tcp://scadabridge@scadabridge-central-a:8081",
        "akka.tcp://scadabridge@scadabridge-central-b:8081"
      ],
      "SplitBrainResolverStrategy": "keep-oldest",
      "StableAfter": "00:00:15",
      "HeartbeatInterval": "00:00:02",
      "FailureDetectionThreshold": "00:00:10",
      "MinNrOfMembers": 1
    }
  }
}
```

`DownIfAlone` is not present in the docker files because its default value of `true` is correct and `ClusterOptionsValidator` rejects `false`.

## Dependencies & Interactions

- [Host (#15)](./Host.md) — owns the Akka.NET bootstrap. `AkkaHostedService` consumes `ClusterOptions` and `NodeOptions`, assembles the HOCON, starts the `ActorSystem`, creates all role-specific actors, and wires `CoordinatedShutdown`. The `ClusterInfrastructure` project has no compile-time dependency on the Host; the dependency is reversed at runtime.
- [Commons (#16)](./Commons.md) — provides `INodeIdentityProvider` (implemented by `NodeIdentityProvider` in the Host), which supplies the `NodeName` label that audit writers stamp on the `SourceNode` column. Also provides `IClusterNodeProvider` (implemented by `AkkaClusterNodeProvider` in the Host), which the Health Monitoring component uses to report per-node up/down status.
- [Health Monitoring (#11)](./HealthMonitoring.md) — uses `IClusterNodeProvider` to list cluster members and determine whether the local node is primary; uses `IActiveNodeGate` (central only) to gate active-node-only health paths. The active/standby distinction reported to central health originates here.
- [Site Runtime (#3)](./SiteRuntime.md) — the Deployment Manager singleton is the most operationally critical singleton this infrastructure hosts. It re-creates the full Instance Actor hierarchy from local SQLite on failover. Staggered Instance Actor startup after failover is Site Runtime's responsibility; this component provides the singleton placement guarantee.
- [Notification Outbox (#21)](./NotificationOutbox.md), [Site Call Audit (#22)](./SiteCallAudit.md), [Audit Log (#23)](./AuditLog.md) — each hosts one or more central singletons wired by `RegisterCentralActors`. Cluster Infrastructure provides the `ClusterSingletonManager`/`ClusterSingletonProxy` boilerplate and the graceful-shutdown hooks; the business logic lives in the owning component.
- [Central–Site Communication (#5)](./Communication.md) — `CentralCommunicationActor` and `SiteCommunicationActor` are created and registered with `ClusterClientReceptionist` inside the same `AkkaHostedService` startup, making them addressable by remote `ClusterClient` instances. The transport-level heartbeat (`TransportHeartbeatInterval`, `TransportFailureThreshold`) is configured separately from the cluster failure-detector and comes from `CommunicationOptions`.
- [Inbound API (#14)](./InboundAPI.md) — resolves `IActiveNodeGate` to return HTTP 503 on standby central nodes. Gate returns `false` until the actor system is `Up` and this node is the cluster leader.
- Design spec: [Component-ClusterInfrastructure.md](../requirements/Component-ClusterInfrastructure.md).

## Troubleshooting

### Node fails to join cluster on startup

`ClusterOptionsValidator` rejects fewer than two seed nodes, a non-`keep-oldest` strategy, `MinNrOfMembers != 1`, or `DownIfAlone = false` at startup with an `OptionsValidationException`. Check that both seed-node URIs reference the Akka remoting port, not the gRPC port (8083) or metrics port (8084) — `StartupValidator` explicitly rejects seed entries whose port matches `GrpcPort`.

### Singleton not starting after failover

If the surviving node is `Up` but singletons do not start, `MinNrOfMembers` is the first thing to check. A value of `2` keeps the surviving node waiting for a second member indefinitely. The validator enforces `1`, but a manually patched `appsettings.json` that bypasses the validator could produce this.

### Two live clusters (split-brain)

If `DownIfAlone = false` were accepted (the validator rejects it), the oldest node could run alone while the younger forms its own cluster, producing two live clusters with divergent singleton state and dual MS SQL writers on central. `ClusterOptionsValidator` makes this configuration impossible to boot.

### Graceful shutdown takes longer than expected

If a clean node stop takes up to 25 seconds instead of seconds, `CoordinatedShutdown` may not be running — check that `run-by-clr-shutdown-hook = on` is present in the assembled HOCON (it is emitted unconditionally by `BuildHocon`) and that the Windows Service stop signal reaches the process rather than being killed. A `SIGKILL` / `TerminateProcess` bypasses `CoordinatedShutdown` entirely; the surviving node then has to wait the full failure-detection window.

## Related Documentation

- [Cluster Infrastructure design specification](../requirements/Component-ClusterInfrastructure.md)
- [Host](./Host.md)
- [Site Runtime](./SiteRuntime.md)
- [Health Monitoring](./HealthMonitoring.md)
- [Central–Site Communication](./Communication.md)
- [Notification Outbox](./NotificationOutbox.md)
- [Site Call Audit](./SiteCallAudit.md)
- [Audit Log](./AuditLog.md)
- [Commons](./Commons.md)