Files
scadalink-design/docs/requirements/Component-ClusterInfrastructure.md
Joseph Doherty c929562e41 docs(audit): apply cross-bundle review fixes before merge
Final cross-bundle reviewer identified 7 inconsistencies that the per-bundle
reviewers couldn't see; all fixed in one logical commit.

Critical:
- HighLevelReqs AL-3: drop 'then upsert-on-newer-status' — AuditLog is
  strictly append-only (correct for SiteCalls/Notifications, wrong for
  the immutable AuditLog shadow).
- Component-AuditLog Error rate KPI: align with HealthMonitoring's
  exclusion list (Success/Delivered/Enqueued) rather than just non-Success;
  otherwise every Delivered notification or Enqueued cached call would be
  counted as an error.

Important:
- Component-AuditLog line 154: ISiteAuditWriter -> IAuditWriter (canonical
  name per Commons and the rest of this doc).
- Component-AuditLog Central direct-write paragraph: convert remaining
  slash notation (ApiInbound/Completed, Notification/Attempt,
  Notification/Terminal) to dot notation used everywhere else.
- Component-ClusterInfrastructure: scope SiteCallAuditActor to
  reconciliation + KPIs + Retry/Discard relay; cached-telemetry ingest is
  AuditLogIngestActor's role per Combined Telemetry contract.
- Component-CentralUI Audit Log page: state the OperationalAudit read
  permission and the read-vs-export split (matching CLI doc).
- Component-NotificationOutbox: add never-fail-the-action invariant for
  dispatcher audit writes.

Minor:
- Component-InboundAPI: 'Non-blocking semantics' was ambiguous (could be
  read as async); reword to 'Fail-soft' — the write is still synchronous
  before flush, but failures are caught and don't change the response.
- Component-CLI: realign audit-query/audit-export flags to actually match
  the Central UI Audit Log filter set (channel, kind, status, site,
  instance, target, actor, correlation-id, errors-only); drop --user and
  --entity-id which are IAuditService concepts, not Audit Log columns.
- Component-AuditLog KPI tile names: 'Volume/Error rate/Backlog' ->
  'Audit volume/Audit error rate/Audit backlog' (matches Central UI and
  Health Monitoring); drop the two orphan KPIs (Top inbound callers, Top
  outbound 5xx) that were never surfaced anywhere.
- Component-AuditLog Interactions: re-attribute DbOutbound emissions to
  ESG (where Database.* lives) with a note that Site Runtime is the API
  surface for scripts.
- HighLevelReqs AL-12: drop 'and reconciliation operations' (CLI has no
  reconcile command; reconciliation is an internal self-healing pull).
  Add note that verify-chain becomes operational once AL-11's hash chain
  ships.
2026-05-20 09:00:11 -04:00

12 KiB

Component: Cluster Infrastructure

Purpose

The Cluster Infrastructure component manages the Akka.NET cluster setup, active/standby node roles, failover detection, and the foundational runtime environment on which all other components run. It provides the base layer for both central and site clusters.

Location

Both central and site clusters.

Responsibilities

  • Bootstrap the Akka.NET actor system on each node.
  • Form a two-node cluster (active/standby) using Akka.NET Cluster.
  • Manage leader election and role assignment (active vs. standby).
  • Detect node failures and trigger failover.
  • Provide the Akka.NET remoting infrastructure for inter-cluster communication.
  • Support cluster singleton hosting (used by the Site Runtime Deployment Manager singleton on site clusters).
  • Manage Windows service lifecycle (start, stop, restart) on each node.

Implementation Note — Code Placement

This component is a design responsibility, not a single buildable project that contains all of the code. The cluster-infrastructure responsibilities above are realised across two projects:

  • src/ScadaLink.ClusterInfrastructure owns the cluster configuration model: the ClusterOptions POCO (seed nodes, roles, remoting/gRPC ports, failure-detection timings, split-brain settings) bound from appsettings.json via the Options pattern.
  • src/ScadaLink.Host owns the cluster bootstrap and runtime wiring: it builds the Akka.NET HOCON from ClusterOptions, starts the ActorSystem, configures the keep-oldest split-brain resolver (down-if-alone = on), wires CoordinatedShutdown into the service lifecycle, and provides active-node / cluster-membership health checks. See Component-Host.md (REQ-HOST-*) for detail.

This split is deliberate — the Host is the single deployable binary and the only project that performs Akka.NET bootstrap, so the cluster bring-up lives there alongside role-based component registration. The ClusterInfrastructure project remains the home of the configuration contract that the Host consumes.

Cluster Topology

Central Cluster

  • Two nodes forming an Akka.NET cluster.
  • One active node runs all central components (Template Engine, Deployment Manager, Central UI, etc.).
  • One standby node is ready to take over on failover.
  • Connected to MS SQL databases (Config DB, Machine Data DB).

Site Cluster (per site)

  • Two nodes forming an Akka.NET cluster.
  • One active node runs all site components (Site Runtime, Data Connection Layer, Store-and-Forward Engine, etc.).
  • The Site Runtime Deployment Manager runs as an Akka.NET cluster singleton on the active node, owning the full Instance Actor hierarchy.
  • One standby node receives replicated store-and-forward data and is ready to take over.
  • Connected to local SQLite databases (store-and-forward buffer, event logs, deployed configurations).
  • Connected to machines via data connections (OPC UA).

Cluster Singletons

Akka.NET cluster singletons run on the active node of their cluster and migrate on failover. Each singleton listed here is owned by the named component; this component (Cluster Infrastructure) provides only the hosting, supervision, and active-node placement guarantee.

Central singletons (active central node)

  • NotificationOutboxActor — owned by Notification Outbox (#21). Drives the central notification dispatch loop against the Notifications table.
  • SiteCallAuditActor — owned by Site Call Audit (#22). Owns the operational SiteCalls table: drives periodic reconciliation pulls for CachedCall / CachedWrite lifecycle, computes KPIs, and relays operator Retry/Discard actions to the owning site. Note: ingest of cached-call telemetry is performed by AuditLogIngestActor (#23) in one transaction with the immutable AuditLog insert — see Component-AuditLog.md, Cached Operations — Combined Telemetry.
  • AuditLogIngestActor — owned by Audit Log (#23). Receives gRPC telemetry batches of AuditEvent rows from sites and performs insert-if-not-exists on EventId against the central AuditLog table. For cached-call telemetry (which carries both audit-row content and operational-state fields in a single packet), the ingest performs the AuditLog insert and the SiteCalls upsert in one transaction — see Component-AuditLog.md for the combined-telemetry contract.
  • SiteAuditReconciliationActor — owned by Audit Log (#23). Periodic per-site pull (default every 5 minutes) that self-heals missed audit telemetry by asking each site for its oldest ForwardState = 'Pending' row and issuing a PullAuditEvents(sinceUtc, batchSize) when a non-draining backlog is detected.
  • AuditLogPurgeActor — owned by Audit Log (#23). Daily partition-switch purge against ps_AuditLog_Month; switches out any partition older than AuditLog:RetentionDays and emits an AuditLog:Purged event. Also rolls the partition scheme forward each month so the next month's partition exists ahead of time.

Site singletons (active site node, per site cluster)

  • Site Runtime Deployment Manager — owned by Site Runtime (#3). Owns the full Instance Actor hierarchy; re-creates it on failover from local SQLite.
  • SiteAuditTelemetryActor — owned by Audit Log (#23). Drains the local site AuditLog SQLite's ForwardState = 'Pending' rows to central in batches via the existing SiteStream gRPC channel; cadence is short (default 5 s) when the queue is non-empty and longer (default 30 s) when idle. Runs on a dedicated dispatcher so it does not compete with the script blocking-I/O dispatcher (per Component-AuditLog.md, Ingestion Paths → Telemetry forward).

Failover Behavior

Detection

  • Akka.NET Cluster monitors node health via heartbeat.
  • If the active node becomes unreachable, the standby node detects the failure and promotes itself to active.

Central Failover

  • The new active node takes over all central responsibilities.
  • In-progress deployments are treated as failed — engineers must retry.
  • The UI session may be interrupted — users reconnect to the new active node.
  • No message buffering at central — no state to recover beyond what's in MS SQL.

Site Failover

  • The new active node takes over:
    • The Deployment Manager singleton restarts and re-creates the full Instance Actor hierarchy by reading deployed configurations from local SQLite. Each Instance Actor spawns its child Script and Alarm Actors.
    • Data collection (Data Connection Layer re-establishes subscriptions as Instance Actors register their data source references).
    • Store-and-forward delivery (buffer is already replicated locally).
  • Active debug view streams from central are interrupted — the engineer must re-open them.
  • Health reporting resumes from the new active node.
  • Alarm states are re-evaluated from incoming values (alarm state is in-memory only).

Split-Brain Resolution

The system uses the Akka.NET keep-oldest split-brain resolver strategy:

  • On a network partition, the node that has been in the cluster longest remains active. The younger node downs itself.
  • Stable-after duration: 15 seconds. The cluster membership must remain stable (no changes) for 15 seconds before the resolver acts to down unreachable nodes. This prevents premature downing during startup or rolling restarts.
  • down-if-alone = on: The keep-oldest resolver is configured with down-if-alone enabled. If the oldest node finds itself alone (no other reachable members), it downs itself rather than continuing as a single-node cluster. This prevents the oldest node from running in isolation during a network partition while the younger node also forms its own cluster.
  • Why keep-oldest: With only two nodes, quorum-based strategies (static-quorum, keep-majority) cannot distinguish "one node crashed" from "network partition" — both sides see fewer than quorum and both would down themselves, resulting in total cluster shutdown. Keep-oldest with down-if-alone provides safe singleton ownership — at most one node runs the cluster singleton at any time.

Single-Node Operation

akka.cluster.min-nr-of-members must be set to 1. After failover, only one node is running. If set to 2, the surviving node waits for a second member before allowing the Cluster Singleton (Site Runtime Deployment Manager) to start — blocking all data collection and script execution indefinitely.

Failure Detection Timing

Configurable defaults for heartbeat and failure detection:

Setting Default Description
Heartbeat interval 2 seconds Frequency of health check messages between nodes
Failure detection threshold 10 seconds Time without heartbeat before a node is considered unreachable
Stable-after (split-brain) 15 seconds Time cluster must be stable before resolver acts
Total failover time ~25 seconds Detection (10s) + stable-after (15s) + singleton restart

These values balance failover speed with stability — fast enough that data collection gaps are small, tolerant enough that brief network hiccups don't trigger unnecessary failovers.

Dual-Node Recovery

If both nodes in a cluster fail simultaneously (e.g., site power outage):

  1. No manual intervention required. Since both nodes are configured as seed nodes, whichever node starts first forms a new cluster. The second node joins when it starts.
  2. State recovery (each node has its own local copy of all required data):
    • Site clusters: The Deployment Manager singleton reads deployed configurations from local SQLite and re-creates the full Instance Actor hierarchy. Store-and-forward buffers are already persisted locally. Alarm states re-evaluate from incoming data values.
    • Central cluster: All state is in MS SQL (configuration database). The active node resumes normal operation.
  3. The keep-oldest resolver handles the "both starting fresh" case naturally — there is no pre-existing cluster to conflict with.

Graceful Shutdown & Singleton Handover

When a node is stopped for planned maintenance (Windows Service stop), CoordinatedShutdown triggers a graceful leave from the cluster. This enables the Cluster Singleton (Site Runtime Deployment Manager) to hand over to the other node in seconds (limited by the hand-over retry interval) rather than waiting for the full failure detection timeout (~25 seconds).

Configuration required:

  • akka.coordinated-shutdown.run-by-clr-shutdown-hook = on
  • akka.cluster.run-coordinated-shutdown-when-down = on

The Host component wires CoordinatedShutdown into the Windows Service lifecycle (see REQ-HOST-9).

Node Configuration

Each node is configured with:

  • Cluster seed nodes: Both nodes are seed nodes — each node lists both itself and its partner. Either node can start first and form the cluster; the other joins when it starts. No startup ordering dependency.
  • Cluster role: Central or Site (plus site identifier for site clusters).
  • Akka.NET remoting: Hostname/port for inter-node and inter-cluster communication (default 8081 central, 8082 site).
  • gRPC port (site nodes only): Dedicated HTTP/2 port for the SiteStreamGrpcServer (default 8083). Separate from the Akka remoting port — gRPC uses Kestrel, Akka uses its own TCP transport.
  • Local storage paths: SQLite database locations (site nodes only).

Windows Service

  • Each node runs as a Windows service for automatic startup and recovery.
  • Service configuration includes Akka.NET cluster settings and component-specific configuration.

Platform

  • OS: Windows Server.
  • Runtime: .NET (Akka.NET).
  • Cluster: Akka.NET Cluster (application-level, not Windows Server Failover Clustering).

Dependencies

  • Akka.NET: Core actor system, cluster, remoting, and cluster singleton libraries.
  • Windows: Service hosting, networking.
  • MS SQL (central only): Database connectivity.
  • SQLite (sites only): Local storage.

Interactions

  • All components: Every component runs within the Akka.NET actor system managed by this infrastructure.
  • Site Runtime: The Deployment Manager singleton relies on Akka.NET cluster singleton support provided by this infrastructure.
  • Communication Layer: Built on top of the Akka.NET remoting provided here.
  • Health Monitoring: Reports node status (active/standby) as a health metric.