Files

Joseph Doherty 34694adba2 Apply Codex review findings across all 17 components

Template Engine: add composed member addressing (path-qualified canonical names),
override granularity per entity type, semantic validation (call targets, arg types),
graph acyclicity enforcement, revision hashes for flattened configs.

Deployment Manager: add deployment ID + idempotency, per-instance operation lock
covering all mutating commands, state transition matrix, site-side apply atomicity
(all-or-nothing), artifact version compatibility policy.

Site Runtime: add script trust model (forbidden APIs, execution timeout, constrained
compilation), concurrency/serialization rules (Instance Actor serializes mutations),
site-wide stream backpressure (per-subscriber buffering, fire-and-forget publish).

Communication: add application-level correlation IDs for protocol safety beyond
Akka.NET transport guarantees.

External System Gateway: add 408/429 as transient errors, CachedCall idempotency
note, dedicated dispatcher for blocking I/O isolation.

Health Monitoring: add monotonic sequence numbers to prevent stale report overwrites.

Security: require LDAPS/StartTLS for LDAP connections.

Central UI: add failover behavior (SignalR reconnect, JWT survives, shared Data
Protection keys, load balancer readiness).

Cluster Infrastructure: add down-if-alone=on for safe singleton ownership.

Site Event Logging: clarify active-node-only logging (no replication), add 1GB
storage cap with oldest-first purge.

Host: add readiness gating (health check endpoint, no traffic until operational).

Commons: add message contract versioning policy (additive-only evolution).

Configuration Database: add optimistic concurrency on deployment status records.

2026-03-16 09:06:12 -04:00

7.2 KiB

Raw Blame History

Component: Cluster Infrastructure

Purpose

The Cluster Infrastructure component manages the Akka.NET cluster setup, active/standby node roles, failover detection, and the foundational runtime environment on which all other components run. It provides the base layer for both central and site clusters.

Location

Both central and site clusters.

Responsibilities

Bootstrap the Akka.NET actor system on each node.
Form a two-node cluster (active/standby) using Akka.NET Cluster.
Manage leader election and role assignment (active vs. standby).
Detect node failures and trigger failover.
Provide the Akka.NET remoting infrastructure for inter-cluster communication.
Support cluster singleton hosting (used by the Site Runtime Deployment Manager singleton on site clusters).
Manage Windows service lifecycle (start, stop, restart) on each node.

Cluster Topology

Central Cluster

Two nodes forming an Akka.NET cluster.
One active node runs all central components (Template Engine, Deployment Manager, Central UI, etc.).
One standby node is ready to take over on failover.
Connected to MS SQL databases (Config DB, Machine Data DB).

Site Cluster (per site)

Two nodes forming an Akka.NET cluster.
One active node runs all site components (Site Runtime, Data Connection Layer, Store-and-Forward Engine, etc.).
The Site Runtime Deployment Manager runs as an Akka.NET cluster singleton on the active node, owning the full Instance Actor hierarchy.
One standby node receives replicated store-and-forward data and is ready to take over.
Connected to local SQLite databases (store-and-forward buffer, event logs, deployed configurations).
Connected to machines via data connections (OPC UA, custom protocol).

Failover Behavior

Detection

Akka.NET Cluster monitors node health via heartbeat.
If the active node becomes unreachable, the standby node detects the failure and promotes itself to active.

Central Failover

The new active node takes over all central responsibilities.
In-progress deployments are treated as failed — engineers must retry.
The UI session may be interrupted — users reconnect to the new active node.
No message buffering at central — no state to recover beyond what's in MS SQL.

Site Failover

The new active node takes over:
- The Deployment Manager singleton restarts and re-creates the full Instance Actor hierarchy by reading deployed configurations from local SQLite. Each Instance Actor spawns its child Script and Alarm Actors.
- Data collection (Data Connection Layer re-establishes subscriptions as Instance Actors register their data source references).
- Store-and-forward delivery (buffer is already replicated locally).
Active debug view streams from central are interrupted — the engineer must re-open them.
Health reporting resumes from the new active node.
Alarm states are re-evaluated from incoming values (alarm state is in-memory only).

Split-Brain Resolution

The system uses the Akka.NET keep-oldest split-brain resolver strategy:

On a network partition, the node that has been in the cluster longest remains active. The younger node downs itself.
Stable-after duration: 15 seconds. The cluster membership must remain stable (no changes) for 15 seconds before the resolver acts to down unreachable nodes. This prevents premature downing during startup or rolling restarts.
down-if-alone = on: The keep-oldest resolver is configured with down-if-alone enabled. If the oldest node finds itself alone (no other reachable members), it downs itself rather than continuing as a single-node cluster. This prevents the oldest node from running in isolation during a network partition while the younger node also forms its own cluster.
Why keep-oldest: With only two nodes, quorum-based strategies (static-quorum, keep-majority) cannot distinguish "one node crashed" from "network partition" — both sides see fewer than quorum and both would down themselves, resulting in total cluster shutdown. Keep-oldest with down-if-alone provides safe singleton ownership — at most one node runs the cluster singleton at any time.

Failure Detection Timing

Configurable defaults for heartbeat and failure detection:

Setting	Default	Description
Heartbeat interval	2 seconds	Frequency of health check messages between nodes
Failure detection threshold	10 seconds	Time without heartbeat before a node is considered unreachable
Stable-after (split-brain)	15 seconds	Time cluster must be stable before resolver acts
Total failover time	~25 seconds	Detection (10s) + stable-after (15s) + singleton restart

These values balance failover speed with stability — fast enough that data collection gaps are small, tolerant enough that brief network hiccups don't trigger unnecessary failovers.

Dual-Node Recovery

If both nodes in a cluster fail simultaneously (e.g., site power outage):

No manual intervention required. Since both nodes are configured as seed nodes, whichever node starts first forms a new cluster. The second node joins when it starts.
State recovery (each node has its own local copy of all required data):
- Site clusters: The Deployment Manager singleton reads deployed configurations from local SQLite and re-creates the full Instance Actor hierarchy. Store-and-forward buffers are already persisted locally. Alarm states re-evaluate from incoming data values.
- Central cluster: All state is in MS SQL (configuration database). The active node resumes normal operation.
The keep-oldest resolver handles the "both starting fresh" case naturally — there is no pre-existing cluster to conflict with.

Node Configuration

Each node is configured with:

Cluster seed nodes: Both nodes are seed nodes — each node lists both itself and its partner. Either node can start first and form the cluster; the other joins when it starts. No startup ordering dependency.
Cluster role: Central or Site (plus site identifier for site clusters).
Akka.NET remoting: Hostname/port for inter-node and inter-cluster communication.
Local storage paths: SQLite database locations (site nodes only).

Windows Service

Each node runs as a Windows service for automatic startup and recovery.
Service configuration includes Akka.NET cluster settings and component-specific configuration.

Platform

OS: Windows Server.
Runtime: .NET (Akka.NET).
Cluster: Akka.NET Cluster (application-level, not Windows Server Failover Clustering).

Dependencies

Akka.NET: Core actor system, cluster, remoting, and cluster singleton libraries.
Windows: Service hosting, networking.
MS SQL (central only): Database connectivity.
SQLite (sites only): Local storage.

Interactions

All components: Every component runs within the Akka.NET actor system managed by this infrastructure.
Site Runtime: The Deployment Manager singleton relies on Akka.NET cluster singleton support provided by this infrastructure.
Communication Layer: Built on top of the Akka.NET remoting provided here.
Health Monitoring: Reports node status (active/standby) as a health metric.

7.2 KiB Raw Blame History