Files
lmxopcua/docs/plans/2026-05-26-akka-hosting-alignment-design.md
Joseph Doherty ef4a70751c docs(plans): add v2 Akka + fused hosting alignment design
Captures the brainstormed design to align OtOpcUa with ScadaLink:
single role-gated binary, Akka.NET cluster with admin/driver roles,
cluster singletons for control plane, per-node actor hierarchy for
OPC UA runtime, dual-endpoint warm redundancy preserved with
ServiceLevel driven by Akka leader, cookie+JWT auth, Traefik routing,
and ScadaLink-style live-edit + deploy model replacing the
draft/publish ConfigGeneration lifecycle.
2026-05-26 03:04:21 -04:00

33 KiB
Raw Blame History

OtOpcUa v2 — Akka.NET + Fused Hosting Alignment with ScadaLink

Status: Design approved, ready for implementation planning Date: 2026-05-26 Branch: v2-akka-fuse Sister project reference: ~/Desktop/scadalink-design (ScadaLink)

1. Motivation

OtOpcUa today runs as three separate processes (OtOpcUa.Server OPC UA host, OtOpcUa.Admin Blazor Server web UI, optional OtOpcUaWonderwareHistorian Framework sidecar) with manual operator-driven warm-redundancy failover. The sister project ScadaLink — owned by the same developer — solved similar problems with a fused single-binary, role-gated hosting model on top of an Akka.NET cluster.

The motivation for this refactor is twofold:

  1. Consistency. A single developer (the project owner) moves between OtOpcUa and ScadaLink frequently. Sharing patterns — hosting, auth, actor hierarchy, deployment model — reduces cognitive overhead and makes fixes portable.
  2. Real HA improvements. Upgrade OtOpcUa's manual operator-driven failover to automatic, Akka-cluster-driven failover with Traefik routing for the web UI. Preserve OPC UA dual-endpoint client-side failover semantics (clients connect to both nodes and pick based on ServiceLevel), now driven automatically by Akka cluster leadership.

2. Architecture overview

One binary, role-gated. OtOpcUa.Host (Microsoft.NET.Sdk.Web, .NET 10) replaces OtOpcUa.Server and OtOpcUa.Admin. Same binary on every node. Role configured via OTOPCUA_ROLES environment variable.

Two Akka roles, single cluster:

  • admin — hosts Blazor web UI + cluster singletons. Singletons pinned via ClusterSingletonManagerSettings.WithRole("admin"). Traefik routes / to whichever Admin-role node /health/active reports as leader.
  • driver — hosts OPC UA endpoint + per-node DriverHostActor hierarchy. Every Driver-role node always serves OPC UA; ServiceLevel computed by RedundancyStateActor is broadcast back to each Driver node and used to publish to the local OPC UA address space.

Roles are additive: OTOPCUA_ROLES=admin, OTOPCUA_ROLES=driver, or OTOPCUA_ROLES=admin,driver. Small deployments run both roles on both nodes; larger deployments separate them.

Per-role leadership. Cluster.Get(system).State.RoleLeader("driver") drives OPC UA ServiceLevel. RoleLeader("admin") drives /health/active (Traefik routing). These are independent — admin and driver leadership can land on different nodes if separated.

Cluster membership. Both seed nodes; keep-oldest split-brain resolver; down-if-alone = on; 15s stable-after; 2s heartbeat / 10s threshold. CoordinatedShutdown for graceful singleton handover. Exact ScadaLink tuning.

OPC UA dual-endpoint preserved. Driver-role nodes all bind opc.tcp://0.0.0.0:4840. Clients still see N endpoints in ServerUriArray and fail over via ServiceLevel. OPC UA spec compliance unchanged from today.

Mac dev: role admin,driver,devdev short-circuits Windows-only driver registration (Galaxy, Wonderware) with explicit [DEV-STUB] log lines.

3. Project & process restructure

Single solution, ScadaLink-style folder layout. Existing OtOpcUa naming convention (ZB.MOM.WW.OtOpcUa.*) preserved.

New entry point & deletions

Action Project
New OtOpcUa.HostMicrosoft.NET.Sdk.Web, single Program.cs, role-gated startup, AddWindowsService
Delete OtOpcUa.Server (content migrates)
Delete OtOpcUa.Admin (UI moves to library)

New libraries

Project Owns ScadaLink analog
OtOpcUa.Commons Entity POCOs, interfaces, message contracts (Types/, Interfaces/, Entities/, Messages/) ScadaLink.Commons
OtOpcUa.ConfigDb EF Core DbContext, repositories, IAuditService, migrations, Data Protection key store ScadaLink.ConfigurationDatabase
OtOpcUa.Cluster Akka HOCON, AkkaHostedService, split-brain resolver config, role-aware membership helpers, IClusterRoleInfo (split out of ScadaLink Host)
OtOpcUa.Security LDAP bind, cookie+JWT hybrid, JWT issuance, role mapping, /auth/login, /auth/ping endpoints ScadaLink.Security
OtOpcUa.ControlPlane Cluster singletons: ConfigPublishCoordinator, AdminOperationsActor, AuditWriterActor, FleetStatusBroadcaster, RedundancyStateActor ScadaLink.ManagementService
OtOpcUa.Runtime Per-node actors: DriverHostActor, DriverInstanceActor, VirtualTagActor, ScriptedAlarmActor, OpcUaPublishActor, HistorianAdapterActor, PeerOpcUaProbeActor, DbHealthProbeActor ScadaLink.SiteRuntime
OtOpcUa.OpcUaServer OPC UA app host, address-space build, Phase7Composer extraction (in ScadaLink.SiteRuntime/DCL)
OtOpcUa.AdminUI Blazor components, hubs (FleetStatusHub, AlertHub, ScriptLogHub), auth state provider, MapAdminUI<TApp>() ScadaLink.CentralUI

Unchanged

  • Driver projects (OtOpcUa.Driver.Galaxy, .Modbus, .S7, .AbCip, .AbLegacy, .TwinCAT, .FOCAS, .OpcUaClient) — still implement IDriver, now consumed by DriverInstanceActor instead of DriverInstanceBootstrapper.
  • OtOpcUa.Driver.Historian.Wonderware — .NET Framework 4.8 sidecar, named-pipe IPC, wrapped by a HistorianAdapterActor in OtOpcUa.Runtime.
  • mxaccessgw sibling repo — unchanged; Galaxy driver still talks gRPC to it.

Tests

  • tests/OtOpcUa.Cluster.Tests — split-brain, leadership transitions
  • tests/OtOpcUa.ControlPlane.Tests — singleton actor unit tests via Akka.TestKit
  • tests/OtOpcUa.Runtime.Tests — per-node actor tests, driver lifecycle
  • tests/OtOpcUa.Security.Tests — LDAP, cookie+JWT roundtrip
  • tests/OtOpcUa.Host.IntegrationTests — 2-node in-process cluster, deployment flow, failover, Mac-safe
  • tests/OtOpcUa.OpcUa.IntegrationTests — real OPCFoundation client against stubbed Host
  • tests/OtOpcUa.E2E.Tests — full stack with Traefik (nightly CI)

Deploy

  • deploy/Install-Services.ps1 — installs one Windows Service per node (OtOpcUaHost), passes role via env var. Old script replaced.
  • deploy/traefik/ — Windows Traefik config + service registration for the leader-routed /health/active.
  • docker-dev/ (new, optional) — 2-node Mac dev compose with stubbed drivers + LDAP + SQL Server + Traefik.

Solution file: OtOpcUa.slnx (matches ScadaLink convention; switch from current .sln).

4. Actor hierarchy

Per-node tree

Rooted under OtOpcUa.Runtime, one tree per Driver-role node:

DriverHostActor                       (per-node coordinator, started by Host)
├─ DriverInstanceActor (per DriverInstance row)
│   └─ children = pooled or per-subscription work
├─ VirtualTagActor (per VirtualTag row)
├─ ScriptedAlarmActor (per ScriptedAlarm row)
├─ OpcUaPublishActor (per-node bridge to OPCFoundation address space)
├─ HistorianAdapterActor (per-node, wraps Wonderware named-pipe sidecar)
├─ PeerOpcUaProbeActor (per-node, tests peer OPC UA stack health)
└─ DbHealthProbeActor (per-node, cached DB health probe)

Cluster singletons

Pinned to admin role via ClusterSingletonManagerSettings.WithRole("admin"):

Actor Owns Notes
ConfigPublishCoordinator The deploy protocol. Writes Deployment row, broadcasts DispatchDeployment(deploymentId) via DistributedPubSub to every DriverHostActor, tracks apply ACKs per node. Replaces ApplyLeaseRegistry. Resumes after failover by re-reading ConfigDb state — no Akka.Persistence.
AdminOperationsActor All mutating admin ops (CRUD on equipment, drivers, scripts, namespaces, ACLs). Wraps each in an audit envelope. UI calls via ClusterSingletonProxy (in-process when UI is on Admin node).
AuditWriterActor Receives AuditEvent telemetry from any node, batch-inserts into ConfigAuditLog. Idempotent on EventId.
FleetStatusBroadcaster Aggregates Akka cluster member events + per-node DriverHostStatus heartbeats. Publishes diffs to IHubContext<FleetStatusHub> and IHubContext<AlertHub>. Push-driven; replaces today's 5s FleetStatusPoller.
RedundancyStateActor Subscribes to ClusterEvent.IMemberEvent + ClusterEvent.LeaderChanged + per-node health probes. Computes ServiceLevel byte + ServerUriArray per Driver node. Publishes to DistributedPubSub topic redundancy-state. Source of truth for OPC UA redundancy. Local OpcUaPublishActor subscribes and writes to its OPCFoundation stack.

Supervision

Actor Strategy
DriverHostActor Resume
DriverInstanceActor Restart with backoff (1s → 30s, ×1.5, jitter)
VirtualTagActor Restart with backoff
ScriptedAlarmActor Restart with backoff; preserve alarm state via PreRestart hook
OpcUaPublishActor Resume
HistorianAdapterActor Restart with backoff; SQLite store-and-forward buffers during pipe outage
All singletons Resume; resumable state in ConfigDb
Script execution actors (short-lived) Stop on failure

State machines

  • DriverInstanceActor — Become/Stash for Connecting → Connected → Reconnecting → Failed. Bad-quality publish on disconnect; transparent re-subscribe on reconnect. Write failures returned synchronously via Ask from OpcUaPublishActor.
  • ConfigPublishCoordinatorIdle → Publishing → AwaitingApplyAcks → Sealed, with timeout-driven escalation if a node fails to ack within ApplyMaxDuration (default 10 min).
  • RedundancyStateActor — recomputes on every membership event, debounced 250ms to coalesce bursts.

Communication conventions

  • Tell for hot-path internal traffic (driver values, alarm state changes, publish broadcasts).
  • Ask only at system boundaries (UI controller → AdminOperationsActor, with explicit timeout + cancellation token).
  • DistributedPubSub for cluster-wide broadcasts (DispatchDeployment, RedundancyStateChanged, FleetStatusChanged).
  • Application-level correlation IDs on every request/response message.
  • Messages live in OtOpcUa.Commons.Messages.{Drivers,Deploy,Admin,Audit,Redundancy} — additive-only evolution.

Singleton persistence

No Akka.Persistence. Each singleton reads its resumable state from ConfigDb on PreStart (e.g., ConfigPublishCoordinator reads the current in-flight Deployment row + per-node NodeDeploymentState) and writes on every state transition.

Mac-dev stubs

DevNode role short-circuits driver registration. DriverInstanceActor for any Galaxy/Wonderware row enters a Stubbed Become state that returns deterministic test values. Logged at INFO with [DEV-STUB] driver={Name} reason=windows-only.

5. Web hosting, auth, and SignalR

Kestrel startup gated by admin role

Program.cs builds WebApplicationBuilder, registers all services, but only calls app.MapBlazor<App>(), app.MapHub<...>(), app.MapStaticAssets(), and auth endpoints when admin ∈ roles. Driver-only nodes still bind Kestrel for /healthz on :4841 and nothing else.

Authentication — cookie+JWT hybrid

Layer Config
Cookie scheme OtOpcUa.Auth, HttpOnly, SameSite=Strict, Secure (prod) / SameAsRequest (dev). Sliding 30-min idle timeout.
Embedded JWT HMAC-SHA256, 15-min expiry, claims = sub, roles, nodeAcls.
LDAP bind LdapAuthService.AuthenticateAsync(user, pw) at /auth/login POST — preserved from current OtOpcUa.Admin/Security.
Role mapping RoleMapper.MapGroupsToRolesAsync() — LDAP groups → FleetAdmin / ConfigEditor / ReadOnly. Stays as-is.
Token issuance /auth/token returns bearer for external clients (CLI, automation).
Circuit expiry probe /auth/ping returns 200/401, polled by CookieAuthenticationStateProvider to detect expiry from inside a SignalR circuit.
Failure mode LDAP unreachable → new logins fail, active sessions continue.

Data Protection keys

services.AddDataProtection().PersistKeysToDbContext<OtOpcUaConfigDbContext>().SetApplicationName("OtOpcUa") — keys live in ConfigDb so a circuit started on Admin-node A survives if Traefik fails over to Admin-node B mid-session.

SignalR hubs

Three existing hubs preserved (/hubs/fleet, /hubs/alerts, /hubs/script-log):

  • Today: FleetStatusPoller polls SQL every 5s.
  • New: FleetStatusBroadcaster singleton receives Akka cluster events + per-node telemetry, pushes diffs via IHubContext<FleetStatusHub>. No polling.
  • HubTokenService bearer-token fallback retired — hubs are circuit-local, cookie auth flows through SignalR natively. External hub consumers use the bearer token from /auth/token with a JwtBearer authentication scheme declaration on the hub.

UI → backend wiring

  • Reads: Blazor components inject scoped repositories from DI and read directly from ConfigDb. No change from today.
  • Writes / mutating ops: Components inject IAdminOperationsClient — a thin wrapper around ClusterSingletonProxy to AdminOperationsActor. Mutations are Ask with a 10s timeout + correlation ID. Audit envelope built UI-side, completed singleton-side.
  • Driver diagnostics: Today's DriverDiagnosticsClient HTTP round-trip retires. UI components ask IFleetDiagnosticsClient which delegates to ClusterClientReceptionist-published actor messages.

Health endpoints

Endpoint Returns Used by
/health/ready 200 once Akka member is Up + ConfigDb reachable + DataProtection key ring loaded Service supervisor readiness gate
/health/active 200 only on the Admin-role leader; 503 elsewhere Traefik — routes browser traffic to leader
/healthz (existing) 200 when Driver-role actor system is up + at least one driver registered (preserved on :4841) Ops probes, OPC UA monitoring tools

Traefik

Windows Service (or external box). One route: host=otopcua.* → load-balance to {admin-node-a:9000, admin-node-b:9000} with /health/active health check, sticky sessions disabled (DataProtection key sharing handles continuity).

appsettings structure

Mirrors ScadaLink's per-component options pattern: Cluster:, Security:, ConfigDb:, OpcUa:, Drivers:, Historian: sections, bound to options classes owned by their respective component projects.

6. Edit + Deploy flow (replaces draft/publish generations)

The single most consequential domain change: drop the draft/publish ConfigGeneration lifecycle. Edits are live; deploy is a snapshot+push, ScadaLink-style.

Edit model

  • Equipment, Driver, DriverInstance, Namespace, UnsItem, Script, VirtualTag, ScriptedAlarm, NodeAcl are edited directly via AdminOperationsActor. No draft staging, no ConfigGeneration lifecycle. Last-write-wins per row (rowversion column for stale-write detection only).
  • Live edits do not affect running Driver-role nodes — running stacks reflect the last-deployed state. The UI shows a "drift" indicator when live ConfigDb state differs from last sealed deployment.
  • Validation runs on edit (semantic checks: driver tag-path validity, script syntax, namespace name uniqueness) — pulled forward from deploy-time to edit-time.

Deploy model

Admin UI "Deploy" → AdminOperationsActor.Ask(StartDeployment)
AdminOperationsActor:
  → snapshot ConfigDb current state
  → ConfigComposer.Flatten() → DeploymentArtifact
  → compute RevisionHash = SHA256(canonical-serialized artifact)
  → write Deployment row (DeploymentId GUID, RevisionHash, CreatedBy, CreatedAtUtc, Status=Dispatching)
  → Ask ConfigPublishCoordinator.DispatchDeployment(deploymentId)

ConfigPublishCoordinator (cluster singleton, admin role):
  → write Deployment.Status = Dispatching
  → DistributedPubSub Publish to "deployments" topic: DispatchDeployment(deploymentId, revisionHash)
  → schedule ApplyDeadline timer (ApplyMaxDuration, default 10 min)

DriverHostActor (per node, subscribed to "deployments"):
  receive DispatchDeployment(deploymentId, revisionHash):
    → if currentDeploymentRevision == revisionHash → ack Applied (idempotent)
    → else:
      → acquire per-node ApplyLock (Become Applying(deploymentId))
      → write NodeDeploymentState row (NodeId, DeploymentId, StartedAtUtc)
      → fetch artifact: read DeploymentArtifact blob from ConfigDb by deploymentId
      → diff against current applied artifact → per-instance ApplyDelta plans
      → dispatch ApplyDelta to DriverInstanceActor / VirtualTagActor / ScriptedAlarmActor children
      → collect per-instance acks (all-or-nothing per node)
      → on full success: write GenerationSealedCache (LiteDb local), update NodeDeploymentState.AppliedAtUtc
      → on any instance Failure: rollback to previous deployment, mark NodeDeploymentState=Failed
      → Tell Coordinator: ApplyAck(deploymentId, nodeId, Applied | Failed(reason))
      → Become Steady

ConfigPublishCoordinator: collect ApplyAcks
  → all Driver nodes Applied → Deployment.Status = Sealed → DistributedPubSub PublishDeploymentSealed
  → any Failed → Deployment.Status = PartiallyFailed → broadcast DeploymentFailed
  → deadline elapsed before all acks → Deployment.Status = TimedOut → broadcast DeploymentTimedOut

Per-instance operation lock

All mutating commands (deploy, disable, enable, delete) on a DriverInstance go through DriverInstanceActor, which serializes them via the actor mailbox — single-threaded by construction.

Idempotency

  • DeploymentId + RevisionHash together identify a deployment.
  • DriverHostActor seeing a DispatchDeployment whose RevisionHash matches current applied state → immediate ack Applied, no work. Safe to redeliver.
  • Phase7Composer.ComposeAsync(artifact) is pure; same artifact → same delta plan.
  • DriverInstanceActor.ApplyDelta(plan) compares against current state, applies only diffs.

Concurrency control

  • Last-write-wins on edits (no optimistic concurrency on Equipment, Driver, Script, etc.) — matches ScadaLink template behavior.
  • Optimistic concurrency on Deployment and NodeDeploymentState rows (rowversion column) — prevents two concurrent Coordinator instances (during failover) from corrupting state.

Singleton failover during deploy

  1. Old Coordinator wrote Deployment.Status = Dispatching + NodeDeploymentState rows before broadcast.
  2. New Coordinator on takeover queries Deployment rows with non-terminal Status.
  3. For each in-flight deployment, Ask every DriverHostActor (via cluster-aware actor selection) for current NodeDeploymentState.
  4. Recompute outstanding-ack set; resume the deadline timer with the remaining time.
  5. If apply deadline already passed → mark Deployment.Status = TimedOut for any unack'd nodes.

Crash recovery on Driver node restart

  • DriverHostActor.PreStart reads NodeDeploymentState for self.
  • If row says Applied for some DeploymentId and matches last sealed cache → Become Steady on that artifact.
  • If row says Applying (didn't reach Applied) → discard partial state, re-fetch the artifact, replay apply (idempotent).
  • If ConfigDb unreachable → fall back to local LiteDb sealed cache, Become Stale (drops ServiceLevel via RedundancyStateActor). Background reconnect retries every 30s.

Schema migration from today

Today New
ConfigGeneration (Draft/Published/Sealed lifecycle) Dropped
ClusterNodeGenerationState Renamed → NodeDeploymentState with (NodeId, DeploymentId, Status, StartedAtUtc, AppliedAtUtc, RowVersion)
ClusterNode.RedundancyRole column Dropped (Akka leader-of-driver-role is source of truth)
ConfigAuditLog Kept; deploy events added as new event types
(new) Deployment (DeploymentId, RevisionHash, Status, CreatedBy, CreatedAtUtc, ArtifactBlob varbinary(max), RowVersion)
(new) ConfigEdit audit row per Equipment/Driver/Script edit Live-edit history
(new) DataProtectionKeys DataProtection key ring storage

No more ApplyLeaseRegistry table or watchdog actor. Apply state lives in NodeDeploymentState; watchdog is a Coordinator-side scheduled message keyed by DeploymentId.

Stale-config fallback

Preserved from today's GenerationSealedCache: local LiteDb cache holds last-applied DeploymentArtifact. On Host boot with ConfigDb unreachable, DriverHostActor boots from cache → Become StaleRedundancyStateActor drops ServiceLevel for that node.

Peer probes consolidated

Today New
PeerHttpProbeLoop (HTTP /healthz) Retired — Akka failure detector replaces it
PeerUaProbeLoop (OPC UA opc.tcp://peer:4840) Retained as PeerOpcUaProbeActor — tests whether the OPC UA stack itself (not just the process) is up. Feeds RedundancyStateActor.
DbHealthCache (cached DB probe) Retained as DbHealthProbeActor per-node. Feeds RedundancyStateActor + /health/ready.

ServiceLevel computation in RedundancyStateActor

serviceLevel(node) =
  base 240 if (cluster member Up AND db reachable AND not stale AND opc ua probe ok)
  base 200 if (member Up AND db reachable AND stale)
  base 100 if (member Up AND db unreachable AND stale)
  base 0   if (member Down / Unreachable)

+10 bonus if Akka driver-role leader is this node

ServiceLevel bands match the existing RedundancyStatePublisher so OPC UA client behavior is unchanged from today. The leader-bonus replaces today's operator-managed RedundancyRole = Primary.

7. Error handling & failure modes

Akka cluster failure modes

Scenario Behavior
Network partition (split-brain) Keep-oldest resolver downs the smaller side after 15s stable-after. down-if-alone = on covers isolated nodes.
Admin leader process crash Failure detector trips after 10s, downs the member, new singleton instance starts on remaining Admin node. Traefik /health/active probe fails over within 1 polling interval (~5s).
Driver-role node crash RedundancyStateActor sees member Down → drops that node's ServiceLevel to 0 → OPC UA clients reconnect to surviving node. Both nodes were already running their own copy; no in-cluster recovery needed for that node's work.
Both Admin nodes down simultaneously Web UI unavailable. Driver nodes continue serving OPC UA from last-sealed cache. No new deployments possible until Admin node recovers.
All Driver nodes down OPC UA endpoints unavailable. Clients reconnect when any Driver node returns. ServiceLevel back to 240 once member Up + DB reachable + apply sealed.
Singleton handover during deploy Coordinator state survives in Deployment + NodeDeploymentState ConfigDb rows. New Coordinator queries DriverHostActors via cluster-aware actor selection. Resume remaining deadline.

ConfigDb unavailability

  • At edit time: AdminUI returns user-visible error. No retries — operator decides.
  • At deploy time: Coordinator refuses to start dispatch if it can't write the Deployment row.
  • At Driver node boot: Fall back to local LiteDb sealed cache. RedundancyStateActor drops ServiceLevel.
  • At singleton failover: New Coordinator's PreStart retries via Polly (5 attempts, exponential backoff). If exhausted → singleton crashes → cluster restarts singleton on next viable Admin node.

Driver / equipment failures

  • Driver connection loss → DriverInstanceActor enters Reconnecting Become state, publishes bad-quality to OPC UA address space immediately, retries at fixed interval.
  • Tag-path-resolution failure → retried periodically.
  • Write failure to driver → returned synchronously to caller via Ask from OpcUaPublishActor.
  • Driver process unresponsive (Galaxy gateway down) → IDriver.HealthCheck returns degraded → DriverInstanceActor reports to DriverHostActorRedundancyStateActor factors into ServiceLevel.

Wonderware historian sidecar

  • Named-pipe disconnect → HistorianAdapterActor enters Reconnecting; alarm history rows buffered to local SQLite store-and-forward.
  • Sidecar process crash → no in-cluster recovery (external process); operator restarts via Windows Service control.

Auth failures

  • LDAP unreachable → /auth/login returns 503. Active sessions continue with cached claims.
  • JWT signature failure (key ring drift) → 401, session terminates. DataProtection keys in ConfigDb prevent this in the happy path.
  • Cookie expired (sliding 30-min idle) → /auth/ping returns 401 → CookieAuthenticationStateProvider triggers UI logout.

SignalR / circuit drops

  • Blazor circuit dropped → App.razor reload script reconnects (preserved from today).
  • Hub message loss during reconnect → FleetStatusBroadcaster resends current state to the reconnecting client on OnConnectedAsync (full snapshot, not just diffs).

OPC UA stack failures

  • Address-space corruption → OpcUaPublishActor logs ERROR, sends RebuildAddressSpace to itself; sequence number bump notifies clients to resubscribe.
  • OPC UA listener bind failure (port collision) → Host fails readiness probe, supervisor restarts service.

Audit invariants

  • Audit write failures never abort the user-facing action. AuditWriterActor buffer overflow → log WARN, drop oldest (with counter metric). The action's success/failure path is authoritative.
  • All deploy + edit events carry ExecutionId (per-request correlation) so audit rows for one operator action share an ID.

8. Testing strategy

Test projects mirror the new layering. Test infrastructure stays Mac-friendly: stubbed Windows-only drivers, ephemeral SQL Server (LocalDB on Windows / mcr.microsoft.com/mssql/server container on Mac), OpenLDAP container, all spun up via tests/docker-compose.yml.

Layered test pyramid

Layer Project What it covers
Unit OtOpcUa.Runtime.Tests Per-actor logic via Akka.TestKit.Xunit2. DriverInstanceActor state-machine transitions, Phase7Composer purity, ScriptedAlarmActor state machine, VirtualTagActor expression eval. Drivers mocked via IDriver test doubles.
Unit OtOpcUa.ControlPlane.Tests Singleton actor logic. ConfigPublishCoordinator happy path + timeout + concurrent ack ordering. RedundancyStateActor ServiceLevel computation truth table. AuditWriterActor batch flush + idempotency on duplicate EventId.
Unit OtOpcUa.Cluster.Tests Split-brain resolver config validation, role-aware membership helpers, HOCON parses.
Unit OtOpcUa.Security.Tests LDAP role mapping, JWT issuance, cookie+JWT roundtrip, /auth/ping expiry semantics.
Integration OtOpcUa.Host.IntegrationTests 2-node in-process Akka cluster. Real SQL Server, stubbed drivers. Tests: deploy happy path, deploy timeout, deploy with one node down, singleton failover mid-deploy, ConfigDb outage + stale-config fallback, edit-then-deploy roundtrip, audit row emission.
Integration OtOpcUa.OpcUa.IntegrationTests Real OPCFoundation client connects to a running stubbed Host. Asserts: dual endpoint visible, ServerUriArray populated, ServiceLevel reflects leader status, browse + read + write through OpcUaPublishActor, write failures returned synchronously.
End-to-end OtOpcUa.E2E.Tests Full Host with Traefik in front, two Admin nodes + two Driver nodes (4 processes via Docker). Verifies: web UI login via LDAP, deploy from UI flows to OPC UA stack, kill admin leader → Traefik fails over within 25s, kill driver node → OPC UA clients reconnect with correct ServiceLevel. CI nightly.

Failover-specific test cases

  1. Kill Admin leader during Dispatching phase → new Coordinator resumes, deployment seals.
  2. Kill Admin leader during AwaitingApplyAcks → new Coordinator queries DriverHostActors, completes ack collection.
  3. Kill Driver node during Applying → Coordinator marks that node's NodeDeploymentState=Failed after deadline; surviving Driver nodes complete their apply.
  4. Restart Driver node mid-deploy → on restart, replays apply (idempotent).
  5. Akka split-brain (network partition between 2 admin nodes) → keep-oldest wins, smaller side downs itself within 15s.
  6. Both Admin nodes restart simultaneously → deployments in Dispatching resume cleanly after cluster reforms.
  7. Concurrent edits to the same DriverInstance from two UI sessions → last write wins, both audit rows present, no row corruption.

Deploy idempotency tests

  • Replay DispatchDeployment with same DeploymentId/RevisionHash → no work, ack Applied.
  • Apply same DeploymentArtifact twice in a row → second application is a no-op.
  • Crash DriverHostActor mid-apply, restart → resumes from NodeDeploymentState, completes idempotently.

Property tests

  • Phase7Composer.ComposeAsync is pure: same artifact → same plan, no side effects.
  • RedundancyStateActor ServiceLevel computation: every combination of (member-state, db-ok, stale, opc-ok, is-leader) produces expected byte.
  • Audit envelope generation: every mutating op produces exactly one audit row with stable ExecutionId correlation.

Mac-dev test invariants

  • All unit + integration tests run on macOS without Windows-only assemblies.
  • Cluster tests use in-process Akka.Remote on 127.0.0.1.
  • LDAP tests use OpenLDAP container or Security:Ldap:DevStubMode=true.

Retired tests

Anything touching ConfigGeneration lifecycle, ApplyLeaseRegistry, PeerHttpProbeLoop, FleetStatusPoller, RedundancyCoordinator peer-probe loops, RedundancyStatePublisher.

9. Risks & open questions

  1. Akka.NET on .NET 10. Verify Akka.NET 1.5+ targets .NET 10 cleanly.
  2. OPCFoundation SDK threading. The OPC UA stack runs its own threadpool. OpcUaPublishActor must marshal writes via thread-safe wrappers; use a dedicated synchronized-dispatcher for actors that touch the OPC UA address space.
  3. Failure detector tuning. ScadaLink's 2s/10s is tuned for site-to-central RTT. Benchmark before locking. Aggressive tuning + GC pauses → spurious singleton handover.
  4. ServiceLevel = Akka leader removes operator control. No escape hatch in v1. If a customer needs one later, add a PinnedPrimary column to ClusterNode and an override path in RedundancyStateActor. Out of scope now.
  5. Long-lived v2 branch drift. Monthly rebase from main, CI runs on v2 from day one.
  6. Schema migration is destructive. Dropping ConfigGeneration + ClusterNode.RedundancyRole is one-way. Cutover must run on a quiesced system. Provide a Migrate-To-V2.ps1 script that backs up ConfigDb, runs EF migrations, validates row counts, prints a summary.
  7. Wonderware + mxaccessgw still external processes. Both untouched by this refactor. Future actorization would be a second refactor.
  8. Audit row volume. Edit-heavy install ≈ 5k rows/day. Need monthly partition + 365-day retention same as ScadaLink #23.

10. Migration plan

Big-bang on v2-akka-fuse branch:

  1. Branch v2-akka-fuse off main.
  2. Add new projects: OtOpcUa.Host, .Cluster, .Security, .ControlPlane, .Runtime, .ConfigDb, .Commons, .AdminUI, .OpcUaServer. Convert to OtOpcUa.slnx.
  3. Move ConfigDb access (EF context, repos, migrations) out of Server and Admin into OtOpcUa.ConfigDb. Add DataProtection key store table.
  4. Move LDAP + cookie + JWT out of Admin/Security into OtOpcUa.Security. Adopt 15-min JWT / 30-min sliding cookie / /auth/ping.
  5. Build OtOpcUa.Cluster: HOCON, AkkaHostedService, role-aware membership helpers, split-brain resolver.
  6. Build OtOpcUa.ControlPlane: ConfigPublishCoordinator, AdminOperationsActor, AuditWriterActor, FleetStatusBroadcaster, RedundancyStateActor.
  7. Build OtOpcUa.Runtime: DriverHostActor, DriverInstanceActor, VirtualTagActor, ScriptedAlarmActor, OpcUaPublishActor, HistorianAdapterActor, PeerOpcUaProbeActor, DbHealthProbeActor.
  8. Migrate Phase7Composer to OtOpcUa.OpcUaServer; make it pure and unit-tested.
  9. Move Blazor components from Admin into OtOpcUa.AdminUI library; replace DriverDiagnosticsClient HTTP with in-process actor calls; rewire FleetStatusHub / AlertHub / ScriptLogHub to be fed by FleetStatusBroadcaster IHubContext.
  10. Build OtOpcUa.Host Program.cs: role-gated startup, health endpoints (/health/ready, /health/active, /healthz), AddWindowsService.
  11. ConfigDb migration: add Deployment, ConfigEdit, DataProtectionKeys tables; rename ClusterNodeGenerationStateNodeDeploymentState; drop ConfigGeneration; drop ClusterNode.RedundancyRole. EF migration + idempotent SQL script + Migrate-To-V2.ps1.
  12. Delete OtOpcUa.Server, OtOpcUa.Admin, DriverInstanceBootstrapper, RedundancyCoordinator, RedundancyStatePublisher, ApplyLeaseRegistry, FleetStatusPoller, PeerHttpProbeLoop, HubTokenService. Sweep any *RedundancyRole* references.
  13. Update deploy/Install-Services.ps1: single Windows Service per node, role via env var, Traefik service registration.
  14. Update docs in docs/: rewrite Redundancy.md, ServiceHosting.md; add Cluster.md, ControlPlane.md, Runtime.md. Add top-level Architecture-v2.md summary.
  15. CI: add integration test job for the 2-node cluster + OPC UA roundtrip.
  16. Tag the last v1 release on main for backport-only fixes. Merge v2-akka-fusemain when GA.