Files

Joseph Doherty ef4a70751c docs(plans): add v2 Akka + fused hosting alignment design

Captures the brainstormed design to align OtOpcUa with ScadaLink:
single role-gated binary, Akka.NET cluster with admin/driver roles,
cluster singletons for control plane, per-node actor hierarchy for
OPC UA runtime, dual-endpoint warm redundancy preserved with
ServiceLevel driven by Akka leader, cookie+JWT auth, Traefik routing,
and ScadaLink-style live-edit + deploy model replacing the
draft/publish ConfigGeneration lifecycle.

2026-05-26 03:04:21 -04:00

33 KiB

Raw Blame History

OtOpcUa v2 — Akka.NET + Fused Hosting Alignment with ScadaLink

Status: Design approved, ready for implementation planning Date: 2026-05-26 Branch: v2-akka-fuse Sister project reference: ~/Desktop/scadalink-design (ScadaLink)

1. Motivation

OtOpcUa today runs as three separate processes (OtOpcUa.Server OPC UA host, OtOpcUa.Admin Blazor Server web UI, optional OtOpcUaWonderwareHistorian Framework sidecar) with manual operator-driven warm-redundancy failover. The sister project ScadaLink — owned by the same developer — solved similar problems with a fused single-binary, role-gated hosting model on top of an Akka.NET cluster.

The motivation for this refactor is twofold:

Consistency. A single developer (the project owner) moves between OtOpcUa and ScadaLink frequently. Sharing patterns — hosting, auth, actor hierarchy, deployment model — reduces cognitive overhead and makes fixes portable.
Real HA improvements. Upgrade OtOpcUa's manual operator-driven failover to automatic, Akka-cluster-driven failover with Traefik routing for the web UI. Preserve OPC UA dual-endpoint client-side failover semantics (clients connect to both nodes and pick based on ServiceLevel), now driven automatically by Akka cluster leadership.

2. Architecture overview

One binary, role-gated. OtOpcUa.Host (Microsoft.NET.Sdk.Web, .NET 10) replaces OtOpcUa.Server and OtOpcUa.Admin. Same binary on every node. Role configured via OTOPCUA_ROLES environment variable.

Two Akka roles, single cluster:

admin — hosts Blazor web UI + cluster singletons. Singletons pinned via ClusterSingletonManagerSettings.WithRole("admin"). Traefik routes / to whichever Admin-role node /health/active reports as leader.
driver — hosts OPC UA endpoint + per-node DriverHostActor hierarchy. Every Driver-role node always serves OPC UA; ServiceLevel computed by RedundancyStateActor is broadcast back to each Driver node and used to publish to the local OPC UA address space.

Roles are additive: OTOPCUA_ROLES=admin, OTOPCUA_ROLES=driver, or OTOPCUA_ROLES=admin,driver. Small deployments run both roles on both nodes; larger deployments separate them.

Per-role leadership. Cluster.Get(system).State.RoleLeader("driver") drives OPC UA ServiceLevel. RoleLeader("admin") drives /health/active (Traefik routing). These are independent — admin and driver leadership can land on different nodes if separated.

Cluster membership. Both seed nodes; keep-oldest split-brain resolver; down-if-alone = on; 15s stable-after; 2s heartbeat / 10s threshold. CoordinatedShutdown for graceful singleton handover. Exact ScadaLink tuning.

OPC UA dual-endpoint preserved. Driver-role nodes all bind opc.tcp://0.0.0.0:4840. Clients still see N endpoints in ServerUriArray and fail over via ServiceLevel. OPC UA spec compliance unchanged from today.

Mac dev: role admin,driver,dev — dev short-circuits Windows-only driver registration (Galaxy, Wonderware) with explicit [DEV-STUB] log lines.

3. Project & process restructure

Single solution, ScadaLink-style folder layout. Existing OtOpcUa naming convention (ZB.MOM.WW.OtOpcUa.*) preserved.

New entry point & deletions

Action	Project
New	`OtOpcUa.Host` — `Microsoft.NET.Sdk.Web`, single Program.cs, role-gated startup, `AddWindowsService`
Delete	`OtOpcUa.Server` (content migrates)
Delete	`OtOpcUa.Admin` (UI moves to library)

New libraries

Project	Owns	ScadaLink analog
`OtOpcUa.Commons`	Entity POCOs, interfaces, message contracts (`Types/`, `Interfaces/`, `Entities/`, `Messages/`)	`ScadaLink.Commons`
`OtOpcUa.ConfigDb`	EF Core `DbContext`, repositories, `IAuditService`, migrations, Data Protection key store	`ScadaLink.ConfigurationDatabase`
`OtOpcUa.Cluster`	Akka HOCON, `AkkaHostedService`, split-brain resolver config, role-aware membership helpers, `IClusterRoleInfo`	(split out of ScadaLink Host)
`OtOpcUa.Security`	LDAP bind, cookie+JWT hybrid, JWT issuance, role mapping, `/auth/login`, `/auth/ping` endpoints	`ScadaLink.Security`
`OtOpcUa.ControlPlane`	Cluster singletons: `ConfigPublishCoordinator`, `AdminOperationsActor`, `AuditWriterActor`, `FleetStatusBroadcaster`, `RedundancyStateActor`	`ScadaLink.ManagementService`
`OtOpcUa.Runtime`	Per-node actors: `DriverHostActor`, `DriverInstanceActor`, `VirtualTagActor`, `ScriptedAlarmActor`, `OpcUaPublishActor`, `HistorianAdapterActor`, `PeerOpcUaProbeActor`, `DbHealthProbeActor`	`ScadaLink.SiteRuntime`
`OtOpcUa.OpcUaServer`	OPC UA app host, address-space build, `Phase7Composer` extraction	(in ScadaLink.SiteRuntime/DCL)
`OtOpcUa.AdminUI`	Blazor components, hubs (`FleetStatusHub`, `AlertHub`, `ScriptLogHub`), auth state provider, `MapAdminUI<TApp>()`	`ScadaLink.CentralUI`

Unchanged

Driver projects (OtOpcUa.Driver.Galaxy, .Modbus, .S7, .AbCip, .AbLegacy, .TwinCAT, .FOCAS, .OpcUaClient) — still implement IDriver, now consumed by DriverInstanceActor instead of DriverInstanceBootstrapper.
OtOpcUa.Driver.Historian.Wonderware — .NET Framework 4.8 sidecar, named-pipe IPC, wrapped by a HistorianAdapterActor in OtOpcUa.Runtime.
mxaccessgw sibling repo — unchanged; Galaxy driver still talks gRPC to it.

Tests

tests/OtOpcUa.Cluster.Tests — split-brain, leadership transitions
tests/OtOpcUa.ControlPlane.Tests — singleton actor unit tests via Akka.TestKit
tests/OtOpcUa.Runtime.Tests — per-node actor tests, driver lifecycle
tests/OtOpcUa.Security.Tests — LDAP, cookie+JWT roundtrip
tests/OtOpcUa.Host.IntegrationTests — 2-node in-process cluster, deployment flow, failover, Mac-safe
tests/OtOpcUa.OpcUa.IntegrationTests — real OPCFoundation client against stubbed Host
tests/OtOpcUa.E2E.Tests — full stack with Traefik (nightly CI)

Deploy

deploy/Install-Services.ps1 — installs one Windows Service per node (OtOpcUaHost), passes role via env var. Old script replaced.
deploy/traefik/ — Windows Traefik config + service registration for the leader-routed /health/active.
docker-dev/ (new, optional) — 2-node Mac dev compose with stubbed drivers + LDAP + SQL Server + Traefik.

Solution file: OtOpcUa.slnx (matches ScadaLink convention; switch from current .sln).

4. Actor hierarchy

Per-node tree

Rooted under OtOpcUa.Runtime, one tree per Driver-role node:

DriverHostActor                       (per-node coordinator, started by Host)
├─ DriverInstanceActor (per DriverInstance row)
│   └─ children = pooled or per-subscription work
├─ VirtualTagActor (per VirtualTag row)
├─ ScriptedAlarmActor (per ScriptedAlarm row)
├─ OpcUaPublishActor (per-node bridge to OPCFoundation address space)
├─ HistorianAdapterActor (per-node, wraps Wonderware named-pipe sidecar)
├─ PeerOpcUaProbeActor (per-node, tests peer OPC UA stack health)
└─ DbHealthProbeActor (per-node, cached DB health probe)

Cluster singletons

Pinned to admin role via ClusterSingletonManagerSettings.WithRole("admin"):

Actor	Owns	Notes
`ConfigPublishCoordinator`	The deploy protocol. Writes `Deployment` row, broadcasts `DispatchDeployment(deploymentId)` via `DistributedPubSub` to every `DriverHostActor`, tracks apply ACKs per node.	Replaces `ApplyLeaseRegistry`. Resumes after failover by re-reading ConfigDb state — no Akka.Persistence.
`AdminOperationsActor`	All mutating admin ops (CRUD on equipment, drivers, scripts, namespaces, ACLs). Wraps each in an audit envelope.	UI calls via `ClusterSingletonProxy` (in-process when UI is on Admin node).
`AuditWriterActor`	Receives `AuditEvent` telemetry from any node, batch-inserts into `ConfigAuditLog`.	Idempotent on `EventId`.
`FleetStatusBroadcaster`	Aggregates Akka cluster member events + per-node `DriverHostStatus` heartbeats. Publishes diffs to `IHubContext<FleetStatusHub>` and `IHubContext<AlertHub>`.	Push-driven; replaces today's 5s `FleetStatusPoller`.
`RedundancyStateActor`	Subscribes to `ClusterEvent.IMemberEvent` + `ClusterEvent.LeaderChanged` + per-node health probes. Computes `ServiceLevel` byte + `ServerUriArray` per Driver node. Publishes to `DistributedPubSub` topic `redundancy-state`.	Source of truth for OPC UA redundancy. Local `OpcUaPublishActor` subscribes and writes to its OPCFoundation stack.

Supervision

Actor	Strategy
`DriverHostActor`	`Resume`
`DriverInstanceActor`	`Restart` with backoff (1s → 30s, ×1.5, jitter)
`VirtualTagActor`	`Restart` with backoff
`ScriptedAlarmActor`	`Restart` with backoff; preserve alarm state via `PreRestart` hook
`OpcUaPublishActor`	`Resume`
`HistorianAdapterActor`	`Restart` with backoff; SQLite store-and-forward buffers during pipe outage
All singletons	`Resume`; resumable state in ConfigDb
Script execution actors (short-lived)	`Stop` on failure

State machines

DriverInstanceActor — Become/Stash for Connecting → Connected → Reconnecting → Failed. Bad-quality publish on disconnect; transparent re-subscribe on reconnect. Write failures returned synchronously via Ask from OpcUaPublishActor.
ConfigPublishCoordinator — Idle → Publishing → AwaitingApplyAcks → Sealed, with timeout-driven escalation if a node fails to ack within ApplyMaxDuration (default 10 min).
RedundancyStateActor — recomputes on every membership event, debounced 250ms to coalesce bursts.

Communication conventions

Tell for hot-path internal traffic (driver values, alarm state changes, publish broadcasts).
Ask only at system boundaries (UI controller → AdminOperationsActor, with explicit timeout + cancellation token).
DistributedPubSub for cluster-wide broadcasts (DispatchDeployment, RedundancyStateChanged, FleetStatusChanged).
Application-level correlation IDs on every request/response message.
Messages live in OtOpcUa.Commons.Messages.{Drivers,Deploy,Admin,Audit,Redundancy} — additive-only evolution.

Singleton persistence

No Akka.Persistence. Each singleton reads its resumable state from ConfigDb on PreStart (e.g., ConfigPublishCoordinator reads the current in-flight Deployment row + per-node NodeDeploymentState) and writes on every state transition.

Mac-dev stubs

DevNode role short-circuits driver registration. DriverInstanceActor for any Galaxy/Wonderware row enters a Stubbed Become state that returns deterministic test values. Logged at INFO with [DEV-STUB] driver={Name} reason=windows-only.

5. Web hosting, auth, and SignalR

Kestrel startup gated by `admin` role

Program.cs builds WebApplicationBuilder, registers all services, but only calls app.MapBlazor<App>(), app.MapHub<...>(), app.MapStaticAssets(), and auth endpoints when admin ∈ roles. Driver-only nodes still bind Kestrel for /healthz on :4841 and nothing else.

Authentication — cookie+JWT hybrid

Layer	Config
Cookie scheme	`OtOpcUa.Auth`, HttpOnly, SameSite=Strict, Secure (prod) / SameAsRequest (dev). Sliding 30-min idle timeout.
Embedded JWT	HMAC-SHA256, 15-min expiry, claims = `sub`, `roles`, `nodeAcls`.
LDAP bind	`LdapAuthService.AuthenticateAsync(user, pw)` at `/auth/login` POST — preserved from current `OtOpcUa.Admin/Security`.
Role mapping	`RoleMapper.MapGroupsToRolesAsync()` — LDAP groups → `FleetAdmin` / `ConfigEditor` / `ReadOnly`. Stays as-is.
Token issuance	`/auth/token` returns bearer for external clients (CLI, automation).
Circuit expiry probe	`/auth/ping` returns 200/401, polled by `CookieAuthenticationStateProvider` to detect expiry from inside a SignalR circuit.
Failure mode	LDAP unreachable → new logins fail, active sessions continue.

Data Protection keys

services.AddDataProtection().PersistKeysToDbContext<OtOpcUaConfigDbContext>().SetApplicationName("OtOpcUa") — keys live in ConfigDb so a circuit started on Admin-node A survives if Traefik fails over to Admin-node B mid-session.

SignalR hubs

Three existing hubs preserved (/hubs/fleet, /hubs/alerts, /hubs/script-log):

Today: FleetStatusPoller polls SQL every 5s.
New: FleetStatusBroadcaster singleton receives Akka cluster events + per-node telemetry, pushes diffs via IHubContext<FleetStatusHub>. No polling.
HubTokenService bearer-token fallback retired — hubs are circuit-local, cookie auth flows through SignalR natively. External hub consumers use the bearer token from /auth/token with a JwtBearer authentication scheme declaration on the hub.

UI → backend wiring

Reads: Blazor components inject scoped repositories from DI and read directly from ConfigDb. No change from today.
Writes / mutating ops: Components inject IAdminOperationsClient — a thin wrapper around ClusterSingletonProxy to AdminOperationsActor. Mutations are Ask with a 10s timeout + correlation ID. Audit envelope built UI-side, completed singleton-side.
Driver diagnostics: Today's DriverDiagnosticsClient HTTP round-trip retires. UI components ask IFleetDiagnosticsClient which delegates to ClusterClientReceptionist-published actor messages.

Health endpoints

Endpoint	Returns	Used by
`/health/ready`	200 once Akka member is `Up` + ConfigDb reachable + DataProtection key ring loaded	Service supervisor readiness gate
`/health/active`	200 only on the Admin-role leader; 503 elsewhere	Traefik — routes browser traffic to leader
`/healthz` (existing)	200 when Driver-role actor system is up + at least one driver registered (preserved on `:4841`)	Ops probes, OPC UA monitoring tools

Traefik

Windows Service (or external box). One route: host=otopcua.* → load-balance to {admin-node-a:9000, admin-node-b:9000} with /health/active health check, sticky sessions disabled (DataProtection key sharing handles continuity).

appsettings structure

Mirrors ScadaLink's per-component options pattern: Cluster:, Security:, ConfigDb:, OpcUa:, Drivers:, Historian: sections, bound to options classes owned by their respective component projects.

6. Edit + Deploy flow (replaces draft/publish generations)

The single most consequential domain change: drop the draft/publish ConfigGeneration lifecycle. Edits are live; deploy is a snapshot+push, ScadaLink-style.

Edit model

Equipment, Driver, DriverInstance, Namespace, UnsItem, Script, VirtualTag, ScriptedAlarm, NodeAcl are edited directly via AdminOperationsActor. No draft staging, no ConfigGeneration lifecycle. Last-write-wins per row (rowversion column for stale-write detection only).
Live edits do not affect running Driver-role nodes — running stacks reflect the last-deployed state. The UI shows a "drift" indicator when live ConfigDb state differs from last sealed deployment.
Validation runs on edit (semantic checks: driver tag-path validity, script syntax, namespace name uniqueness) — pulled forward from deploy-time to edit-time.

Deploy model

Admin UI "Deploy" → AdminOperationsActor.Ask(StartDeployment)
AdminOperationsActor:
  → snapshot ConfigDb current state
  → ConfigComposer.Flatten() → DeploymentArtifact
  → compute RevisionHash = SHA256(canonical-serialized artifact)
  → write Deployment row (DeploymentId GUID, RevisionHash, CreatedBy, CreatedAtUtc, Status=Dispatching)
  → Ask ConfigPublishCoordinator.DispatchDeployment(deploymentId)

ConfigPublishCoordinator (cluster singleton, admin role):
  → write Deployment.Status = Dispatching
  → DistributedPubSub Publish to "deployments" topic: DispatchDeployment(deploymentId, revisionHash)
  → schedule ApplyDeadline timer (ApplyMaxDuration, default 10 min)

DriverHostActor (per node, subscribed to "deployments"):
  receive DispatchDeployment(deploymentId, revisionHash):
    → if currentDeploymentRevision == revisionHash → ack Applied (idempotent)
    → else:
      → acquire per-node ApplyLock (Become Applying(deploymentId))
      → write NodeDeploymentState row (NodeId, DeploymentId, StartedAtUtc)
      → fetch artifact: read DeploymentArtifact blob from ConfigDb by deploymentId
      → diff against current applied artifact → per-instance ApplyDelta plans
      → dispatch ApplyDelta to DriverInstanceActor / VirtualTagActor / ScriptedAlarmActor children
      → collect per-instance acks (all-or-nothing per node)
      → on full success: write GenerationSealedCache (LiteDb local), update NodeDeploymentState.AppliedAtUtc
      → on any instance Failure: rollback to previous deployment, mark NodeDeploymentState=Failed
      → Tell Coordinator: ApplyAck(deploymentId, nodeId, Applied | Failed(reason))
      → Become Steady

ConfigPublishCoordinator: collect ApplyAcks
  → all Driver nodes Applied → Deployment.Status = Sealed → DistributedPubSub PublishDeploymentSealed
  → any Failed → Deployment.Status = PartiallyFailed → broadcast DeploymentFailed
  → deadline elapsed before all acks → Deployment.Status = TimedOut → broadcast DeploymentTimedOut

Per-instance operation lock

All mutating commands (deploy, disable, enable, delete) on a DriverInstance go through DriverInstanceActor, which serializes them via the actor mailbox — single-threaded by construction.

Idempotency

DeploymentId + RevisionHash together identify a deployment.
DriverHostActor seeing a DispatchDeployment whose RevisionHash matches current applied state → immediate ack Applied, no work. Safe to redeliver.
Phase7Composer.ComposeAsync(artifact) is pure; same artifact → same delta plan.
DriverInstanceActor.ApplyDelta(plan) compares against current state, applies only diffs.

Concurrency control

Last-write-wins on edits (no optimistic concurrency on Equipment, Driver, Script, etc.) — matches ScadaLink template behavior.
Optimistic concurrency on Deployment and NodeDeploymentState rows (rowversion column) — prevents two concurrent Coordinator instances (during failover) from corrupting state.

Singleton failover during deploy

Old Coordinator wrote Deployment.Status = Dispatching + NodeDeploymentState rows before broadcast.
New Coordinator on takeover queries Deployment rows with non-terminal Status.
For each in-flight deployment, Ask every DriverHostActor (via cluster-aware actor selection) for current NodeDeploymentState.
Recompute outstanding-ack set; resume the deadline timer with the remaining time.
If apply deadline already passed → mark Deployment.Status = TimedOut for any unack'd nodes.

Crash recovery on Driver node restart

DriverHostActor.PreStart reads NodeDeploymentState for self.
If row says Applied for some DeploymentId and matches last sealed cache → Become Steady on that artifact.
If row says Applying (didn't reach Applied) → discard partial state, re-fetch the artifact, replay apply (idempotent).
If ConfigDb unreachable → fall back to local LiteDb sealed cache, Become Stale (drops ServiceLevel via RedundancyStateActor). Background reconnect retries every 30s.

Schema migration from today

Today	New
`ConfigGeneration` (Draft/Published/Sealed lifecycle)	Dropped
`ClusterNodeGenerationState`	Renamed → `NodeDeploymentState` with `(NodeId, DeploymentId, Status, StartedAtUtc, AppliedAtUtc, RowVersion)`
`ClusterNode.RedundancyRole` column	Dropped (Akka leader-of-driver-role is source of truth)
`ConfigAuditLog`	Kept; deploy events added as new event types
(new) `Deployment`	`(DeploymentId, RevisionHash, Status, CreatedBy, CreatedAtUtc, ArtifactBlob varbinary(max), RowVersion)`
(new) `ConfigEdit` audit row per Equipment/Driver/Script edit	Live-edit history
(new) `DataProtectionKeys`	DataProtection key ring storage

No more ApplyLeaseRegistry table or watchdog actor. Apply state lives in NodeDeploymentState; watchdog is a Coordinator-side scheduled message keyed by DeploymentId.

Stale-config fallback

Preserved from today's GenerationSealedCache: local LiteDb cache holds last-applied DeploymentArtifact. On Host boot with ConfigDb unreachable, DriverHostActor boots from cache → Become Stale → RedundancyStateActor drops ServiceLevel for that node.

Peer probes consolidated

Today	New
`PeerHttpProbeLoop` (HTTP `/healthz`)	Retired — Akka failure detector replaces it
`PeerUaProbeLoop` (OPC UA `opc.tcp://peer:4840`)	Retained as `PeerOpcUaProbeActor` — tests whether the OPC UA stack itself (not just the process) is up. Feeds `RedundancyStateActor`.
`DbHealthCache` (cached DB probe)	Retained as `DbHealthProbeActor` per-node. Feeds `RedundancyStateActor` + `/health/ready`.

ServiceLevel computation in `RedundancyStateActor`

serviceLevel(node) =
  base 240 if (cluster member Up AND db reachable AND not stale AND opc ua probe ok)
  base 200 if (member Up AND db reachable AND stale)
  base 100 if (member Up AND db unreachable AND stale)
  base 0   if (member Down / Unreachable)

+10 bonus if Akka driver-role leader is this node

ServiceLevel bands match the existing RedundancyStatePublisher so OPC UA client behavior is unchanged from today. The leader-bonus replaces today's operator-managed RedundancyRole = Primary.

7. Error handling & failure modes

Akka cluster failure modes

Scenario	Behavior
Network partition (split-brain)	Keep-oldest resolver downs the smaller side after 15s stable-after. `down-if-alone = on` covers isolated nodes.
Admin leader process crash	Failure detector trips after 10s, downs the member, new singleton instance starts on remaining Admin node. Traefik `/health/active` probe fails over within 1 polling interval (~5s).
Driver-role node crash	RedundancyStateActor sees member Down → drops that node's ServiceLevel to 0 → OPC UA clients reconnect to surviving node. Both nodes were already running their own copy; no in-cluster recovery needed for that node's work.
Both Admin nodes down simultaneously	Web UI unavailable. Driver nodes continue serving OPC UA from last-sealed cache. No new deployments possible until Admin node recovers.
All Driver nodes down	OPC UA endpoints unavailable. Clients reconnect when any Driver node returns. ServiceLevel back to 240 once member Up + DB reachable + apply sealed.
Singleton handover during deploy	Coordinator state survives in `Deployment` + `NodeDeploymentState` ConfigDb rows. New Coordinator queries DriverHostActors via cluster-aware actor selection. Resume remaining deadline.

ConfigDb unavailability

At edit time: AdminUI returns user-visible error. No retries — operator decides.
At deploy time: Coordinator refuses to start dispatch if it can't write the Deployment row.
At Driver node boot: Fall back to local LiteDb sealed cache. RedundancyStateActor drops ServiceLevel.
At singleton failover: New Coordinator's PreStart retries via Polly (5 attempts, exponential backoff). If exhausted → singleton crashes → cluster restarts singleton on next viable Admin node.

Driver / equipment failures

Driver connection loss → DriverInstanceActor enters Reconnecting Become state, publishes bad-quality to OPC UA address space immediately, retries at fixed interval.
Tag-path-resolution failure → retried periodically.
Write failure to driver → returned synchronously to caller via Ask from OpcUaPublishActor.
Driver process unresponsive (Galaxy gateway down) → IDriver.HealthCheck returns degraded → DriverInstanceActor reports to DriverHostActor → RedundancyStateActor factors into ServiceLevel.

Wonderware historian sidecar

Named-pipe disconnect → HistorianAdapterActor enters Reconnecting; alarm history rows buffered to local SQLite store-and-forward.
Sidecar process crash → no in-cluster recovery (external process); operator restarts via Windows Service control.

Auth failures

LDAP unreachable → /auth/login returns 503. Active sessions continue with cached claims.
JWT signature failure (key ring drift) → 401, session terminates. DataProtection keys in ConfigDb prevent this in the happy path.
Cookie expired (sliding 30-min idle) → /auth/ping returns 401 → CookieAuthenticationStateProvider triggers UI logout.

SignalR / circuit drops

Blazor circuit dropped → App.razor reload script reconnects (preserved from today).
Hub message loss during reconnect → FleetStatusBroadcaster resends current state to the reconnecting client on OnConnectedAsync (full snapshot, not just diffs).

OPC UA stack failures

Address-space corruption → OpcUaPublishActor logs ERROR, sends RebuildAddressSpace to itself; sequence number bump notifies clients to resubscribe.
OPC UA listener bind failure (port collision) → Host fails readiness probe, supervisor restarts service.

Audit invariants

Audit write failures never abort the user-facing action. AuditWriterActor buffer overflow → log WARN, drop oldest (with counter metric). The action's success/failure path is authoritative.
All deploy + edit events carry ExecutionId (per-request correlation) so audit rows for one operator action share an ID.

8. Testing strategy

Test projects mirror the new layering. Test infrastructure stays Mac-friendly: stubbed Windows-only drivers, ephemeral SQL Server (LocalDB on Windows / mcr.microsoft.com/mssql/server container on Mac), OpenLDAP container, all spun up via tests/docker-compose.yml.

Layered test pyramid

Layer	Project	What it covers
Unit	`OtOpcUa.Runtime.Tests`	Per-actor logic via `Akka.TestKit.Xunit2`. `DriverInstanceActor` state-machine transitions, `Phase7Composer` purity, `ScriptedAlarmActor` state machine, `VirtualTagActor` expression eval. Drivers mocked via `IDriver` test doubles.
Unit	`OtOpcUa.ControlPlane.Tests`	Singleton actor logic. `ConfigPublishCoordinator` happy path + timeout + concurrent ack ordering. `RedundancyStateActor` ServiceLevel computation truth table. `AuditWriterActor` batch flush + idempotency on duplicate `EventId`.
Unit	`OtOpcUa.Cluster.Tests`	Split-brain resolver config validation, role-aware membership helpers, HOCON parses.
Unit	`OtOpcUa.Security.Tests`	LDAP role mapping, JWT issuance, cookie+JWT roundtrip, `/auth/ping` expiry semantics.
Integration	`OtOpcUa.Host.IntegrationTests`	2-node in-process Akka cluster. Real SQL Server, stubbed drivers. Tests: deploy happy path, deploy timeout, deploy with one node down, singleton failover mid-deploy, ConfigDb outage + stale-config fallback, edit-then-deploy roundtrip, audit row emission.
Integration	`OtOpcUa.OpcUa.IntegrationTests`	Real OPCFoundation client connects to a running stubbed Host. Asserts: dual endpoint visible, ServerUriArray populated, ServiceLevel reflects leader status, browse + read + write through `OpcUaPublishActor`, write failures returned synchronously.
End-to-end	`OtOpcUa.E2E.Tests`	Full Host with Traefik in front, two Admin nodes + two Driver nodes (4 processes via Docker). Verifies: web UI login via LDAP, deploy from UI flows to OPC UA stack, kill admin leader → Traefik fails over within 25s, kill driver node → OPC UA clients reconnect with correct ServiceLevel. CI nightly.

Failover-specific test cases

Kill Admin leader during Dispatching phase → new Coordinator resumes, deployment seals.
Kill Admin leader during AwaitingApplyAcks → new Coordinator queries DriverHostActors, completes ack collection.
Kill Driver node during Applying → Coordinator marks that node's NodeDeploymentState=Failed after deadline; surviving Driver nodes complete their apply.
Restart Driver node mid-deploy → on restart, replays apply (idempotent).
Akka split-brain (network partition between 2 admin nodes) → keep-oldest wins, smaller side downs itself within 15s.
Both Admin nodes restart simultaneously → deployments in Dispatching resume cleanly after cluster reforms.
Concurrent edits to the same DriverInstance from two UI sessions → last write wins, both audit rows present, no row corruption.

Deploy idempotency tests

Replay DispatchDeployment with same DeploymentId/RevisionHash → no work, ack Applied.
Apply same DeploymentArtifact twice in a row → second application is a no-op.
Crash DriverHostActor mid-apply, restart → resumes from NodeDeploymentState, completes idempotently.

Property tests

Phase7Composer.ComposeAsync is pure: same artifact → same plan, no side effects.
RedundancyStateActor ServiceLevel computation: every combination of (member-state, db-ok, stale, opc-ok, is-leader) produces expected byte.
Audit envelope generation: every mutating op produces exactly one audit row with stable ExecutionId correlation.

Mac-dev test invariants

All unit + integration tests run on macOS without Windows-only assemblies.
Cluster tests use in-process Akka.Remote on 127.0.0.1.
LDAP tests use OpenLDAP container or Security:Ldap:DevStubMode=true.

Retired tests

Anything touching ConfigGeneration lifecycle, ApplyLeaseRegistry, PeerHttpProbeLoop, FleetStatusPoller, RedundancyCoordinator peer-probe loops, RedundancyStatePublisher.

9. Risks & open questions

Akka.NET on .NET 10. Verify Akka.NET 1.5+ targets .NET 10 cleanly.
OPCFoundation SDK threading. The OPC UA stack runs its own threadpool. OpcUaPublishActor must marshal writes via thread-safe wrappers; use a dedicated synchronized-dispatcher for actors that touch the OPC UA address space.
Failure detector tuning. ScadaLink's 2s/10s is tuned for site-to-central RTT. Benchmark before locking. Aggressive tuning + GC pauses → spurious singleton handover.
ServiceLevel = Akka leader removes operator control. No escape hatch in v1. If a customer needs one later, add a PinnedPrimary column to ClusterNode and an override path in RedundancyStateActor. Out of scope now.
Long-lived v2 branch drift. Monthly rebase from main, CI runs on v2 from day one.
Schema migration is destructive. Dropping ConfigGeneration + ClusterNode.RedundancyRole is one-way. Cutover must run on a quiesced system. Provide a Migrate-To-V2.ps1 script that backs up ConfigDb, runs EF migrations, validates row counts, prints a summary.
Wonderware + mxaccessgw still external processes. Both untouched by this refactor. Future actorization would be a second refactor.
Audit row volume. Edit-heavy install ≈ 5k rows/day. Need monthly partition + 365-day retention same as ScadaLink #23.

10. Migration plan

Big-bang on v2-akka-fuse branch:

Branch v2-akka-fuse off main.
Add new projects: OtOpcUa.Host, .Cluster, .Security, .ControlPlane, .Runtime, .ConfigDb, .Commons, .AdminUI, .OpcUaServer. Convert to OtOpcUa.slnx.
Move ConfigDb access (EF context, repos, migrations) out of Server and Admin into OtOpcUa.ConfigDb. Add DataProtection key store table.
Move LDAP + cookie + JWT out of Admin/Security into OtOpcUa.Security. Adopt 15-min JWT / 30-min sliding cookie / /auth/ping.
Build OtOpcUa.Cluster: HOCON, AkkaHostedService, role-aware membership helpers, split-brain resolver.
Build OtOpcUa.ControlPlane: ConfigPublishCoordinator, AdminOperationsActor, AuditWriterActor, FleetStatusBroadcaster, RedundancyStateActor.
Build OtOpcUa.Runtime: DriverHostActor, DriverInstanceActor, VirtualTagActor, ScriptedAlarmActor, OpcUaPublishActor, HistorianAdapterActor, PeerOpcUaProbeActor, DbHealthProbeActor.
Migrate Phase7Composer to OtOpcUa.OpcUaServer; make it pure and unit-tested.
Move Blazor components from Admin into OtOpcUa.AdminUI library; replace DriverDiagnosticsClient HTTP with in-process actor calls; rewire FleetStatusHub / AlertHub / ScriptLogHub to be fed by FleetStatusBroadcaster IHubContext.
Build OtOpcUa.Host Program.cs: role-gated startup, health endpoints (/health/ready, /health/active, /healthz), AddWindowsService.
ConfigDb migration: add Deployment, ConfigEdit, DataProtectionKeys tables; rename ClusterNodeGenerationState → NodeDeploymentState; drop ConfigGeneration; drop ClusterNode.RedundancyRole. EF migration + idempotent SQL script + Migrate-To-V2.ps1.
Delete OtOpcUa.Server, OtOpcUa.Admin, DriverInstanceBootstrapper, RedundancyCoordinator, RedundancyStatePublisher, ApplyLeaseRegistry, FleetStatusPoller, PeerHttpProbeLoop, HubTokenService. Sweep any *RedundancyRole* references.
Update deploy/Install-Services.ps1: single Windows Service per node, role via env var, Traefik service registration.
Update docs in docs/: rewrite Redundancy.md, ServiceHosting.md; add Cluster.md, ControlPlane.md, Runtime.md. Add top-level Architecture-v2.md summary.
CI: add integration test job for the 2-node cluster + OPC UA roundtrip.
Tag the last v1 release on main for backport-only fixes. Merge v2-akka-fuse → main when GA.

33 KiB Raw Blame History Unescape Escape