Captures the brainstormed design to align OtOpcUa with ScadaLink: single role-gated binary, Akka.NET cluster with admin/driver roles, cluster singletons for control plane, per-node actor hierarchy for OPC UA runtime, dual-endpoint warm redundancy preserved with ServiceLevel driven by Akka leader, cookie+JWT auth, Traefik routing, and ScadaLink-style live-edit + deploy model replacing the draft/publish ConfigGeneration lifecycle.
33 KiB
OtOpcUa v2 — Akka.NET + Fused Hosting Alignment with ScadaLink
Status: Design approved, ready for implementation planning
Date: 2026-05-26
Branch: v2-akka-fuse
Sister project reference: ~/Desktop/scadalink-design (ScadaLink)
1. Motivation
OtOpcUa today runs as three separate processes (OtOpcUa.Server OPC UA host, OtOpcUa.Admin Blazor Server web UI, optional OtOpcUaWonderwareHistorian Framework sidecar) with manual operator-driven warm-redundancy failover. The sister project ScadaLink — owned by the same developer — solved similar problems with a fused single-binary, role-gated hosting model on top of an Akka.NET cluster.
The motivation for this refactor is twofold:
- Consistency. A single developer (the project owner) moves between OtOpcUa and ScadaLink frequently. Sharing patterns — hosting, auth, actor hierarchy, deployment model — reduces cognitive overhead and makes fixes portable.
- Real HA improvements. Upgrade OtOpcUa's manual operator-driven failover to automatic, Akka-cluster-driven failover with Traefik routing for the web UI. Preserve OPC UA dual-endpoint client-side failover semantics (clients connect to both nodes and pick based on
ServiceLevel), now driven automatically by Akka cluster leadership.
2. Architecture overview
One binary, role-gated. OtOpcUa.Host (Microsoft.NET.Sdk.Web, .NET 10) replaces OtOpcUa.Server and OtOpcUa.Admin. Same binary on every node. Role configured via OTOPCUA_ROLES environment variable.
Two Akka roles, single cluster:
admin— hosts Blazor web UI + cluster singletons. Singletons pinned viaClusterSingletonManagerSettings.WithRole("admin"). Traefik routes/to whichever Admin-role node/health/activereports as leader.driver— hosts OPC UA endpoint + per-nodeDriverHostActorhierarchy. Every Driver-role node always serves OPC UA;ServiceLevelcomputed byRedundancyStateActoris broadcast back to each Driver node and used to publish to the local OPC UA address space.
Roles are additive: OTOPCUA_ROLES=admin, OTOPCUA_ROLES=driver, or OTOPCUA_ROLES=admin,driver. Small deployments run both roles on both nodes; larger deployments separate them.
Per-role leadership. Cluster.Get(system).State.RoleLeader("driver") drives OPC UA ServiceLevel. RoleLeader("admin") drives /health/active (Traefik routing). These are independent — admin and driver leadership can land on different nodes if separated.
Cluster membership. Both seed nodes; keep-oldest split-brain resolver; down-if-alone = on; 15s stable-after; 2s heartbeat / 10s threshold. CoordinatedShutdown for graceful singleton handover. Exact ScadaLink tuning.
OPC UA dual-endpoint preserved. Driver-role nodes all bind opc.tcp://0.0.0.0:4840. Clients still see N endpoints in ServerUriArray and fail over via ServiceLevel. OPC UA spec compliance unchanged from today.
Mac dev: role admin,driver,dev — dev short-circuits Windows-only driver registration (Galaxy, Wonderware) with explicit [DEV-STUB] log lines.
3. Project & process restructure
Single solution, ScadaLink-style folder layout. Existing OtOpcUa naming convention (ZB.MOM.WW.OtOpcUa.*) preserved.
New entry point & deletions
| Action | Project |
|---|---|
| New | OtOpcUa.Host — Microsoft.NET.Sdk.Web, single Program.cs, role-gated startup, AddWindowsService |
| Delete | OtOpcUa.Server (content migrates) |
| Delete | OtOpcUa.Admin (UI moves to library) |
New libraries
| Project | Owns | ScadaLink analog |
|---|---|---|
OtOpcUa.Commons |
Entity POCOs, interfaces, message contracts (Types/, Interfaces/, Entities/, Messages/) |
ScadaLink.Commons |
OtOpcUa.ConfigDb |
EF Core DbContext, repositories, IAuditService, migrations, Data Protection key store |
ScadaLink.ConfigurationDatabase |
OtOpcUa.Cluster |
Akka HOCON, AkkaHostedService, split-brain resolver config, role-aware membership helpers, IClusterRoleInfo |
(split out of ScadaLink Host) |
OtOpcUa.Security |
LDAP bind, cookie+JWT hybrid, JWT issuance, role mapping, /auth/login, /auth/ping endpoints |
ScadaLink.Security |
OtOpcUa.ControlPlane |
Cluster singletons: ConfigPublishCoordinator, AdminOperationsActor, AuditWriterActor, FleetStatusBroadcaster, RedundancyStateActor |
ScadaLink.ManagementService |
OtOpcUa.Runtime |
Per-node actors: DriverHostActor, DriverInstanceActor, VirtualTagActor, ScriptedAlarmActor, OpcUaPublishActor, HistorianAdapterActor, PeerOpcUaProbeActor, DbHealthProbeActor |
ScadaLink.SiteRuntime |
OtOpcUa.OpcUaServer |
OPC UA app host, address-space build, Phase7Composer extraction |
(in ScadaLink.SiteRuntime/DCL) |
OtOpcUa.AdminUI |
Blazor components, hubs (FleetStatusHub, AlertHub, ScriptLogHub), auth state provider, MapAdminUI<TApp>() |
ScadaLink.CentralUI |
Unchanged
- Driver projects (
OtOpcUa.Driver.Galaxy,.Modbus,.S7,.AbCip,.AbLegacy,.TwinCAT,.FOCAS,.OpcUaClient) — still implementIDriver, now consumed byDriverInstanceActorinstead ofDriverInstanceBootstrapper. OtOpcUa.Driver.Historian.Wonderware— .NET Framework 4.8 sidecar, named-pipe IPC, wrapped by aHistorianAdapterActorinOtOpcUa.Runtime.mxaccessgwsibling repo — unchanged; Galaxy driver still talks gRPC to it.
Tests
tests/OtOpcUa.Cluster.Tests— split-brain, leadership transitionstests/OtOpcUa.ControlPlane.Tests— singleton actor unit tests via Akka.TestKittests/OtOpcUa.Runtime.Tests— per-node actor tests, driver lifecycletests/OtOpcUa.Security.Tests— LDAP, cookie+JWT roundtriptests/OtOpcUa.Host.IntegrationTests— 2-node in-process cluster, deployment flow, failover, Mac-safetests/OtOpcUa.OpcUa.IntegrationTests— real OPCFoundation client against stubbed Hosttests/OtOpcUa.E2E.Tests— full stack with Traefik (nightly CI)
Deploy
deploy/Install-Services.ps1— installs one Windows Service per node (OtOpcUaHost), passes role via env var. Old script replaced.deploy/traefik/— Windows Traefik config + service registration for the leader-routed/health/active.docker-dev/(new, optional) — 2-node Mac dev compose with stubbed drivers + LDAP + SQL Server + Traefik.
Solution file: OtOpcUa.slnx (matches ScadaLink convention; switch from current .sln).
4. Actor hierarchy
Per-node tree
Rooted under OtOpcUa.Runtime, one tree per Driver-role node:
DriverHostActor (per-node coordinator, started by Host)
├─ DriverInstanceActor (per DriverInstance row)
│ └─ children = pooled or per-subscription work
├─ VirtualTagActor (per VirtualTag row)
├─ ScriptedAlarmActor (per ScriptedAlarm row)
├─ OpcUaPublishActor (per-node bridge to OPCFoundation address space)
├─ HistorianAdapterActor (per-node, wraps Wonderware named-pipe sidecar)
├─ PeerOpcUaProbeActor (per-node, tests peer OPC UA stack health)
└─ DbHealthProbeActor (per-node, cached DB health probe)
Cluster singletons
Pinned to admin role via ClusterSingletonManagerSettings.WithRole("admin"):
| Actor | Owns | Notes |
|---|---|---|
ConfigPublishCoordinator |
The deploy protocol. Writes Deployment row, broadcasts DispatchDeployment(deploymentId) via DistributedPubSub to every DriverHostActor, tracks apply ACKs per node. |
Replaces ApplyLeaseRegistry. Resumes after failover by re-reading ConfigDb state — no Akka.Persistence. |
AdminOperationsActor |
All mutating admin ops (CRUD on equipment, drivers, scripts, namespaces, ACLs). Wraps each in an audit envelope. | UI calls via ClusterSingletonProxy (in-process when UI is on Admin node). |
AuditWriterActor |
Receives AuditEvent telemetry from any node, batch-inserts into ConfigAuditLog. |
Idempotent on EventId. |
FleetStatusBroadcaster |
Aggregates Akka cluster member events + per-node DriverHostStatus heartbeats. Publishes diffs to IHubContext<FleetStatusHub> and IHubContext<AlertHub>. |
Push-driven; replaces today's 5s FleetStatusPoller. |
RedundancyStateActor |
Subscribes to ClusterEvent.IMemberEvent + ClusterEvent.LeaderChanged + per-node health probes. Computes ServiceLevel byte + ServerUriArray per Driver node. Publishes to DistributedPubSub topic redundancy-state. |
Source of truth for OPC UA redundancy. Local OpcUaPublishActor subscribes and writes to its OPCFoundation stack. |
Supervision
| Actor | Strategy |
|---|---|
DriverHostActor |
Resume |
DriverInstanceActor |
Restart with backoff (1s → 30s, ×1.5, jitter) |
VirtualTagActor |
Restart with backoff |
ScriptedAlarmActor |
Restart with backoff; preserve alarm state via PreRestart hook |
OpcUaPublishActor |
Resume |
HistorianAdapterActor |
Restart with backoff; SQLite store-and-forward buffers during pipe outage |
| All singletons | Resume; resumable state in ConfigDb |
| Script execution actors (short-lived) | Stop on failure |
State machines
DriverInstanceActor— Become/Stash forConnecting → Connected → Reconnecting → Failed. Bad-quality publish on disconnect; transparent re-subscribe on reconnect. Write failures returned synchronously viaAskfromOpcUaPublishActor.ConfigPublishCoordinator—Idle → Publishing → AwaitingApplyAcks → Sealed, with timeout-driven escalation if a node fails to ack withinApplyMaxDuration(default 10 min).RedundancyStateActor— recomputes on every membership event, debounced 250ms to coalesce bursts.
Communication conventions
- Tell for hot-path internal traffic (driver values, alarm state changes, publish broadcasts).
- Ask only at system boundaries (UI controller →
AdminOperationsActor, with explicit timeout + cancellation token). - DistributedPubSub for cluster-wide broadcasts (
DispatchDeployment,RedundancyStateChanged,FleetStatusChanged). - Application-level correlation IDs on every request/response message.
- Messages live in
OtOpcUa.Commons.Messages.{Drivers,Deploy,Admin,Audit,Redundancy}— additive-only evolution.
Singleton persistence
No Akka.Persistence. Each singleton reads its resumable state from ConfigDb on PreStart (e.g., ConfigPublishCoordinator reads the current in-flight Deployment row + per-node NodeDeploymentState) and writes on every state transition.
Mac-dev stubs
DevNode role short-circuits driver registration. DriverInstanceActor for any Galaxy/Wonderware row enters a Stubbed Become state that returns deterministic test values. Logged at INFO with [DEV-STUB] driver={Name} reason=windows-only.
5. Web hosting, auth, and SignalR
Kestrel startup gated by admin role
Program.cs builds WebApplicationBuilder, registers all services, but only calls app.MapBlazor<App>(), app.MapHub<...>(), app.MapStaticAssets(), and auth endpoints when admin ∈ roles. Driver-only nodes still bind Kestrel for /healthz on :4841 and nothing else.
Authentication — cookie+JWT hybrid
| Layer | Config |
|---|---|
| Cookie scheme | OtOpcUa.Auth, HttpOnly, SameSite=Strict, Secure (prod) / SameAsRequest (dev). Sliding 30-min idle timeout. |
| Embedded JWT | HMAC-SHA256, 15-min expiry, claims = sub, roles, nodeAcls. |
| LDAP bind | LdapAuthService.AuthenticateAsync(user, pw) at /auth/login POST — preserved from current OtOpcUa.Admin/Security. |
| Role mapping | RoleMapper.MapGroupsToRolesAsync() — LDAP groups → FleetAdmin / ConfigEditor / ReadOnly. Stays as-is. |
| Token issuance | /auth/token returns bearer for external clients (CLI, automation). |
| Circuit expiry probe | /auth/ping returns 200/401, polled by CookieAuthenticationStateProvider to detect expiry from inside a SignalR circuit. |
| Failure mode | LDAP unreachable → new logins fail, active sessions continue. |
Data Protection keys
services.AddDataProtection().PersistKeysToDbContext<OtOpcUaConfigDbContext>().SetApplicationName("OtOpcUa") — keys live in ConfigDb so a circuit started on Admin-node A survives if Traefik fails over to Admin-node B mid-session.
SignalR hubs
Three existing hubs preserved (/hubs/fleet, /hubs/alerts, /hubs/script-log):
- Today:
FleetStatusPollerpolls SQL every 5s. - New:
FleetStatusBroadcastersingleton receives Akka cluster events + per-node telemetry, pushes diffs viaIHubContext<FleetStatusHub>. No polling. HubTokenServicebearer-token fallback retired — hubs are circuit-local, cookie auth flows through SignalR natively. External hub consumers use the bearer token from/auth/tokenwith aJwtBearerauthentication scheme declaration on the hub.
UI → backend wiring
- Reads: Blazor components inject scoped repositories from DI and read directly from
ConfigDb. No change from today. - Writes / mutating ops: Components inject
IAdminOperationsClient— a thin wrapper aroundClusterSingletonProxytoAdminOperationsActor. Mutations areAskwith a 10s timeout + correlation ID. Audit envelope built UI-side, completed singleton-side. - Driver diagnostics: Today's
DriverDiagnosticsClientHTTP round-trip retires. UI components askIFleetDiagnosticsClientwhich delegates toClusterClientReceptionist-published actor messages.
Health endpoints
| Endpoint | Returns | Used by |
|---|---|---|
/health/ready |
200 once Akka member is Up + ConfigDb reachable + DataProtection key ring loaded |
Service supervisor readiness gate |
/health/active |
200 only on the Admin-role leader; 503 elsewhere | Traefik — routes browser traffic to leader |
/healthz (existing) |
200 when Driver-role actor system is up + at least one driver registered (preserved on :4841) |
Ops probes, OPC UA monitoring tools |
Traefik
Windows Service (or external box). One route: host=otopcua.* → load-balance to {admin-node-a:9000, admin-node-b:9000} with /health/active health check, sticky sessions disabled (DataProtection key sharing handles continuity).
appsettings structure
Mirrors ScadaLink's per-component options pattern: Cluster:, Security:, ConfigDb:, OpcUa:, Drivers:, Historian: sections, bound to options classes owned by their respective component projects.
6. Edit + Deploy flow (replaces draft/publish generations)
The single most consequential domain change: drop the draft/publish ConfigGeneration lifecycle. Edits are live; deploy is a snapshot+push, ScadaLink-style.
Edit model
Equipment,Driver,DriverInstance,Namespace,UnsItem,Script,VirtualTag,ScriptedAlarm,NodeAclare edited directly viaAdminOperationsActor. No draft staging, noConfigGenerationlifecycle. Last-write-wins per row (rowversion column for stale-write detection only).- Live edits do not affect running Driver-role nodes — running stacks reflect the last-deployed state. The UI shows a "drift" indicator when live ConfigDb state differs from last sealed deployment.
- Validation runs on edit (semantic checks: driver tag-path validity, script syntax, namespace name uniqueness) — pulled forward from deploy-time to edit-time.
Deploy model
Admin UI "Deploy" → AdminOperationsActor.Ask(StartDeployment)
AdminOperationsActor:
→ snapshot ConfigDb current state
→ ConfigComposer.Flatten() → DeploymentArtifact
→ compute RevisionHash = SHA256(canonical-serialized artifact)
→ write Deployment row (DeploymentId GUID, RevisionHash, CreatedBy, CreatedAtUtc, Status=Dispatching)
→ Ask ConfigPublishCoordinator.DispatchDeployment(deploymentId)
ConfigPublishCoordinator (cluster singleton, admin role):
→ write Deployment.Status = Dispatching
→ DistributedPubSub Publish to "deployments" topic: DispatchDeployment(deploymentId, revisionHash)
→ schedule ApplyDeadline timer (ApplyMaxDuration, default 10 min)
DriverHostActor (per node, subscribed to "deployments"):
receive DispatchDeployment(deploymentId, revisionHash):
→ if currentDeploymentRevision == revisionHash → ack Applied (idempotent)
→ else:
→ acquire per-node ApplyLock (Become Applying(deploymentId))
→ write NodeDeploymentState row (NodeId, DeploymentId, StartedAtUtc)
→ fetch artifact: read DeploymentArtifact blob from ConfigDb by deploymentId
→ diff against current applied artifact → per-instance ApplyDelta plans
→ dispatch ApplyDelta to DriverInstanceActor / VirtualTagActor / ScriptedAlarmActor children
→ collect per-instance acks (all-or-nothing per node)
→ on full success: write GenerationSealedCache (LiteDb local), update NodeDeploymentState.AppliedAtUtc
→ on any instance Failure: rollback to previous deployment, mark NodeDeploymentState=Failed
→ Tell Coordinator: ApplyAck(deploymentId, nodeId, Applied | Failed(reason))
→ Become Steady
ConfigPublishCoordinator: collect ApplyAcks
→ all Driver nodes Applied → Deployment.Status = Sealed → DistributedPubSub PublishDeploymentSealed
→ any Failed → Deployment.Status = PartiallyFailed → broadcast DeploymentFailed
→ deadline elapsed before all acks → Deployment.Status = TimedOut → broadcast DeploymentTimedOut
Per-instance operation lock
All mutating commands (deploy, disable, enable, delete) on a DriverInstance go through DriverInstanceActor, which serializes them via the actor mailbox — single-threaded by construction.
Idempotency
DeploymentId+RevisionHashtogether identify a deployment.DriverHostActorseeing aDispatchDeploymentwhoseRevisionHashmatches current applied state → immediate ackApplied, no work. Safe to redeliver.Phase7Composer.ComposeAsync(artifact)is pure; same artifact → same delta plan.DriverInstanceActor.ApplyDelta(plan)compares against current state, applies only diffs.
Concurrency control
- Last-write-wins on edits (no optimistic concurrency on
Equipment,Driver,Script, etc.) — matches ScadaLink template behavior. - Optimistic concurrency on
DeploymentandNodeDeploymentStaterows (rowversion column) — prevents two concurrent Coordinator instances (during failover) from corrupting state.
Singleton failover during deploy
- Old Coordinator wrote
Deployment.Status = Dispatching+NodeDeploymentStaterows before broadcast. - New Coordinator on takeover queries
Deploymentrows with non-terminalStatus. - For each in-flight deployment,
AskeveryDriverHostActor(via cluster-aware actor selection) for currentNodeDeploymentState. - Recompute outstanding-ack set; resume the deadline timer with the remaining time.
- If apply deadline already passed → mark
Deployment.Status = TimedOutfor any unack'd nodes.
Crash recovery on Driver node restart
DriverHostActor.PreStartreadsNodeDeploymentStatefor self.- If row says
Appliedfor someDeploymentIdand matches last sealed cache → Become Steady on that artifact. - If row says
Applying(didn't reach Applied) → discard partial state, re-fetch the artifact, replay apply (idempotent). - If ConfigDb unreachable → fall back to local LiteDb sealed cache, Become
Stale(drops ServiceLevel viaRedundancyStateActor). Background reconnect retries every 30s.
Schema migration from today
| Today | New |
|---|---|
ConfigGeneration (Draft/Published/Sealed lifecycle) |
Dropped |
ClusterNodeGenerationState |
Renamed → NodeDeploymentState with (NodeId, DeploymentId, Status, StartedAtUtc, AppliedAtUtc, RowVersion) |
ClusterNode.RedundancyRole column |
Dropped (Akka leader-of-driver-role is source of truth) |
ConfigAuditLog |
Kept; deploy events added as new event types |
(new) Deployment |
(DeploymentId, RevisionHash, Status, CreatedBy, CreatedAtUtc, ArtifactBlob varbinary(max), RowVersion) |
(new) ConfigEdit audit row per Equipment/Driver/Script edit |
Live-edit history |
(new) DataProtectionKeys |
DataProtection key ring storage |
No more ApplyLeaseRegistry table or watchdog actor. Apply state lives in NodeDeploymentState; watchdog is a Coordinator-side scheduled message keyed by DeploymentId.
Stale-config fallback
Preserved from today's GenerationSealedCache: local LiteDb cache holds last-applied DeploymentArtifact. On Host boot with ConfigDb unreachable, DriverHostActor boots from cache → Become Stale → RedundancyStateActor drops ServiceLevel for that node.
Peer probes consolidated
| Today | New |
|---|---|
PeerHttpProbeLoop (HTTP /healthz) |
Retired — Akka failure detector replaces it |
PeerUaProbeLoop (OPC UA opc.tcp://peer:4840) |
Retained as PeerOpcUaProbeActor — tests whether the OPC UA stack itself (not just the process) is up. Feeds RedundancyStateActor. |
DbHealthCache (cached DB probe) |
Retained as DbHealthProbeActor per-node. Feeds RedundancyStateActor + /health/ready. |
ServiceLevel computation in RedundancyStateActor
serviceLevel(node) =
base 240 if (cluster member Up AND db reachable AND not stale AND opc ua probe ok)
base 200 if (member Up AND db reachable AND stale)
base 100 if (member Up AND db unreachable AND stale)
base 0 if (member Down / Unreachable)
+10 bonus if Akka driver-role leader is this node
ServiceLevel bands match the existing RedundancyStatePublisher so OPC UA client behavior is unchanged from today. The leader-bonus replaces today's operator-managed RedundancyRole = Primary.
7. Error handling & failure modes
Akka cluster failure modes
| Scenario | Behavior |
|---|---|
| Network partition (split-brain) | Keep-oldest resolver downs the smaller side after 15s stable-after. down-if-alone = on covers isolated nodes. |
| Admin leader process crash | Failure detector trips after 10s, downs the member, new singleton instance starts on remaining Admin node. Traefik /health/active probe fails over within 1 polling interval (~5s). |
| Driver-role node crash | RedundancyStateActor sees member Down → drops that node's ServiceLevel to 0 → OPC UA clients reconnect to surviving node. Both nodes were already running their own copy; no in-cluster recovery needed for that node's work. |
| Both Admin nodes down simultaneously | Web UI unavailable. Driver nodes continue serving OPC UA from last-sealed cache. No new deployments possible until Admin node recovers. |
| All Driver nodes down | OPC UA endpoints unavailable. Clients reconnect when any Driver node returns. ServiceLevel back to 240 once member Up + DB reachable + apply sealed. |
| Singleton handover during deploy | Coordinator state survives in Deployment + NodeDeploymentState ConfigDb rows. New Coordinator queries DriverHostActors via cluster-aware actor selection. Resume remaining deadline. |
ConfigDb unavailability
- At edit time: AdminUI returns user-visible error. No retries — operator decides.
- At deploy time: Coordinator refuses to start dispatch if it can't write the
Deploymentrow. - At Driver node boot: Fall back to local LiteDb sealed cache. RedundancyStateActor drops
ServiceLevel. - At singleton failover: New Coordinator's
PreStartretries via Polly (5 attempts, exponential backoff). If exhausted → singleton crashes → cluster restarts singleton on next viable Admin node.
Driver / equipment failures
- Driver connection loss →
DriverInstanceActorentersReconnectingBecome state, publishes bad-quality to OPC UA address space immediately, retries at fixed interval. - Tag-path-resolution failure → retried periodically.
- Write failure to driver → returned synchronously to caller via
AskfromOpcUaPublishActor. - Driver process unresponsive (Galaxy gateway down) →
IDriver.HealthCheckreturns degraded →DriverInstanceActorreports toDriverHostActor→RedundancyStateActorfactors into ServiceLevel.
Wonderware historian sidecar
- Named-pipe disconnect →
HistorianAdapterActorentersReconnecting; alarm history rows buffered to local SQLite store-and-forward. - Sidecar process crash → no in-cluster recovery (external process); operator restarts via Windows Service control.
Auth failures
- LDAP unreachable →
/auth/loginreturns 503. Active sessions continue with cached claims. - JWT signature failure (key ring drift) → 401, session terminates. DataProtection keys in ConfigDb prevent this in the happy path.
- Cookie expired (sliding 30-min idle) →
/auth/pingreturns 401 →CookieAuthenticationStateProvidertriggers UI logout.
SignalR / circuit drops
- Blazor circuit dropped →
App.razorreload script reconnects (preserved from today). - Hub message loss during reconnect →
FleetStatusBroadcasterresends current state to the reconnecting client onOnConnectedAsync(full snapshot, not just diffs).
OPC UA stack failures
- Address-space corruption →
OpcUaPublishActorlogs ERROR, sendsRebuildAddressSpaceto itself; sequence number bump notifies clients to resubscribe. - OPC UA listener bind failure (port collision) → Host fails readiness probe, supervisor restarts service.
Audit invariants
- Audit write failures never abort the user-facing action.
AuditWriterActorbuffer overflow → log WARN, drop oldest (with counter metric). The action's success/failure path is authoritative. - All deploy + edit events carry
ExecutionId(per-request correlation) so audit rows for one operator action share an ID.
8. Testing strategy
Test projects mirror the new layering. Test infrastructure stays Mac-friendly: stubbed Windows-only drivers, ephemeral SQL Server (LocalDB on Windows / mcr.microsoft.com/mssql/server container on Mac), OpenLDAP container, all spun up via tests/docker-compose.yml.
Layered test pyramid
| Layer | Project | What it covers |
|---|---|---|
| Unit | OtOpcUa.Runtime.Tests |
Per-actor logic via Akka.TestKit.Xunit2. DriverInstanceActor state-machine transitions, Phase7Composer purity, ScriptedAlarmActor state machine, VirtualTagActor expression eval. Drivers mocked via IDriver test doubles. |
| Unit | OtOpcUa.ControlPlane.Tests |
Singleton actor logic. ConfigPublishCoordinator happy path + timeout + concurrent ack ordering. RedundancyStateActor ServiceLevel computation truth table. AuditWriterActor batch flush + idempotency on duplicate EventId. |
| Unit | OtOpcUa.Cluster.Tests |
Split-brain resolver config validation, role-aware membership helpers, HOCON parses. |
| Unit | OtOpcUa.Security.Tests |
LDAP role mapping, JWT issuance, cookie+JWT roundtrip, /auth/ping expiry semantics. |
| Integration | OtOpcUa.Host.IntegrationTests |
2-node in-process Akka cluster. Real SQL Server, stubbed drivers. Tests: deploy happy path, deploy timeout, deploy with one node down, singleton failover mid-deploy, ConfigDb outage + stale-config fallback, edit-then-deploy roundtrip, audit row emission. |
| Integration | OtOpcUa.OpcUa.IntegrationTests |
Real OPCFoundation client connects to a running stubbed Host. Asserts: dual endpoint visible, ServerUriArray populated, ServiceLevel reflects leader status, browse + read + write through OpcUaPublishActor, write failures returned synchronously. |
| End-to-end | OtOpcUa.E2E.Tests |
Full Host with Traefik in front, two Admin nodes + two Driver nodes (4 processes via Docker). Verifies: web UI login via LDAP, deploy from UI flows to OPC UA stack, kill admin leader → Traefik fails over within 25s, kill driver node → OPC UA clients reconnect with correct ServiceLevel. CI nightly. |
Failover-specific test cases
- Kill Admin leader during
Dispatchingphase → new Coordinator resumes, deployment seals. - Kill Admin leader during
AwaitingApplyAcks→ new Coordinator queries DriverHostActors, completes ack collection. - Kill Driver node during
Applying→ Coordinator marks that node'sNodeDeploymentState=Failedafter deadline; surviving Driver nodes complete their apply. - Restart Driver node mid-deploy → on restart, replays apply (idempotent).
- Akka split-brain (network partition between 2 admin nodes) → keep-oldest wins, smaller side downs itself within 15s.
- Both Admin nodes restart simultaneously → deployments in
Dispatchingresume cleanly after cluster reforms. - Concurrent edits to the same
DriverInstancefrom two UI sessions → last write wins, both audit rows present, no row corruption.
Deploy idempotency tests
- Replay
DispatchDeploymentwith sameDeploymentId/RevisionHash→ no work, ackApplied. - Apply same
DeploymentArtifacttwice in a row → second application is a no-op. - Crash DriverHostActor mid-apply, restart → resumes from
NodeDeploymentState, completes idempotently.
Property tests
Phase7Composer.ComposeAsyncis pure: same artifact → same plan, no side effects.RedundancyStateActorServiceLevel computation: every combination of (member-state, db-ok, stale, opc-ok, is-leader) produces expected byte.- Audit envelope generation: every mutating op produces exactly one audit row with stable
ExecutionIdcorrelation.
Mac-dev test invariants
- All unit + integration tests run on macOS without Windows-only assemblies.
- Cluster tests use in-process Akka.Remote on 127.0.0.1.
- LDAP tests use
OpenLDAPcontainer orSecurity:Ldap:DevStubMode=true.
Retired tests
Anything touching ConfigGeneration lifecycle, ApplyLeaseRegistry, PeerHttpProbeLoop, FleetStatusPoller, RedundancyCoordinator peer-probe loops, RedundancyStatePublisher.
9. Risks & open questions
- Akka.NET on .NET 10. Verify Akka.NET 1.5+ targets .NET 10 cleanly.
- OPCFoundation SDK threading. The OPC UA stack runs its own threadpool.
OpcUaPublishActormust marshal writes via thread-safe wrappers; use a dedicatedsynchronized-dispatcherfor actors that touch the OPC UA address space. - Failure detector tuning. ScadaLink's 2s/10s is tuned for site-to-central RTT. Benchmark before locking. Aggressive tuning + GC pauses → spurious singleton handover.
- ServiceLevel = Akka leader removes operator control. No escape hatch in v1. If a customer needs one later, add a
PinnedPrimarycolumn toClusterNodeand an override path inRedundancyStateActor. Out of scope now. - Long-lived v2 branch drift. Monthly rebase from main, CI runs on v2 from day one.
- Schema migration is destructive. Dropping
ConfigGeneration+ClusterNode.RedundancyRoleis one-way. Cutover must run on a quiesced system. Provide aMigrate-To-V2.ps1script that backs up ConfigDb, runs EF migrations, validates row counts, prints a summary. - Wonderware + mxaccessgw still external processes. Both untouched by this refactor. Future actorization would be a second refactor.
- Audit row volume. Edit-heavy install ≈ 5k rows/day. Need monthly partition + 365-day retention same as ScadaLink #23.
10. Migration plan
Big-bang on v2-akka-fuse branch:
- Branch
v2-akka-fuseoffmain. - Add new projects:
OtOpcUa.Host,.Cluster,.Security,.ControlPlane,.Runtime,.ConfigDb,.Commons,.AdminUI,.OpcUaServer. Convert toOtOpcUa.slnx. - Move ConfigDb access (EF context, repos, migrations) out of
ServerandAdminintoOtOpcUa.ConfigDb. Add DataProtection key store table. - Move LDAP + cookie + JWT out of
Admin/SecurityintoOtOpcUa.Security. Adopt 15-min JWT / 30-min sliding cookie //auth/ping. - Build
OtOpcUa.Cluster: HOCON,AkkaHostedService, role-aware membership helpers, split-brain resolver. - Build
OtOpcUa.ControlPlane:ConfigPublishCoordinator,AdminOperationsActor,AuditWriterActor,FleetStatusBroadcaster,RedundancyStateActor. - Build
OtOpcUa.Runtime:DriverHostActor,DriverInstanceActor,VirtualTagActor,ScriptedAlarmActor,OpcUaPublishActor,HistorianAdapterActor,PeerOpcUaProbeActor,DbHealthProbeActor. - Migrate
Phase7ComposertoOtOpcUa.OpcUaServer; make it pure and unit-tested. - Move Blazor components from
AdminintoOtOpcUa.AdminUIlibrary; replaceDriverDiagnosticsClientHTTP with in-process actor calls; rewireFleetStatusHub/AlertHub/ScriptLogHubto be fed byFleetStatusBroadcasterIHubContext. - Build
OtOpcUa.HostProgram.cs: role-gated startup, health endpoints (/health/ready,/health/active,/healthz),AddWindowsService. - ConfigDb migration: add
Deployment,ConfigEdit,DataProtectionKeystables; renameClusterNodeGenerationState→NodeDeploymentState; dropConfigGeneration; dropClusterNode.RedundancyRole. EF migration + idempotent SQL script +Migrate-To-V2.ps1. - Delete
OtOpcUa.Server,OtOpcUa.Admin,DriverInstanceBootstrapper,RedundancyCoordinator,RedundancyStatePublisher,ApplyLeaseRegistry,FleetStatusPoller,PeerHttpProbeLoop,HubTokenService. Sweep any*RedundancyRole*references. - Update
deploy/Install-Services.ps1: single Windows Service per node, role via env var, Traefik service registration. - Update docs in
docs/: rewriteRedundancy.md,ServiceHosting.md; addCluster.md,ControlPlane.md,Runtime.md. Add top-levelArchitecture-v2.mdsummary. - CI: add integration test job for the 2-node cluster + OPC UA roundtrip.
- Tag the last v1 release on
mainfor backport-only fixes. Mergev2-akka-fuse→mainwhen GA.