Captures the brainstormed design to align OtOpcUa with ScadaLink: single role-gated binary, Akka.NET cluster with admin/driver roles, cluster singletons for control plane, per-node actor hierarchy for OPC UA runtime, dual-endpoint warm redundancy preserved with ServiceLevel driven by Akka leader, cookie+JWT auth, Traefik routing, and ScadaLink-style live-edit + deploy model replacing the draft/publish ConfigGeneration lifecycle.
452 lines
33 KiB
Markdown
452 lines
33 KiB
Markdown
# OtOpcUa v2 — Akka.NET + Fused Hosting Alignment with ScadaLink
|
||
|
||
**Status:** Design approved, ready for implementation planning
|
||
**Date:** 2026-05-26
|
||
**Branch:** `v2-akka-fuse`
|
||
**Sister project reference:** `~/Desktop/scadalink-design` (ScadaLink)
|
||
|
||
## 1. Motivation
|
||
|
||
OtOpcUa today runs as three separate processes (`OtOpcUa.Server` OPC UA host, `OtOpcUa.Admin` Blazor Server web UI, optional `OtOpcUaWonderwareHistorian` Framework sidecar) with manual operator-driven warm-redundancy failover. The sister project ScadaLink — owned by the same developer — solved similar problems with a fused single-binary, role-gated hosting model on top of an Akka.NET cluster.
|
||
|
||
The motivation for this refactor is twofold:
|
||
|
||
1. **Consistency.** A single developer (the project owner) moves between OtOpcUa and ScadaLink frequently. Sharing patterns — hosting, auth, actor hierarchy, deployment model — reduces cognitive overhead and makes fixes portable.
|
||
2. **Real HA improvements.** Upgrade OtOpcUa's manual operator-driven failover to automatic, Akka-cluster-driven failover with Traefik routing for the web UI. Preserve OPC UA dual-endpoint client-side failover semantics (clients connect to both nodes and pick based on `ServiceLevel`), now driven automatically by Akka cluster leadership.
|
||
|
||
## 2. Architecture overview
|
||
|
||
**One binary, role-gated.** `OtOpcUa.Host` (Microsoft.NET.Sdk.Web, .NET 10) replaces `OtOpcUa.Server` and `OtOpcUa.Admin`. Same binary on every node. Role configured via `OTOPCUA_ROLES` environment variable.
|
||
|
||
**Two Akka roles, single cluster:**
|
||
|
||
- **`admin`** — hosts Blazor web UI + cluster singletons. Singletons pinned via `ClusterSingletonManagerSettings.WithRole("admin")`. Traefik routes `/` to whichever Admin-role node `/health/active` reports as leader.
|
||
- **`driver`** — hosts OPC UA endpoint + per-node `DriverHostActor` hierarchy. Every Driver-role node always serves OPC UA; `ServiceLevel` computed by `RedundancyStateActor` is broadcast back to each Driver node and used to publish to the local OPC UA address space.
|
||
|
||
Roles are additive: `OTOPCUA_ROLES=admin`, `OTOPCUA_ROLES=driver`, or `OTOPCUA_ROLES=admin,driver`. Small deployments run both roles on both nodes; larger deployments separate them.
|
||
|
||
**Per-role leadership.** `Cluster.Get(system).State.RoleLeader("driver")` drives OPC UA `ServiceLevel`. `RoleLeader("admin")` drives `/health/active` (Traefik routing). These are independent — admin and driver leadership can land on different nodes if separated.
|
||
|
||
**Cluster membership.** Both seed nodes; keep-oldest split-brain resolver; `down-if-alone = on`; 15s stable-after; 2s heartbeat / 10s threshold. CoordinatedShutdown for graceful singleton handover. Exact ScadaLink tuning.
|
||
|
||
**OPC UA dual-endpoint preserved.** Driver-role nodes all bind `opc.tcp://0.0.0.0:4840`. Clients still see N endpoints in `ServerUriArray` and fail over via `ServiceLevel`. OPC UA spec compliance unchanged from today.
|
||
|
||
**Mac dev:** role `admin,driver,dev` — `dev` short-circuits Windows-only driver registration (Galaxy, Wonderware) with explicit `[DEV-STUB]` log lines.
|
||
|
||
## 3. Project & process restructure
|
||
|
||
Single solution, ScadaLink-style folder layout. Existing OtOpcUa naming convention (`ZB.MOM.WW.OtOpcUa.*`) preserved.
|
||
|
||
### New entry point & deletions
|
||
|
||
| Action | Project |
|
||
|---|---|
|
||
| **New** | `OtOpcUa.Host` — `Microsoft.NET.Sdk.Web`, single Program.cs, role-gated startup, `AddWindowsService` |
|
||
| **Delete** | `OtOpcUa.Server` (content migrates) |
|
||
| **Delete** | `OtOpcUa.Admin` (UI moves to library) |
|
||
|
||
### New libraries
|
||
|
||
| Project | Owns | ScadaLink analog |
|
||
|---|---|---|
|
||
| `OtOpcUa.Commons` | Entity POCOs, interfaces, message contracts (`Types/`, `Interfaces/`, `Entities/`, `Messages/`) | `ScadaLink.Commons` |
|
||
| `OtOpcUa.ConfigDb` | EF Core `DbContext`, repositories, `IAuditService`, migrations, Data Protection key store | `ScadaLink.ConfigurationDatabase` |
|
||
| `OtOpcUa.Cluster` | Akka HOCON, `AkkaHostedService`, split-brain resolver config, role-aware membership helpers, `IClusterRoleInfo` | (split out of ScadaLink Host) |
|
||
| `OtOpcUa.Security` | LDAP bind, cookie+JWT hybrid, JWT issuance, role mapping, `/auth/login`, `/auth/ping` endpoints | `ScadaLink.Security` |
|
||
| `OtOpcUa.ControlPlane` | Cluster singletons: `ConfigPublishCoordinator`, `AdminOperationsActor`, `AuditWriterActor`, `FleetStatusBroadcaster`, `RedundancyStateActor` | `ScadaLink.ManagementService` |
|
||
| `OtOpcUa.Runtime` | Per-node actors: `DriverHostActor`, `DriverInstanceActor`, `VirtualTagActor`, `ScriptedAlarmActor`, `OpcUaPublishActor`, `HistorianAdapterActor`, `PeerOpcUaProbeActor`, `DbHealthProbeActor` | `ScadaLink.SiteRuntime` |
|
||
| `OtOpcUa.OpcUaServer` | OPC UA app host, address-space build, `Phase7Composer` extraction | (in ScadaLink.SiteRuntime/DCL) |
|
||
| `OtOpcUa.AdminUI` | Blazor components, hubs (`FleetStatusHub`, `AlertHub`, `ScriptLogHub`), auth state provider, `MapAdminUI<TApp>()` | `ScadaLink.CentralUI` |
|
||
|
||
### Unchanged
|
||
|
||
- Driver projects (`OtOpcUa.Driver.Galaxy`, `.Modbus`, `.S7`, `.AbCip`, `.AbLegacy`, `.TwinCAT`, `.FOCAS`, `.OpcUaClient`) — still implement `IDriver`, now consumed by `DriverInstanceActor` instead of `DriverInstanceBootstrapper`.
|
||
- `OtOpcUa.Driver.Historian.Wonderware` — .NET Framework 4.8 sidecar, named-pipe IPC, wrapped by a `HistorianAdapterActor` in `OtOpcUa.Runtime`.
|
||
- `mxaccessgw` sibling repo — unchanged; Galaxy driver still talks gRPC to it.
|
||
|
||
### Tests
|
||
|
||
- `tests/OtOpcUa.Cluster.Tests` — split-brain, leadership transitions
|
||
- `tests/OtOpcUa.ControlPlane.Tests` — singleton actor unit tests via Akka.TestKit
|
||
- `tests/OtOpcUa.Runtime.Tests` — per-node actor tests, driver lifecycle
|
||
- `tests/OtOpcUa.Security.Tests` — LDAP, cookie+JWT roundtrip
|
||
- `tests/OtOpcUa.Host.IntegrationTests` — 2-node in-process cluster, deployment flow, failover, Mac-safe
|
||
- `tests/OtOpcUa.OpcUa.IntegrationTests` — real OPCFoundation client against stubbed Host
|
||
- `tests/OtOpcUa.E2E.Tests` — full stack with Traefik (nightly CI)
|
||
|
||
### Deploy
|
||
|
||
- `deploy/Install-Services.ps1` — installs one Windows Service per node (`OtOpcUaHost`), passes role via env var. Old script replaced.
|
||
- `deploy/traefik/` — Windows Traefik config + service registration for the leader-routed `/health/active`.
|
||
- `docker-dev/` (new, optional) — 2-node Mac dev compose with stubbed drivers + LDAP + SQL Server + Traefik.
|
||
|
||
Solution file: `OtOpcUa.slnx` (matches ScadaLink convention; switch from current `.sln`).
|
||
|
||
## 4. Actor hierarchy
|
||
|
||
### Per-node tree
|
||
|
||
Rooted under `OtOpcUa.Runtime`, one tree per Driver-role node:
|
||
|
||
```
|
||
DriverHostActor (per-node coordinator, started by Host)
|
||
├─ DriverInstanceActor (per DriverInstance row)
|
||
│ └─ children = pooled or per-subscription work
|
||
├─ VirtualTagActor (per VirtualTag row)
|
||
├─ ScriptedAlarmActor (per ScriptedAlarm row)
|
||
├─ OpcUaPublishActor (per-node bridge to OPCFoundation address space)
|
||
├─ HistorianAdapterActor (per-node, wraps Wonderware named-pipe sidecar)
|
||
├─ PeerOpcUaProbeActor (per-node, tests peer OPC UA stack health)
|
||
└─ DbHealthProbeActor (per-node, cached DB health probe)
|
||
```
|
||
|
||
### Cluster singletons
|
||
|
||
Pinned to `admin` role via `ClusterSingletonManagerSettings.WithRole("admin")`:
|
||
|
||
| Actor | Owns | Notes |
|
||
|---|---|---|
|
||
| `ConfigPublishCoordinator` | The deploy protocol. Writes `Deployment` row, broadcasts `DispatchDeployment(deploymentId)` via `DistributedPubSub` to every `DriverHostActor`, tracks apply ACKs per node. | Replaces `ApplyLeaseRegistry`. Resumes after failover by re-reading ConfigDb state — no Akka.Persistence. |
|
||
| `AdminOperationsActor` | All mutating admin ops (CRUD on equipment, drivers, scripts, namespaces, ACLs). Wraps each in an audit envelope. | UI calls via `ClusterSingletonProxy` (in-process when UI is on Admin node). |
|
||
| `AuditWriterActor` | Receives `AuditEvent` telemetry from any node, batch-inserts into `ConfigAuditLog`. | Idempotent on `EventId`. |
|
||
| `FleetStatusBroadcaster` | Aggregates Akka cluster member events + per-node `DriverHostStatus` heartbeats. Publishes diffs to `IHubContext<FleetStatusHub>` and `IHubContext<AlertHub>`. | Push-driven; replaces today's 5s `FleetStatusPoller`. |
|
||
| `RedundancyStateActor` | Subscribes to `ClusterEvent.IMemberEvent` + `ClusterEvent.LeaderChanged` + per-node health probes. Computes `ServiceLevel` byte + `ServerUriArray` per Driver node. Publishes to `DistributedPubSub` topic `redundancy-state`. | Source of truth for OPC UA redundancy. Local `OpcUaPublishActor` subscribes and writes to its OPCFoundation stack. |
|
||
|
||
### Supervision
|
||
|
||
| Actor | Strategy |
|
||
|---|---|
|
||
| `DriverHostActor` | `Resume` |
|
||
| `DriverInstanceActor` | `Restart` with backoff (1s → 30s, ×1.5, jitter) |
|
||
| `VirtualTagActor` | `Restart` with backoff |
|
||
| `ScriptedAlarmActor` | `Restart` with backoff; preserve alarm state via `PreRestart` hook |
|
||
| `OpcUaPublishActor` | `Resume` |
|
||
| `HistorianAdapterActor` | `Restart` with backoff; SQLite store-and-forward buffers during pipe outage |
|
||
| All singletons | `Resume`; resumable state in ConfigDb |
|
||
| Script execution actors (short-lived) | `Stop` on failure |
|
||
|
||
### State machines
|
||
|
||
- `DriverInstanceActor` — Become/Stash for `Connecting → Connected → Reconnecting → Failed`. Bad-quality publish on disconnect; transparent re-subscribe on reconnect. Write failures returned synchronously via `Ask` from `OpcUaPublishActor`.
|
||
- `ConfigPublishCoordinator` — `Idle → Publishing → AwaitingApplyAcks → Sealed`, with timeout-driven escalation if a node fails to ack within `ApplyMaxDuration` (default 10 min).
|
||
- `RedundancyStateActor` — recomputes on every membership event, debounced 250ms to coalesce bursts.
|
||
|
||
### Communication conventions
|
||
|
||
- **Tell** for hot-path internal traffic (driver values, alarm state changes, publish broadcasts).
|
||
- **Ask** only at system boundaries (UI controller → `AdminOperationsActor`, with explicit timeout + cancellation token).
|
||
- **DistributedPubSub** for cluster-wide broadcasts (`DispatchDeployment`, `RedundancyStateChanged`, `FleetStatusChanged`).
|
||
- Application-level **correlation IDs** on every request/response message.
|
||
- Messages live in `OtOpcUa.Commons.Messages.{Drivers,Deploy,Admin,Audit,Redundancy}` — additive-only evolution.
|
||
|
||
### Singleton persistence
|
||
|
||
No Akka.Persistence. Each singleton reads its resumable state from `ConfigDb` on `PreStart` (e.g., `ConfigPublishCoordinator` reads the current in-flight `Deployment` row + per-node `NodeDeploymentState`) and writes on every state transition.
|
||
|
||
### Mac-dev stubs
|
||
|
||
`DevNode` role short-circuits driver registration. `DriverInstanceActor` for any Galaxy/Wonderware row enters a `Stubbed` Become state that returns deterministic test values. Logged at INFO with `[DEV-STUB] driver={Name} reason=windows-only`.
|
||
|
||
## 5. Web hosting, auth, and SignalR
|
||
|
||
### Kestrel startup gated by `admin` role
|
||
|
||
`Program.cs` builds `WebApplicationBuilder`, registers all services, but only calls `app.MapBlazor<App>()`, `app.MapHub<...>()`, `app.MapStaticAssets()`, and auth endpoints when `admin ∈ roles`. Driver-only nodes still bind Kestrel for `/healthz` on `:4841` and nothing else.
|
||
|
||
### Authentication — cookie+JWT hybrid
|
||
|
||
| Layer | Config |
|
||
|---|---|
|
||
| Cookie scheme | `OtOpcUa.Auth`, HttpOnly, SameSite=Strict, Secure (prod) / SameAsRequest (dev). Sliding 30-min idle timeout. |
|
||
| Embedded JWT | HMAC-SHA256, 15-min expiry, claims = `sub`, `roles`, `nodeAcls`. |
|
||
| LDAP bind | `LdapAuthService.AuthenticateAsync(user, pw)` at `/auth/login` POST — preserved from current `OtOpcUa.Admin/Security`. |
|
||
| Role mapping | `RoleMapper.MapGroupsToRolesAsync()` — LDAP groups → `FleetAdmin` / `ConfigEditor` / `ReadOnly`. Stays as-is. |
|
||
| Token issuance | `/auth/token` returns bearer for external clients (CLI, automation). |
|
||
| Circuit expiry probe | `/auth/ping` returns 200/401, polled by `CookieAuthenticationStateProvider` to detect expiry from inside a SignalR circuit. |
|
||
| Failure mode | LDAP unreachable → new logins fail, active sessions continue. |
|
||
|
||
### Data Protection keys
|
||
|
||
`services.AddDataProtection().PersistKeysToDbContext<OtOpcUaConfigDbContext>().SetApplicationName("OtOpcUa")` — keys live in `ConfigDb` so a circuit started on Admin-node A survives if Traefik fails over to Admin-node B mid-session.
|
||
|
||
### SignalR hubs
|
||
|
||
Three existing hubs preserved (`/hubs/fleet`, `/hubs/alerts`, `/hubs/script-log`):
|
||
|
||
- **Today:** `FleetStatusPoller` polls SQL every 5s.
|
||
- **New:** `FleetStatusBroadcaster` singleton receives Akka cluster events + per-node telemetry, pushes diffs via `IHubContext<FleetStatusHub>`. No polling.
|
||
- `HubTokenService` bearer-token fallback retired — hubs are circuit-local, cookie auth flows through SignalR natively. External hub consumers use the bearer token from `/auth/token` with a `JwtBearer` authentication scheme declaration on the hub.
|
||
|
||
### UI → backend wiring
|
||
|
||
- **Reads:** Blazor components inject scoped repositories from DI and read directly from `ConfigDb`. No change from today.
|
||
- **Writes / mutating ops:** Components inject `IAdminOperationsClient` — a thin wrapper around `ClusterSingletonProxy` to `AdminOperationsActor`. Mutations are `Ask` with a 10s timeout + correlation ID. Audit envelope built UI-side, completed singleton-side.
|
||
- **Driver diagnostics:** Today's `DriverDiagnosticsClient` HTTP round-trip retires. UI components ask `IFleetDiagnosticsClient` which delegates to `ClusterClientReceptionist`-published actor messages.
|
||
|
||
### Health endpoints
|
||
|
||
| Endpoint | Returns | Used by |
|
||
|---|---|---|
|
||
| `/health/ready` | 200 once Akka member is `Up` + ConfigDb reachable + DataProtection key ring loaded | Service supervisor readiness gate |
|
||
| `/health/active` | 200 only on the Admin-role leader; 503 elsewhere | Traefik — routes browser traffic to leader |
|
||
| `/healthz` (existing) | 200 when Driver-role actor system is up + at least one driver registered (preserved on `:4841`) | Ops probes, OPC UA monitoring tools |
|
||
|
||
### Traefik
|
||
|
||
Windows Service (or external box). One route: `host=otopcua.*` → load-balance to `{admin-node-a:9000, admin-node-b:9000}` with `/health/active` health check, sticky sessions disabled (DataProtection key sharing handles continuity).
|
||
|
||
### appsettings structure
|
||
|
||
Mirrors ScadaLink's per-component options pattern: `Cluster:`, `Security:`, `ConfigDb:`, `OpcUa:`, `Drivers:`, `Historian:` sections, bound to options classes owned by their respective component projects.
|
||
|
||
## 6. Edit + Deploy flow (replaces draft/publish generations)
|
||
|
||
The single most consequential domain change: **drop the draft/publish `ConfigGeneration` lifecycle**. Edits are live; deploy is a snapshot+push, ScadaLink-style.
|
||
|
||
### Edit model
|
||
|
||
- `Equipment`, `Driver`, `DriverInstance`, `Namespace`, `UnsItem`, `Script`, `VirtualTag`, `ScriptedAlarm`, `NodeAcl` are edited **directly** via `AdminOperationsActor`. No draft staging, no `ConfigGeneration` lifecycle. Last-write-wins per row (rowversion column for stale-write detection only).
|
||
- Live edits do **not** affect running Driver-role nodes — running stacks reflect the *last-deployed* state. The UI shows a "drift" indicator when live ConfigDb state differs from last sealed deployment.
|
||
- Validation runs on edit (semantic checks: driver tag-path validity, script syntax, namespace name uniqueness) — pulled forward from deploy-time to edit-time.
|
||
|
||
### Deploy model
|
||
|
||
```
|
||
Admin UI "Deploy" → AdminOperationsActor.Ask(StartDeployment)
|
||
AdminOperationsActor:
|
||
→ snapshot ConfigDb current state
|
||
→ ConfigComposer.Flatten() → DeploymentArtifact
|
||
→ compute RevisionHash = SHA256(canonical-serialized artifact)
|
||
→ write Deployment row (DeploymentId GUID, RevisionHash, CreatedBy, CreatedAtUtc, Status=Dispatching)
|
||
→ Ask ConfigPublishCoordinator.DispatchDeployment(deploymentId)
|
||
|
||
ConfigPublishCoordinator (cluster singleton, admin role):
|
||
→ write Deployment.Status = Dispatching
|
||
→ DistributedPubSub Publish to "deployments" topic: DispatchDeployment(deploymentId, revisionHash)
|
||
→ schedule ApplyDeadline timer (ApplyMaxDuration, default 10 min)
|
||
|
||
DriverHostActor (per node, subscribed to "deployments"):
|
||
receive DispatchDeployment(deploymentId, revisionHash):
|
||
→ if currentDeploymentRevision == revisionHash → ack Applied (idempotent)
|
||
→ else:
|
||
→ acquire per-node ApplyLock (Become Applying(deploymentId))
|
||
→ write NodeDeploymentState row (NodeId, DeploymentId, StartedAtUtc)
|
||
→ fetch artifact: read DeploymentArtifact blob from ConfigDb by deploymentId
|
||
→ diff against current applied artifact → per-instance ApplyDelta plans
|
||
→ dispatch ApplyDelta to DriverInstanceActor / VirtualTagActor / ScriptedAlarmActor children
|
||
→ collect per-instance acks (all-or-nothing per node)
|
||
→ on full success: write GenerationSealedCache (LiteDb local), update NodeDeploymentState.AppliedAtUtc
|
||
→ on any instance Failure: rollback to previous deployment, mark NodeDeploymentState=Failed
|
||
→ Tell Coordinator: ApplyAck(deploymentId, nodeId, Applied | Failed(reason))
|
||
→ Become Steady
|
||
|
||
ConfigPublishCoordinator: collect ApplyAcks
|
||
→ all Driver nodes Applied → Deployment.Status = Sealed → DistributedPubSub PublishDeploymentSealed
|
||
→ any Failed → Deployment.Status = PartiallyFailed → broadcast DeploymentFailed
|
||
→ deadline elapsed before all acks → Deployment.Status = TimedOut → broadcast DeploymentTimedOut
|
||
```
|
||
|
||
### Per-instance operation lock
|
||
|
||
All mutating commands (deploy, disable, enable, delete) on a `DriverInstance` go through `DriverInstanceActor`, which serializes them via the actor mailbox — single-threaded by construction.
|
||
|
||
### Idempotency
|
||
|
||
- `DeploymentId` + `RevisionHash` together identify a deployment.
|
||
- `DriverHostActor` seeing a `DispatchDeployment` whose `RevisionHash` matches current applied state → immediate ack `Applied`, no work. Safe to redeliver.
|
||
- `Phase7Composer.ComposeAsync(artifact)` is pure; same artifact → same delta plan.
|
||
- `DriverInstanceActor.ApplyDelta(plan)` compares against current state, applies only diffs.
|
||
|
||
### Concurrency control
|
||
|
||
- Last-write-wins on edits (no optimistic concurrency on `Equipment`, `Driver`, `Script`, etc.) — matches ScadaLink template behavior.
|
||
- **Optimistic concurrency on `Deployment` and `NodeDeploymentState` rows** (rowversion column) — prevents two concurrent Coordinator instances (during failover) from corrupting state.
|
||
|
||
### Singleton failover during deploy
|
||
|
||
1. Old Coordinator wrote `Deployment.Status = Dispatching` + `NodeDeploymentState` rows before broadcast.
|
||
2. New Coordinator on takeover queries `Deployment` rows with non-terminal `Status`.
|
||
3. For each in-flight deployment, `Ask` every `DriverHostActor` (via cluster-aware actor selection) for current `NodeDeploymentState`.
|
||
4. Recompute outstanding-ack set; resume the deadline timer with the remaining time.
|
||
5. If apply deadline already passed → mark `Deployment.Status = TimedOut` for any unack'd nodes.
|
||
|
||
### Crash recovery on Driver node restart
|
||
|
||
- `DriverHostActor.PreStart` reads `NodeDeploymentState` for self.
|
||
- If row says `Applied` for some `DeploymentId` and matches last sealed cache → Become Steady on that artifact.
|
||
- If row says `Applying` (didn't reach Applied) → discard partial state, re-fetch the artifact, replay apply (idempotent).
|
||
- If ConfigDb unreachable → fall back to local LiteDb sealed cache, Become `Stale` (drops ServiceLevel via `RedundancyStateActor`). Background reconnect retries every 30s.
|
||
|
||
### Schema migration from today
|
||
|
||
| Today | New |
|
||
|---|---|
|
||
| `ConfigGeneration` (Draft/Published/Sealed lifecycle) | **Dropped** |
|
||
| `ClusterNodeGenerationState` | Renamed → `NodeDeploymentState` with `(NodeId, DeploymentId, Status, StartedAtUtc, AppliedAtUtc, RowVersion)` |
|
||
| `ClusterNode.RedundancyRole` column | **Dropped** (Akka leader-of-driver-role is source of truth) |
|
||
| `ConfigAuditLog` | Kept; deploy events added as new event types |
|
||
| (new) `Deployment` | `(DeploymentId, RevisionHash, Status, CreatedBy, CreatedAtUtc, ArtifactBlob varbinary(max), RowVersion)` |
|
||
| (new) `ConfigEdit` audit row per Equipment/Driver/Script edit | Live-edit history |
|
||
| (new) `DataProtectionKeys` | DataProtection key ring storage |
|
||
|
||
No more `ApplyLeaseRegistry` table or watchdog actor. Apply state lives in `NodeDeploymentState`; watchdog is a Coordinator-side scheduled message keyed by `DeploymentId`.
|
||
|
||
### Stale-config fallback
|
||
|
||
Preserved from today's `GenerationSealedCache`: local LiteDb cache holds last-applied `DeploymentArtifact`. On Host boot with ConfigDb unreachable, `DriverHostActor` boots from cache → Become `Stale` → `RedundancyStateActor` drops `ServiceLevel` for that node.
|
||
|
||
### Peer probes consolidated
|
||
|
||
| Today | New |
|
||
|---|---|
|
||
| `PeerHttpProbeLoop` (HTTP `/healthz`) | Retired — Akka failure detector replaces it |
|
||
| `PeerUaProbeLoop` (OPC UA `opc.tcp://peer:4840`) | **Retained** as `PeerOpcUaProbeActor` — tests whether the OPC UA stack itself (not just the process) is up. Feeds `RedundancyStateActor`. |
|
||
| `DbHealthCache` (cached DB probe) | Retained as `DbHealthProbeActor` per-node. Feeds `RedundancyStateActor` + `/health/ready`. |
|
||
|
||
### ServiceLevel computation in `RedundancyStateActor`
|
||
|
||
```
|
||
serviceLevel(node) =
|
||
base 240 if (cluster member Up AND db reachable AND not stale AND opc ua probe ok)
|
||
base 200 if (member Up AND db reachable AND stale)
|
||
base 100 if (member Up AND db unreachable AND stale)
|
||
base 0 if (member Down / Unreachable)
|
||
|
||
+10 bonus if Akka driver-role leader is this node
|
||
```
|
||
|
||
ServiceLevel bands match the existing `RedundancyStatePublisher` so OPC UA client behavior is unchanged from today. The leader-bonus replaces today's operator-managed `RedundancyRole = Primary`.
|
||
|
||
## 7. Error handling & failure modes
|
||
|
||
### Akka cluster failure modes
|
||
|
||
| Scenario | Behavior |
|
||
|---|---|
|
||
| Network partition (split-brain) | Keep-oldest resolver downs the smaller side after 15s stable-after. `down-if-alone = on` covers isolated nodes. |
|
||
| Admin leader process crash | Failure detector trips after 10s, downs the member, new singleton instance starts on remaining Admin node. Traefik `/health/active` probe fails over within 1 polling interval (~5s). |
|
||
| Driver-role node crash | RedundancyStateActor sees member Down → drops that node's ServiceLevel to 0 → OPC UA clients reconnect to surviving node. Both nodes were already running their own copy; no in-cluster recovery needed for that node's work. |
|
||
| Both Admin nodes down simultaneously | Web UI unavailable. Driver nodes continue serving OPC UA from last-sealed cache. No new deployments possible until Admin node recovers. |
|
||
| All Driver nodes down | OPC UA endpoints unavailable. Clients reconnect when any Driver node returns. ServiceLevel back to 240 once member Up + DB reachable + apply sealed. |
|
||
| Singleton handover during deploy | Coordinator state survives in `Deployment` + `NodeDeploymentState` ConfigDb rows. New Coordinator queries DriverHostActors via cluster-aware actor selection. Resume remaining deadline. |
|
||
|
||
### ConfigDb unavailability
|
||
|
||
- **At edit time:** AdminUI returns user-visible error. No retries — operator decides.
|
||
- **At deploy time:** Coordinator refuses to start dispatch if it can't write the `Deployment` row.
|
||
- **At Driver node boot:** Fall back to local LiteDb sealed cache. RedundancyStateActor drops `ServiceLevel`.
|
||
- **At singleton failover:** New Coordinator's `PreStart` retries via Polly (5 attempts, exponential backoff). If exhausted → singleton crashes → cluster restarts singleton on next viable Admin node.
|
||
|
||
### Driver / equipment failures
|
||
|
||
- Driver connection loss → `DriverInstanceActor` enters `Reconnecting` Become state, publishes bad-quality to OPC UA address space immediately, retries at fixed interval.
|
||
- Tag-path-resolution failure → retried periodically.
|
||
- Write failure to driver → returned synchronously to caller via `Ask` from `OpcUaPublishActor`.
|
||
- Driver process unresponsive (Galaxy gateway down) → `IDriver.HealthCheck` returns degraded → `DriverInstanceActor` reports to `DriverHostActor` → `RedundancyStateActor` factors into ServiceLevel.
|
||
|
||
### Wonderware historian sidecar
|
||
|
||
- Named-pipe disconnect → `HistorianAdapterActor` enters `Reconnecting`; alarm history rows buffered to local SQLite store-and-forward.
|
||
- Sidecar process crash → no in-cluster recovery (external process); operator restarts via Windows Service control.
|
||
|
||
### Auth failures
|
||
|
||
- LDAP unreachable → `/auth/login` returns 503. Active sessions continue with cached claims.
|
||
- JWT signature failure (key ring drift) → 401, session terminates. DataProtection keys in ConfigDb prevent this in the happy path.
|
||
- Cookie expired (sliding 30-min idle) → `/auth/ping` returns 401 → `CookieAuthenticationStateProvider` triggers UI logout.
|
||
|
||
### SignalR / circuit drops
|
||
|
||
- Blazor circuit dropped → `App.razor` reload script reconnects (preserved from today).
|
||
- Hub message loss during reconnect → `FleetStatusBroadcaster` resends current state to the reconnecting client on `OnConnectedAsync` (full snapshot, not just diffs).
|
||
|
||
### OPC UA stack failures
|
||
|
||
- Address-space corruption → `OpcUaPublishActor` logs ERROR, sends `RebuildAddressSpace` to itself; sequence number bump notifies clients to resubscribe.
|
||
- OPC UA listener bind failure (port collision) → Host fails readiness probe, supervisor restarts service.
|
||
|
||
### Audit invariants
|
||
|
||
- Audit write failures **never abort** the user-facing action. `AuditWriterActor` buffer overflow → log WARN, drop oldest (with counter metric). The action's success/failure path is authoritative.
|
||
- All deploy + edit events carry `ExecutionId` (per-request correlation) so audit rows for one operator action share an ID.
|
||
|
||
## 8. Testing strategy
|
||
|
||
Test projects mirror the new layering. Test infrastructure stays Mac-friendly: stubbed Windows-only drivers, ephemeral SQL Server (LocalDB on Windows / `mcr.microsoft.com/mssql/server` container on Mac), `OpenLDAP` container, all spun up via `tests/docker-compose.yml`.
|
||
|
||
### Layered test pyramid
|
||
|
||
| Layer | Project | What it covers |
|
||
|---|---|---|
|
||
| **Unit** | `OtOpcUa.Runtime.Tests` | Per-actor logic via `Akka.TestKit.Xunit2`. `DriverInstanceActor` state-machine transitions, `Phase7Composer` purity, `ScriptedAlarmActor` state machine, `VirtualTagActor` expression eval. Drivers mocked via `IDriver` test doubles. |
|
||
| **Unit** | `OtOpcUa.ControlPlane.Tests` | Singleton actor logic. `ConfigPublishCoordinator` happy path + timeout + concurrent ack ordering. `RedundancyStateActor` ServiceLevel computation truth table. `AuditWriterActor` batch flush + idempotency on duplicate `EventId`. |
|
||
| **Unit** | `OtOpcUa.Cluster.Tests` | Split-brain resolver config validation, role-aware membership helpers, HOCON parses. |
|
||
| **Unit** | `OtOpcUa.Security.Tests` | LDAP role mapping, JWT issuance, cookie+JWT roundtrip, `/auth/ping` expiry semantics. |
|
||
| **Integration** | `OtOpcUa.Host.IntegrationTests` | 2-node in-process Akka cluster. Real SQL Server, stubbed drivers. Tests: deploy happy path, deploy timeout, deploy with one node down, singleton failover mid-deploy, ConfigDb outage + stale-config fallback, edit-then-deploy roundtrip, audit row emission. |
|
||
| **Integration** | `OtOpcUa.OpcUa.IntegrationTests` | Real OPCFoundation client connects to a running stubbed Host. Asserts: dual endpoint visible, ServerUriArray populated, ServiceLevel reflects leader status, browse + read + write through `OpcUaPublishActor`, write failures returned synchronously. |
|
||
| **End-to-end** | `OtOpcUa.E2E.Tests` | Full Host with Traefik in front, two Admin nodes + two Driver nodes (4 processes via Docker). Verifies: web UI login via LDAP, deploy from UI flows to OPC UA stack, kill admin leader → Traefik fails over within 25s, kill driver node → OPC UA clients reconnect with correct ServiceLevel. CI nightly. |
|
||
|
||
### Failover-specific test cases
|
||
|
||
1. Kill Admin leader during `Dispatching` phase → new Coordinator resumes, deployment seals.
|
||
2. Kill Admin leader during `AwaitingApplyAcks` → new Coordinator queries DriverHostActors, completes ack collection.
|
||
3. Kill Driver node during `Applying` → Coordinator marks that node's `NodeDeploymentState=Failed` after deadline; surviving Driver nodes complete their apply.
|
||
4. Restart Driver node mid-deploy → on restart, replays apply (idempotent).
|
||
5. Akka split-brain (network partition between 2 admin nodes) → keep-oldest wins, smaller side downs itself within 15s.
|
||
6. Both Admin nodes restart simultaneously → deployments in `Dispatching` resume cleanly after cluster reforms.
|
||
7. Concurrent edits to the same `DriverInstance` from two UI sessions → last write wins, both audit rows present, no row corruption.
|
||
|
||
### Deploy idempotency tests
|
||
|
||
- Replay `DispatchDeployment` with same `DeploymentId/RevisionHash` → no work, ack `Applied`.
|
||
- Apply same `DeploymentArtifact` twice in a row → second application is a no-op.
|
||
- Crash DriverHostActor mid-apply, restart → resumes from `NodeDeploymentState`, completes idempotently.
|
||
|
||
### Property tests
|
||
|
||
- `Phase7Composer.ComposeAsync` is pure: same artifact → same plan, no side effects.
|
||
- `RedundancyStateActor` ServiceLevel computation: every combination of (member-state, db-ok, stale, opc-ok, is-leader) produces expected byte.
|
||
- Audit envelope generation: every mutating op produces exactly one audit row with stable `ExecutionId` correlation.
|
||
|
||
### Mac-dev test invariants
|
||
|
||
- All unit + integration tests run on macOS without Windows-only assemblies.
|
||
- Cluster tests use in-process Akka.Remote on 127.0.0.1.
|
||
- LDAP tests use `OpenLDAP` container or `Security:Ldap:DevStubMode=true`.
|
||
|
||
### Retired tests
|
||
|
||
Anything touching `ConfigGeneration` lifecycle, `ApplyLeaseRegistry`, `PeerHttpProbeLoop`, `FleetStatusPoller`, `RedundancyCoordinator` peer-probe loops, `RedundancyStatePublisher`.
|
||
|
||
## 9. Risks & open questions
|
||
|
||
1. **Akka.NET on .NET 10.** Verify Akka.NET 1.5+ targets .NET 10 cleanly.
|
||
2. **OPCFoundation SDK threading.** The OPC UA stack runs its own threadpool. `OpcUaPublishActor` must marshal writes via thread-safe wrappers; use a dedicated `synchronized-dispatcher` for actors that touch the OPC UA address space.
|
||
3. **Failure detector tuning.** ScadaLink's 2s/10s is tuned for site-to-central RTT. Benchmark before locking. Aggressive tuning + GC pauses → spurious singleton handover.
|
||
4. **ServiceLevel = Akka leader removes operator control.** No escape hatch in v1. If a customer needs one later, add a `PinnedPrimary` column to `ClusterNode` and an override path in `RedundancyStateActor`. Out of scope now.
|
||
5. **Long-lived v2 branch drift.** Monthly rebase from main, CI runs on v2 from day one.
|
||
6. **Schema migration is destructive.** Dropping `ConfigGeneration` + `ClusterNode.RedundancyRole` is one-way. Cutover must run on a quiesced system. Provide a `Migrate-To-V2.ps1` script that backs up ConfigDb, runs EF migrations, validates row counts, prints a summary.
|
||
7. **Wonderware + mxaccessgw still external processes.** Both untouched by this refactor. Future actorization would be a second refactor.
|
||
8. **Audit row volume.** Edit-heavy install ≈ 5k rows/day. Need monthly partition + 365-day retention same as ScadaLink #23.
|
||
|
||
## 10. Migration plan
|
||
|
||
Big-bang on `v2-akka-fuse` branch:
|
||
|
||
1. Branch `v2-akka-fuse` off `main`.
|
||
2. Add new projects: `OtOpcUa.Host`, `.Cluster`, `.Security`, `.ControlPlane`, `.Runtime`, `.ConfigDb`, `.Commons`, `.AdminUI`, `.OpcUaServer`. Convert to `OtOpcUa.slnx`.
|
||
3. Move ConfigDb access (EF context, repos, migrations) out of `Server` and `Admin` into `OtOpcUa.ConfigDb`. Add DataProtection key store table.
|
||
4. Move LDAP + cookie + JWT out of `Admin/Security` into `OtOpcUa.Security`. Adopt 15-min JWT / 30-min sliding cookie / `/auth/ping`.
|
||
5. Build `OtOpcUa.Cluster`: HOCON, `AkkaHostedService`, role-aware membership helpers, split-brain resolver.
|
||
6. Build `OtOpcUa.ControlPlane`: `ConfigPublishCoordinator`, `AdminOperationsActor`, `AuditWriterActor`, `FleetStatusBroadcaster`, `RedundancyStateActor`.
|
||
7. Build `OtOpcUa.Runtime`: `DriverHostActor`, `DriverInstanceActor`, `VirtualTagActor`, `ScriptedAlarmActor`, `OpcUaPublishActor`, `HistorianAdapterActor`, `PeerOpcUaProbeActor`, `DbHealthProbeActor`.
|
||
8. Migrate `Phase7Composer` to `OtOpcUa.OpcUaServer`; make it pure and unit-tested.
|
||
9. Move Blazor components from `Admin` into `OtOpcUa.AdminUI` library; replace `DriverDiagnosticsClient` HTTP with in-process actor calls; rewire `FleetStatusHub` / `AlertHub` / `ScriptLogHub` to be fed by `FleetStatusBroadcaster` `IHubContext`.
|
||
10. Build `OtOpcUa.Host` `Program.cs`: role-gated startup, health endpoints (`/health/ready`, `/health/active`, `/healthz`), `AddWindowsService`.
|
||
11. ConfigDb migration: add `Deployment`, `ConfigEdit`, `DataProtectionKeys` tables; rename `ClusterNodeGenerationState` → `NodeDeploymentState`; drop `ConfigGeneration`; drop `ClusterNode.RedundancyRole`. EF migration + idempotent SQL script + `Migrate-To-V2.ps1`.
|
||
12. Delete `OtOpcUa.Server`, `OtOpcUa.Admin`, `DriverInstanceBootstrapper`, `RedundancyCoordinator`, `RedundancyStatePublisher`, `ApplyLeaseRegistry`, `FleetStatusPoller`, `PeerHttpProbeLoop`, `HubTokenService`. Sweep any `*RedundancyRole*` references.
|
||
13. Update `deploy/Install-Services.ps1`: single Windows Service per node, role via env var, Traefik service registration.
|
||
14. Update docs in `docs/`: rewrite `Redundancy.md`, `ServiceHosting.md`; add `Cluster.md`, `ControlPlane.md`, `Runtime.md`. Add top-level `Architecture-v2.md` summary.
|
||
15. CI: add integration test job for the 2-node cluster + OPC UA roundtrip.
|
||
16. Tag the last v1 release on `main` for backport-only fixes. Merge `v2-akka-fuse` → `main` when GA.
|