diff --git a/docs/plans/2026-05-26-akka-hosting-alignment-design.md b/docs/plans/2026-05-26-akka-hosting-alignment-design.md new file mode 100644 index 0000000..ed174b6 --- /dev/null +++ b/docs/plans/2026-05-26-akka-hosting-alignment-design.md @@ -0,0 +1,451 @@ +# OtOpcUa v2 — Akka.NET + Fused Hosting Alignment with ScadaLink + +**Status:** Design approved, ready for implementation planning +**Date:** 2026-05-26 +**Branch:** `v2-akka-fuse` +**Sister project reference:** `~/Desktop/scadalink-design` (ScadaLink) + +## 1. Motivation + +OtOpcUa today runs as three separate processes (`OtOpcUa.Server` OPC UA host, `OtOpcUa.Admin` Blazor Server web UI, optional `OtOpcUaWonderwareHistorian` Framework sidecar) with manual operator-driven warm-redundancy failover. The sister project ScadaLink — owned by the same developer — solved similar problems with a fused single-binary, role-gated hosting model on top of an Akka.NET cluster. + +The motivation for this refactor is twofold: + +1. **Consistency.** A single developer (the project owner) moves between OtOpcUa and ScadaLink frequently. Sharing patterns — hosting, auth, actor hierarchy, deployment model — reduces cognitive overhead and makes fixes portable. +2. **Real HA improvements.** Upgrade OtOpcUa's manual operator-driven failover to automatic, Akka-cluster-driven failover with Traefik routing for the web UI. Preserve OPC UA dual-endpoint client-side failover semantics (clients connect to both nodes and pick based on `ServiceLevel`), now driven automatically by Akka cluster leadership. + +## 2. Architecture overview + +**One binary, role-gated.** `OtOpcUa.Host` (Microsoft.NET.Sdk.Web, .NET 10) replaces `OtOpcUa.Server` and `OtOpcUa.Admin`. Same binary on every node. Role configured via `OTOPCUA_ROLES` environment variable. + +**Two Akka roles, single cluster:** + +- **`admin`** — hosts Blazor web UI + cluster singletons. Singletons pinned via `ClusterSingletonManagerSettings.WithRole("admin")`. Traefik routes `/` to whichever Admin-role node `/health/active` reports as leader. +- **`driver`** — hosts OPC UA endpoint + per-node `DriverHostActor` hierarchy. Every Driver-role node always serves OPC UA; `ServiceLevel` computed by `RedundancyStateActor` is broadcast back to each Driver node and used to publish to the local OPC UA address space. + +Roles are additive: `OTOPCUA_ROLES=admin`, `OTOPCUA_ROLES=driver`, or `OTOPCUA_ROLES=admin,driver`. Small deployments run both roles on both nodes; larger deployments separate them. + +**Per-role leadership.** `Cluster.Get(system).State.RoleLeader("driver")` drives OPC UA `ServiceLevel`. `RoleLeader("admin")` drives `/health/active` (Traefik routing). These are independent — admin and driver leadership can land on different nodes if separated. + +**Cluster membership.** Both seed nodes; keep-oldest split-brain resolver; `down-if-alone = on`; 15s stable-after; 2s heartbeat / 10s threshold. CoordinatedShutdown for graceful singleton handover. Exact ScadaLink tuning. + +**OPC UA dual-endpoint preserved.** Driver-role nodes all bind `opc.tcp://0.0.0.0:4840`. Clients still see N endpoints in `ServerUriArray` and fail over via `ServiceLevel`. OPC UA spec compliance unchanged from today. + +**Mac dev:** role `admin,driver,dev` — `dev` short-circuits Windows-only driver registration (Galaxy, Wonderware) with explicit `[DEV-STUB]` log lines. + +## 3. Project & process restructure + +Single solution, ScadaLink-style folder layout. Existing OtOpcUa naming convention (`ZB.MOM.WW.OtOpcUa.*`) preserved. + +### New entry point & deletions + +| Action | Project | +|---|---| +| **New** | `OtOpcUa.Host` — `Microsoft.NET.Sdk.Web`, single Program.cs, role-gated startup, `AddWindowsService` | +| **Delete** | `OtOpcUa.Server` (content migrates) | +| **Delete** | `OtOpcUa.Admin` (UI moves to library) | + +### New libraries + +| Project | Owns | ScadaLink analog | +|---|---|---| +| `OtOpcUa.Commons` | Entity POCOs, interfaces, message contracts (`Types/`, `Interfaces/`, `Entities/`, `Messages/`) | `ScadaLink.Commons` | +| `OtOpcUa.ConfigDb` | EF Core `DbContext`, repositories, `IAuditService`, migrations, Data Protection key store | `ScadaLink.ConfigurationDatabase` | +| `OtOpcUa.Cluster` | Akka HOCON, `AkkaHostedService`, split-brain resolver config, role-aware membership helpers, `IClusterRoleInfo` | (split out of ScadaLink Host) | +| `OtOpcUa.Security` | LDAP bind, cookie+JWT hybrid, JWT issuance, role mapping, `/auth/login`, `/auth/ping` endpoints | `ScadaLink.Security` | +| `OtOpcUa.ControlPlane` | Cluster singletons: `ConfigPublishCoordinator`, `AdminOperationsActor`, `AuditWriterActor`, `FleetStatusBroadcaster`, `RedundancyStateActor` | `ScadaLink.ManagementService` | +| `OtOpcUa.Runtime` | Per-node actors: `DriverHostActor`, `DriverInstanceActor`, `VirtualTagActor`, `ScriptedAlarmActor`, `OpcUaPublishActor`, `HistorianAdapterActor`, `PeerOpcUaProbeActor`, `DbHealthProbeActor` | `ScadaLink.SiteRuntime` | +| `OtOpcUa.OpcUaServer` | OPC UA app host, address-space build, `Phase7Composer` extraction | (in ScadaLink.SiteRuntime/DCL) | +| `OtOpcUa.AdminUI` | Blazor components, hubs (`FleetStatusHub`, `AlertHub`, `ScriptLogHub`), auth state provider, `MapAdminUI()` | `ScadaLink.CentralUI` | + +### Unchanged + +- Driver projects (`OtOpcUa.Driver.Galaxy`, `.Modbus`, `.S7`, `.AbCip`, `.AbLegacy`, `.TwinCAT`, `.FOCAS`, `.OpcUaClient`) — still implement `IDriver`, now consumed by `DriverInstanceActor` instead of `DriverInstanceBootstrapper`. +- `OtOpcUa.Driver.Historian.Wonderware` — .NET Framework 4.8 sidecar, named-pipe IPC, wrapped by a `HistorianAdapterActor` in `OtOpcUa.Runtime`. +- `mxaccessgw` sibling repo — unchanged; Galaxy driver still talks gRPC to it. + +### Tests + +- `tests/OtOpcUa.Cluster.Tests` — split-brain, leadership transitions +- `tests/OtOpcUa.ControlPlane.Tests` — singleton actor unit tests via Akka.TestKit +- `tests/OtOpcUa.Runtime.Tests` — per-node actor tests, driver lifecycle +- `tests/OtOpcUa.Security.Tests` — LDAP, cookie+JWT roundtrip +- `tests/OtOpcUa.Host.IntegrationTests` — 2-node in-process cluster, deployment flow, failover, Mac-safe +- `tests/OtOpcUa.OpcUa.IntegrationTests` — real OPCFoundation client against stubbed Host +- `tests/OtOpcUa.E2E.Tests` — full stack with Traefik (nightly CI) + +### Deploy + +- `deploy/Install-Services.ps1` — installs one Windows Service per node (`OtOpcUaHost`), passes role via env var. Old script replaced. +- `deploy/traefik/` — Windows Traefik config + service registration for the leader-routed `/health/active`. +- `docker-dev/` (new, optional) — 2-node Mac dev compose with stubbed drivers + LDAP + SQL Server + Traefik. + +Solution file: `OtOpcUa.slnx` (matches ScadaLink convention; switch from current `.sln`). + +## 4. Actor hierarchy + +### Per-node tree + +Rooted under `OtOpcUa.Runtime`, one tree per Driver-role node: + +``` +DriverHostActor (per-node coordinator, started by Host) +├─ DriverInstanceActor (per DriverInstance row) +│ └─ children = pooled or per-subscription work +├─ VirtualTagActor (per VirtualTag row) +├─ ScriptedAlarmActor (per ScriptedAlarm row) +├─ OpcUaPublishActor (per-node bridge to OPCFoundation address space) +├─ HistorianAdapterActor (per-node, wraps Wonderware named-pipe sidecar) +├─ PeerOpcUaProbeActor (per-node, tests peer OPC UA stack health) +└─ DbHealthProbeActor (per-node, cached DB health probe) +``` + +### Cluster singletons + +Pinned to `admin` role via `ClusterSingletonManagerSettings.WithRole("admin")`: + +| Actor | Owns | Notes | +|---|---|---| +| `ConfigPublishCoordinator` | The deploy protocol. Writes `Deployment` row, broadcasts `DispatchDeployment(deploymentId)` via `DistributedPubSub` to every `DriverHostActor`, tracks apply ACKs per node. | Replaces `ApplyLeaseRegistry`. Resumes after failover by re-reading ConfigDb state — no Akka.Persistence. | +| `AdminOperationsActor` | All mutating admin ops (CRUD on equipment, drivers, scripts, namespaces, ACLs). Wraps each in an audit envelope. | UI calls via `ClusterSingletonProxy` (in-process when UI is on Admin node). | +| `AuditWriterActor` | Receives `AuditEvent` telemetry from any node, batch-inserts into `ConfigAuditLog`. | Idempotent on `EventId`. | +| `FleetStatusBroadcaster` | Aggregates Akka cluster member events + per-node `DriverHostStatus` heartbeats. Publishes diffs to `IHubContext` and `IHubContext`. | Push-driven; replaces today's 5s `FleetStatusPoller`. | +| `RedundancyStateActor` | Subscribes to `ClusterEvent.IMemberEvent` + `ClusterEvent.LeaderChanged` + per-node health probes. Computes `ServiceLevel` byte + `ServerUriArray` per Driver node. Publishes to `DistributedPubSub` topic `redundancy-state`. | Source of truth for OPC UA redundancy. Local `OpcUaPublishActor` subscribes and writes to its OPCFoundation stack. | + +### Supervision + +| Actor | Strategy | +|---|---| +| `DriverHostActor` | `Resume` | +| `DriverInstanceActor` | `Restart` with backoff (1s → 30s, ×1.5, jitter) | +| `VirtualTagActor` | `Restart` with backoff | +| `ScriptedAlarmActor` | `Restart` with backoff; preserve alarm state via `PreRestart` hook | +| `OpcUaPublishActor` | `Resume` | +| `HistorianAdapterActor` | `Restart` with backoff; SQLite store-and-forward buffers during pipe outage | +| All singletons | `Resume`; resumable state in ConfigDb | +| Script execution actors (short-lived) | `Stop` on failure | + +### State machines + +- `DriverInstanceActor` — Become/Stash for `Connecting → Connected → Reconnecting → Failed`. Bad-quality publish on disconnect; transparent re-subscribe on reconnect. Write failures returned synchronously via `Ask` from `OpcUaPublishActor`. +- `ConfigPublishCoordinator` — `Idle → Publishing → AwaitingApplyAcks → Sealed`, with timeout-driven escalation if a node fails to ack within `ApplyMaxDuration` (default 10 min). +- `RedundancyStateActor` — recomputes on every membership event, debounced 250ms to coalesce bursts. + +### Communication conventions + +- **Tell** for hot-path internal traffic (driver values, alarm state changes, publish broadcasts). +- **Ask** only at system boundaries (UI controller → `AdminOperationsActor`, with explicit timeout + cancellation token). +- **DistributedPubSub** for cluster-wide broadcasts (`DispatchDeployment`, `RedundancyStateChanged`, `FleetStatusChanged`). +- Application-level **correlation IDs** on every request/response message. +- Messages live in `OtOpcUa.Commons.Messages.{Drivers,Deploy,Admin,Audit,Redundancy}` — additive-only evolution. + +### Singleton persistence + +No Akka.Persistence. Each singleton reads its resumable state from `ConfigDb` on `PreStart` (e.g., `ConfigPublishCoordinator` reads the current in-flight `Deployment` row + per-node `NodeDeploymentState`) and writes on every state transition. + +### Mac-dev stubs + +`DevNode` role short-circuits driver registration. `DriverInstanceActor` for any Galaxy/Wonderware row enters a `Stubbed` Become state that returns deterministic test values. Logged at INFO with `[DEV-STUB] driver={Name} reason=windows-only`. + +## 5. Web hosting, auth, and SignalR + +### Kestrel startup gated by `admin` role + +`Program.cs` builds `WebApplicationBuilder`, registers all services, but only calls `app.MapBlazor()`, `app.MapHub<...>()`, `app.MapStaticAssets()`, and auth endpoints when `admin ∈ roles`. Driver-only nodes still bind Kestrel for `/healthz` on `:4841` and nothing else. + +### Authentication — cookie+JWT hybrid + +| Layer | Config | +|---|---| +| Cookie scheme | `OtOpcUa.Auth`, HttpOnly, SameSite=Strict, Secure (prod) / SameAsRequest (dev). Sliding 30-min idle timeout. | +| Embedded JWT | HMAC-SHA256, 15-min expiry, claims = `sub`, `roles`, `nodeAcls`. | +| LDAP bind | `LdapAuthService.AuthenticateAsync(user, pw)` at `/auth/login` POST — preserved from current `OtOpcUa.Admin/Security`. | +| Role mapping | `RoleMapper.MapGroupsToRolesAsync()` — LDAP groups → `FleetAdmin` / `ConfigEditor` / `ReadOnly`. Stays as-is. | +| Token issuance | `/auth/token` returns bearer for external clients (CLI, automation). | +| Circuit expiry probe | `/auth/ping` returns 200/401, polled by `CookieAuthenticationStateProvider` to detect expiry from inside a SignalR circuit. | +| Failure mode | LDAP unreachable → new logins fail, active sessions continue. | + +### Data Protection keys + +`services.AddDataProtection().PersistKeysToDbContext().SetApplicationName("OtOpcUa")` — keys live in `ConfigDb` so a circuit started on Admin-node A survives if Traefik fails over to Admin-node B mid-session. + +### SignalR hubs + +Three existing hubs preserved (`/hubs/fleet`, `/hubs/alerts`, `/hubs/script-log`): + +- **Today:** `FleetStatusPoller` polls SQL every 5s. +- **New:** `FleetStatusBroadcaster` singleton receives Akka cluster events + per-node telemetry, pushes diffs via `IHubContext`. No polling. +- `HubTokenService` bearer-token fallback retired — hubs are circuit-local, cookie auth flows through SignalR natively. External hub consumers use the bearer token from `/auth/token` with a `JwtBearer` authentication scheme declaration on the hub. + +### UI → backend wiring + +- **Reads:** Blazor components inject scoped repositories from DI and read directly from `ConfigDb`. No change from today. +- **Writes / mutating ops:** Components inject `IAdminOperationsClient` — a thin wrapper around `ClusterSingletonProxy` to `AdminOperationsActor`. Mutations are `Ask` with a 10s timeout + correlation ID. Audit envelope built UI-side, completed singleton-side. +- **Driver diagnostics:** Today's `DriverDiagnosticsClient` HTTP round-trip retires. UI components ask `IFleetDiagnosticsClient` which delegates to `ClusterClientReceptionist`-published actor messages. + +### Health endpoints + +| Endpoint | Returns | Used by | +|---|---|---| +| `/health/ready` | 200 once Akka member is `Up` + ConfigDb reachable + DataProtection key ring loaded | Service supervisor readiness gate | +| `/health/active` | 200 only on the Admin-role leader; 503 elsewhere | Traefik — routes browser traffic to leader | +| `/healthz` (existing) | 200 when Driver-role actor system is up + at least one driver registered (preserved on `:4841`) | Ops probes, OPC UA monitoring tools | + +### Traefik + +Windows Service (or external box). One route: `host=otopcua.*` → load-balance to `{admin-node-a:9000, admin-node-b:9000}` with `/health/active` health check, sticky sessions disabled (DataProtection key sharing handles continuity). + +### appsettings structure + +Mirrors ScadaLink's per-component options pattern: `Cluster:`, `Security:`, `ConfigDb:`, `OpcUa:`, `Drivers:`, `Historian:` sections, bound to options classes owned by their respective component projects. + +## 6. Edit + Deploy flow (replaces draft/publish generations) + +The single most consequential domain change: **drop the draft/publish `ConfigGeneration` lifecycle**. Edits are live; deploy is a snapshot+push, ScadaLink-style. + +### Edit model + +- `Equipment`, `Driver`, `DriverInstance`, `Namespace`, `UnsItem`, `Script`, `VirtualTag`, `ScriptedAlarm`, `NodeAcl` are edited **directly** via `AdminOperationsActor`. No draft staging, no `ConfigGeneration` lifecycle. Last-write-wins per row (rowversion column for stale-write detection only). +- Live edits do **not** affect running Driver-role nodes — running stacks reflect the *last-deployed* state. The UI shows a "drift" indicator when live ConfigDb state differs from last sealed deployment. +- Validation runs on edit (semantic checks: driver tag-path validity, script syntax, namespace name uniqueness) — pulled forward from deploy-time to edit-time. + +### Deploy model + +``` +Admin UI "Deploy" → AdminOperationsActor.Ask(StartDeployment) +AdminOperationsActor: + → snapshot ConfigDb current state + → ConfigComposer.Flatten() → DeploymentArtifact + → compute RevisionHash = SHA256(canonical-serialized artifact) + → write Deployment row (DeploymentId GUID, RevisionHash, CreatedBy, CreatedAtUtc, Status=Dispatching) + → Ask ConfigPublishCoordinator.DispatchDeployment(deploymentId) + +ConfigPublishCoordinator (cluster singleton, admin role): + → write Deployment.Status = Dispatching + → DistributedPubSub Publish to "deployments" topic: DispatchDeployment(deploymentId, revisionHash) + → schedule ApplyDeadline timer (ApplyMaxDuration, default 10 min) + +DriverHostActor (per node, subscribed to "deployments"): + receive DispatchDeployment(deploymentId, revisionHash): + → if currentDeploymentRevision == revisionHash → ack Applied (idempotent) + → else: + → acquire per-node ApplyLock (Become Applying(deploymentId)) + → write NodeDeploymentState row (NodeId, DeploymentId, StartedAtUtc) + → fetch artifact: read DeploymentArtifact blob from ConfigDb by deploymentId + → diff against current applied artifact → per-instance ApplyDelta plans + → dispatch ApplyDelta to DriverInstanceActor / VirtualTagActor / ScriptedAlarmActor children + → collect per-instance acks (all-or-nothing per node) + → on full success: write GenerationSealedCache (LiteDb local), update NodeDeploymentState.AppliedAtUtc + → on any instance Failure: rollback to previous deployment, mark NodeDeploymentState=Failed + → Tell Coordinator: ApplyAck(deploymentId, nodeId, Applied | Failed(reason)) + → Become Steady + +ConfigPublishCoordinator: collect ApplyAcks + → all Driver nodes Applied → Deployment.Status = Sealed → DistributedPubSub PublishDeploymentSealed + → any Failed → Deployment.Status = PartiallyFailed → broadcast DeploymentFailed + → deadline elapsed before all acks → Deployment.Status = TimedOut → broadcast DeploymentTimedOut +``` + +### Per-instance operation lock + +All mutating commands (deploy, disable, enable, delete) on a `DriverInstance` go through `DriverInstanceActor`, which serializes them via the actor mailbox — single-threaded by construction. + +### Idempotency + +- `DeploymentId` + `RevisionHash` together identify a deployment. +- `DriverHostActor` seeing a `DispatchDeployment` whose `RevisionHash` matches current applied state → immediate ack `Applied`, no work. Safe to redeliver. +- `Phase7Composer.ComposeAsync(artifact)` is pure; same artifact → same delta plan. +- `DriverInstanceActor.ApplyDelta(plan)` compares against current state, applies only diffs. + +### Concurrency control + +- Last-write-wins on edits (no optimistic concurrency on `Equipment`, `Driver`, `Script`, etc.) — matches ScadaLink template behavior. +- **Optimistic concurrency on `Deployment` and `NodeDeploymentState` rows** (rowversion column) — prevents two concurrent Coordinator instances (during failover) from corrupting state. + +### Singleton failover during deploy + +1. Old Coordinator wrote `Deployment.Status = Dispatching` + `NodeDeploymentState` rows before broadcast. +2. New Coordinator on takeover queries `Deployment` rows with non-terminal `Status`. +3. For each in-flight deployment, `Ask` every `DriverHostActor` (via cluster-aware actor selection) for current `NodeDeploymentState`. +4. Recompute outstanding-ack set; resume the deadline timer with the remaining time. +5. If apply deadline already passed → mark `Deployment.Status = TimedOut` for any unack'd nodes. + +### Crash recovery on Driver node restart + +- `DriverHostActor.PreStart` reads `NodeDeploymentState` for self. +- If row says `Applied` for some `DeploymentId` and matches last sealed cache → Become Steady on that artifact. +- If row says `Applying` (didn't reach Applied) → discard partial state, re-fetch the artifact, replay apply (idempotent). +- If ConfigDb unreachable → fall back to local LiteDb sealed cache, Become `Stale` (drops ServiceLevel via `RedundancyStateActor`). Background reconnect retries every 30s. + +### Schema migration from today + +| Today | New | +|---|---| +| `ConfigGeneration` (Draft/Published/Sealed lifecycle) | **Dropped** | +| `ClusterNodeGenerationState` | Renamed → `NodeDeploymentState` with `(NodeId, DeploymentId, Status, StartedAtUtc, AppliedAtUtc, RowVersion)` | +| `ClusterNode.RedundancyRole` column | **Dropped** (Akka leader-of-driver-role is source of truth) | +| `ConfigAuditLog` | Kept; deploy events added as new event types | +| (new) `Deployment` | `(DeploymentId, RevisionHash, Status, CreatedBy, CreatedAtUtc, ArtifactBlob varbinary(max), RowVersion)` | +| (new) `ConfigEdit` audit row per Equipment/Driver/Script edit | Live-edit history | +| (new) `DataProtectionKeys` | DataProtection key ring storage | + +No more `ApplyLeaseRegistry` table or watchdog actor. Apply state lives in `NodeDeploymentState`; watchdog is a Coordinator-side scheduled message keyed by `DeploymentId`. + +### Stale-config fallback + +Preserved from today's `GenerationSealedCache`: local LiteDb cache holds last-applied `DeploymentArtifact`. On Host boot with ConfigDb unreachable, `DriverHostActor` boots from cache → Become `Stale` → `RedundancyStateActor` drops `ServiceLevel` for that node. + +### Peer probes consolidated + +| Today | New | +|---|---| +| `PeerHttpProbeLoop` (HTTP `/healthz`) | Retired — Akka failure detector replaces it | +| `PeerUaProbeLoop` (OPC UA `opc.tcp://peer:4840`) | **Retained** as `PeerOpcUaProbeActor` — tests whether the OPC UA stack itself (not just the process) is up. Feeds `RedundancyStateActor`. | +| `DbHealthCache` (cached DB probe) | Retained as `DbHealthProbeActor` per-node. Feeds `RedundancyStateActor` + `/health/ready`. | + +### ServiceLevel computation in `RedundancyStateActor` + +``` +serviceLevel(node) = + base 240 if (cluster member Up AND db reachable AND not stale AND opc ua probe ok) + base 200 if (member Up AND db reachable AND stale) + base 100 if (member Up AND db unreachable AND stale) + base 0 if (member Down / Unreachable) + ++10 bonus if Akka driver-role leader is this node +``` + +ServiceLevel bands match the existing `RedundancyStatePublisher` so OPC UA client behavior is unchanged from today. The leader-bonus replaces today's operator-managed `RedundancyRole = Primary`. + +## 7. Error handling & failure modes + +### Akka cluster failure modes + +| Scenario | Behavior | +|---|---| +| Network partition (split-brain) | Keep-oldest resolver downs the smaller side after 15s stable-after. `down-if-alone = on` covers isolated nodes. | +| Admin leader process crash | Failure detector trips after 10s, downs the member, new singleton instance starts on remaining Admin node. Traefik `/health/active` probe fails over within 1 polling interval (~5s). | +| Driver-role node crash | RedundancyStateActor sees member Down → drops that node's ServiceLevel to 0 → OPC UA clients reconnect to surviving node. Both nodes were already running their own copy; no in-cluster recovery needed for that node's work. | +| Both Admin nodes down simultaneously | Web UI unavailable. Driver nodes continue serving OPC UA from last-sealed cache. No new deployments possible until Admin node recovers. | +| All Driver nodes down | OPC UA endpoints unavailable. Clients reconnect when any Driver node returns. ServiceLevel back to 240 once member Up + DB reachable + apply sealed. | +| Singleton handover during deploy | Coordinator state survives in `Deployment` + `NodeDeploymentState` ConfigDb rows. New Coordinator queries DriverHostActors via cluster-aware actor selection. Resume remaining deadline. | + +### ConfigDb unavailability + +- **At edit time:** AdminUI returns user-visible error. No retries — operator decides. +- **At deploy time:** Coordinator refuses to start dispatch if it can't write the `Deployment` row. +- **At Driver node boot:** Fall back to local LiteDb sealed cache. RedundancyStateActor drops `ServiceLevel`. +- **At singleton failover:** New Coordinator's `PreStart` retries via Polly (5 attempts, exponential backoff). If exhausted → singleton crashes → cluster restarts singleton on next viable Admin node. + +### Driver / equipment failures + +- Driver connection loss → `DriverInstanceActor` enters `Reconnecting` Become state, publishes bad-quality to OPC UA address space immediately, retries at fixed interval. +- Tag-path-resolution failure → retried periodically. +- Write failure to driver → returned synchronously to caller via `Ask` from `OpcUaPublishActor`. +- Driver process unresponsive (Galaxy gateway down) → `IDriver.HealthCheck` returns degraded → `DriverInstanceActor` reports to `DriverHostActor` → `RedundancyStateActor` factors into ServiceLevel. + +### Wonderware historian sidecar + +- Named-pipe disconnect → `HistorianAdapterActor` enters `Reconnecting`; alarm history rows buffered to local SQLite store-and-forward. +- Sidecar process crash → no in-cluster recovery (external process); operator restarts via Windows Service control. + +### Auth failures + +- LDAP unreachable → `/auth/login` returns 503. Active sessions continue with cached claims. +- JWT signature failure (key ring drift) → 401, session terminates. DataProtection keys in ConfigDb prevent this in the happy path. +- Cookie expired (sliding 30-min idle) → `/auth/ping` returns 401 → `CookieAuthenticationStateProvider` triggers UI logout. + +### SignalR / circuit drops + +- Blazor circuit dropped → `App.razor` reload script reconnects (preserved from today). +- Hub message loss during reconnect → `FleetStatusBroadcaster` resends current state to the reconnecting client on `OnConnectedAsync` (full snapshot, not just diffs). + +### OPC UA stack failures + +- Address-space corruption → `OpcUaPublishActor` logs ERROR, sends `RebuildAddressSpace` to itself; sequence number bump notifies clients to resubscribe. +- OPC UA listener bind failure (port collision) → Host fails readiness probe, supervisor restarts service. + +### Audit invariants + +- Audit write failures **never abort** the user-facing action. `AuditWriterActor` buffer overflow → log WARN, drop oldest (with counter metric). The action's success/failure path is authoritative. +- All deploy + edit events carry `ExecutionId` (per-request correlation) so audit rows for one operator action share an ID. + +## 8. Testing strategy + +Test projects mirror the new layering. Test infrastructure stays Mac-friendly: stubbed Windows-only drivers, ephemeral SQL Server (LocalDB on Windows / `mcr.microsoft.com/mssql/server` container on Mac), `OpenLDAP` container, all spun up via `tests/docker-compose.yml`. + +### Layered test pyramid + +| Layer | Project | What it covers | +|---|---|---| +| **Unit** | `OtOpcUa.Runtime.Tests` | Per-actor logic via `Akka.TestKit.Xunit2`. `DriverInstanceActor` state-machine transitions, `Phase7Composer` purity, `ScriptedAlarmActor` state machine, `VirtualTagActor` expression eval. Drivers mocked via `IDriver` test doubles. | +| **Unit** | `OtOpcUa.ControlPlane.Tests` | Singleton actor logic. `ConfigPublishCoordinator` happy path + timeout + concurrent ack ordering. `RedundancyStateActor` ServiceLevel computation truth table. `AuditWriterActor` batch flush + idempotency on duplicate `EventId`. | +| **Unit** | `OtOpcUa.Cluster.Tests` | Split-brain resolver config validation, role-aware membership helpers, HOCON parses. | +| **Unit** | `OtOpcUa.Security.Tests` | LDAP role mapping, JWT issuance, cookie+JWT roundtrip, `/auth/ping` expiry semantics. | +| **Integration** | `OtOpcUa.Host.IntegrationTests` | 2-node in-process Akka cluster. Real SQL Server, stubbed drivers. Tests: deploy happy path, deploy timeout, deploy with one node down, singleton failover mid-deploy, ConfigDb outage + stale-config fallback, edit-then-deploy roundtrip, audit row emission. | +| **Integration** | `OtOpcUa.OpcUa.IntegrationTests` | Real OPCFoundation client connects to a running stubbed Host. Asserts: dual endpoint visible, ServerUriArray populated, ServiceLevel reflects leader status, browse + read + write through `OpcUaPublishActor`, write failures returned synchronously. | +| **End-to-end** | `OtOpcUa.E2E.Tests` | Full Host with Traefik in front, two Admin nodes + two Driver nodes (4 processes via Docker). Verifies: web UI login via LDAP, deploy from UI flows to OPC UA stack, kill admin leader → Traefik fails over within 25s, kill driver node → OPC UA clients reconnect with correct ServiceLevel. CI nightly. | + +### Failover-specific test cases + +1. Kill Admin leader during `Dispatching` phase → new Coordinator resumes, deployment seals. +2. Kill Admin leader during `AwaitingApplyAcks` → new Coordinator queries DriverHostActors, completes ack collection. +3. Kill Driver node during `Applying` → Coordinator marks that node's `NodeDeploymentState=Failed` after deadline; surviving Driver nodes complete their apply. +4. Restart Driver node mid-deploy → on restart, replays apply (idempotent). +5. Akka split-brain (network partition between 2 admin nodes) → keep-oldest wins, smaller side downs itself within 15s. +6. Both Admin nodes restart simultaneously → deployments in `Dispatching` resume cleanly after cluster reforms. +7. Concurrent edits to the same `DriverInstance` from two UI sessions → last write wins, both audit rows present, no row corruption. + +### Deploy idempotency tests + +- Replay `DispatchDeployment` with same `DeploymentId/RevisionHash` → no work, ack `Applied`. +- Apply same `DeploymentArtifact` twice in a row → second application is a no-op. +- Crash DriverHostActor mid-apply, restart → resumes from `NodeDeploymentState`, completes idempotently. + +### Property tests + +- `Phase7Composer.ComposeAsync` is pure: same artifact → same plan, no side effects. +- `RedundancyStateActor` ServiceLevel computation: every combination of (member-state, db-ok, stale, opc-ok, is-leader) produces expected byte. +- Audit envelope generation: every mutating op produces exactly one audit row with stable `ExecutionId` correlation. + +### Mac-dev test invariants + +- All unit + integration tests run on macOS without Windows-only assemblies. +- Cluster tests use in-process Akka.Remote on 127.0.0.1. +- LDAP tests use `OpenLDAP` container or `Security:Ldap:DevStubMode=true`. + +### Retired tests + +Anything touching `ConfigGeneration` lifecycle, `ApplyLeaseRegistry`, `PeerHttpProbeLoop`, `FleetStatusPoller`, `RedundancyCoordinator` peer-probe loops, `RedundancyStatePublisher`. + +## 9. Risks & open questions + +1. **Akka.NET on .NET 10.** Verify Akka.NET 1.5+ targets .NET 10 cleanly. +2. **OPCFoundation SDK threading.** The OPC UA stack runs its own threadpool. `OpcUaPublishActor` must marshal writes via thread-safe wrappers; use a dedicated `synchronized-dispatcher` for actors that touch the OPC UA address space. +3. **Failure detector tuning.** ScadaLink's 2s/10s is tuned for site-to-central RTT. Benchmark before locking. Aggressive tuning + GC pauses → spurious singleton handover. +4. **ServiceLevel = Akka leader removes operator control.** No escape hatch in v1. If a customer needs one later, add a `PinnedPrimary` column to `ClusterNode` and an override path in `RedundancyStateActor`. Out of scope now. +5. **Long-lived v2 branch drift.** Monthly rebase from main, CI runs on v2 from day one. +6. **Schema migration is destructive.** Dropping `ConfigGeneration` + `ClusterNode.RedundancyRole` is one-way. Cutover must run on a quiesced system. Provide a `Migrate-To-V2.ps1` script that backs up ConfigDb, runs EF migrations, validates row counts, prints a summary. +7. **Wonderware + mxaccessgw still external processes.** Both untouched by this refactor. Future actorization would be a second refactor. +8. **Audit row volume.** Edit-heavy install ≈ 5k rows/day. Need monthly partition + 365-day retention same as ScadaLink #23. + +## 10. Migration plan + +Big-bang on `v2-akka-fuse` branch: + +1. Branch `v2-akka-fuse` off `main`. +2. Add new projects: `OtOpcUa.Host`, `.Cluster`, `.Security`, `.ControlPlane`, `.Runtime`, `.ConfigDb`, `.Commons`, `.AdminUI`, `.OpcUaServer`. Convert to `OtOpcUa.slnx`. +3. Move ConfigDb access (EF context, repos, migrations) out of `Server` and `Admin` into `OtOpcUa.ConfigDb`. Add DataProtection key store table. +4. Move LDAP + cookie + JWT out of `Admin/Security` into `OtOpcUa.Security`. Adopt 15-min JWT / 30-min sliding cookie / `/auth/ping`. +5. Build `OtOpcUa.Cluster`: HOCON, `AkkaHostedService`, role-aware membership helpers, split-brain resolver. +6. Build `OtOpcUa.ControlPlane`: `ConfigPublishCoordinator`, `AdminOperationsActor`, `AuditWriterActor`, `FleetStatusBroadcaster`, `RedundancyStateActor`. +7. Build `OtOpcUa.Runtime`: `DriverHostActor`, `DriverInstanceActor`, `VirtualTagActor`, `ScriptedAlarmActor`, `OpcUaPublishActor`, `HistorianAdapterActor`, `PeerOpcUaProbeActor`, `DbHealthProbeActor`. +8. Migrate `Phase7Composer` to `OtOpcUa.OpcUaServer`; make it pure and unit-tested. +9. Move Blazor components from `Admin` into `OtOpcUa.AdminUI` library; replace `DriverDiagnosticsClient` HTTP with in-process actor calls; rewire `FleetStatusHub` / `AlertHub` / `ScriptLogHub` to be fed by `FleetStatusBroadcaster` `IHubContext`. +10. Build `OtOpcUa.Host` `Program.cs`: role-gated startup, health endpoints (`/health/ready`, `/health/active`, `/healthz`), `AddWindowsService`. +11. ConfigDb migration: add `Deployment`, `ConfigEdit`, `DataProtectionKeys` tables; rename `ClusterNodeGenerationState` → `NodeDeploymentState`; drop `ConfigGeneration`; drop `ClusterNode.RedundancyRole`. EF migration + idempotent SQL script + `Migrate-To-V2.ps1`. +12. Delete `OtOpcUa.Server`, `OtOpcUa.Admin`, `DriverInstanceBootstrapper`, `RedundancyCoordinator`, `RedundancyStatePublisher`, `ApplyLeaseRegistry`, `FleetStatusPoller`, `PeerHttpProbeLoop`, `HubTokenService`. Sweep any `*RedundancyRole*` references. +13. Update `deploy/Install-Services.ps1`: single Windows Service per node, role via env var, Traefik service registration. +14. Update docs in `docs/`: rewrite `Redundancy.md`, `ServiceHosting.md`; add `Cluster.md`, `ControlPlane.md`, `Runtime.md`. Add top-level `Architecture-v2.md` summary. +15. CI: add integration test job for the 2-node cluster + OPC UA roundtrip. +16. Tag the last v1 release on `main` for backport-only fixes. Merge `v2-akka-fuse` → `main` when GA.