Commit Graph

30 Commits

Author SHA1 Message Date
Joseph Doherty 57355405a6 chore(security): drop dead audit suppressions; patch OpenTelemetry + Tmds.DBus CVEs
All five suppressed advisories are now resolved at baseline/resolved versions,
so every NuGetAuditSuppress is removed repo-wide:
- System.Security.Cryptography.Xml (GHSA-37gx-xxp4-5rgx / GHSA-w3x6-4m5h-cxqf)
  -> fixed by the .NET 10 baseline (10.0.6)
- OPCFoundation Opc.Ua.Core (GHSA-h958-fxgg-g7w3) -> fixed at resolved 1.5.378.106

Two were still live and are now patched via direct security pins:
- OpenTelemetry.Api 1.9.0 -> 1.15.3 (GHSA-g94r-2vxg-569j) pinned in Cluster;
  Runtime/ControlPlane/AdminUI + tests inherit via project reference
- Tmds.DBus.Protocol 0.20.0 -> 0.21.3 (GHSA-xrw6-gwf8-vvr9) pinned in Client.UI

Also correct the Historian sidecar runtime comments (x86 -> x64, matching the
csproj PlatformTarget). Solution audit: 0 vulnerable packages; full build clean.
2026-06-12 09:03:42 -04:00
Joseph Doherty fc0d43a3dc refactor(scripted-alarms): retire orphaned ScriptedAlarmActor + F9b evaluator (T11) 2026-06-10 15:22:26 -04:00
Joseph Doherty b5748288df test(scriptlog): prove bridge→broadcaster delivery off the script-logs DPS topic
Composes the one Layer-0 hop existing tests left uncovered together:
ScriptLogSignalRBridge subscribing to the script-logs DPS topic and
fanning a ScriptLogEntry out to the IInProcessBroadcaster<ScriptLogEntry>
singleton resolved from the SAME DI container the /script-log page injects.
Mirrors DriverStatusHubE2eTests. Confirms the server-side topic→page chain
delivers end-to-end (only the live Blazor circuit remains manual).
2026-06-10 13:53:34 -04:00
Joseph Doherty bd2dd05a0c feat(scripting): evaluators log through root script logger → script-log page (F8) 2026-06-10 12:03:51 -04:00
Joseph Doherty b73ce75402 harden(vtag): exclude backslash from passthrough capture + parity tests (A review) 2026-06-07 15:31:54 -04:00
Joseph Doherty 08d7477860 feat(vtag): passthrough fast-path skips Roslyn for mirror scripts (A) 2026-06-07 15:26:20 -04:00
Joseph Doherty e4a3f07c99 test(integration): poll per-node diagnostics to remove timing fragility (review) 2026-06-07 03:37:04 -04:00
Joseph Doherty ec9599e234 test(integration): multi-cluster deploy scopes drivers per node 2026-06-07 03:31:08 -04:00
Joseph Doherty ad7f9e731f feat(admin): headless POST /api/deployments REST endpoint (API-key gated)
v2-ci / build (push) Failing after 50s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
A thin gateway over the admin-operations cluster singleton so CI/scripts can trigger a
deployment without the Blazor button. Forwards to the same IAdminOperationsClient.
StartDeploymentAsync; mounted on admin-role nodes. Auth is a fixed-time X-Api-Key check
against Security:DeployApiKey (orthogonal to the cookie-only web auth); AllowAnonymous so the
auth fallback doesn't 401 it, self-disabling (503) until the key is set. Outcome->status:
202/200/409/422. Unit tests for the key check + outcome mapping; HTTP E2E (real auth + real
deploy via the 2-node harness). Documented in docs/security.md.
2026-06-06 15:54:51 -04:00
Joseph Doherty a5d857d5b2 test(opcua): E2E deploy of an Equipment namespace through the real ConfigComposer
v2-ci / build (push) Failing after 46s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
Seed a 1-area/1-line/1-equipment/1-tag Equipment namespace, StartDeployment via the
in-process 2-node harness, and assert the persisted artifact decodes (ParseComposition)
to the equipment signal (FullName from TagConfig) + friendly UNS folder names. Covers the
ConfigComposer -> ArtifactBlob -> ParseComposition.EquipmentTags seam the unit tests only
approximated with hand-built JSON. (OPC UA browse is covered against a real SDK node manager
in Phase7ApplierHierarchyTests; the cluster harness binds the no-op sink.)
2026-06-06 15:22:25 -04:00
Joseph Doherty c1619d95f5 feat(auth)!: OtOpcUa canonical control-plane roles + config-DB migration (Task 1.7)
Standardize the control-plane admin role VALUES on the canonical six
(ZB.MOM.WW.Auth CanonicalRole). OtOpcUa uses four:
  ConfigViewer   -> Viewer
  ConfigEditor   -> Designer
  FleetAdmin     -> Administrator
  DriverOperator -> Operator   (appsettings-only string role)

This is a rename, not a permission change: enforcement semantics are
preserved (whoever could deploy/administer/operate before still can).

- AdminRole enum members renamed (persisted as string names via
  HasConversion<string>); RoleGrants.razor dropdown default updated.
- EF DATA migration CanonicalizeAdminRoles rewrites existing
  LdapGroupRoleMapping.Role rows old->new (Up) and back (Down); schema /
  model snapshot byte-identical (no pending model changes).
- Enforcement role STRINGS canonicalized:
  * Security policies keep their NAMES ("DriverOperator"/"FleetAdmin")
    but require canonical roles: RequireRole("Operator","Administrator")
    and RequireRole("Administrator").
  * Deployments.razor [Authorize(Roles="Administrator,Designer")].
  * DevStub now grants "Administrator"; LdapOptions/doc-comment examples
    canonicalized.
- Data-plane authorization (NodePermissions/NodeAcl/IPermissionEvaluator/
  TriePermissionEvaluator/UserAuthorizationState) UNTOUCHED.
- New CanonicalAdminRolesTests pins canonical claim values end-to-end and
  the real registered policies; existing role-string tests updated.
2026-06-02 07:30:00 -04:00
Joseph Doherty 8ba289f975 chore(auth): OtOpcUa unify dev LDAP base DN to dc=zb,dc=local (Task 1.6)
Replace all dev-directory dc=lmxopcua,dc=local references with dc=zb,dc=local
across LdapOptions default, integration harness overrides, docker-compose LDAP_ROOT,
AclEdit placeholder DN, and dev/smoke-test docs. CN/OU prefixes preserved.
2026-06-02 06:45:23 -04:00
Joseph Doherty c4f315ec90 fix(auth): OtOpcUa 1.2 review fixes — startup insecure-transport guard + Ldaps in prod overlays, test fidelity, 0.1.1 pin 2026-06-02 01:37:29 -04:00
Joseph Doherty 257caa7bd1 feat(auth): cut OtOpcUa over to ZB.MOM.WW.Auth.Ldap; preserve DevStubMode; route roles via IGroupRoleMapper (Task 1.2/1.4) 2026-06-02 00:55:10 -04:00
Joseph Doherty 2844180865 fix: honor LdapOptions.Enabled at runtime; dedupe ILdapAuthService registration; +SearchBase test, doc fix
v2-ci / build (push) Failing after 41s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
2026-06-01 23:03:12 -04:00
Joseph Doherty d3ab2bfbaf fix: bind OtOpcUa LdapOptions from real Security:Ldap section; gate validator on DevStubMode 2026-06-01 22:46:09 -04:00
Joseph Doherty 88e773af36 feat: validate OpcUa host options at startup (route through IOptions + ValidateOnStart) 2026-06-01 18:45:55 -04:00
Joseph Doherty f35ebd7aaf feat: add fail-fast LDAP options validation in OtOpcUa via ZB.MOM.WW.Configuration 2026-06-01 18:32:44 -04:00
Joseph Doherty 61193629b6 fix(adminui): wire Test Connect probes + live panels on admin-only nodes
v2-ci / build (push) Failing after 36s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
Both bugs surfaced only on split-role deployments (the MAIN cluster's
admin-only nodes), where the AdminUI runs without the driver role.

- Test Connect returned "No probe registered" for every driver: the
  IDriverProbe set was registered only under the driver role, but the
  admin-operations singleton that consumes it is pinned to admin. Extract
  AddOtOpcUaDriverProbes() (idempotent via TryAddEnumerable) and call it
  in the hasAdmin path too.

- Live driver-status/alerts/script-log panels showed "SignalR error:
  Connection refused": these Blazor Server components opened a HubConnection
  to their own hub via the browser's public URL, which server-side code
  can't reach behind Traefik (host :9200 -> container :9000). Read the
  in-process source directly instead -- DriverStatus via
  IDriverStatusSnapshotStore.SnapshotChanged, Alerts/ScriptLog via a new
  IInProcessBroadcaster<T>. Fleet status was unaffected (reads DB/ActorSystem).

Adds unit tests for probe registration, the snapshot-store event, and the
broadcaster.
2026-05-29 16:38:32 -04:00
Joseph Doherty 494da22cd1 test(adminui): E2E scaffolding for Test Connect + Reconnect + Status hub
- DriverTestConnectE2eTests: 3 scenarios (sim/wrong-port/black-hole)
  against the Modbus Docker fixture. Sim + wrong-port skip if fixture
  unreachable; black-hole uses ModbusDriverProbe directly (no fixture).
- DriverReconnectE2eTests: message round-trip through AdminOperationsActor
  cluster singleton — Ok=true + audit write, without live driver side effect.
- DriverStatusHubE2eTests: bridge-mocked fallback — spawns
  DriverStatusSignalRBridge in the harness ActorSystem with a mock
  IHubContext, publishes DriverHealthChanged to the driver-health DPS
  topic, asserts store upsert + hub SendAsync call.
- DockerFixtureAvailability helper: TCP-connect probe for skip guards.
- Moq 4.20.72 added to central package management for hub mocking.
- Design doc §8.3 replaced with concrete pre-ship operator runbook.
2026-05-28 11:31:12 -04:00
Joseph Doherty 64e3fbe035 docs: backfill XML documentation across 756 files
v2-ci / build (push) Failing after 1m43s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
Adds <summary>, <param>, <typeparam>, and <inheritdoc/> tags to public
members surfaced by commentchecker — resolves 5,847 of 5,869 issues
(99.6%) across three /fixdocs passes.
2026-05-28 08:10:17 -04:00
Joseph Doherty 25ce111981 refactor(test): rename FailoverScenarioTests → FailoverDuringDeployTests for plan parity 2026-05-26 11:18:13 -04:00
Joseph Doherty 05a0596fb1 feat(host): F9b RoslynScriptedAlarmEvaluator + #107 close engine DI
v2-ci / build (push) Failing after 39s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (push) Has been skipped
RoslynScriptedAlarmEvaluator mirrors F8b's pattern for alarm predicates:
caches a compiled ScriptEvaluator<AlarmPredicateContext, bool> per unique
predicate, runs against the dependency dictionary with a 2s timeout, and
turns every failure (compile error, sandbox violation, runtime throw,
ctx.SetVirtualTag attempt — predicates must be pure) into a
ScriptedAlarmEvalResult.Failure. ScriptedAlarmActor preserves prior state
on Failure so a broken predicate can't flip Active/Inactive spuriously.

Program.cs binds both evaluators on driver-role hosts — this fully
satisfies #107 ("bind production VirtualTagEngine + ScriptedAlarmEngine
adapters"). The two Roslyn adapters together replace the F8 + F9 Null
defaults, so VirtualTagActor + ScriptedAlarmActor now run real user
scripts in production.

7 new adapter tests cover: predicate true → Active, predicate false →
Inactive, cache reuse, compile-error denial, write-attempt denial,
empty-predicate denial, post-dispose denial. Host.IntegrationTests now
17/17 green.

Closes #80 + #107. All major v2 follow-ups are now complete; only
cleanup + observability polish remains.
2026-05-26 10:58:04 -04:00
Joseph Doherty 219d10a22d feat(host): F8b RoslynVirtualTagEvaluator — production virtual-tag eval
RoslynVirtualTagEvaluator wraps Core.Scripting.ScriptEvaluator + Core
.VirtualTags.VirtualTagContext into a single-tag IVirtualTagEvaluator
adapter. Caches the compiled ScriptEvaluator per unique expression so
the second-and-onwards Evaluate is an in-process method call against the
dependency dictionary. Compile/sandbox/runtime errors all surface as
VirtualTagEvalResult.Failure rather than propagating exceptions through
the VirtualTagActor message loop.

Single-tag scope: cross-tag ctx.SetVirtualTag writes are dropped + logged
because fan-out between actors is owned by DependencyMuxActor. Cycle
detection + cascade ordering stay in Core.VirtualTags.VirtualTagEngine
where they belong (loaded fleet-wide); this adapter keeps the actor
message handler simple.

Host adds Core.Scripting + Core.VirtualTags project refs, plus a
TargetWarningsAsErrors NU1608 suppression — Microsoft.CodeAnalysis.CSharp
.Scripting 4.12.0 pins Common to 4.12.0 but ASP.NET Core transitively
brings Microsoft.CodeAnalysis.Common 5.0.0; the surface we use is stable
across the drift (verified by Core.Scripting.Tests).

Program.cs binds RoslynVirtualTagEvaluator → IVirtualTagEvaluator on
driver-role hosts, replacing the F8-default NullVirtualTagEvaluator so
VirtualTagActor evaluates real user scripts at runtime.

6 new adapter tests cover: simple expression sums, cache reuse across
calls, compile-error denial, runtime-throw denial, empty-expression
denial, post-dispose denial. Host.IntegrationTests now 10/10 green.

Closes #79. F9b + #107 next.
2026-05-26 10:55:56 -04:00
Joseph Doherty 21eac21409 feat(opcua,host): F13c LDAP-bound UserName validator
Adds IOpcUaUserAuthenticator seam in OpcUaServer.Security with a deny-all
NullOpcUaUserAuthenticator default. OpcUaApplicationHost subscribes to
SessionManager.ImpersonateUser after _application.Start so UserName tokens
flow through the authenticator and either attach a UserIdentity to the
session (Allow) or set IdentityValidationError = BadIdentityTokenRejected
(Deny / authenticator exception). Anonymous + X509 tokens fall through to
SDK defaults.

LdapOpcUaUserAuthenticator (Host project) bridges to the same
ILdapAuthService that AddOtOpcUaAuth uses for Admin cookies / JWT, so a
single LDAP source-of-truth governs both Admin control plane and OPC UA
data plane. Program.cs registers LdapOptions + LdapAuthService +
IOpcUaUserAuthenticator on driver-role hosts; admin-only nodes are
unchanged.

OtOpcUaServerHostedService threads the resolved authenticator into
OpcUaApplicationHost so the seam respects Host DI.

10 new tests: 6 in OpcUaServer.Tests cover the pure HandleImpersonation
static method (success / denial / anonymous fallthrough / authenticator-
throw / null-username / Null authenticator); 4 in Host.IntegrationTests
cover the LdapOpcUaUserAuthenticator adapter (LDAP allow → Allow with
roles, LDAP deny → Deny, exception → backend-error denial, display-name
fallback). OpcUaServer suite is 40 / 40 green.

Closes #104. Unblocks Task 60 (dual-endpoint + ServiceLevel tests) once
#81 residual lands.
2026-05-26 10:21:37 -04:00
Joseph Doherty b0a2bb037d test(integration): F21 — docker-compose + env-driven SQL/LDAP harness mode
Adds a real-infra mode for the integration test harness alongside the default
in-memory mode. Drops the previously-untested code paths (EF SqlServer
behaviors, real LDAP bind) under env-var control without breaking the
zero-infra default that CI runs.

- docker-compose.yml — minimal SQL 2022 (14331) + OpenLDAP (3894) stack
  (ports chosen to coexist with docker-dev/ on 14330/3893)
- HarnessMode record reads OTOPCUA_HARNESS_USE_SQL=1 / USE_LDAP=1 from env
- SQL mode: per-harness unique DB OtOpcUa_Harness_{guid}, EnsureCreated
  at startup, EnsureDeleted on dispose (best-effort)
- LDAP mode: drops StubLdapAuthService and configures real LdapAuthService
  against the compose'd OpenLDAP via Authentication:Ldap:* config keys
- Microsoft.EntityFrameworkCore.SqlServer added to the test project
- README documents both modes + the macOS no-Docker caveat

Default in-memory mode unchanged — all 9 existing tests still pass.
2026-05-26 07:25:16 -04:00
Joseph Doherty cd5540cb1a test(integration): F22 — failover scenario tests + harness Stop/Restart primitives
Extends TwoNodeClusterHarness with three lifecycle primitives:
- StopNodeBAsync()      — graceful CoordinatedShutdown (Cluster.Leave)
- RestartNodeBAsync()   — rebuild node B on same Akka port + same in-memory DB
- WaitForClusterSizeAsync(n) — converge assertion helper

Adds three failover scenario tests:
- Stopping node B shrinks cluster to 1 Up member
- Restarted node B rejoins on the same Akka port
- Deployment started with B down seals with a single NodeDeploymentState
  (validates ConfigPublishCoordinator.DiscoverDriverNodes snapshots
   membership at dispatch time)

Closes follow-up F22. Integration test count: 6 → 9 (+3).
2026-05-26 07:13:14 -04:00
Joseph Doherty 8f32b89fb9 feat(adminui): FleetDiagnosticsClient real Akka ActorSelection round-trip (F17)
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been cancelled
v2-ci / build (push) Has been cancelled
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been cancelled
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been cancelled
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been cancelled
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been cancelled
v2-ci / integration (push) Has been cancelled
- New Commons.Messages.Fleet.GetDiagnostics request record.
- DriverHostActor handles GetDiagnostics in all three states (Steady, Applying,
  Stale); replies with a NodeDiagnosticsSnapshot built from _currentRevision
  + the local NodeId. Drivers list is empty until F7 wires the per-instance
  children.
- FleetDiagnosticsClient now resolves the target via ActorSelection at
  akka.tcp://{system}@{nodeId}/user/driver-host and Asks with a 3s timeout.
  On timeout/peer-down it returns an empty snapshot so the UI degrades
  gracefully rather than throwing.

Two new integration tests in Host.IntegrationTests:
- GetDiagnostics_returns_snapshot_with_target_NodeId verifies the
  cross-node Ask/Reply works.
- GetDiagnostics_after_deploy_reports_current_revision exercises the
  end-to-end path: AdminOps starts a deployment, both DriverHostActors
  apply, then diagnostics reports the new revision on both nodes.

All 98 v2 tests pass (was 96 + 2 new).
2026-05-26 06:58:11 -04:00
Joseph Doherty 5cfbe8b5dd test(host): deploy happy-path + idempotency integration tests (Task 59)
DeployHappyPathTests exercises the full deploy pipeline on the 2-node harness:
AdminOperationsActor → ConfigPublishCoordinator → DistributedPubSub →
DriverHostActor on both nodes → ApplyAck → coordinator seals. Verifies both
NodeDeploymentState rows reach Applied and Deployment.Status reaches Sealed.

Exposed + fixed two production bugs along the way:

1. Coordinator was publishing DispatchDeployment on the "deployments" topic but
   never subscribed to anything — DriverHostActor ACKs published on the same
   topic could not reach it. Added dedicated "deployment-acks" topic with
   coordinator subscription in PreStart, and DriverHostActor publishes ACKs
   there.

2. NodeId derivation used member.Address.Host only — two cluster members on a
   shared loopback host (test harness, dev VMs) collided to one identity. The
   coordinator's expected-ack set became {1} and the system sealed after only
   half the nodes acked. Switched to host:port everywhere (ClusterRoleInfo +
   coordinator) so loopback nodes stay distinct and production identities are
   harmlessly more specific.

Tests: 95 v2 tests pass (was 93 + 2 deploy tests), 0 skipped.

Failover scenarios (design §8 cases 3-7: node-kill-mid-apply, split-brain,
restart-during-deploy) deferred — they need controlled node-down primitives
on the harness. Tracked as F22 (failover scenario test cases).
2026-05-26 06:34:36 -04:00
Joseph Doherty d6fac2d81d test(host): 2-node integration test harness + consolidate to one ActorSystem (Task 58)
Builds TwoNodeClusterHarness: two in-process Host-equivalent nodes sharing
an in-memory ConfigDb. Forms a 2-member Akka cluster. ClusterFormationTests
proves both nodes see each other as admin+driver role members.

Fixes a real production bug uncovered while wiring the harness — Program.cs
ran two separate ActorSystems (one from AddOtOpcUaCluster.AkkaHostedService
with cluster HOCON, one from Akka.Hosting.AddAkka with bare HOCON). Cluster
singletons landed on the bare ActorSystem and could not actually form a
cluster ("Configuration does not contain `akka.cluster` node").

Consolidation:
- AddOtOpcUaCluster now only binds AkkaClusterOptions + registers IClusterRoleInfo
- New WithOtOpcUaClusterBootstrap pushes embedded HOCON + Remote/Cluster options
  into Akka.Hosting's AkkaConfigurationBuilder
- AkkaHostedService.cs deleted — Akka.Hosting now owns the lifecycle
- Program.cs + harness call WithOtOpcUaClusterBootstrap inside AddAkka

Why not WebApplicationFactory<Program>? Program.cs reads OTOPCUA_ROLES from
process env (shared across in-process WAFs); the harness replays Program.cs's
DI graph from a clean WebApplicationBuilder per node with per-node config
overrides. Same production extensions, isolated config + Kestrel + Akka ports.

Tests: 93 v2 tests pass (was 91 + 2 new cluster formation), 0 skipped.
2026-05-26 06:27:04 -04:00