Adds a real-infra mode for the integration test harness alongside the default
in-memory mode. Drops the previously-untested code paths (EF SqlServer
behaviors, real LDAP bind) under env-var control without breaking the
zero-infra default that CI runs.
- docker-compose.yml — minimal SQL 2022 (14331) + OpenLDAP (3894) stack
(ports chosen to coexist with docker-dev/ on 14330/3893)
- HarnessMode record reads OTOPCUA_HARNESS_USE_SQL=1 / USE_LDAP=1 from env
- SQL mode: per-harness unique DB OtOpcUa_Harness_{guid}, EnsureCreated
at startup, EnsureDeleted on dispose (best-effort)
- LDAP mode: drops StubLdapAuthService and configures real LdapAuthService
against the compose'd OpenLDAP via Authentication:Ldap:* config keys
- Microsoft.EntityFrameworkCore.SqlServer added to the test project
- README documents both modes + the macOS no-Docker caveat
Default in-memory mode unchanged — all 9 existing tests still pass.
Reshapes the placeholder buffered-counter actor into a thin fire-and-forget
bridge over the existing IAlarmHistorianSink contract. Default sink is
NullAlarmHistorianSink; production deployments override the DI binding to
SqliteStoreAndForwardSink wrapping WonderwareHistorianClient (the v1
components in src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware*
are reused verbatim — actor is just a mailbox-friendly entry point).
- HistorianAdapterActor.Props(IAlarmHistorianSink?) — null defaults to NullAlarmHistorianSink
- Receive<AlarmHistorianEvent>: fire-and-forget sink.EnqueueAsync
- Receive<GetStatus>: returns sink.GetStatus() (queue depth + drain state)
- ServiceCollectionExtensions.AddOtOpcUaRuntime registers the default sink
- WithOtOpcUaRuntimeActors spawns the actor + registers HistorianAdapterActorKey
- Program.cs calls AddOtOpcUaRuntime when hasDriver
Tests: 2 new (forward-to-sink + GetStatus). Runtime suite 17 → 18.
Extends TwoNodeClusterHarness with three lifecycle primitives:
- StopNodeBAsync() — graceful CoordinatedShutdown (Cluster.Leave)
- RestartNodeBAsync() — rebuild node B on same Akka port + same in-memory DB
- WaitForClusterSizeAsync(n) — converge assertion helper
Adds three failover scenario tests:
- Stopping node B shrinks cluster to 1 Up member
- Restarted node B rejoins on the same Akka port
- Deployment started with B down seals with a single NodeDeploymentState
(validates ConfigPublishCoordinator.DiscoverDriverNodes snapshots
membership at dispatch time)
Closes follow-up F22. Integration test count: 6 → 9 (+3).
New per-admin-node actor that subscribes to the fleet-status DistributedPubSub
topic + forwards every FleetStatusChanged snapshot to all SignalR clients
connected to FleetStatusHub via IHubContext.
Wired via WithOtOpcUaSignalRBridges (new AkkaConfigurationBuilder extension in
AdminUI.Hubs) — Program.cs calls it inside the if(hasAdmin) block alongside
WithOtOpcUaControlPlaneSingletons.
Per-node subscription rather than cluster-singleton: every admin node forwards
its own snapshots to its own connected clients. Simpler than singleton
coordination + acceptable because the messages are small and SignalR fan-out
is per-node anyway.
- New Commons.Messages.Fleet.GetDiagnostics request record.
- DriverHostActor handles GetDiagnostics in all three states (Steady, Applying,
Stale); replies with a NodeDiagnosticsSnapshot built from _currentRevision
+ the local NodeId. Drivers list is empty until F7 wires the per-instance
children.
- FleetDiagnosticsClient now resolves the target via ActorSelection at
akka.tcp://{system}@{nodeId}/user/driver-host and Asks with a 3s timeout.
On timeout/peer-down it returns an empty snapshot so the UI degrades
gracefully rather than throwing.
Two new integration tests in Host.IntegrationTests:
- GetDiagnostics_returns_snapshot_with_target_NodeId verifies the
cross-node Ask/Reply works.
- GetDiagnostics_after_deploy_reports_current_revision exercises the
end-to-end path: AdminOps starts a deployment, both DriverHostActors
apply, then diagnostics reports the new revision on both nodes.
All 98 v2 tests pass (was 96 + 2 new).
Replaces the Ok=true stub with a TCP connect to the peer's OPC UA port (4840
default) with a 2s timeout. A successful connect indicates the OPC UA server
process is up + accepting connections — enough for the redundancy calculator
to treat the peer as live. A full secure-channel Hello/Acknowledge handshake
is overkill for what the redundancy calc consumes and would pull in the OPC
UA Client SDK + a PKI setup. Upgrade later if a deeper liveness signal is ever
required.
Probe extracts the host from NodeId by stripping the :port suffix (commit
5cfbe8b encoded host:port into NodeId for cluster-member identity).
Tests: 2 new tests — Ok=true against a live TcpListener on a chosen port,
Ok=false against an unreachable endpoint. All 17 Runtime tests pass (was 16
covering only the message-contract surface).
ConfigAuditLog gains two nullable columns (EventId, CorrelationId) + a filtered
unique index UX_ConfigAuditLog_EventId. EF migration
20260526105027_AddConfigAuditLogEventIdColumns is additive (nullable + filtered
index = legacy rows backfill cleanly).
AuditWriterActor now writes EventId + CorrelationId into the dedicated columns
instead of synthesising a JSON wrapper into DetailsJson. Cross-restart dedup
is now real: a retry of an already-flushed batch hits the unique index and
SaveChanges throws; the existing catch drops the duplicate without losing the
rest of the batch.
WrapDetails helper deleted — F4 (its JSON hardening) becomes moot.
AuditWriterActorTests.Details_wrapper_embeds_eventId_and_correlationId renamed
+ rewritten to assert against the columns. All 29 ControlPlane tests pass,
all 95 v2 tests green.
.github/workflows/v2-ci.yml runs on push/PR to v2-akka-fuse + master:
- build job: dotnet restore + build (Release)
- unit-tests job: matrix over the 5 v2 test projects (Cluster, ControlPlane,
Runtime, Security, OpcUaServer) with Category!=E2E
- integration job: Host.IntegrationTests with Category!=E2E
.github/workflows/v2-e2e.yml runs nightly at 03:00 UTC + workflow_dispatch:
- Brings up the docker-dev four-node fleet (admin pair + driver pair + SQL
+ LDAP + Traefik)
- Waits up to 60s for /health/active to return 200
- Runs Category=E2E only
- Always tears down with -v
Both workflows pin .NET 10 via actions/setup-dotnet (no global.json so any
10.0 SDK works). Compatible with both GitHub Actions and Gitea Actions
(act_runner). The E2E filter currently matches zero tests because the
tests/Server/ZB.MOM.WW.OtOpcUa.E2ETests project doesn't exist yet — it lands
when F10/F11/F12 wire enough engine for an end-to-end round-trip to be
meaningful.
- scripts/install/traefik.yml + traefik-dynamic.yml: Traefik static + dynamic
config. One :80 entry point, one router on HostRegexp(otopcua.*), one
service load-balancing admin-a:9000 + admin-b:9000 with /health/active health
check (interval 5s, timeout 2s, expected 200). Followers return 503 from
/health/active so Traefik drops them within the next interval after a
leadership change.
- scripts/install/Install-Traefik.ps1: downloads Traefik for Windows, drops the
yml configs, registers the OtOpcUaTraefik Windows service via sc.exe with
restart-on-failure. Companion to Install-Services.ps1.
- docker-dev/{Dockerfile,docker-compose.yml,traefik-dynamic.yml,README.md}:
Mac-friendly four-node fleet (admin-a + admin-b + driver-a + driver-b) plus
SQL Server 2022 + OpenLDAP + Traefik. Single OtOpcUa.Host image built once;
Compose drives OTOPCUA_ROLES + Cluster:* per container to differentiate the
four hosts. README walks through bring-up + failover smoke + the dev LDAP
users.
Note: untested on macOS (no local Docker — see docs/v2/dev-environment.md).
- Install-Services.ps1: installs OtOpcUaHost (single fused binary) replacing
the v1 OtOpcUa + OtOpcUaAdmin pair. Required -Roles param writes OTOPCUA_ROLES
to the service env so Program.cs decides what to mount (admin / driver / both).
-HttpPort param (default 9000) writes ASPNETCORE_URLS on admin-role nodes.
sc.exe restart-on-failure: 5s, 30s, 60s; reset counter after 24h clean run.
Wonderware historian sidecar install logic preserved from v1.
- Uninstall-Services.ps1: removes OtOpcUaHost + cleans up legacy v1 names
(OtOpcUa, OtOpcUaAdmin) and the long-retired OtOpcUaGalaxyHost.
- Refresh-Services.ps1: updated service names (OtOpcUa -> OtOpcUaHost), publish
path (ZB.MOM.WW.OtOpcUa.Server -> ZB.MOM.WW.OtOpcUa.Host), process names
(OtOpcUa.Server -> OtOpcUa.Host). Switched nssm stop/start calls to
Stop-Service/Start-Service so the script works whether the underlying
service was installed via nssm or sc.exe.
Four new docs at docs/v2/ giving a single-page tour of each v2 piece:
- Architecture-v2.md: top-level mental model (fused Host + roles + cluster + live-edit)
- Cluster.md: AkkaClusterOptions + IClusterRoleInfo + WithOtOpcUaClusterBootstrap
- ControlPlane.md: 5 admin singletons + DPS topics + deploy flow + failover recovery
- Runtime.md: per-node actor tree + state machines + engine-wiring follow-up map
Each links back to the design doc for depth. Architecture-v2 cross-references
the other three + ServiceHosting + Redundancy + security.
- Redundancy.md: full rewrite — Akka-leader-driven ServiceLevel replaces
operator-managed RedundancyRole. Documents the 5-tier ServiceLevelCalculator,
RedundancyStateActor cluster singleton, and the DPS data flow.
- ServiceHosting.md: full rewrite — single fused OtOpcUa.Host binary with
OTOPCUA_ROLES env gating. Documents the conditional DI graph and the new
health endpoints (/health/ready, /health/active, /healthz).
- security.md: v2 banner at top covering path/project renames + new JWT bearer
+ DataProtection persisted to ConfigDb. Body unchanged because the 4-concern
security model is unchanged in v2; full per-section rewrite waits for F15
(Admin pages migration) since security.md references many pages that move.
- README.md: platform overview updated to v2 (fused Host + role gating).
DeployHappyPathTests exercises the full deploy pipeline on the 2-node harness:
AdminOperationsActor → ConfigPublishCoordinator → DistributedPubSub →
DriverHostActor on both nodes → ApplyAck → coordinator seals. Verifies both
NodeDeploymentState rows reach Applied and Deployment.Status reaches Sealed.
Exposed + fixed two production bugs along the way:
1. Coordinator was publishing DispatchDeployment on the "deployments" topic but
never subscribed to anything — DriverHostActor ACKs published on the same
topic could not reach it. Added dedicated "deployment-acks" topic with
coordinator subscription in PreStart, and DriverHostActor publishes ACKs
there.
2. NodeId derivation used member.Address.Host only — two cluster members on a
shared loopback host (test harness, dev VMs) collided to one identity. The
coordinator's expected-ack set became {1} and the system sealed after only
half the nodes acked. Switched to host:port everywhere (ClusterRoleInfo +
coordinator) so loopback nodes stay distinct and production identities are
harmlessly more specific.
Tests: 95 v2 tests pass (was 93 + 2 deploy tests), 0 skipped.
Failover scenarios (design §8 cases 3-7: node-kill-mid-apply, split-brain,
restart-during-deploy) deferred — they need controlled node-down primitives
on the harness. Tracked as F22 (failover scenario test cases).
Builds TwoNodeClusterHarness: two in-process Host-equivalent nodes sharing
an in-memory ConfigDb. Forms a 2-member Akka cluster. ClusterFormationTests
proves both nodes see each other as admin+driver role members.
Fixes a real production bug uncovered while wiring the harness — Program.cs
ran two separate ActorSystems (one from AddOtOpcUaCluster.AkkaHostedService
with cluster HOCON, one from Akka.Hosting.AddAkka with bare HOCON). Cluster
singletons landed on the bare ActorSystem and could not actually form a
cluster ("Configuration does not contain `akka.cluster` node").
Consolidation:
- AddOtOpcUaCluster now only binds AkkaClusterOptions + registers IClusterRoleInfo
- New WithOtOpcUaClusterBootstrap pushes embedded HOCON + Remote/Cluster options
into Akka.Hosting's AkkaConfigurationBuilder
- AkkaHostedService.cs deleted — Akka.Hosting now owns the lifecycle
- Program.cs + harness call WithOtOpcUaClusterBootstrap inside AddAkka
Why not WebApplicationFactory<Program>? Program.cs reads OTOPCUA_ROLES from
process env (shared across in-process WAFs); the harness replays Program.cs's
DI graph from a clean WebApplicationBuilder per node with per-node config
overrides. Same production extensions, isolated config + Kestrel + Akka ports.
Tests: 93 v2 tests pass (was 91 + 2 new cluster formation), 0 skipped.
Eliminates the services.BuildServiceProvider() captive-provider antipattern
(ASP0000) inside AddJwtBearer. The new ConfigureJwtBearerFromTokenService
resolves JwtTokenService from the real DI container at runtime and stays
in lock-step with JwtTokenService.BuildValidationParameters.
All 27 Security.Tests stay green, including the F1 integration tests that
exercise /auth/token through the real bearer pipeline.
Injects AuthenticationStateProvider and reads the current user's identity
name on Deploy click, replacing the "(current user)" placeholder.
Anonymous case falls back to "(anonymous)" — should never hit in practice
since the page requires FleetAdmin/ConfigEditor.
Mirrors the publisher-injection pattern from FleetStatusBroadcaster and
PeerOpcUaProbeActor: Props accepts an optional Action<object> override so
tests can use a TestProbe sink instead of bootstrapping DistributedPubSub
(unreliable single-node in TestKit).
Un-skips the two RedundancyStateActor tests deferred under F6.
7 tests exercise AddOtOpcUaAuth + MapOtOpcUaAuth end-to-end against an
in-memory ConfigDb + stub ILdapAuthService. Covers /auth/login (204/401/503),
/auth/ping (401/200), /auth/token (200+JWT shape), /auth/logout (204+clear-cookie).
Scope is the auth contract — not the fused Host bootstrap (cluster + role
gating belongs in the Task 58 multi-node harness). HostBuilder + TestServer
is used directly instead of WebApplicationFactory<Program> because the
test project has no Program entry point and Host needs Akka cluster up.
Mirrors WithOtOpcUaControlPlaneSingletons for the driver role. Spawns
DriverHostActor + DbHealthProbeActor on the host's ActorSystem and
registers both under marker keys. Host's Program.cs now calls it when
the node carries the driver role, so driver-only and admin+driver
deployments both auto-bootstrap the per-node actors.
Integration test covers the registration round-trip via Microsoft.Extensions.Hosting
+ Akka.Hosting AddAkka.
Task 56: removes the legacy in-process Server + Admin Web project + their test
projects (Server.Tests, Admin.Tests, Admin.E2ETests). The fused OtOpcUa.Host
binary built across Phases 1-9 is now the sole production entry point.
What happened to the 47 legacy Admin Blazor pages: per follow-up F15, the
v1 architecture's draft/publish UX is replaced by v2's live-edit + snapshot-
deploy model, so a 1:1 migration is not meaningful. The mechanical move via
git mv preserves the history; service classes + page bodies that referenced
removed v1 types (ConfigGeneration, RedundancyRole, GenerationId) were
deleted. AdminUI now ships a minimal Home page + the v2 Deployments page.
Per-page rebuild against the v2 surface is tracked as F15. The v2 Deployments
page (Task 52) is the only first-party UI shipping in this PR.
Task 57: solution build green; 84+ tests green across active v2 + legacy
driver test projects.