Three pieces landed in one batch, closing F7-residual + Host DI #106:
Runtime/DriverInstanceActor:
- Subscribe / Unsubscribe message contracts; the Connected state
handles them via IDriver.ISubscribable. On every OnDataChange
event the actor publishes AttributeValuePublished to its parent
(DriverHostActor → OpcUaPublishActor). OPC UA StatusCode is
mapped to the 3-state OpcUaQuality enum via severity bits
(00=Good, 01=Uncertain, 10/11=Bad).
- DetachSubscription tears the handler off the driver on
DisconnectObserved, Unsubscribe, and PostStop so a stale handler
never pushes to a dead actor.
- WriteAttribute now dispatches IWritable.WriteAsync (batch of one)
with a 5s CancellationTokenSource; status-code propagated to
WriteAttributeResult on non-Good results.
Host:
- New ProjectReferences to Core + every cross-platform driver
assembly (AbCip/AbLegacy/FOCAS/Galaxy/Modbus/S7/TwinCAT).
Galaxy is net10 (gRPC client to mxaccessgw); the COM-bound net48
Wonderware Historian driver stays out of the Host's reference
closure — its .Client gRPC wrapper is what binds for historian
needs.
- New DriverFactoryBootstrap.AddOtOpcUaDriverFactories() registers
a singleton DriverFactoryRegistry, invokes each driver's
Register(registry, loggerFactory), and binds IDriverFactory to
DriverFactoryRegistryAdapter. Replaces the F7 NullDriverFactory
default so deploys actually materialise real IDriver instances
on driver-role nodes. ShouldStub() still gates per-platform
behaviour at spawn time.
- Program.cs wires AddOtOpcUaDriverFactories() before AddAkka so
the runtime extension can resolve IDriverFactory from DI.
Tests: Runtime 46 -> 52 (+6):
- Write returns success when StatusCode = Good
- Write propagates non-Good status code in failure Reason
- Subscribe forwards OnDataChange to parent as AttributeValuePublished
- Quality translation: Uncertain (0x40...) and Bad (0x80...)
- Subscribe against non-ISubscribable returns failure
- DisconnectObserved detaches handler so late events are dropped
All 6 v2 test suites green: 152 tests passing.
Closes F7. F7-residual sub-tasks #110 (subscribe) and #111 (write)
both shipped. Host DI binding #106 shipped.
Splits the side-effecting half of Phase7Composer (deferred at Task 47)
into two pieces that mirror DriverHostActor's spawn-plan pattern:
Phase7Plan + Phase7Planner.Compute (pure):
Diff two Phase7CompositionResult snapshots by stable id (EquipmentId,
DriverInstanceId, ScriptedAlarmId). Emits Added/Removed/Changed lists
per entity class. Added/Removed are sorted by id for deterministic
apply order. Changed wraps both Previous and Current projections so
consumers can decide between in-place mutation and tear-down +
rebuild.
Phase7Applier (side-effecting):
Drives an IOpcUaAddressSpaceSink against a plan. Removed equipment/
alarms get an inactive AlarmState write per id; Added/Removed of
Equipment or ScriptedAlarm triggers RebuildAddressSpace. Driver-only
changes correctly skip the rebuild — those flow through DriverHost-
Actor's spawn-plan in Runtime. Sink exceptions are caught + logged so
one bad node doesn't abort the apply.
Tests: OpcUaServer 6 -> 20 (+14):
- Phase7PlannerTests x9 (empty-in/empty-out, add/remove/change per
entity class, mixed changes, deterministic ordering)
- Phase7ApplierTests x5 (empty plan no-op, removal writes inactive
states + rebuild, added equipment triggers rebuild, driver-only
skips rebuild, sink fault is non-fatal)
The remaining piece is the EquipmentNodeWalker integration against a
real SDK NodeManager — split as F14b, gated on F10b's SDK builder.
All 6 v2 test suites green: 146 tests passing.
OpcUaPublishActor now routes through pluggable seams instead of just
incrementing a counter:
- IOpcUaAddressSpaceSink (Commons.OpcUa) — WriteValue / WriteAlarmState
/ RebuildAddressSpace. OpcUaQuality enum moved here from the actor's
nested type so producers don't have to reference the actor itself.
- IServiceLevelPublisher — Publish(byte). NullServiceLevelPublisher
retains the last level for inspection.
- The actor subscribes to the redundancy-state DPS topic in PreStart
and maps the local node's NodeRedundancyState to a coarse
ServiceLevel (Primary+leader=240, Primary=200, Secondary=100,
Detached=0). This keeps the local SDK's ServiceLevel node honest
without round-tripping back through the admin-singleton calculator.
- ServiceLevelChanged dedupes identical levels so the SDK doesn't see
redundant writes.
- Sink + publisher exceptions are caught and logged; the actor never
crashes its own dispatcher.
- PropsForTests gets optional sink/publisher/localNode params and
skips the DPS subscribe so unit tests stay on a vanilla TestKit
cluster.
Production binding to a real SDK NodeManager + Variable nodes is the
remaining residual — split as F10b. Task 60 still blocked on F10b.
Tests: Runtime 40 -> 46 (+6):
- AttributeValueUpdate routes to sink
- AlarmStateUpdate routes to sink
- RebuildAddressSpace calls sink.Rebuild
- ServiceLevelChanged dedupes
- RedundancyStateChanged for primary-leader publishes 240
- RedundancyStateChanged for secondary publishes 100
All 6 v2 test suites green: 132 tests passing.
VirtualTagActor and ScriptedAlarmActor now route through pluggable
evaluator interfaces and fan out to the cluster's live-tail topics
shipped in F15.3:
- IVirtualTagEvaluator + NullVirtualTagEvaluator in Commons.Engines.
VirtualTagActor calls evaluator on every DependencyValueChanged,
dedupes unchanged values, forwards EvaluationResult to its parent,
and publishes ScriptLogEntry Warning to the script-logs DPS topic
whenever the evaluator fails.
- IScriptedAlarmEvaluator + NullScriptedAlarmEvaluator. ScriptedAlarmActor
takes an AlarmConfig (id/name/equipment-path/severity/predicate) and
publishes both an AlarmTransitionEvent (alerts topic) and a
ScriptLogEntry (script-logs topic) at every transition. Manual
ConditionMet/Acknowledge/Cleared still flow through the same
Transition() so callers without engine bindings still drive the
state machine; the legacy single-string Props() overload routes
through a default AlarmConfig.
The Null* defaults keep the actors safe when no engine is bound —
unconfigured nodes never spuriously alarm. Production binding to
Core.VirtualTags.VirtualTagEngine and Core.ScriptedAlarms is the
remaining residual (F8b/F9b — split in tasks JSON).
Tests: Runtime 34 -> 40 (+6):
- VirtualTagActorTests x3 (evaluator drives EvaluationResult,
unchanged-value dedup, failure publishes Warning ScriptLogEntry)
- ScriptedAlarmActorTests x3 (engine threshold drives Activated +
Cleared on alerts topic, manual Acknowledge attribution).
All 6 v2 test suites green: 126 tests passing.
DriverHostActor.ApplyAndAck now reads the deployment artifact and
reconciles its set of DriverInstanceActor children — spawn the missing,
ApplyDelta to those with changed config, stop the removed/disabled.
The diff lives in pure DriverSpawnPlanner so it can be unit-tested
without an ActorSystem.
Adds IDriverFactory in Core.Abstractions (consumed by Runtime) +
DriverFactoryRegistryAdapter in Core.Hosting that wraps the existing
v1 DriverFactoryRegistry — Runtime stays decoupled from Polly/Serilog,
the Host wires the adapter once driver assemblies have registered.
ShouldStub(type, roles) is now actually called on every spawn — Galaxy
+ Wonderware-Historian boot stubbed on macOS/Linux or whenever the host
carries the dev role. Missing factory ⇒ stub fallback, never a crash.
Tests: 24 → 34 in Runtime (+10):
- DriverSpawnPlannerTests x7 (diff cases, type change ⇒ stop+respawn)
- DeploymentArtifactTests x5 (empty/malformed/missing fields tolerant)
- DriverHostActorReconcileTests x4 (spawn count, stub fallback,
ShouldStub gate, second-apply stops the removed)
All 6 v2 test suites green: 120 tests passing.
Closes F20 (ShouldStub wired). F7 marked partial — subscription
publishing + write path still stubbed in DriverInstanceActor itself.
Final F15 batch wires up the SignalR-backed live pages, ports the bulk
equipment importer, and progressively enhances the Script source editor
with Monaco.
Message contracts:
- Commons.Messages.Alerts.AlarmTransitionEvent — fires on every alarm
state transition; published on the `alerts` DPS topic by future
ScriptedAlarmActor (F9) emits.
- Commons.Messages.Logging.ScriptLogEntry — one log line emitted by a
hosted script; published on the `script-logs` DPS topic by future
VirtualTagActor (F8) + ScriptedAlarmActor (F9) emits.
(Folder named "Logging" to dodge .gitignore's "logs/" rule.)
SignalR plumbing:
- AlertHub gains MethodName + bridge actor (AlertSignalRBridge)
- ScriptLogHub introduced; ScriptLogSignalRBridge follows the same
DPS-subscribe → IHubContext fan-out pattern as FleetStatusSignalRBridge
- WithOtOpcUaSignalRBridges now spawns all three bridges
- MapOtOpcUaHubs maps /hubs/script-log alongside the existing hubs
Pages:
- /alerts live alarm tail, 200-row capacity
- /script-log live script-log tail with level + script
filter, 500-row capacity
- /clusters/{id}/equipment/import — CSV bulk Equipment add with preview
(Name/MachineCode/UnsLineId/Driver +
optional ZTag/SAPID/Manufacturer/Model;
skips rows whose MachineCode already
exists in the fleet)
- ScriptEdit progressively enhanced with Monaco editor via JSInterop —
the textarea remains Blazor's source of truth and Monaco syncs into it
on every keystroke so @bind keeps working; falls back gracefully if
the CDN is unreachable.
MainLayout nav gains a "Live" section (Deployments, Alerts, Alarms
historian) and a "Scripts" link under Scripting. ClusterEquipment
surfaces the new Import CSV button.
Tally: F15 ships ~42 razor pages + 3 SignalR hubs + 3 bridge actors.
Microsoft.AspNetCore.SignalR.Client added (was already in central PM).
All 104 v2 tests remain green.
Final batch of F15.2. After this commit every entity surfaced by the
Phase A-D read views has a matching new/edit/delete form.
- AclEdit.razor /clusters/{id}/acls/{new|aclId}
- NodePermissions [Flags] enum surfaced as per-bit checkboxes plus
one-click bundle buttons (ReadOnly / Operator / Engineer / Admin)
- ScopeKind select + ScopeId free-text target (null = cluster-wide)
- VirtualTagEdit.razor /virtual-tags/{new|virtualTagId}
- Trigger validation: enforces at least one of ChangeTriggered or
TimerIntervalMs is set
- ScriptedAlarmEdit.razor /scripted-alarms/{new|scriptedAlarmId}
- AlarmType select with OPC UA Part 9 subtypes
- MessageTemplate is a textarea (template tokens are server-resolved)
- ScriptEdit.razor /scripts/{new|scriptId}
- SHA-256 hash computed from SourceCode on save (operator never sees
or edits SourceHash directly)
- InputTextArea now; Monaco syntax editor is a future enhancement
List pages (ClusterAcls / VirtualTags / ScriptedAlarms / Scripts) all
gain New + per-row Edit affordances.
Tally: F15.2 shipped CRUD for 11 entities — Cluster, ClusterNode,
UnsArea, UnsLine, Namespace, DriverInstance, Equipment, Tag, NodeAcl,
VirtualTag, ScriptedAlarm, Script.
All 9 integration tests still green.
The two most-edited entities for daily operator workflows. Both follow the
same single-page edit-or-create pattern from batches 1 + 2 with RowVersion
optimistic concurrency.
- EquipmentEdit.razor /clusters/{id}/equipment/{new|EquipmentId}
- EquipmentId is system-generated on create (decision #125): EQ-{first
12 hex chars of a new EquipmentUuid}.
- UNS line + driver instance selects are scoped to the cluster.
- All 9 OPC 40010 identification fields surfaced as an optional panel.
- MachineCode uniqueness checked client-side before EF unique index
enforces it server-side.
- TagEdit.razor /clusters/{id}/tags/{new|TagId}
- Equipment vs FolderPath input switches based on the selected
driver's namespace kind — Equipment-kind requires EquipmentId,
SystemPlatform-kind requires FolderPath (decision #110 invariant
enforced client-side; sp_ValidateDraft re-enforces server-side at
deploy).
- DataType select uses the OPC UA built-in primitive type names.
- TagConfig validated as JSON pre-flight.
ClusterEquipment + ClusterTags list pages get New / Edit affordances.
All 9 integration tests still green.
Same single-page edit-or-create pattern as batch 1, applied to the
foundational topology entities. After this batch the whole hierarchy
(cluster → nodes → UNS areas → UNS lines → namespaces → drivers) is
fully editable through the UI.
- ClusterEdit.razor /clusters/{id}/edit
Update + delete for an existing cluster. NodeCount stays coupled to
RedundancyMode (None→1, Warm/Hot→2). ModifiedBy taken from
AuthenticationStateProvider.
- NodeEdit.razor /clusters/{id}/nodes/{new|nodeId}
Full ClusterNode CRUD. ApplicationUri uniqueness is enforced by EF
index; ServiceLevelBase defaults to 200 (primary preference) on
create; per-node DriverConfigOverridesJson validated as JSON.
- UnsAreaEdit.razor /clusters/{id}/uns/areas/{new|id}
- UnsLineEdit.razor /clusters/{id}/uns/lines/{new|id}
UNS structure CRUD; Lines pick their parent Area from a select that
loads the cluster's areas.
List pages updated:
- ClusterOverview now shows an "Edit cluster" button + a "New node"
action on the nodes panel + per-row Edit buttons.
- ClusterUns gains New/Edit affordances for both Areas and Lines.
All 9 integration tests still green; no regressions.
Pattern proof for the live-edit forms gated by Phases A–D's read views.
Each entity gets a single edit page handling both create (route param
omitted) and update (route param present) modes, with RowVersion-based
optimistic concurrency checked against EF Core's
DbUpdateConcurrencyException.
Pattern:
- @page "/clusters/{id}/<thing>/new"
- @page "/clusters/{id}/<thing>/{rowId}"
- IsNew computed from rowId presence
- EditForm + DataAnnotations validation
- byte[] RowVersion stashed on FormModel; assigned to
Entry(e).Property(e => e.RowVersion).OriginalValue before SaveChanges
- Delete button (edit mode only) flows through the same RowVersion check
- Concurrency conflict surfaces as an inline error panel; user reloads
This batch:
- NamespaceEdit.razor — small entity, validates the pattern
- DriverEdit.razor — keystone for everything downstream
(Equipment/Tag/VirtualTag/ScriptedAlarm),
JSON config editor per Q1 with reformat
on save and validation pre-flight
- ClusterNamespaces row gains an Edit button + New action
- ClusterDrivers expanded view gains an Edit button + New action
Equipment/UnsArea/UnsLine/Tag/ACL/VirtualTag/ScriptedAlarm/Script forms
follow this same template in subsequent F15.2 batches.
All 9 integration tests still green; no v2 test regressions.
- ClusterAudit (/clusters/{id}/audit) — reads ConfigAuditLog with the
EventId/CorrelationId columns added in F3; shown as a Cluster tab
- VirtualTags (/virtual-tags) — fleet-wide read view
- ScriptedAlarms (/scripted-alarms) — fleet-wide read view
- Scripts (/scripts) — fleet-wide; expandable code preview
- RoleGrants (/role-grants) — per Q4, surfaces the fleet-wide
LDAP-group → role mapping from
Authentication:Ldap:GroupToRole
(read-only; reload via host restart)
- Certificates (/certificates) — own/trusted/issuer/rejected store
contents resolved against
OpcUa:PkiStoreRoot config (F13a)
- Reservations (/reservations) — ExternalIdReservation table
- AlarmsHistorian (/alarms-historian) — live HistorianAdapterActor sink
status via the F11 GetStatus query;
5s polling
ScriptLog deferred (needs the F16-deferred ScriptLogHub bridge).
ClusterNav extended with the Audit tab.
Adds an AdminUI → Runtime project reference so the historian status page can
inject IRequiredActor<HistorianAdapterActorKey>. NuGet audit suppression for
the transitive Opc.Ua.Core advisory mirrored from the Runtime project.
All 104 v2 tests still green.
Per Q3 of the rebuild plan, each v1 ClusterDetail tab becomes a separate
route under /clusters/{id}/<tab>. This batch adds read-only table views
for the six core config entity types; live-edit forms with RowVersion
concurrency land in Phase C.2 once the read-view shape is reviewed.
- ClusterEquipment /clusters/{id}/equipment — joins via DriverInstance
so the cluster scope works
- ClusterUns /clusters/{id}/uns — Areas + Lines tables
- ClusterNamespaces /clusters/{id}/namespaces — Kind + URI + Enabled chip
- ClusterDrivers /clusters/{id}/drivers — collapsed list with JSON
config expandable per Q1
(typed editors deferred)
- ClusterTags /clusters/{id}/tags — first 200 by name + filter
- ClusterAcls /clusters/{id}/acls — LDAP group + scope +
NodePermissions bits
Shared ClusterNav.razor extracted; ClusterOverview + ClusterRedundancy
updated to use it. _Imports.razor adds Components.Shared so the shared
nav is in scope across pages.
- ClustersList (/clusters) — table view, row-click opens detail
- NewCluster (/clusters/new) — EditForm with DataAnnotations; redundancy
mode + node-count coupling enforced client-side (None→1, Warm/Hot→2);
CreatedBy taken from AuthenticationStateProvider
- ClusterOverview (/clusters/{id}) — cluster details + last-deployment
badge + node list. Per Q3, the legacy 10-tab monolith is split into
separate routes; this page hosts the Overview "tab" as its primary slot
- ClusterRedundancy (/clusters/{id}/redundancy) — static ServiceLevelBase
config view; live ServiceLevel comes via RedundancyStateActor DPS topic
(deferred to its own follow-up once the SignalR bridge lands)
The other 8 v1 cluster tabs (Equipment, UNS, Namespaces, Drivers, Tags,
ACLs, ScriptedAlarms, Scripts, Audit) land in Phase C/D.
Implements Phase A of the F15 rebuild plan: minimum-viable Admin surface
with a working sign-in path and a fleet-state landing page. Decisions Q1–Q5
of docs/v2/AdminUI-rebuild-plan.md were taken as recommended.
- App.razor (moved into AdminUI library from the Host stub; vendored
Bootstrap from RCL wwwroot — no public CDN, air-gap safe)
- Routes.razor (AuthorizeRouteView enforces page-level [Authorize])
- RedirectToLogin.razor (preserves returnUrl through the auth hop)
- Login.razor (static SSR, posts to /auth/login; Q5 wording about
generic-vs-specific LDAP errors)
- Account.razor (identity + fleet roles + raw LDAP groups; Q4 — no
per-cluster grants; fleet-wide LDAP-group → role mapping only)
- Fleet.razor (per-node deployment status: reads NodeDeploymentState
+ unions with IClusterRoleInfo.MembersWithRole("driver") so freshly-
joined nodes appear as "waiting"; 10s auto-refresh)
- Hosts.razor (Akka cluster topology: members, status, roles, role-
leader; 5s auto-refresh)
Host's stub App.razor deleted; Program.cs now points at
AdminUI.Components.App via an added using.
All 104 v2 tests remain green.
Adds OPC UA SDK's CheckApplicationInstanceCertificate call to
OpcUaApplicationHost.StartAsync, removing the v1 friction of needing to
pre-create the PKI directory tree before booting.
- New OpcUaApplicationHostOptions.PkiStoreRoot (defaults to "pki")
- BuildConfigurationAsync now derives own/issuer/trusted/rejected from
PkiStoreRoot so the cert paths are configurable + consistent
- EnsureApplicationCertificateAsync runs before StandardServer.Start, and
fails fast with a clear message if the SDK can't produce a valid cert
- 2 new tests: fresh-tree creates a cert, second boot reuses it
Partial slice of follow-up F13. Endpoint-security, user-token validator,
and observability wiring still pending in the F13 follow-up. OpcUaServer
tests: 4 → 6.
Adds a real-infra mode for the integration test harness alongside the default
in-memory mode. Drops the previously-untested code paths (EF SqlServer
behaviors, real LDAP bind) under env-var control without breaking the
zero-infra default that CI runs.
- docker-compose.yml — minimal SQL 2022 (14331) + OpenLDAP (3894) stack
(ports chosen to coexist with docker-dev/ on 14330/3893)
- HarnessMode record reads OTOPCUA_HARNESS_USE_SQL=1 / USE_LDAP=1 from env
- SQL mode: per-harness unique DB OtOpcUa_Harness_{guid}, EnsureCreated
at startup, EnsureDeleted on dispose (best-effort)
- LDAP mode: drops StubLdapAuthService and configures real LdapAuthService
against the compose'd OpenLDAP via Authentication:Ldap:* config keys
- Microsoft.EntityFrameworkCore.SqlServer added to the test project
- README documents both modes + the macOS no-Docker caveat
Default in-memory mode unchanged — all 9 existing tests still pass.
Reshapes the placeholder buffered-counter actor into a thin fire-and-forget
bridge over the existing IAlarmHistorianSink contract. Default sink is
NullAlarmHistorianSink; production deployments override the DI binding to
SqliteStoreAndForwardSink wrapping WonderwareHistorianClient (the v1
components in src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware*
are reused verbatim — actor is just a mailbox-friendly entry point).
- HistorianAdapterActor.Props(IAlarmHistorianSink?) — null defaults to NullAlarmHistorianSink
- Receive<AlarmHistorianEvent>: fire-and-forget sink.EnqueueAsync
- Receive<GetStatus>: returns sink.GetStatus() (queue depth + drain state)
- ServiceCollectionExtensions.AddOtOpcUaRuntime registers the default sink
- WithOtOpcUaRuntimeActors spawns the actor + registers HistorianAdapterActorKey
- Program.cs calls AddOtOpcUaRuntime when hasDriver
Tests: 2 new (forward-to-sink + GetStatus). Runtime suite 17 → 18.
Extends TwoNodeClusterHarness with three lifecycle primitives:
- StopNodeBAsync() — graceful CoordinatedShutdown (Cluster.Leave)
- RestartNodeBAsync() — rebuild node B on same Akka port + same in-memory DB
- WaitForClusterSizeAsync(n) — converge assertion helper
Adds three failover scenario tests:
- Stopping node B shrinks cluster to 1 Up member
- Restarted node B rejoins on the same Akka port
- Deployment started with B down seals with a single NodeDeploymentState
(validates ConfigPublishCoordinator.DiscoverDriverNodes snapshots
membership at dispatch time)
Closes follow-up F22. Integration test count: 6 → 9 (+3).
New per-admin-node actor that subscribes to the fleet-status DistributedPubSub
topic + forwards every FleetStatusChanged snapshot to all SignalR clients
connected to FleetStatusHub via IHubContext.
Wired via WithOtOpcUaSignalRBridges (new AkkaConfigurationBuilder extension in
AdminUI.Hubs) — Program.cs calls it inside the if(hasAdmin) block alongside
WithOtOpcUaControlPlaneSingletons.
Per-node subscription rather than cluster-singleton: every admin node forwards
its own snapshots to its own connected clients. Simpler than singleton
coordination + acceptable because the messages are small and SignalR fan-out
is per-node anyway.
- New Commons.Messages.Fleet.GetDiagnostics request record.
- DriverHostActor handles GetDiagnostics in all three states (Steady, Applying,
Stale); replies with a NodeDiagnosticsSnapshot built from _currentRevision
+ the local NodeId. Drivers list is empty until F7 wires the per-instance
children.
- FleetDiagnosticsClient now resolves the target via ActorSelection at
akka.tcp://{system}@{nodeId}/user/driver-host and Asks with a 3s timeout.
On timeout/peer-down it returns an empty snapshot so the UI degrades
gracefully rather than throwing.
Two new integration tests in Host.IntegrationTests:
- GetDiagnostics_returns_snapshot_with_target_NodeId verifies the
cross-node Ask/Reply works.
- GetDiagnostics_after_deploy_reports_current_revision exercises the
end-to-end path: AdminOps starts a deployment, both DriverHostActors
apply, then diagnostics reports the new revision on both nodes.
All 98 v2 tests pass (was 96 + 2 new).
Replaces the Ok=true stub with a TCP connect to the peer's OPC UA port (4840
default) with a 2s timeout. A successful connect indicates the OPC UA server
process is up + accepting connections — enough for the redundancy calculator
to treat the peer as live. A full secure-channel Hello/Acknowledge handshake
is overkill for what the redundancy calc consumes and would pull in the OPC
UA Client SDK + a PKI setup. Upgrade later if a deeper liveness signal is ever
required.
Probe extracts the host from NodeId by stripping the :port suffix (commit
5cfbe8b encoded host:port into NodeId for cluster-member identity).
Tests: 2 new tests — Ok=true against a live TcpListener on a chosen port,
Ok=false against an unreachable endpoint. All 17 Runtime tests pass (was 16
covering only the message-contract surface).
ConfigAuditLog gains two nullable columns (EventId, CorrelationId) + a filtered
unique index UX_ConfigAuditLog_EventId. EF migration
20260526105027_AddConfigAuditLogEventIdColumns is additive (nullable + filtered
index = legacy rows backfill cleanly).
AuditWriterActor now writes EventId + CorrelationId into the dedicated columns
instead of synthesising a JSON wrapper into DetailsJson. Cross-restart dedup
is now real: a retry of an already-flushed batch hits the unique index and
SaveChanges throws; the existing catch drops the duplicate without losing the
rest of the batch.
WrapDetails helper deleted — F4 (its JSON hardening) becomes moot.
AuditWriterActorTests.Details_wrapper_embeds_eventId_and_correlationId renamed
+ rewritten to assert against the columns. All 29 ControlPlane tests pass,
all 95 v2 tests green.
.github/workflows/v2-ci.yml runs on push/PR to v2-akka-fuse + master:
- build job: dotnet restore + build (Release)
- unit-tests job: matrix over the 5 v2 test projects (Cluster, ControlPlane,
Runtime, Security, OpcUaServer) with Category!=E2E
- integration job: Host.IntegrationTests with Category!=E2E
.github/workflows/v2-e2e.yml runs nightly at 03:00 UTC + workflow_dispatch:
- Brings up the docker-dev four-node fleet (admin pair + driver pair + SQL
+ LDAP + Traefik)
- Waits up to 60s for /health/active to return 200
- Runs Category=E2E only
- Always tears down with -v
Both workflows pin .NET 10 via actions/setup-dotnet (no global.json so any
10.0 SDK works). Compatible with both GitHub Actions and Gitea Actions
(act_runner). The E2E filter currently matches zero tests because the
tests/Server/ZB.MOM.WW.OtOpcUa.E2ETests project doesn't exist yet — it lands
when F10/F11/F12 wire enough engine for an end-to-end round-trip to be
meaningful.
- scripts/install/traefik.yml + traefik-dynamic.yml: Traefik static + dynamic
config. One :80 entry point, one router on HostRegexp(otopcua.*), one
service load-balancing admin-a:9000 + admin-b:9000 with /health/active health
check (interval 5s, timeout 2s, expected 200). Followers return 503 from
/health/active so Traefik drops them within the next interval after a
leadership change.
- scripts/install/Install-Traefik.ps1: downloads Traefik for Windows, drops the
yml configs, registers the OtOpcUaTraefik Windows service via sc.exe with
restart-on-failure. Companion to Install-Services.ps1.
- docker-dev/{Dockerfile,docker-compose.yml,traefik-dynamic.yml,README.md}:
Mac-friendly four-node fleet (admin-a + admin-b + driver-a + driver-b) plus
SQL Server 2022 + OpenLDAP + Traefik. Single OtOpcUa.Host image built once;
Compose drives OTOPCUA_ROLES + Cluster:* per container to differentiate the
four hosts. README walks through bring-up + failover smoke + the dev LDAP
users.
Note: untested on macOS (no local Docker — see docs/v2/dev-environment.md).
- Install-Services.ps1: installs OtOpcUaHost (single fused binary) replacing
the v1 OtOpcUa + OtOpcUaAdmin pair. Required -Roles param writes OTOPCUA_ROLES
to the service env so Program.cs decides what to mount (admin / driver / both).
-HttpPort param (default 9000) writes ASPNETCORE_URLS on admin-role nodes.
sc.exe restart-on-failure: 5s, 30s, 60s; reset counter after 24h clean run.
Wonderware historian sidecar install logic preserved from v1.
- Uninstall-Services.ps1: removes OtOpcUaHost + cleans up legacy v1 names
(OtOpcUa, OtOpcUaAdmin) and the long-retired OtOpcUaGalaxyHost.
- Refresh-Services.ps1: updated service names (OtOpcUa -> OtOpcUaHost), publish
path (ZB.MOM.WW.OtOpcUa.Server -> ZB.MOM.WW.OtOpcUa.Host), process names
(OtOpcUa.Server -> OtOpcUa.Host). Switched nssm stop/start calls to
Stop-Service/Start-Service so the script works whether the underlying
service was installed via nssm or sc.exe.
Four new docs at docs/v2/ giving a single-page tour of each v2 piece:
- Architecture-v2.md: top-level mental model (fused Host + roles + cluster + live-edit)
- Cluster.md: AkkaClusterOptions + IClusterRoleInfo + WithOtOpcUaClusterBootstrap
- ControlPlane.md: 5 admin singletons + DPS topics + deploy flow + failover recovery
- Runtime.md: per-node actor tree + state machines + engine-wiring follow-up map
Each links back to the design doc for depth. Architecture-v2 cross-references
the other three + ServiceHosting + Redundancy + security.
- Redundancy.md: full rewrite — Akka-leader-driven ServiceLevel replaces
operator-managed RedundancyRole. Documents the 5-tier ServiceLevelCalculator,
RedundancyStateActor cluster singleton, and the DPS data flow.
- ServiceHosting.md: full rewrite — single fused OtOpcUa.Host binary with
OTOPCUA_ROLES env gating. Documents the conditional DI graph and the new
health endpoints (/health/ready, /health/active, /healthz).
- security.md: v2 banner at top covering path/project renames + new JWT bearer
+ DataProtection persisted to ConfigDb. Body unchanged because the 4-concern
security model is unchanged in v2; full per-section rewrite waits for F15
(Admin pages migration) since security.md references many pages that move.
- README.md: platform overview updated to v2 (fused Host + role gating).
DeployHappyPathTests exercises the full deploy pipeline on the 2-node harness:
AdminOperationsActor → ConfigPublishCoordinator → DistributedPubSub →
DriverHostActor on both nodes → ApplyAck → coordinator seals. Verifies both
NodeDeploymentState rows reach Applied and Deployment.Status reaches Sealed.
Exposed + fixed two production bugs along the way:
1. Coordinator was publishing DispatchDeployment on the "deployments" topic but
never subscribed to anything — DriverHostActor ACKs published on the same
topic could not reach it. Added dedicated "deployment-acks" topic with
coordinator subscription in PreStart, and DriverHostActor publishes ACKs
there.
2. NodeId derivation used member.Address.Host only — two cluster members on a
shared loopback host (test harness, dev VMs) collided to one identity. The
coordinator's expected-ack set became {1} and the system sealed after only
half the nodes acked. Switched to host:port everywhere (ClusterRoleInfo +
coordinator) so loopback nodes stay distinct and production identities are
harmlessly more specific.
Tests: 95 v2 tests pass (was 93 + 2 deploy tests), 0 skipped.
Failover scenarios (design §8 cases 3-7: node-kill-mid-apply, split-brain,
restart-during-deploy) deferred — they need controlled node-down primitives
on the harness. Tracked as F22 (failover scenario test cases).
Builds TwoNodeClusterHarness: two in-process Host-equivalent nodes sharing
an in-memory ConfigDb. Forms a 2-member Akka cluster. ClusterFormationTests
proves both nodes see each other as admin+driver role members.
Fixes a real production bug uncovered while wiring the harness — Program.cs
ran two separate ActorSystems (one from AddOtOpcUaCluster.AkkaHostedService
with cluster HOCON, one from Akka.Hosting.AddAkka with bare HOCON). Cluster
singletons landed on the bare ActorSystem and could not actually form a
cluster ("Configuration does not contain `akka.cluster` node").
Consolidation:
- AddOtOpcUaCluster now only binds AkkaClusterOptions + registers IClusterRoleInfo
- New WithOtOpcUaClusterBootstrap pushes embedded HOCON + Remote/Cluster options
into Akka.Hosting's AkkaConfigurationBuilder
- AkkaHostedService.cs deleted — Akka.Hosting now owns the lifecycle
- Program.cs + harness call WithOtOpcUaClusterBootstrap inside AddAkka
Why not WebApplicationFactory<Program>? Program.cs reads OTOPCUA_ROLES from
process env (shared across in-process WAFs); the harness replays Program.cs's
DI graph from a clean WebApplicationBuilder per node with per-node config
overrides. Same production extensions, isolated config + Kestrel + Akka ports.
Tests: 93 v2 tests pass (was 91 + 2 new cluster formation), 0 skipped.
Eliminates the services.BuildServiceProvider() captive-provider antipattern
(ASP0000) inside AddJwtBearer. The new ConfigureJwtBearerFromTokenService
resolves JwtTokenService from the real DI container at runtime and stays
in lock-step with JwtTokenService.BuildValidationParameters.
All 27 Security.Tests stay green, including the F1 integration tests that
exercise /auth/token through the real bearer pipeline.
Injects AuthenticationStateProvider and reads the current user's identity
name on Deploy click, replacing the "(current user)" placeholder.
Anonymous case falls back to "(anonymous)" — should never hit in practice
since the page requires FleetAdmin/ConfigEditor.
Mirrors the publisher-injection pattern from FleetStatusBroadcaster and
PeerOpcUaProbeActor: Props accepts an optional Action<object> override so
tests can use a TestProbe sink instead of bootstrapping DistributedPubSub
(unreliable single-node in TestKit).
Un-skips the two RedundancyStateActor tests deferred under F6.
7 tests exercise AddOtOpcUaAuth + MapOtOpcUaAuth end-to-end against an
in-memory ConfigDb + stub ILdapAuthService. Covers /auth/login (204/401/503),
/auth/ping (401/200), /auth/token (200+JWT shape), /auth/logout (204+clear-cookie).
Scope is the auth contract — not the fused Host bootstrap (cluster + role
gating belongs in the Task 58 multi-node harness). HostBuilder + TestServer
is used directly instead of WebApplicationFactory<Program> because the
test project has no Program entry point and Host needs Akka cluster up.
Mirrors WithOtOpcUaControlPlaneSingletons for the driver role. Spawns
DriverHostActor + DbHealthProbeActor on the host's ActorSystem and
registers both under marker keys. Host's Program.cs now calls it when
the node carries the driver role, so driver-only and admin+driver
deployments both auto-bootstrap the per-node actors.
Integration test covers the registration round-trip via Microsoft.Extensions.Hosting
+ Akka.Hosting AddAkka.