ScriptedAlarmActor now survives actor restart: PreStart loads from
the configured store + restores in-memory state; every Transition()
fires a fire-and-forget save. ActiveState still re-derives from the
evaluator on first tick (Phase 7 decision #14), but Acked state +
lastAckUser persist verbatim so operators don't re-ack across an
outage.
Three pieces:
- IAlarmActorStateStore seam in Commons.Engines, with the
AlarmActorStateSnapshot record (alarmId / state / lastTransitionUtc
/ lastAckUser) and NullAlarmActorStateStore default.
- EfAlarmActorStateStore in Runtime.ScriptedAlarms — production
adapter over the existing ScriptedAlarmState table in ConfigDb.
Maps the actor's 3-state enum to the table's AckedState column
(Active⇒Unacknowledged, Acknowledged⇒Acknowledged, Inactive⇒
Acknowledged). Concurrency conflicts are logged + dropped — the
next transition writes again.
- ScriptedAlarmActor PreStart load (async, piped back as
StateRestored) + Transition save. New Props overload takes the
store; default is NullAlarmActorStateStore so tests stay quiet.
Tests: Runtime 52 -> 57 (+5):
- Transition writes Active then Acknowledged snapshots with
lastAckUser populated
- PreStart with persisted Active state restores so a subsequent
AcknowledgeAlarm fires (not ignored as it would be from Inactive)
- Empty store boots Inactive (AcknowledgeAlarm correctly ignored)
- EfAlarmActorStateStore Save + Load round-trips via in-memory EF
- Load for unknown alarmId returns null
All 6 v2 test suites green: 157 tests passing.
Closes#112. F9 (#80) remaining residual is predicate binding to
Core.ScriptedAlarms.ScriptedAlarmEngine — split as F9b in tasks JSON.
Three pieces landed in one batch, closing F7-residual + Host DI #106:
Runtime/DriverInstanceActor:
- Subscribe / Unsubscribe message contracts; the Connected state
handles them via IDriver.ISubscribable. On every OnDataChange
event the actor publishes AttributeValuePublished to its parent
(DriverHostActor → OpcUaPublishActor). OPC UA StatusCode is
mapped to the 3-state OpcUaQuality enum via severity bits
(00=Good, 01=Uncertain, 10/11=Bad).
- DetachSubscription tears the handler off the driver on
DisconnectObserved, Unsubscribe, and PostStop so a stale handler
never pushes to a dead actor.
- WriteAttribute now dispatches IWritable.WriteAsync (batch of one)
with a 5s CancellationTokenSource; status-code propagated to
WriteAttributeResult on non-Good results.
Host:
- New ProjectReferences to Core + every cross-platform driver
assembly (AbCip/AbLegacy/FOCAS/Galaxy/Modbus/S7/TwinCAT).
Galaxy is net10 (gRPC client to mxaccessgw); the COM-bound net48
Wonderware Historian driver stays out of the Host's reference
closure — its .Client gRPC wrapper is what binds for historian
needs.
- New DriverFactoryBootstrap.AddOtOpcUaDriverFactories() registers
a singleton DriverFactoryRegistry, invokes each driver's
Register(registry, loggerFactory), and binds IDriverFactory to
DriverFactoryRegistryAdapter. Replaces the F7 NullDriverFactory
default so deploys actually materialise real IDriver instances
on driver-role nodes. ShouldStub() still gates per-platform
behaviour at spawn time.
- Program.cs wires AddOtOpcUaDriverFactories() before AddAkka so
the runtime extension can resolve IDriverFactory from DI.
Tests: Runtime 46 -> 52 (+6):
- Write returns success when StatusCode = Good
- Write propagates non-Good status code in failure Reason
- Subscribe forwards OnDataChange to parent as AttributeValuePublished
- Quality translation: Uncertain (0x40...) and Bad (0x80...)
- Subscribe against non-ISubscribable returns failure
- DisconnectObserved detaches handler so late events are dropped
All 6 v2 test suites green: 152 tests passing.
Closes F7. F7-residual sub-tasks #110 (subscribe) and #111 (write)
both shipped. Host DI binding #106 shipped.
Splits the side-effecting half of Phase7Composer (deferred at Task 47)
into two pieces that mirror DriverHostActor's spawn-plan pattern:
Phase7Plan + Phase7Planner.Compute (pure):
Diff two Phase7CompositionResult snapshots by stable id (EquipmentId,
DriverInstanceId, ScriptedAlarmId). Emits Added/Removed/Changed lists
per entity class. Added/Removed are sorted by id for deterministic
apply order. Changed wraps both Previous and Current projections so
consumers can decide between in-place mutation and tear-down +
rebuild.
Phase7Applier (side-effecting):
Drives an IOpcUaAddressSpaceSink against a plan. Removed equipment/
alarms get an inactive AlarmState write per id; Added/Removed of
Equipment or ScriptedAlarm triggers RebuildAddressSpace. Driver-only
changes correctly skip the rebuild — those flow through DriverHost-
Actor's spawn-plan in Runtime. Sink exceptions are caught + logged so
one bad node doesn't abort the apply.
Tests: OpcUaServer 6 -> 20 (+14):
- Phase7PlannerTests x9 (empty-in/empty-out, add/remove/change per
entity class, mixed changes, deterministic ordering)
- Phase7ApplierTests x5 (empty plan no-op, removal writes inactive
states + rebuild, added equipment triggers rebuild, driver-only
skips rebuild, sink fault is non-fatal)
The remaining piece is the EquipmentNodeWalker integration against a
real SDK NodeManager — split as F14b, gated on F10b's SDK builder.
All 6 v2 test suites green: 146 tests passing.
OpcUaPublishActor now routes through pluggable seams instead of just
incrementing a counter:
- IOpcUaAddressSpaceSink (Commons.OpcUa) — WriteValue / WriteAlarmState
/ RebuildAddressSpace. OpcUaQuality enum moved here from the actor's
nested type so producers don't have to reference the actor itself.
- IServiceLevelPublisher — Publish(byte). NullServiceLevelPublisher
retains the last level for inspection.
- The actor subscribes to the redundancy-state DPS topic in PreStart
and maps the local node's NodeRedundancyState to a coarse
ServiceLevel (Primary+leader=240, Primary=200, Secondary=100,
Detached=0). This keeps the local SDK's ServiceLevel node honest
without round-tripping back through the admin-singleton calculator.
- ServiceLevelChanged dedupes identical levels so the SDK doesn't see
redundant writes.
- Sink + publisher exceptions are caught and logged; the actor never
crashes its own dispatcher.
- PropsForTests gets optional sink/publisher/localNode params and
skips the DPS subscribe so unit tests stay on a vanilla TestKit
cluster.
Production binding to a real SDK NodeManager + Variable nodes is the
remaining residual — split as F10b. Task 60 still blocked on F10b.
Tests: Runtime 40 -> 46 (+6):
- AttributeValueUpdate routes to sink
- AlarmStateUpdate routes to sink
- RebuildAddressSpace calls sink.Rebuild
- ServiceLevelChanged dedupes
- RedundancyStateChanged for primary-leader publishes 240
- RedundancyStateChanged for secondary publishes 100
All 6 v2 test suites green: 132 tests passing.
VirtualTagActor and ScriptedAlarmActor now route through pluggable
evaluator interfaces and fan out to the cluster's live-tail topics
shipped in F15.3:
- IVirtualTagEvaluator + NullVirtualTagEvaluator in Commons.Engines.
VirtualTagActor calls evaluator on every DependencyValueChanged,
dedupes unchanged values, forwards EvaluationResult to its parent,
and publishes ScriptLogEntry Warning to the script-logs DPS topic
whenever the evaluator fails.
- IScriptedAlarmEvaluator + NullScriptedAlarmEvaluator. ScriptedAlarmActor
takes an AlarmConfig (id/name/equipment-path/severity/predicate) and
publishes both an AlarmTransitionEvent (alerts topic) and a
ScriptLogEntry (script-logs topic) at every transition. Manual
ConditionMet/Acknowledge/Cleared still flow through the same
Transition() so callers without engine bindings still drive the
state machine; the legacy single-string Props() overload routes
through a default AlarmConfig.
The Null* defaults keep the actors safe when no engine is bound —
unconfigured nodes never spuriously alarm. Production binding to
Core.VirtualTags.VirtualTagEngine and Core.ScriptedAlarms is the
remaining residual (F8b/F9b — split in tasks JSON).
Tests: Runtime 34 -> 40 (+6):
- VirtualTagActorTests x3 (evaluator drives EvaluationResult,
unchanged-value dedup, failure publishes Warning ScriptLogEntry)
- ScriptedAlarmActorTests x3 (engine threshold drives Activated +
Cleared on alerts topic, manual Acknowledge attribution).
All 6 v2 test suites green: 126 tests passing.
DriverHostActor.ApplyAndAck now reads the deployment artifact and
reconciles its set of DriverInstanceActor children — spawn the missing,
ApplyDelta to those with changed config, stop the removed/disabled.
The diff lives in pure DriverSpawnPlanner so it can be unit-tested
without an ActorSystem.
Adds IDriverFactory in Core.Abstractions (consumed by Runtime) +
DriverFactoryRegistryAdapter in Core.Hosting that wraps the existing
v1 DriverFactoryRegistry — Runtime stays decoupled from Polly/Serilog,
the Host wires the adapter once driver assemblies have registered.
ShouldStub(type, roles) is now actually called on every spawn — Galaxy
+ Wonderware-Historian boot stubbed on macOS/Linux or whenever the host
carries the dev role. Missing factory ⇒ stub fallback, never a crash.
Tests: 24 → 34 in Runtime (+10):
- DriverSpawnPlannerTests x7 (diff cases, type change ⇒ stop+respawn)
- DeploymentArtifactTests x5 (empty/malformed/missing fields tolerant)
- DriverHostActorReconcileTests x4 (spawn count, stub fallback,
ShouldStub gate, second-apply stops the removed)
All 6 v2 test suites green: 120 tests passing.
Closes F20 (ShouldStub wired). F7 marked partial — subscription
publishing + write path still stubbed in DriverInstanceActor itself.
Adds OPC UA SDK's CheckApplicationInstanceCertificate call to
OpcUaApplicationHost.StartAsync, removing the v1 friction of needing to
pre-create the PKI directory tree before booting.
- New OpcUaApplicationHostOptions.PkiStoreRoot (defaults to "pki")
- BuildConfigurationAsync now derives own/issuer/trusted/rejected from
PkiStoreRoot so the cert paths are configurable + consistent
- EnsureApplicationCertificateAsync runs before StandardServer.Start, and
fails fast with a clear message if the SDK can't produce a valid cert
- 2 new tests: fresh-tree creates a cert, second boot reuses it
Partial slice of follow-up F13. Endpoint-security, user-token validator,
and observability wiring still pending in the F13 follow-up. OpcUaServer
tests: 4 → 6.
Adds a real-infra mode for the integration test harness alongside the default
in-memory mode. Drops the previously-untested code paths (EF SqlServer
behaviors, real LDAP bind) under env-var control without breaking the
zero-infra default that CI runs.
- docker-compose.yml — minimal SQL 2022 (14331) + OpenLDAP (3894) stack
(ports chosen to coexist with docker-dev/ on 14330/3893)
- HarnessMode record reads OTOPCUA_HARNESS_USE_SQL=1 / USE_LDAP=1 from env
- SQL mode: per-harness unique DB OtOpcUa_Harness_{guid}, EnsureCreated
at startup, EnsureDeleted on dispose (best-effort)
- LDAP mode: drops StubLdapAuthService and configures real LdapAuthService
against the compose'd OpenLDAP via Authentication:Ldap:* config keys
- Microsoft.EntityFrameworkCore.SqlServer added to the test project
- README documents both modes + the macOS no-Docker caveat
Default in-memory mode unchanged — all 9 existing tests still pass.
Reshapes the placeholder buffered-counter actor into a thin fire-and-forget
bridge over the existing IAlarmHistorianSink contract. Default sink is
NullAlarmHistorianSink; production deployments override the DI binding to
SqliteStoreAndForwardSink wrapping WonderwareHistorianClient (the v1
components in src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware*
are reused verbatim — actor is just a mailbox-friendly entry point).
- HistorianAdapterActor.Props(IAlarmHistorianSink?) — null defaults to NullAlarmHistorianSink
- Receive<AlarmHistorianEvent>: fire-and-forget sink.EnqueueAsync
- Receive<GetStatus>: returns sink.GetStatus() (queue depth + drain state)
- ServiceCollectionExtensions.AddOtOpcUaRuntime registers the default sink
- WithOtOpcUaRuntimeActors spawns the actor + registers HistorianAdapterActorKey
- Program.cs calls AddOtOpcUaRuntime when hasDriver
Tests: 2 new (forward-to-sink + GetStatus). Runtime suite 17 → 18.
Extends TwoNodeClusterHarness with three lifecycle primitives:
- StopNodeBAsync() — graceful CoordinatedShutdown (Cluster.Leave)
- RestartNodeBAsync() — rebuild node B on same Akka port + same in-memory DB
- WaitForClusterSizeAsync(n) — converge assertion helper
Adds three failover scenario tests:
- Stopping node B shrinks cluster to 1 Up member
- Restarted node B rejoins on the same Akka port
- Deployment started with B down seals with a single NodeDeploymentState
(validates ConfigPublishCoordinator.DiscoverDriverNodes snapshots
membership at dispatch time)
Closes follow-up F22. Integration test count: 6 → 9 (+3).
- New Commons.Messages.Fleet.GetDiagnostics request record.
- DriverHostActor handles GetDiagnostics in all three states (Steady, Applying,
Stale); replies with a NodeDiagnosticsSnapshot built from _currentRevision
+ the local NodeId. Drivers list is empty until F7 wires the per-instance
children.
- FleetDiagnosticsClient now resolves the target via ActorSelection at
akka.tcp://{system}@{nodeId}/user/driver-host and Asks with a 3s timeout.
On timeout/peer-down it returns an empty snapshot so the UI degrades
gracefully rather than throwing.
Two new integration tests in Host.IntegrationTests:
- GetDiagnostics_returns_snapshot_with_target_NodeId verifies the
cross-node Ask/Reply works.
- GetDiagnostics_after_deploy_reports_current_revision exercises the
end-to-end path: AdminOps starts a deployment, both DriverHostActors
apply, then diagnostics reports the new revision on both nodes.
All 98 v2 tests pass (was 96 + 2 new).
Replaces the Ok=true stub with a TCP connect to the peer's OPC UA port (4840
default) with a 2s timeout. A successful connect indicates the OPC UA server
process is up + accepting connections — enough for the redundancy calculator
to treat the peer as live. A full secure-channel Hello/Acknowledge handshake
is overkill for what the redundancy calc consumes and would pull in the OPC
UA Client SDK + a PKI setup. Upgrade later if a deeper liveness signal is ever
required.
Probe extracts the host from NodeId by stripping the :port suffix (commit
5cfbe8b encoded host:port into NodeId for cluster-member identity).
Tests: 2 new tests — Ok=true against a live TcpListener on a chosen port,
Ok=false against an unreachable endpoint. All 17 Runtime tests pass (was 16
covering only the message-contract surface).
ConfigAuditLog gains two nullable columns (EventId, CorrelationId) + a filtered
unique index UX_ConfigAuditLog_EventId. EF migration
20260526105027_AddConfigAuditLogEventIdColumns is additive (nullable + filtered
index = legacy rows backfill cleanly).
AuditWriterActor now writes EventId + CorrelationId into the dedicated columns
instead of synthesising a JSON wrapper into DetailsJson. Cross-restart dedup
is now real: a retry of an already-flushed batch hits the unique index and
SaveChanges throws; the existing catch drops the duplicate without losing the
rest of the batch.
WrapDetails helper deleted — F4 (its JSON hardening) becomes moot.
AuditWriterActorTests.Details_wrapper_embeds_eventId_and_correlationId renamed
+ rewritten to assert against the columns. All 29 ControlPlane tests pass,
all 95 v2 tests green.
DeployHappyPathTests exercises the full deploy pipeline on the 2-node harness:
AdminOperationsActor → ConfigPublishCoordinator → DistributedPubSub →
DriverHostActor on both nodes → ApplyAck → coordinator seals. Verifies both
NodeDeploymentState rows reach Applied and Deployment.Status reaches Sealed.
Exposed + fixed two production bugs along the way:
1. Coordinator was publishing DispatchDeployment on the "deployments" topic but
never subscribed to anything — DriverHostActor ACKs published on the same
topic could not reach it. Added dedicated "deployment-acks" topic with
coordinator subscription in PreStart, and DriverHostActor publishes ACKs
there.
2. NodeId derivation used member.Address.Host only — two cluster members on a
shared loopback host (test harness, dev VMs) collided to one identity. The
coordinator's expected-ack set became {1} and the system sealed after only
half the nodes acked. Switched to host:port everywhere (ClusterRoleInfo +
coordinator) so loopback nodes stay distinct and production identities are
harmlessly more specific.
Tests: 95 v2 tests pass (was 93 + 2 deploy tests), 0 skipped.
Failover scenarios (design §8 cases 3-7: node-kill-mid-apply, split-brain,
restart-during-deploy) deferred — they need controlled node-down primitives
on the harness. Tracked as F22 (failover scenario test cases).
Builds TwoNodeClusterHarness: two in-process Host-equivalent nodes sharing
an in-memory ConfigDb. Forms a 2-member Akka cluster. ClusterFormationTests
proves both nodes see each other as admin+driver role members.
Fixes a real production bug uncovered while wiring the harness — Program.cs
ran two separate ActorSystems (one from AddOtOpcUaCluster.AkkaHostedService
with cluster HOCON, one from Akka.Hosting.AddAkka with bare HOCON). Cluster
singletons landed on the bare ActorSystem and could not actually form a
cluster ("Configuration does not contain `akka.cluster` node").
Consolidation:
- AddOtOpcUaCluster now only binds AkkaClusterOptions + registers IClusterRoleInfo
- New WithOtOpcUaClusterBootstrap pushes embedded HOCON + Remote/Cluster options
into Akka.Hosting's AkkaConfigurationBuilder
- AkkaHostedService.cs deleted — Akka.Hosting now owns the lifecycle
- Program.cs + harness call WithOtOpcUaClusterBootstrap inside AddAkka
Why not WebApplicationFactory<Program>? Program.cs reads OTOPCUA_ROLES from
process env (shared across in-process WAFs); the harness replays Program.cs's
DI graph from a clean WebApplicationBuilder per node with per-node config
overrides. Same production extensions, isolated config + Kestrel + Akka ports.
Tests: 93 v2 tests pass (was 91 + 2 new cluster formation), 0 skipped.
Mirrors the publisher-injection pattern from FleetStatusBroadcaster and
PeerOpcUaProbeActor: Props accepts an optional Action<object> override so
tests can use a TestProbe sink instead of bootstrapping DistributedPubSub
(unreliable single-node in TestKit).
Un-skips the two RedundancyStateActor tests deferred under F6.
7 tests exercise AddOtOpcUaAuth + MapOtOpcUaAuth end-to-end against an
in-memory ConfigDb + stub ILdapAuthService. Covers /auth/login (204/401/503),
/auth/ping (401/200), /auth/token (200+JWT shape), /auth/logout (204+clear-cookie).
Scope is the auth contract — not the fused Host bootstrap (cluster + role
gating belongs in the Task 58 multi-node harness). HostBuilder + TestServer
is used directly instead of WebApplicationFactory<Program> because the
test project has no Program entry point and Host needs Akka cluster up.
Mirrors WithOtOpcUaControlPlaneSingletons for the driver role. Spawns
DriverHostActor + DbHealthProbeActor on the host's ActorSystem and
registers both under marker keys. Host's Program.cs now calls it when
the node carries the driver role, so driver-only and admin+driver
deployments both auto-bootstrap the per-node actors.
Integration test covers the registration round-trip via Microsoft.Extensions.Hosting
+ Akka.Hosting AddAkka.
Task 56: removes the legacy in-process Server + Admin Web project + their test
projects (Server.Tests, Admin.Tests, Admin.E2ETests). The fused OtOpcUa.Host
binary built across Phases 1-9 is now the sole production entry point.
What happened to the 47 legacy Admin Blazor pages: per follow-up F15, the
v1 architecture's draft/publish UX is replaced by v2's live-edit + snapshot-
deploy model, so a 1:1 migration is not meaningful. The mechanical move via
git mv preserves the history; service classes + page bodies that referenced
removed v1 types (ConfigGeneration, RedundancyRole, GenerationId) were
deleted. AdminUI now ships a minimal Home page + the v2 Deployments page.
Per-page rebuild against the v2 surface is tracked as F15. The v2 Deployments
page (Task 52) is the only first-party UI shipping in this PR.
Task 57: solution build green; 84+ tests green across active v2 + legacy
driver test projects.
Phase 1f — the consolidator migration. Closes out the v2 entity-model
rewrite by emitting a single EF migration that captures the cumulative
schema delta from 14a (RowVersion) through 14e (drop generation entities).
Generated: src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Migrations/
20260526081556_V2HostingAlignment.cs (1562 lines)
20260526081556_V2HostingAlignment.Designer.cs
Migration shape (per `grep -nE migrationBuilder.\(...)`):
Drop 12 ForeignKey constraints (one per live-edit entity's GenerationId FK)
Drop 2 Tables (ConfigGeneration, ClusterNodeGenerationState)
Drop 45 Indexes (every UX_*_Generation_* and IX_*_Generation_* across the
13 live-edit tables — 1 also dropped the unique-Primary
filtered index UX_ClusterNode_Primary_Per_Cluster)
Drop 13 Columns (12 GenerationId + 1 RedundancyRole)
Add 12 RowVersion columns (one per live-edit entity)
Create 4 Tables (Deployment, NodeDeploymentState, ConfigEdit,
DataProtectionKeys)
Create ~45 Indexes (recreated under the new naming pattern
UX_<Table>_LogicalId / UX_<Table>_<X> with the
GenerationId column stripped from composite keys)
Notable EF quirks accepted:
Unique-on-required-column indexes (UX_VirtualTag_LogicalId etc.) ship a
`filter: "[VirtualTagId] IS NOT NULL"` clause that EF auto-inserts for
SQL Server. Harmless — the column is C#-side `required` so NULL never
appears.
Verification:
dotnet build src/Core/ZB.MOM.WW.OtOpcUa.Configuration -> 0 errors
dotnet ef migrations script --idempotent (against placeholder DSN)
-> 3259-line
.sql produced
OK
tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests -> 0 errors
Live `dotnet ef database update` against a scratch SQL Server deferred to
Task 15 (Migrate-To-V2.ps1) — SSH to the docker host needs a key/password I
don't have, and the always-on SQL at 10.100.0.35,14330 uses Integrated
Security (Windows auth, unreachable from this macOS dev). The migration
itself is structurally correct by construction (EF tooling generated it
against the live DbContext model); the live-DB confidence step is the
PowerShell wrapper's job.
SchemaComplianceTests updates:
- All_expected_tables_exist: removed ConfigGeneration +
ClusterNodeGenerationState; added Deployment, NodeDeploymentState,
ConfigEdit, DataProtectionKeys.
- Filtered_unique_indexes_match_schema_spec: removed entries for
UX_ClusterNode_Primary_Per_Cluster (Task 14d) and
UX_ConfigGeneration_Draft_Per_Cluster (Task 14e). Two filtered uniques
remain (UX_ClusterNodeCredential_Value, UX_ExternalIdReservation_KindValue_Active).
- Check_constraints_match_schema_spec: added CK_ConfigEdit_FieldsJson_IsJson.
StoredProceduresTests update:
- Removed RedundancyRole + 'Primary' from the raw INSERT into ClusterNode
so the DB-backed test runs against the new schema.
Phase 1d of the v2 entity-model rewrite. The static RedundancyRole column
is replaced by Akka cluster's role-leader-of-"driver" election at runtime
(see RedundancyStateActor + ServiceLevelCalculator in Task 35).
Changes:
- Removed `public required RedundancyRole RedundancyRole` from
ClusterNode entity.
- Removed `e.Property(x => x.RedundancyRole).HasConversion<string>()...`
mapping from OtOpcUaConfigDbContext.ConfigureClusterNode.
- Removed the `UX_ClusterNode_Primary_Per_Cluster` filtered unique index
(filter referenced [RedundancyRole]='Primary').
- Dropped `using ZB.MOM.WW.OtOpcUa.Configuration.Enums` from ClusterNode.cs
(no longer needed).
- Deleted `Enums/RedundancyRole.cs` — the enum is unused in v2-kept code.
- DraftValidator: dropped the "exactly one Primary per cluster"
validation block. Comment in place explaining v2 picks primary at
runtime via Akka.
- DraftValidatorTests: dropped ValidateClusterTopology_flags_multiple_Primary
test; reworked BuildNode helper to no longer take a `role` argument.
Untouched (Server + Admin still reference RedundancyRole; accepted broken
per Task 56 policy):
src/Server/ZB.MOM.WW.OtOpcUa.Server/Redundancy/{ClusterTopologyLoader,
RedundancyStatePublisher, RedundancyTopology, ServiceLevelCalculator}.cs
src/Server/ZB.MOM.WW.OtOpcUa.Admin/Services/RedundancyMetrics.cs
DB-runtime tests will fail against the new schema (Task 14f's migration
drops the column) — to be updated in Task 14f's SchemaComplianceTests
update:
- SchemaComplianceTests.cs:55 (expected filtered index list)
- StoredProceduresTests.cs:263 (raw INSERT names the column)
Verification:
src/Core/ZB.MOM.WW.OtOpcUa.Configuration -> 0 errors
tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests -> 0 errors
whole solution -> 71 errors
(70 from Task 14b in Server/Admin, +1 new Server/Redundancy reference)
Phase 1c of the v2 entity-model rewrite. Deletes the draft/publish lifecycle
machinery that v2 replaces with AdminOperationsActor + ConfigComposer +
DriverInstanceActor.ApplyDelta.
Deleted (6 files):
src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Apply/
IGenerationApplier.cs — interface for the apply pipeline
GenerationApplier.cs — the v1 applier coordinating per-driver hook-back
GenerationDiff.cs — typed wrapper over the sp_ComputeGenerationDiff
SQL output
ApplyCallbacks.cs — per-driver hook surface invoked by the applier
ChangeKind.cs — enum {Added, Modified, Removed, Unchanged}
tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests/GenerationApplierTests.cs
The empty Apply/ directory is removed.
Kept (repurposed in Task 39 for stale-config fallback):
src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/GenerationSealedCache.cs
src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/ResilientConfigReader.cs
tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests/GenerationSealedCacheTests.cs
tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests/ResilientConfigReaderTests.cs
Naming rename (GenerationSealedCache -> DeploymentArtifactCache) deferred
to Task 39 (DriverHostActor stale-config fallback) where the consumer is
written. The type stays available under its v1 name until then.
IDriver.cs doc-comment: replaced the "Used by IGenerationApplier..." sentence
with "Invoked by the v2 DriverInstanceActor when ApplyDelta reports that only
this driver's config changed in the new deployment."
Server/Admin breakage from Task 14b unchanged (70 errors). Configuration +
Core.Tests + Configuration.Tests stay green.
src/Core/ZB.MOM.WW.OtOpcUa.Configuration -> 0 errors
tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests -> 0 errors
whole solution -> 70 errors (all in Server/Admin)
Adds Directory.Packages.props (ManagePackageVersionsCentrally) and
Directory.Build.props (net10.0/nullable/implicit usings/LangVersion latest).
Strips Version attributes from every csproj PackageReference and consolidates
versions into the central file.
Side fixes (necessary to keep the build green on .NET SDK 10.0.105 on macOS):
- Microsoft.CodeAnalysis.CSharp{,.Workspaces}: 5.3.0 -> 5.0.0. The 5.3.0
analyzer DLL references compiler 5.3.0.0 and the local SDK ships compiler
5.0.0.0, producing CS9057 on every project that loaded the Analyzers
output. Master itself was broken on this machine pre-change.
- Server + Server.Tests pin OPCFoundation.NetStandard.Opc.Ua.{Configuration,
Client} to 1.5.374.126 via VersionOverride, matching Opc.Ua.Server's
pin. Mixing 1.5.378.106 Opc.Ua.Core transitively with 1.5.374.126
Opc.Ua.Server breaks CustomNodeManager2 override signatures
(CS0115 on LoadPredefinedNodes/Browse/HistoryRead*) and CS7069 in
the tests. The pin disappears when the legacy Server project is
deleted in Task 56.
- Client.UI + Client.UI.Tests: NuGetAuditSuppress for
GHSA-xrw6-gwf8-vvr9 (Tmds.DBus.Protocol 0.20.0 reaches both projects
transitively from Avalonia.Desktop on Linux/macOS only).
Deviation from the plan: TreatWarningsAsErrors=true is NOT set in
Directory.Build.props because the pre-v2 Admin/Server test projects carry
~240 xUnit1051 analyzer warnings that would fail the build. New v2 projects
opt in via their own csproj; the global flag can return once the legacy
projects are deleted in Task 56.
UnsTabDragDropE2ETests were timing out at the 'UNS Structure' nav-link
locator because AdminWebAppFactory never registered AdminHubConnectionFactory
/ HubTokenService / DataProtection — ClusterDetail.razor's @inject threw at
circuit boot, so the page never advanced past the Loading placeholder. 2 → 3
pass after the registrations land. Also documents the Modbus standard-vs-
exception_injection coverage matrix in the fixture README + cross-references
docs/drivers/AbServer-Test-Fixture.md from each Emulate test so a developer
landing on a skipped test has a direct doc pointer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two real bugs uncovered by re-running with the new fixture defaults
pointing at the shared docker host. Both are test-side, not driver-side.
AbCip — Driver_reads_seeded_DInt_from_ab_server (4 parametrized rows):
Hardcoded 'ab://127.0.0.1:{port}/1,0' in the deviceUri instead of
the resolved fixture.Host. The new 10.100.0.35 default (and any
AB_SERVER_ENDPOINT override) silently couldn't reach this test —
the driver tried to connect to a non-existent localhost:44818 and
returned BadCommunicationError on all 4 profile rows. The sibling
Emulate tests already use the fixture's resolved endpoint; this
smoke test was missed in the original migration.
Fix: deviceUri = $"ab://{fixture.Host}:{fixture.Port}/1,0".
Modbus — Float32_With_CDAB_Roundtrips_Through_Wire:
Test wrote a Float32 to HR 100 (2 consecutive registers: 100+101).
standard.json's writable HR range declares [100,100] only — a
single-cell auto-incrementing register, not a 2-register pair. The
write to register 101 was rejected with Illegal Data Address
(BadOutOfRange).
Fix: moved the tag from HR 100 to HR 200 (in standard.json's
'[200, 209]' scratch range — 10 consecutive writable HRs). The
Float32+CDAB semantic the test exercises is unchanged.
Modbus — Block_Read_Coalescing_Reduces_PDU_Count_End_To_End:
Test read HR 300, 302, 304 — outside both the writable ranges and
the uint16 seed list. pymodbus rejects reads to unseeded HRs even
though 'hr size' is 2048. BadOutOfRange on every read.
Fix: moved the tags from 300/302/304 to 200/202/204 (within the
scratch range). The non-contiguous coalescing semantic (3 tags
inside a 5-register window with MaxReadGap=5) is preserved.
After this commit:
- Modbus.IntegrationTests: 6/38 pass / 32 skip / 0 fail
(was 4 pass / 32 skip / 2 fail; 32 skips are profile-gated
ExceptionInjectionTests — they need MODBUS_SIM_PROFILE=
exception_injection and a different container, intentional gating)
- AbCip.IntegrationTests: 10/12 pass / 2 skip / 0 fail
(was 6 pass / 2 skip / 4 fail; 2 skips are Emulate tests that
need the fixture for separate scenarios)
No driver code changed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CLAUDE.md "Docker Workflow" claims (per the 2026-04-28 migration note)
that all fixture-class default endpoints were rewritten to target the
shared Docker host at 10.100.0.35. Audit during today's e2e run showed
the claim was incomplete — five fixture classes still defaulted to
localhost / 127.0.0.1, causing every fixture-touching integration test
to skip with "endpoint unreachable" on a fresh box that hadn't set
the override env vars.
Files corrected:
- tests/.../Modbus.IntegrationTests/ModbusSimulatorFixture.cs
DefaultEndpoint: localhost:5020 → 10.100.0.35:5020
- tests/.../S7.IntegrationTests/Snap7ServerFixture.cs
DefaultEndpoint: localhost:1102 → 10.100.0.35:1102
- tests/.../OpcUaClient.IntegrationTests/OpcPlcFixture.cs
DefaultEndpoint: opc.tcp://localhost:50000 → opc.tcp://10.100.0.35:50000
- tests/.../AbCip.IntegrationTests/AbServerFixture.cs
Host default + ResolveHost fallback: 127.0.0.1 → 10.100.0.35
- tests/.../AbLegacy.IntegrationTests/AbLegacyServerFixture.cs
Host default + ResolveEndpoint fallback: 127.0.0.1 → 10.100.0.35
XML doc comments referencing the old localhost defaults were updated in
the same pass so the class-summary documentation matches the actual
default. The override-via-env-var mechanism (MODBUS_SIM_ENDPOINT,
AB_SERVER_ENDPOINT, AB_LEGACY_ENDPOINT, S7_SIM_ENDPOINT,
OPCUA_SIM_ENDPOINT) is unchanged — pointing at a real PLC or a
locally-running container still works exactly as before.
Verification:
- Solution-wide dotnet build: 0 errors.
- S7.IntegrationTests: 3/3 pass without env-var override.
- OpcUaClient.IntegrationTests: 3/3 pass without env-var override.
- Modbus.IntegrationTests: 4/38 (same as the env-var-override run —
the 2 failures + 32 skips are pre-existing fixture-profile
mismatches unrelated to this fix).
- AbCip.IntegrationTests / AbLegacy.IntegrationTests: same results
as the env-var-override run.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single commit covering the four small/medium fixes from the updated
code review.
Core.Scripting-014 (Medium, Concurrency):
CompiledScriptCache.Clear() used the key-only TryRemove(key, out var
lazy) overload — same race shape Core.Scripting-006 closed in
GetOrCompile's catch block. A concurrent re-add between snapshot and
TryRemove was evicted + disposed while the new caller still held it.
Replaced with the value-scoped TryRemove(KeyValuePair<,>) overload.
Regression test
Clear_uses_value_scoped_TryRemove_so_a_race_inserted_entry_survives
added.
Core.Scripting-013 (Medium, Security):
Hand-rolled BuildWrapperSource pastes user source between literal
braces; brace-balanced source could inject sibling methods/classes
alongside CompiledScript.Run. Analyzer still walked the injected
members so it wasn't a direct escape, but it relaxed the documented
'method body' authoring contract. Added EnforceSingleRunMember:
after ParseText, the compilation unit must hold exactly one type
(CompiledScript) and that type must hold exactly one member (the Run
method). Any deviation throws CompilationErrorException with LMX001/
LMX002 diagnostic IDs and a Core.Scripting-013 reference in the
message. Two regression tests added covering the sibling-method and
sibling-class injection vectors.
Core.Scripting-015 (Low, Correctness, latent):
ToCSharpTypeName's generic branch truncated at the first backtick via
IndexOf, silently dropping closed args of nested-generic shapes
(Outer<T>.Inner<U>). No production caller exercises this shape today
(all TContext/TResult are top-level non-nested), so the bug was
latent. Rewrote the generic branch to walk the FullName segment-by-
segment, consuming generic args per segment so nested shapes emit
valid C# (global::Ns.Outer<T>.Inner<U> rather than the broken
Outer<T,U>).
Core.ScriptedAlarms-013 (Low, Documentation):
The internal test accessors TryGetScratchReadCacheForTest /
TryGetScratchContextForTest return live mutable scratch refilled in
place under _evalGate. XML docs didn't warn future test authors about
the synchronization contract. Added a <remarks> block to each
documenting the only-safe-on-quiesced-engine + identity-or-single-key
contract.
Verification (suites green):
Core.Scripting.Tests: 110/110 (was 107 — +3 new rejection/race tests)
Core.ScriptedAlarms.Tests: 67/67 (unchanged — doc-only fix)
Core.VirtualTags.Tests: 57/57 (unchanged)
After this commit, all 12 findings from the updated re-review are
closed (10 Resolved, 1 Won't Fix none, 1 Deferred — Driver.Galaxy-017).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Core.Scripting-012 (High, Security) resolution.
The Core.Scripting-008 rewrite broadened the BCL references list from a
narrow allow-list to the full System.* + netstandard +
Microsoft.Win32.Registry set, delegating the security gate entirely to
ForbiddenTypeAnalyzer. Three categories of dangerous BCL types were
reachable from script source without a deny-list entry:
- System.Threading.ThreadPool — QueueUserWorkItem re-introduces the
background-fanout threat Core.Scripting-003 closed against
System.Threading.Tasks.
- System.Threading.Timer — schedules unbounded callback work that
outlives the per-evaluation timeout.
- System.Runtime.Loader.AssemblyLoadContext — loads arbitrary DLLs.
Defense-in-depth gap; invocation needs reflection (already denied)
but the load itself was reachable.
Fix:
- Added 'System.Runtime.Loader' to ForbiddenNamespacePrefixes
(preferred over type-granular per the recommendation so future BCL
additions to that namespace are denied by default).
- Added 'System.Threading.ThreadPool' and 'System.Threading.Timer'
to ForbiddenFullTypeNames — both live in System.Threading shared
with allowed primitives so they must be type-granular.
Regression tests added to ScriptSandboxTests:
Rejects_ThreadPool_QueueUserWorkItem_at_compile
Rejects_Timer_new_at_compile
Rejects_AssemblyLoadContext_at_compile
Docs:
docs/v2/implementation/phase-7-scripting-and-alarming.md decision #6
and the Sandbox-escape compliance-check row both updated to enumerate
the new entries per the Core.Scripting-009 doc-sync convention.
Two lower-impact suggestions from the finding's recommendation
(System.Console, CultureInfo.DefaultThreadCurrentCulture) were
intentionally not addressed and are recorded as accepted minor risks
in the resolution.
Verification: Core.Scripting.Tests 107/107 (was 104 + 3 new rejection
tests).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both VirtualTagEngine.Load and ScriptedAlarmEngine.LoadAsync were calling
ScriptEvaluator.Compile directly, bypassing CompiledScriptCache. The
Core.Scripting-008 collectible-ALC fix wired Dispose only through the cache's
Clear()/Dispose(), so the per-publish accretion the -008 fix was meant to
eliminate was still in effect on the actual production path — the headline
'no more restarts needed' guarantee wasn't delivered.
Resolution:
- VirtualTagEngine + ScriptedAlarmEngine each gained a private
CompiledScriptCache<TContext, TResult> instance.
- Both Load methods now call _compileCache.GetOrCompile(source).
- Publish-replace path: _compileCache.Clear() runs alongside the existing
_tags / _alarms clears so the prior generation's ALCs are disposed
before recompile.
- Engine Dispose now calls _compileCache.Dispose() so shutdown actually
releases the emitted assemblies.
Side-fix in CompiledScriptCache: Dispose() set _disposed=true then called
Clear(), but Clear() had a pre-existing 'if (_disposed) return' guard that
aborted the drain unconditionally — making the Dispose-triggered cleanup a
silent no-op. Removed the disposed-guard on Clear() (clearing an empty/
cleared cache is idempotent).
Side-fix in ScriptedAlarmEngine.Dispose: cleared _alarms AFTER the
Task.WhenAll drain. The drain guarantees no background callback is mid-
flight, so clearing is safe. Previously _alarms was deliberately NOT
cleared on Dispose (per Core.ScriptedAlarms-005), but that left the
AlarmState records holding TimedScriptEvaluator → ScriptEvaluator → delegate
references that rooted the emitted assemblies, defeating the cache's
Dispose work on the engine side.
Regression tests:
- VirtualTagEngineTests.Dispose_unloads_compiled_script_assembly
- ScriptedAlarmEngineTests.Dispose_unloads_compiled_predicate_assembly
Both use WeakReference + bounded GC.Collect() to prove the emitted
assembly is reclaimable after engine.Dispose(). The alarms test had to
be synchronous (not 'async Task<WeakReference>') because async state
machines capture locals as state-struct fields, keeping them alive past
the method's apparent end and defeating GC.
Verification:
- Core.Scripting.Tests: 104/104 (unchanged).
- VirtualTags.Tests: 57/57 (was 56 — +1 unload test).
- ScriptedAlarms.Tests: 67/67 (was 66 — +1 unload test).
- All other consumer suites still green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Driver.Cli.Common-007 + Driver.Cli.Common-008 resolution.
Driver.Cli.Common-007 (High, Correctness):
0x80550000 is the canonical OPC UA spec value for BadSecurityPolicyRejected,
not BadDeviceFailure. The correct spec value for BadDeviceFailure is
0x808B0000 (verified against OPC Foundation Opc.Ua.StatusCodes;
corroborated locally by Driver.Galaxy.Runtime.StatusCodeMap and both
Wonderware historian quality mappers which all hand-pin the correct
value).
The bug was duplicated across six driver modules:
- FocasStatusMapper.BadDeviceFailure
- AbCipStatusMapper.BadDeviceFailure
- AbLegacyStatusMapper.BadDeviceFailure
- TwinCATStatusMapper.BadDeviceFailure
- ModbusDriver.StatusBadDeviceFailure
- S7Driver.StatusBadDeviceFailure
Plus the SnapshotFormatter shortlist that named 0x80550000 as
BadDeviceFailure, and three downstream Modbus tests that asserted
against the wrong value (so CI was blind).
This commit fixes all six native-mapper constants, the formatter
shortlist, and the three Modbus tests in one pass. Added a regression
guard to FormatStatus_does_not_apply_pre_fix_wrong_names that pins
0x80550000 never renders as BadDeviceFailure (mirroring the existing
-001 wrong-name guards).
Behavior change: OPC UA clients consuming the native drivers now see
the canonical BadDeviceFailure (0x808B0000) on device-fault paths
instead of the misnamed BadSecurityPolicyRejected (0x80550000). Wire-
level status semantics now match operator-facing CLI labels.
Driver.Cli.Common-008 (Low, Testing):
Deleted the redundant FormatStatus_names_native_driver_emitted_codes
Theory — its five InlineData rows were already covered by the
well-known Theory in the same commit (5a9c459), and used a weaker
ShouldContain vs the well-known Theory's ShouldBe (exact match).
Verification:
- Driver.Cli.Common.Tests: 43/43 pass (was 48 after the -008 deletion).
- Driver.Modbus.Tests: 263/263 pass.
- Driver.AbCip.Tests: 262/262.
- Driver.AbLegacy.Tests: 157/157.
- Driver.FOCAS.Tests: 178/178.
- Driver.S7.Tests: 112/112.
- Driver.TwinCAT.Tests: 131/131.
Total: 1146 tests across the affected modules, all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The sibling mxaccessgw repo restructured: clients/dotnet/MxGateway.Client
no longer exists, and the proto contracts moved to a new namespace
(ZB.MOM.WW.MxGateway.Contracts.Proto, was MxGateway.Contracts.Proto). The
driver's source still expects the pre-restructure namespace, so the
broken ProjectReference produced 86 build errors in src/ + 1 in tests/
on master.
Resolution: vendor the last known-good build of MxGateway.Client.dll
(99 KB, May 22) and MxGateway.Contracts.dll (490 KB, May 23) under
src/Drivers/.../Driver.Galaxy/libs/, reference them via <Reference
HintPath=...> in both the driver and its test csproj, and declare the
NuGet packages the dropped ProjectReference was supplying transitively
(Google.Protobuf, Grpc.Core.Api, Grpc.Net.Client,
Microsoft.Extensions.Logging.Abstractions, Polly) at versions matching
the sibling repo's ZB.MOM.WW.MxGateway.Contracts.csproj so binary
compatibility is preserved.
Why this over a source migration:
Source migration would require namespace renames across ~19 driver
files PLUS reimplementing MxGatewayClient / MxGatewaySession /
GalaxyRepositoryClient (~2,200 LoC) — the sibling repo dropped the
client library entirely, keeping only the proto contracts. Vendoring
the last known-good binaries unblocks the build in minutes, freezes
the gateway contract surface at a known-good version, and preserves
the option to migrate properly once the sibling repo decides whether
to restore a client library or hand the work back to us.
libs/README.md documents the unwinding plan (either path closes the
debt: sibling restores a client library, or driver migrates to the new
contracts namespace + reimplements the client wrapper).
Verification:
- dotnet build ZB.MOM.WW.OtOpcUa.slnx: 0 errors (was 87).
- Driver.Galaxy unit tests: 245/245 pass.
- Integration tests not run here (require a live mxaccessgw gateway).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Core.ScriptedAlarms-009 resolution: replace the per-call Dictionary +
AlarmPredicateContext allocation with a per-alarm reusable AlarmScratch
held in _scratchByAlarmId, refilled in place under _evalGate on each
evaluation. The hot path no longer allocates per upstream tag change.
Why this matters:
On a busy line where many tags feeding many alarms change frequently,
the old BuildReadCache allocated a fresh dictionary + context on every
predicate evaluation — a steady stream of short-lived allocations the
GC eventually has to reclaim. With the reuse, the dictionary and
context are allocated once per alarm (on first evaluation) and refilled
in place across every subsequent re-eval.
Implementation:
- New private AlarmScratch class holds the reusable
Dictionary<string, DataValueSnapshot> read cache (pre-sized to the
alarm's Inputs.Count) and the AlarmPredicateContext that wraps it by
reference. The context observes refilled values without being
re-created.
- ConcurrentDictionary<string, AlarmScratch> _scratchByAlarmId on the
engine, cleared in LoadAsync alongside _alarms so a config-publish
drops the prior generation's scratch (Inputs / Logger may change).
- EvaluatePredicateToStateAsync looks up scratch via GetOrAdd, calls
the new RefillReadCache(Dictionary, IReadOnlySet) helper to clear +
repopulate the dictionary in place, then runs the predicate against
the reused context.
- BuildReadCache removed.
Safety:
Reuse is serialised under _evalGate which guarantees no two threads
ever observe the same scratch in a half-refilled state. The
AlarmPredicateContext is bound to the scratch dictionary by reference,
so the predicate's ctx.GetTag(path) sees the freshly-refilled values
rather than a stale snapshot.
Verification:
- All 66 ScriptedAlarms tests pass (was 63 — three new regression tests
locking the reuse contract).
- All 56 VirtualTags tests still pass (unchanged).
- All 104 Core.Scripting tests still pass (unchanged).
New tests in ScriptedAlarmEngineTests:
- Reevaluation_reuses_the_same_read_cache_dictionary — asserts
ReferenceEquals(scratch_before, scratch_after) across two
evaluations of the same alarm.
- Reevaluation_reuses_the_same_predicate_context — same, for the
context.
- LoadAsync_drops_the_prior_generations_scratch — asserts a config
publish wipes the prior scratch (so a stale Logger / Inputs can't
leak into the new generation).
Internal test hooks TryGetScratchReadCacheForTest /
TryGetScratchContextForTest added via the existing
InternalsVisibleTo for the tests project. Kept internal — not part of
the public engine surface.
Docs:
- docs/v2/Galaxy.Performance.md "Scripted-alarm engine" section
rewritten as "hot-path allocation reuse" documenting the new
contract + reuse safety reasoning + the three regression tests.
- code-reviews/Core.ScriptedAlarms/findings.md -009 flipped
Won't Fix → Resolved.
- code-reviews/README.md regenerated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>