Commit Graph

464 Commits

Author SHA1 Message Date
Joseph Doherty
686138123f feat(runtime): F11 — HistorianAdapterActor wired to IAlarmHistorianSink
Some checks failed
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been cancelled
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been cancelled
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been cancelled
v2-ci / build (push) Has been cancelled
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been cancelled
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been cancelled
v2-ci / integration (push) Has been cancelled
Reshapes the placeholder buffered-counter actor into a thin fire-and-forget
bridge over the existing IAlarmHistorianSink contract. Default sink is
NullAlarmHistorianSink; production deployments override the DI binding to
SqliteStoreAndForwardSink wrapping WonderwareHistorianClient (the v1
components in src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware*
are reused verbatim — actor is just a mailbox-friendly entry point).

- HistorianAdapterActor.Props(IAlarmHistorianSink?) — null defaults to NullAlarmHistorianSink
- Receive<AlarmHistorianEvent>: fire-and-forget sink.EnqueueAsync
- Receive<GetStatus>: returns sink.GetStatus() (queue depth + drain state)
- ServiceCollectionExtensions.AddOtOpcUaRuntime registers the default sink
- WithOtOpcUaRuntimeActors spawns the actor + registers HistorianAdapterActorKey
- Program.cs calls AddOtOpcUaRuntime when hasDriver

Tests: 2 new (forward-to-sink + GetStatus). Runtime suite 17 → 18.
2026-05-26 07:18:08 -04:00
Joseph Doherty
cd5540cb1a test(integration): F22 — failover scenario tests + harness Stop/Restart primitives
Extends TwoNodeClusterHarness with three lifecycle primitives:
- StopNodeBAsync()      — graceful CoordinatedShutdown (Cluster.Leave)
- RestartNodeBAsync()   — rebuild node B on same Akka port + same in-memory DB
- WaitForClusterSizeAsync(n) — converge assertion helper

Adds three failover scenario tests:
- Stopping node B shrinks cluster to 1 Up member
- Restarted node B rejoins on the same Akka port
- Deployment started with B down seals with a single NodeDeploymentState
  (validates ConfigPublishCoordinator.DiscoverDriverNodes snapshots
   membership at dispatch time)

Closes follow-up F22. Integration test count: 6 → 9 (+3).
2026-05-26 07:13:14 -04:00
Joseph Doherty
8f32b89fb9 feat(adminui): FleetDiagnosticsClient real Akka ActorSelection round-trip (F17)
Some checks failed
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been cancelled
v2-ci / build (push) Has been cancelled
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been cancelled
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been cancelled
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been cancelled
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been cancelled
v2-ci / integration (push) Has been cancelled
- New Commons.Messages.Fleet.GetDiagnostics request record.
- DriverHostActor handles GetDiagnostics in all three states (Steady, Applying,
  Stale); replies with a NodeDiagnosticsSnapshot built from _currentRevision
  + the local NodeId. Drivers list is empty until F7 wires the per-instance
  children.
- FleetDiagnosticsClient now resolves the target via ActorSelection at
  akka.tcp://{system}@{nodeId}/user/driver-host and Asks with a 3s timeout.
  On timeout/peer-down it returns an empty snapshot so the UI degrades
  gracefully rather than throwing.

Two new integration tests in Host.IntegrationTests:
- GetDiagnostics_returns_snapshot_with_target_NodeId verifies the
  cross-node Ask/Reply works.
- GetDiagnostics_after_deploy_reports_current_revision exercises the
  end-to-end path: AdminOps starts a deployment, both DriverHostActors
  apply, then diagnostics reports the new revision on both nodes.

All 98 v2 tests pass (was 96 + 2 new).
2026-05-26 06:58:11 -04:00
Joseph Doherty
b06e3ae740 feat(runtime): PeerOpcUaProbeActor real TCP-connect probe (F12)
Replaces the Ok=true stub with a TCP connect to the peer's OPC UA port (4840
default) with a 2s timeout. A successful connect indicates the OPC UA server
process is up + accepting connections — enough for the redundancy calculator
to treat the peer as live. A full secure-channel Hello/Acknowledge handshake
is overkill for what the redundancy calc consumes and would pull in the OPC
UA Client SDK + a PKI setup. Upgrade later if a deeper liveness signal is ever
required.

Probe extracts the host from NodeId by stripping the :port suffix (commit
5cfbe8b encoded host:port into NodeId for cluster-member identity).

Tests: 2 new tests — Ok=true against a live TcpListener on a chosen port,
Ok=false against an unreachable endpoint. All 17 Runtime tests pass (was 16
covering only the message-contract surface).
2026-05-26 06:54:51 -04:00
Joseph Doherty
f57f61deac feat(audit): EventId + CorrelationId columns + filtered unique index (F3 + F4)
ConfigAuditLog gains two nullable columns (EventId, CorrelationId) + a filtered
unique index UX_ConfigAuditLog_EventId. EF migration
20260526105027_AddConfigAuditLogEventIdColumns is additive (nullable + filtered
index = legacy rows backfill cleanly).

AuditWriterActor now writes EventId + CorrelationId into the dedicated columns
instead of synthesising a JSON wrapper into DetailsJson. Cross-restart dedup
is now real: a retry of an already-flushed batch hits the unique index and
SaveChanges throws; the existing catch drops the duplicate without losing the
rest of the batch.

WrapDetails helper deleted — F4 (its JSON hardening) becomes moot.

AuditWriterActorTests.Details_wrapper_embeds_eventId_and_correlationId renamed
+ rewritten to assert against the columns. All 29 ControlPlane tests pass,
all 95 v2 tests green.
2026-05-26 06:52:53 -04:00
Joseph Doherty
5cfbe8b5dd test(host): deploy happy-path + idempotency integration tests (Task 59)
DeployHappyPathTests exercises the full deploy pipeline on the 2-node harness:
AdminOperationsActor → ConfigPublishCoordinator → DistributedPubSub →
DriverHostActor on both nodes → ApplyAck → coordinator seals. Verifies both
NodeDeploymentState rows reach Applied and Deployment.Status reaches Sealed.

Exposed + fixed two production bugs along the way:

1. Coordinator was publishing DispatchDeployment on the "deployments" topic but
   never subscribed to anything — DriverHostActor ACKs published on the same
   topic could not reach it. Added dedicated "deployment-acks" topic with
   coordinator subscription in PreStart, and DriverHostActor publishes ACKs
   there.

2. NodeId derivation used member.Address.Host only — two cluster members on a
   shared loopback host (test harness, dev VMs) collided to one identity. The
   coordinator's expected-ack set became {1} and the system sealed after only
   half the nodes acked. Switched to host:port everywhere (ClusterRoleInfo +
   coordinator) so loopback nodes stay distinct and production identities are
   harmlessly more specific.

Tests: 95 v2 tests pass (was 93 + 2 deploy tests), 0 skipped.

Failover scenarios (design §8 cases 3-7: node-kill-mid-apply, split-brain,
restart-during-deploy) deferred — they need controlled node-down primitives
on the harness. Tracked as F22 (failover scenario test cases).
2026-05-26 06:34:36 -04:00
Joseph Doherty
d6fac2d81d test(host): 2-node integration test harness + consolidate to one ActorSystem (Task 58)
Builds TwoNodeClusterHarness: two in-process Host-equivalent nodes sharing
an in-memory ConfigDb. Forms a 2-member Akka cluster. ClusterFormationTests
proves both nodes see each other as admin+driver role members.

Fixes a real production bug uncovered while wiring the harness — Program.cs
ran two separate ActorSystems (one from AddOtOpcUaCluster.AkkaHostedService
with cluster HOCON, one from Akka.Hosting.AddAkka with bare HOCON). Cluster
singletons landed on the bare ActorSystem and could not actually form a
cluster ("Configuration does not contain `akka.cluster` node").

Consolidation:
- AddOtOpcUaCluster now only binds AkkaClusterOptions + registers IClusterRoleInfo
- New WithOtOpcUaClusterBootstrap pushes embedded HOCON + Remote/Cluster options
  into Akka.Hosting's AkkaConfigurationBuilder
- AkkaHostedService.cs deleted — Akka.Hosting now owns the lifecycle
- Program.cs + harness call WithOtOpcUaClusterBootstrap inside AddAkka

Why not WebApplicationFactory<Program>? Program.cs reads OTOPCUA_ROLES from
process env (shared across in-process WAFs); the harness replays Program.cs's
DI graph from a clean WebApplicationBuilder per node with per-node config
overrides. Same production extensions, isolated config + Kestrel + Akka ports.

Tests: 93 v2 tests pass (was 91 + 2 new cluster formation), 0 skipped.
2026-05-26 06:27:04 -04:00
Joseph Doherty
dfc143cdeb feat(controlplane): RedundancyStateActor broadcast override + un-skip tests (F6)
Mirrors the publisher-injection pattern from FleetStatusBroadcaster and
PeerOpcUaProbeActor: Props accepts an optional Action<object> override so
tests can use a TestProbe sink instead of bootstrapping DistributedPubSub
(unreliable single-node in TestKit).

Un-skips the two RedundancyStateActor tests deferred under F6.
2026-05-26 06:16:32 -04:00
Joseph Doherty
463512d1d8 test(security): AuthEndpoints integration tests via TestServer (F1)
7 tests exercise AddOtOpcUaAuth + MapOtOpcUaAuth end-to-end against an
in-memory ConfigDb + stub ILdapAuthService. Covers /auth/login (204/401/503),
/auth/ping (401/200), /auth/token (200+JWT shape), /auth/logout (204+clear-cookie).

Scope is the auth contract — not the fused Host bootstrap (cluster + role
gating belongs in the Task 58 multi-node harness). HostBuilder + TestServer
is used directly instead of WebApplicationFactory<Program> because the
test project has no Program entry point and Host needs Akka cluster up.
2026-05-26 06:15:07 -04:00
Joseph Doherty
09d6676e1f feat(runtime): WithOtOpcUaRuntimeActors extension for driver-role node startup (F19)
Mirrors WithOtOpcUaControlPlaneSingletons for the driver role. Spawns
DriverHostActor + DbHealthProbeActor on the host's ActorSystem and
registers both under marker keys. Host's Program.cs now calls it when
the node carries the driver role, so driver-only and admin+driver
deployments both auto-bootstrap the per-node actors.

Integration test covers the registration round-trip via Microsoft.Extensions.Hosting
+ Akka.Hosting AddAkka.
2026-05-26 06:09:37 -04:00
Joseph Doherty
76310b8829 chore(cleanup): delete OtOpcUa.Server, OtOpcUa.Admin, and obsolete v1 tests
Task 56: removes the legacy in-process Server + Admin Web project + their test
projects (Server.Tests, Admin.Tests, Admin.E2ETests). The fused OtOpcUa.Host
binary built across Phases 1-9 is now the sole production entry point.

What happened to the 47 legacy Admin Blazor pages: per follow-up F15, the
v1 architecture's draft/publish UX is replaced by v2's live-edit + snapshot-
deploy model, so a 1:1 migration is not meaningful. The mechanical move via
git mv preserves the history; service classes + page bodies that referenced
removed v1 types (ConfigGeneration, RedundancyRole, GenerationId) were
deleted. AdminUI now ships a minimal Home page + the v2 Deployments page.

Per-page rebuild against the v2 surface is tracked as F15. The v2 Deployments
page (Task 52) is the only first-party UI shipping in this PR.

Task 57: solution build green; 84+ tests green across active v2 + legacy
driver test projects.
2026-05-26 05:38:31 -04:00
Joseph Doherty
b7c117ab31 feat(opcua): pure Phase7Composer + purity tests (side-effects tracked as F14) 2026-05-26 05:14:45 -04:00
Joseph Doherty
28639cb14d feat(runtime): HistorianAdapter + PeerOpcUaProbe + DbHealthProbe actors (engine wiring tracked as F11/F12) 2026-05-26 05:09:06 -04:00
Joseph Doherty
e115f13104 feat(runtime): OpcUaPublishActor on synchronized dispatcher (SDK wiring tracked as F10) 2026-05-26 05:09:04 -04:00
Joseph Doherty
95ef533822 feat(runtime): ScriptedAlarmActor state machine (engine wiring tracked as F9) 2026-05-26 05:09:03 -04:00
Joseph Doherty
39729bfe21 feat(runtime): VirtualTagActor skeleton (engine wiring tracked as F8) 2026-05-26 05:09:01 -04:00
Joseph Doherty
64c627f8d6 feat(runtime): DriverInstanceActor state machine with Connecting/Connected/Reconnecting 2026-05-26 05:05:36 -04:00
Joseph Doherty
ed130135ca feat(runtime): DriverHostActor state machine with PreStart recovery + DispatchDeployment + stale fallback 2026-05-26 05:02:42 -04:00
Joseph Doherty
dd122c4ca9 feat(controlplane): FleetStatusBroadcaster push-driven from cluster events + heartbeats 2026-05-26 04:57:07 -04:00
Joseph Doherty
f193872891 feat(controlplane): ConfigPublishCoordinator deadline timeout + failover PreStart recovery 2026-05-26 04:57:05 -04:00
Joseph Doherty
6b37f997ad feat(controlplane): RedundancyStateActor with debounced topology publish 2026-05-26 04:53:31 -04:00
Joseph Doherty
62e12dab95 feat(controlplane): ConfigPublishCoordinator happy path with NodeDeploymentState seeding 2026-05-26 04:53:29 -04:00
Joseph Doherty
ef683f5073 feat(controlplane): AdminOperationsActor + ConfigComposer + StartDeployment flow 2026-05-26 04:53:28 -04:00
Joseph Doherty
9f61cd5989 test(controlplane): self-join cluster + DistributedPubSub extension in test harness 2026-05-26 04:53:25 -04:00
Joseph Doherty
23f669c376 feat(controlplane): AuditWriterActor with batched in-buffer-dedup insert 2026-05-26 04:44:01 -04:00
Joseph Doherty
14acab5a58 feat(controlplane): ServiceLevelCalculator + ControlPlane.Tests harness 2026-05-26 04:43:59 -04:00
Joseph Doherty
38ea0c5086 test(security): cookie+JWT roundtrip, role mapper, LDAP escape/RDN helpers 2026-05-26 04:35:51 -04:00
Joseph Doherty
e0b6d5680b test(cluster): HOCON parses, role parser truth table 2026-05-26 04:31:08 -04:00
Joseph Doherty
605dbf3dcc feat(configdb): V2HostingAlignment migration consolidating Phase 1a-1e
Phase 1f — the consolidator migration. Closes out the v2 entity-model
rewrite by emitting a single EF migration that captures the cumulative
schema delta from 14a (RowVersion) through 14e (drop generation entities).

Generated: src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Migrations/
              20260526081556_V2HostingAlignment.cs           (1562 lines)
              20260526081556_V2HostingAlignment.Designer.cs

Migration shape (per `grep -nE migrationBuilder.\(...)`):

  Drop  12 ForeignKey constraints (one per live-edit entity's GenerationId FK)
  Drop  2  Tables  (ConfigGeneration, ClusterNodeGenerationState)
  Drop  45 Indexes (every UX_*_Generation_* and IX_*_Generation_* across the
                    13 live-edit tables — 1 also dropped the unique-Primary
                    filtered index UX_ClusterNode_Primary_Per_Cluster)
  Drop  13 Columns (12 GenerationId + 1 RedundancyRole)
  Add   12 RowVersion columns (one per live-edit entity)
  Create 4  Tables (Deployment, NodeDeploymentState, ConfigEdit,
                    DataProtectionKeys)
  Create ~45 Indexes (recreated under the new naming pattern
                      UX_<Table>_LogicalId / UX_<Table>_<X> with the
                      GenerationId column stripped from composite keys)

Notable EF quirks accepted:
  Unique-on-required-column indexes (UX_VirtualTag_LogicalId etc.) ship a
  `filter: "[VirtualTagId] IS NOT NULL"` clause that EF auto-inserts for
  SQL Server. Harmless — the column is C#-side `required` so NULL never
  appears.

Verification:
  dotnet build src/Core/ZB.MOM.WW.OtOpcUa.Configuration          -> 0 errors
  dotnet ef migrations script --idempotent (against placeholder DSN)
                                                                 -> 3259-line
                                                                    .sql produced
                                                                    OK
  tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests                -> 0 errors

Live `dotnet ef database update` against a scratch SQL Server deferred to
Task 15 (Migrate-To-V2.ps1) — SSH to the docker host needs a key/password I
don't have, and the always-on SQL at 10.100.0.35,14330 uses Integrated
Security (Windows auth, unreachable from this macOS dev). The migration
itself is structurally correct by construction (EF tooling generated it
against the live DbContext model); the live-DB confidence step is the
PowerShell wrapper's job.

SchemaComplianceTests updates:
  - All_expected_tables_exist: removed ConfigGeneration +
    ClusterNodeGenerationState; added Deployment, NodeDeploymentState,
    ConfigEdit, DataProtectionKeys.
  - Filtered_unique_indexes_match_schema_spec: removed entries for
    UX_ClusterNode_Primary_Per_Cluster (Task 14d) and
    UX_ConfigGeneration_Draft_Per_Cluster (Task 14e). Two filtered uniques
    remain (UX_ClusterNodeCredential_Value, UX_ExternalIdReservation_KindValue_Active).
  - Check_constraints_match_schema_spec: added CK_ConfigEdit_FieldsJson_IsJson.

StoredProceduresTests update:
  - Removed RedundancyRole + 'Primary' from the raw INSERT into ClusterNode
    so the DB-backed test runs against the new schema.
2026-05-26 04:18:50 -04:00
Joseph Doherty
3c915e652e refactor(configdb): drop ClusterNode.RedundancyRole (replaced by Akka leader)
Phase 1d of the v2 entity-model rewrite. The static RedundancyRole column
is replaced by Akka cluster's role-leader-of-"driver" election at runtime
(see RedundancyStateActor + ServiceLevelCalculator in Task 35).

Changes:

  - Removed `public required RedundancyRole RedundancyRole` from
    ClusterNode entity.
  - Removed `e.Property(x => x.RedundancyRole).HasConversion<string>()...`
    mapping from OtOpcUaConfigDbContext.ConfigureClusterNode.
  - Removed the `UX_ClusterNode_Primary_Per_Cluster` filtered unique index
    (filter referenced [RedundancyRole]='Primary').
  - Dropped `using ZB.MOM.WW.OtOpcUa.Configuration.Enums` from ClusterNode.cs
    (no longer needed).
  - Deleted `Enums/RedundancyRole.cs` — the enum is unused in v2-kept code.
  - DraftValidator: dropped the "exactly one Primary per cluster"
    validation block. Comment in place explaining v2 picks primary at
    runtime via Akka.
  - DraftValidatorTests: dropped ValidateClusterTopology_flags_multiple_Primary
    test; reworked BuildNode helper to no longer take a `role` argument.

Untouched (Server + Admin still reference RedundancyRole; accepted broken
per Task 56 policy):

  src/Server/ZB.MOM.WW.OtOpcUa.Server/Redundancy/{ClusterTopologyLoader,
    RedundancyStatePublisher, RedundancyTopology, ServiceLevelCalculator}.cs
  src/Server/ZB.MOM.WW.OtOpcUa.Admin/Services/RedundancyMetrics.cs

DB-runtime tests will fail against the new schema (Task 14f's migration
drops the column) — to be updated in Task 14f's SchemaComplianceTests
update:

  - SchemaComplianceTests.cs:55 (expected filtered index list)
  - StoredProceduresTests.cs:263 (raw INSERT names the column)

Verification:
  src/Core/ZB.MOM.WW.OtOpcUa.Configuration            -> 0 errors
  tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests    -> 0 errors
  whole solution                                       -> 71 errors
    (70 from Task 14b in Server/Admin, +1 new Server/Redundancy reference)
2026-05-26 04:11:57 -04:00
Joseph Doherty
1ddf8bb50e refactor(configdb): delete v1 Apply pipeline (replaced by AdminOperationsActor)
Phase 1c of the v2 entity-model rewrite. Deletes the draft/publish lifecycle
machinery that v2 replaces with AdminOperationsActor + ConfigComposer +
DriverInstanceActor.ApplyDelta.

Deleted (6 files):

  src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Apply/
    IGenerationApplier.cs   — interface for the apply pipeline
    GenerationApplier.cs    — the v1 applier coordinating per-driver hook-back
    GenerationDiff.cs       — typed wrapper over the sp_ComputeGenerationDiff
                              SQL output
    ApplyCallbacks.cs       — per-driver hook surface invoked by the applier
    ChangeKind.cs           — enum {Added, Modified, Removed, Unchanged}

  tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests/GenerationApplierTests.cs

The empty Apply/ directory is removed.

Kept (repurposed in Task 39 for stale-config fallback):

  src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/GenerationSealedCache.cs
  src/Core/ZB.MOM.WW.OtOpcUa.Configuration/LocalCache/ResilientConfigReader.cs
  tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests/GenerationSealedCacheTests.cs
  tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests/ResilientConfigReaderTests.cs

Naming rename (GenerationSealedCache -> DeploymentArtifactCache) deferred
to Task 39 (DriverHostActor stale-config fallback) where the consumer is
written. The type stays available under its v1 name until then.

IDriver.cs doc-comment: replaced the "Used by IGenerationApplier..." sentence
with "Invoked by the v2 DriverInstanceActor when ApplyDelta reports that only
this driver's config changed in the new deployment."

Server/Admin breakage from Task 14b unchanged (70 errors). Configuration +
Core.Tests + Configuration.Tests stay green.

  src/Core/ZB.MOM.WW.OtOpcUa.Configuration  -> 0 errors
  tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests  -> 0 errors
  whole solution  -> 70 errors (all in Server/Admin)
2026-05-26 04:09:17 -04:00
Joseph Doherty
13d3aeab09 refactor(configdb): drop GenerationId FK from live-edit entities
Phase 1b of the v2 entity-model rewrite. The design's live-edit model means
the 12 v2 live-edit entities no longer carry a generation scope — they're
edited directly via AdminOperationsActor, with RowVersion (added in Task 14a)
providing last-write-wins detection.

Entity changes (12 files):

  Equipment, DriverInstance, Device, Tag, PollGroup, Namespace,
  UnsArea, UnsLine, NodeAcl, Script, VirtualTag, ScriptedAlarm

  - Removed: public long GenerationId
  - Removed: public ConfigGeneration? Generation (navigation)

DbContext changes (OtOpcUaConfigDbContext.cs):

  - Removed 12 HasOne(x => x.Generation).WithMany().HasForeignKey... mappings
  - Rewrote ~36 indexes: dropped the GenerationId column from each composite
    key, renamed UX_<Table>_Generation_<X> -> UX_<Table>_<X> and
    IX_<Table>_Generation_<X> -> IX_<Table>_<X>. Logical IDs become globally
    unique (UX_<Table>_LogicalId on the LogicalId column alone).
  - Removed Namespace's redundant UX_Namespace_Generation_LogicalId_Cluster
    index (subsumed by the new UX_Namespace_LogicalId).

Core.Tests fixtures (4 files):

  Removed "GenerationId = 1," lines from:
    - PermissionTrieBuilderTests.cs (NodeAcl Row factory)
    - PermissionTrieTests.cs (NodeAcl Row factory)
    - TriePermissionEvaluatorTests.cs (NodeAcl Row factory + 2 gen{1,5}Row
      mutations that test stale-generation evaluation; the trie itself still
      carries a generation tag via PermissionTrie.GenerationId, fed in via
      PermissionTrieBuilder.Build's generationId parameter, so the tests
      still exercise the production code path)
    - EquipmentNodeWalkerTests.cs (Area/Line/Eq/Tag/VirtualTag/ScriptedAlarm
      builders)

Expected breakage (accepted per Task 56 policy):

  src/Server/ZB.MOM.WW.OtOpcUa.Server   ~25 errors  (DriverInstanceBootstrapper,
                                                     AuthorizationBootstrap,
                                                     EquipmentNamespaceContentLoader,
                                                     Phase7Composer, ...)
  src/Server/ZB.MOM.WW.OtOpcUa.Admin    ~45 errors  (VirtualTags.razor,
                                                     ScriptedAlarms.razor,
                                                     DriverInstanceService,
                                                     EquipmentService,
                                                     EquipmentImportBatchService,
                                                     UnsService,
                                                     FocasDriverDetailService,
                                                     ...)

Server.Tests, Admin.Tests, Admin.E2ETests also break transitively (they
project-reference Server/Admin). All deleted in Task 56.

Verification:
  dotnet build src/Core/ZB.MOM.WW.OtOpcUa.Configuration -> 0 errors
  dotnet build tests/Core/ZB.MOM.WW.OtOpcUa.Core.Tests  -> 0 errors
  dotnet build tests/Core/ZB.MOM.WW.OtOpcUa.Configuration.Tests -> 0 errors
  dotnet build (whole solution) -> 70 errors, all in Server/Admin
2026-05-26 04:06:25 -04:00
Joseph Doherty
2b811477d1 chore(build): introduce central package management for v2
Adds Directory.Packages.props (ManagePackageVersionsCentrally) and
Directory.Build.props (net10.0/nullable/implicit usings/LangVersion latest).
Strips Version attributes from every csproj PackageReference and consolidates
versions into the central file.

Side fixes (necessary to keep the build green on .NET SDK 10.0.105 on macOS):

- Microsoft.CodeAnalysis.CSharp{,.Workspaces}: 5.3.0 -> 5.0.0. The 5.3.0
  analyzer DLL references compiler 5.3.0.0 and the local SDK ships compiler
  5.0.0.0, producing CS9057 on every project that loaded the Analyzers
  output. Master itself was broken on this machine pre-change.
- Server + Server.Tests pin OPCFoundation.NetStandard.Opc.Ua.{Configuration,
  Client} to 1.5.374.126 via VersionOverride, matching Opc.Ua.Server's
  pin. Mixing 1.5.378.106 Opc.Ua.Core transitively with 1.5.374.126
  Opc.Ua.Server breaks CustomNodeManager2 override signatures
  (CS0115 on LoadPredefinedNodes/Browse/HistoryRead*) and CS7069 in
  the tests. The pin disappears when the legacy Server project is
  deleted in Task 56.
- Client.UI + Client.UI.Tests: NuGetAuditSuppress for
  GHSA-xrw6-gwf8-vvr9 (Tmds.DBus.Protocol 0.20.0 reaches both projects
  transitively from Avalonia.Desktop on Linux/macOS only).

Deviation from the plan: TreatWarningsAsErrors=true is NOT set in
Directory.Build.props because the pre-v2 Admin/Server test projects carry
~240 xUnit1051 analyzer warnings that would fail the build. New v2 projects
opt in via their own csproj; the global flag can return once the legacy
projects are deleted in Task 56.
2026-05-26 03:40:24 -04:00
Joseph Doherty
c6082aa0b9 fix(admin-e2e): register missing DI services so ClusterDetail interactive circuit boots
UnsTabDragDropE2ETests were timing out at the 'UNS Structure' nav-link
locator because AdminWebAppFactory never registered AdminHubConnectionFactory
/ HubTokenService / DataProtection — ClusterDetail.razor's @inject threw at
circuit boot, so the page never advanced past the Loading placeholder. 2 → 3
pass after the registrations land. Also documents the Modbus standard-vs-
exception_injection coverage matrix in the fixture README + cross-references
docs/drivers/AbServer-Test-Fixture.md from each Emulate test so a developer
landing on a skipped test has a direct doc pointer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 01:07:17 -04:00
Joseph Doherty
b1f3e09661 test(modbus, abcip): align failing integration tests with fixtures
Two real bugs uncovered by re-running with the new fixture defaults
pointing at the shared docker host. Both are test-side, not driver-side.

AbCip — Driver_reads_seeded_DInt_from_ab_server (4 parametrized rows):
  Hardcoded 'ab://127.0.0.1:{port}/1,0' in the deviceUri instead of
  the resolved fixture.Host. The new 10.100.0.35 default (and any
  AB_SERVER_ENDPOINT override) silently couldn't reach this test —
  the driver tried to connect to a non-existent localhost:44818 and
  returned BadCommunicationError on all 4 profile rows. The sibling
  Emulate tests already use the fixture's resolved endpoint; this
  smoke test was missed in the original migration.

  Fix: deviceUri = $"ab://{fixture.Host}:{fixture.Port}/1,0".

Modbus — Float32_With_CDAB_Roundtrips_Through_Wire:
  Test wrote a Float32 to HR 100 (2 consecutive registers: 100+101).
  standard.json's writable HR range declares [100,100] only — a
  single-cell auto-incrementing register, not a 2-register pair. The
  write to register 101 was rejected with Illegal Data Address
  (BadOutOfRange).

  Fix: moved the tag from HR 100 to HR 200 (in standard.json's
  '[200, 209]' scratch range — 10 consecutive writable HRs). The
  Float32+CDAB semantic the test exercises is unchanged.

Modbus — Block_Read_Coalescing_Reduces_PDU_Count_End_To_End:
  Test read HR 300, 302, 304 — outside both the writable ranges and
  the uint16 seed list. pymodbus rejects reads to unseeded HRs even
  though 'hr size' is 2048. BadOutOfRange on every read.

  Fix: moved the tags from 300/302/304 to 200/202/204 (within the
  scratch range). The non-contiguous coalescing semantic (3 tags
  inside a 5-register window with MaxReadGap=5) is preserved.

After this commit:
  - Modbus.IntegrationTests: 6/38 pass / 32 skip / 0 fail
    (was 4 pass / 32 skip / 2 fail; 32 skips are profile-gated
    ExceptionInjectionTests — they need MODBUS_SIM_PROFILE=
    exception_injection and a different container, intentional gating)
  - AbCip.IntegrationTests: 10/12 pass / 2 skip / 0 fail
    (was 6 pass / 2 skip / 4 fail; 2 skips are Emulate tests that
    need the fixture for separate scenarios)

No driver code changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 19:45:57 -04:00
Joseph Doherty
49644fc7fd test(fixtures): migrate integration-test fixture defaults to 10.100.0.35
CLAUDE.md "Docker Workflow" claims (per the 2026-04-28 migration note)
that all fixture-class default endpoints were rewritten to target the
shared Docker host at 10.100.0.35. Audit during today's e2e run showed
the claim was incomplete — five fixture classes still defaulted to
localhost / 127.0.0.1, causing every fixture-touching integration test
to skip with "endpoint unreachable" on a fresh box that hadn't set
the override env vars.

Files corrected:

  - tests/.../Modbus.IntegrationTests/ModbusSimulatorFixture.cs
      DefaultEndpoint: localhost:5020 → 10.100.0.35:5020
  - tests/.../S7.IntegrationTests/Snap7ServerFixture.cs
      DefaultEndpoint: localhost:1102 → 10.100.0.35:1102
  - tests/.../OpcUaClient.IntegrationTests/OpcPlcFixture.cs
      DefaultEndpoint: opc.tcp://localhost:50000 → opc.tcp://10.100.0.35:50000
  - tests/.../AbCip.IntegrationTests/AbServerFixture.cs
      Host default + ResolveHost fallback: 127.0.0.1 → 10.100.0.35
  - tests/.../AbLegacy.IntegrationTests/AbLegacyServerFixture.cs
      Host default + ResolveEndpoint fallback: 127.0.0.1 → 10.100.0.35

XML doc comments referencing the old localhost defaults were updated in
the same pass so the class-summary documentation matches the actual
default. The override-via-env-var mechanism (MODBUS_SIM_ENDPOINT,
AB_SERVER_ENDPOINT, AB_LEGACY_ENDPOINT, S7_SIM_ENDPOINT,
OPCUA_SIM_ENDPOINT) is unchanged — pointing at a real PLC or a
locally-running container still works exactly as before.

Verification:
  - Solution-wide dotnet build: 0 errors.
  - S7.IntegrationTests: 3/3 pass without env-var override.
  - OpcUaClient.IntegrationTests: 3/3 pass without env-var override.
  - Modbus.IntegrationTests: 4/38 (same as the env-var-override run —
    the 2 failures + 32 skips are pre-existing fixture-profile
    mismatches unrelated to this fix).
  - AbCip.IntegrationTests / AbLegacy.IntegrationTests: same results
    as the env-var-override run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 19:32:23 -04:00
Joseph Doherty
23d59d73f2 fix(scripting+alarms): close remaining re-review findings
Single commit covering the four small/medium fixes from the updated
code review.

Core.Scripting-014 (Medium, Concurrency):
  CompiledScriptCache.Clear() used the key-only TryRemove(key, out var
  lazy) overload — same race shape Core.Scripting-006 closed in
  GetOrCompile's catch block. A concurrent re-add between snapshot and
  TryRemove was evicted + disposed while the new caller still held it.
  Replaced with the value-scoped TryRemove(KeyValuePair<,>) overload.
  Regression test
  Clear_uses_value_scoped_TryRemove_so_a_race_inserted_entry_survives
  added.

Core.Scripting-013 (Medium, Security):
  Hand-rolled BuildWrapperSource pastes user source between literal
  braces; brace-balanced source could inject sibling methods/classes
  alongside CompiledScript.Run. Analyzer still walked the injected
  members so it wasn't a direct escape, but it relaxed the documented
  'method body' authoring contract. Added EnforceSingleRunMember:
  after ParseText, the compilation unit must hold exactly one type
  (CompiledScript) and that type must hold exactly one member (the Run
  method). Any deviation throws CompilationErrorException with LMX001/
  LMX002 diagnostic IDs and a Core.Scripting-013 reference in the
  message. Two regression tests added covering the sibling-method and
  sibling-class injection vectors.

Core.Scripting-015 (Low, Correctness, latent):
  ToCSharpTypeName's generic branch truncated at the first backtick via
  IndexOf, silently dropping closed args of nested-generic shapes
  (Outer<T>.Inner<U>). No production caller exercises this shape today
  (all TContext/TResult are top-level non-nested), so the bug was
  latent. Rewrote the generic branch to walk the FullName segment-by-
  segment, consuming generic args per segment so nested shapes emit
  valid C# (global::Ns.Outer<T>.Inner<U> rather than the broken
  Outer<T,U>).

Core.ScriptedAlarms-013 (Low, Documentation):
  The internal test accessors TryGetScratchReadCacheForTest /
  TryGetScratchContextForTest return live mutable scratch refilled in
  place under _evalGate. XML docs didn't warn future test authors about
  the synchronization contract. Added a <remarks> block to each
  documenting the only-safe-on-quiesced-engine + identity-or-single-key
  contract.

Verification (suites green):
  Core.Scripting.Tests: 110/110 (was 107 — +3 new rejection/race tests)
  Core.ScriptedAlarms.Tests: 67/67 (unchanged — doc-only fix)
  Core.VirtualTags.Tests: 57/57 (unchanged)

After this commit, all 12 findings from the updated re-review are
closed (10 Resolved, 1 Won't Fix none, 1 Deferred — Driver.Galaxy-017).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 18:00:59 -04:00
Joseph Doherty
3a53d03d23 fix(scripting): block ThreadPool/Timer/AssemblyLoadContext in sandbox
Core.Scripting-012 (High, Security) resolution.

The Core.Scripting-008 rewrite broadened the BCL references list from a
narrow allow-list to the full System.* + netstandard +
Microsoft.Win32.Registry set, delegating the security gate entirely to
ForbiddenTypeAnalyzer. Three categories of dangerous BCL types were
reachable from script source without a deny-list entry:

  - System.Threading.ThreadPool — QueueUserWorkItem re-introduces the
    background-fanout threat Core.Scripting-003 closed against
    System.Threading.Tasks.
  - System.Threading.Timer — schedules unbounded callback work that
    outlives the per-evaluation timeout.
  - System.Runtime.Loader.AssemblyLoadContext — loads arbitrary DLLs.
    Defense-in-depth gap; invocation needs reflection (already denied)
    but the load itself was reachable.

Fix:
  - Added 'System.Runtime.Loader' to ForbiddenNamespacePrefixes
    (preferred over type-granular per the recommendation so future BCL
    additions to that namespace are denied by default).
  - Added 'System.Threading.ThreadPool' and 'System.Threading.Timer'
    to ForbiddenFullTypeNames — both live in System.Threading shared
    with allowed primitives so they must be type-granular.

Regression tests added to ScriptSandboxTests:
  Rejects_ThreadPool_QueueUserWorkItem_at_compile
  Rejects_Timer_new_at_compile
  Rejects_AssemblyLoadContext_at_compile

Docs:
  docs/v2/implementation/phase-7-scripting-and-alarming.md decision #6
  and the Sandbox-escape compliance-check row both updated to enumerate
  the new entries per the Core.Scripting-009 doc-sync convention.

Two lower-impact suggestions from the finding's recommendation
(System.Console, CultureInfo.DefaultThreadCurrentCulture) were
intentionally not addressed and are recorded as accepted minor risks
in the resolution.

Verification: Core.Scripting.Tests 107/107 (was 104 + 3 new rejection
tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 17:39:20 -04:00
Joseph Doherty
fb7c6c7046 fix(scripting): route engines through CompiledScriptCache (Core.Scripting-016)
Both VirtualTagEngine.Load and ScriptedAlarmEngine.LoadAsync were calling
ScriptEvaluator.Compile directly, bypassing CompiledScriptCache. The
Core.Scripting-008 collectible-ALC fix wired Dispose only through the cache's
Clear()/Dispose(), so the per-publish accretion the -008 fix was meant to
eliminate was still in effect on the actual production path — the headline
'no more restarts needed' guarantee wasn't delivered.

Resolution:
  - VirtualTagEngine + ScriptedAlarmEngine each gained a private
    CompiledScriptCache<TContext, TResult> instance.
  - Both Load methods now call _compileCache.GetOrCompile(source).
  - Publish-replace path: _compileCache.Clear() runs alongside the existing
    _tags / _alarms clears so the prior generation's ALCs are disposed
    before recompile.
  - Engine Dispose now calls _compileCache.Dispose() so shutdown actually
    releases the emitted assemblies.

Side-fix in CompiledScriptCache: Dispose() set _disposed=true then called
Clear(), but Clear() had a pre-existing 'if (_disposed) return' guard that
aborted the drain unconditionally — making the Dispose-triggered cleanup a
silent no-op. Removed the disposed-guard on Clear() (clearing an empty/
cleared cache is idempotent).

Side-fix in ScriptedAlarmEngine.Dispose: cleared _alarms AFTER the
Task.WhenAll drain. The drain guarantees no background callback is mid-
flight, so clearing is safe. Previously _alarms was deliberately NOT
cleared on Dispose (per Core.ScriptedAlarms-005), but that left the
AlarmState records holding TimedScriptEvaluator → ScriptEvaluator → delegate
references that rooted the emitted assemblies, defeating the cache's
Dispose work on the engine side.

Regression tests:
  - VirtualTagEngineTests.Dispose_unloads_compiled_script_assembly
  - ScriptedAlarmEngineTests.Dispose_unloads_compiled_predicate_assembly
  Both use WeakReference + bounded GC.Collect() to prove the emitted
  assembly is reclaimable after engine.Dispose(). The alarms test had to
  be synchronous (not 'async Task<WeakReference>') because async state
  machines capture locals as state-struct fields, keeping them alive past
  the method's apparent end and defeating GC.

Verification:
  - Core.Scripting.Tests: 104/104 (unchanged).
  - VirtualTags.Tests: 57/57 (was 56 — +1 unload test).
  - ScriptedAlarms.Tests: 67/67 (was 66 — +1 unload test).
  - All other consumer suites still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 17:33:34 -04:00
Joseph Doherty
a6ae4e22d1 fix(status-codes): correct BadDeviceFailure from 0x80550000 to 0x808B0000
Driver.Cli.Common-007 + Driver.Cli.Common-008 resolution.

Driver.Cli.Common-007 (High, Correctness):
  0x80550000 is the canonical OPC UA spec value for BadSecurityPolicyRejected,
  not BadDeviceFailure. The correct spec value for BadDeviceFailure is
  0x808B0000 (verified against OPC Foundation Opc.Ua.StatusCodes;
  corroborated locally by Driver.Galaxy.Runtime.StatusCodeMap and both
  Wonderware historian quality mappers which all hand-pin the correct
  value).

  The bug was duplicated across six driver modules:
    - FocasStatusMapper.BadDeviceFailure
    - AbCipStatusMapper.BadDeviceFailure
    - AbLegacyStatusMapper.BadDeviceFailure
    - TwinCATStatusMapper.BadDeviceFailure
    - ModbusDriver.StatusBadDeviceFailure
    - S7Driver.StatusBadDeviceFailure
  Plus the SnapshotFormatter shortlist that named 0x80550000 as
  BadDeviceFailure, and three downstream Modbus tests that asserted
  against the wrong value (so CI was blind).

  This commit fixes all six native-mapper constants, the formatter
  shortlist, and the three Modbus tests in one pass. Added a regression
  guard to FormatStatus_does_not_apply_pre_fix_wrong_names that pins
  0x80550000 never renders as BadDeviceFailure (mirroring the existing
  -001 wrong-name guards).

  Behavior change: OPC UA clients consuming the native drivers now see
  the canonical BadDeviceFailure (0x808B0000) on device-fault paths
  instead of the misnamed BadSecurityPolicyRejected (0x80550000). Wire-
  level status semantics now match operator-facing CLI labels.

Driver.Cli.Common-008 (Low, Testing):
  Deleted the redundant FormatStatus_names_native_driver_emitted_codes
  Theory — its five InlineData rows were already covered by the
  well-known Theory in the same commit (5a9c459), and used a weaker
  ShouldContain vs the well-known Theory's ShouldBe (exact match).

Verification:
  - Driver.Cli.Common.Tests: 43/43 pass (was 48 after the -008 deletion).
  - Driver.Modbus.Tests: 263/263 pass.
  - Driver.AbCip.Tests: 262/262.
  - Driver.AbLegacy.Tests: 157/157.
  - Driver.FOCAS.Tests: 178/178.
  - Driver.S7.Tests: 112/112.
  - Driver.TwinCAT.Tests: 131/131.
  Total: 1146 tests across the affected modules, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 17:14:28 -04:00
Joseph Doherty
994997ba7b fix(driver-galaxy): vendor MxGateway.Client + MxGateway.Contracts as binary refs
The sibling mxaccessgw repo restructured: clients/dotnet/MxGateway.Client
no longer exists, and the proto contracts moved to a new namespace
(ZB.MOM.WW.MxGateway.Contracts.Proto, was MxGateway.Contracts.Proto). The
driver's source still expects the pre-restructure namespace, so the
broken ProjectReference produced 86 build errors in src/ + 1 in tests/
on master.

Resolution: vendor the last known-good build of MxGateway.Client.dll
(99 KB, May 22) and MxGateway.Contracts.dll (490 KB, May 23) under
src/Drivers/.../Driver.Galaxy/libs/, reference them via <Reference
HintPath=...> in both the driver and its test csproj, and declare the
NuGet packages the dropped ProjectReference was supplying transitively
(Google.Protobuf, Grpc.Core.Api, Grpc.Net.Client,
Microsoft.Extensions.Logging.Abstractions, Polly) at versions matching
the sibling repo's ZB.MOM.WW.MxGateway.Contracts.csproj so binary
compatibility is preserved.

Why this over a source migration:
  Source migration would require namespace renames across ~19 driver
  files PLUS reimplementing MxGatewayClient / MxGatewaySession /
  GalaxyRepositoryClient (~2,200 LoC) — the sibling repo dropped the
  client library entirely, keeping only the proto contracts. Vendoring
  the last known-good binaries unblocks the build in minutes, freezes
  the gateway contract surface at a known-good version, and preserves
  the option to migrate properly once the sibling repo decides whether
  to restore a client library or hand the work back to us.

libs/README.md documents the unwinding plan (either path closes the
debt: sibling restores a client library, or driver migrates to the new
contracts namespace + reimplements the client wrapper).

Verification:
  - dotnet build ZB.MOM.WW.OtOpcUa.slnx: 0 errors (was 87).
  - Driver.Galaxy unit tests: 245/245 pass.
  - Integration tests not run here (require a live mxaccessgw gateway).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 16:32:56 -04:00
Joseph Doherty
0001cdd579 fix(scripted-alarms): reuse per-alarm evaluation scratch on the hot path
Core.ScriptedAlarms-009 resolution: replace the per-call Dictionary +
AlarmPredicateContext allocation with a per-alarm reusable AlarmScratch
held in _scratchByAlarmId, refilled in place under _evalGate on each
evaluation. The hot path no longer allocates per upstream tag change.

Why this matters:
  On a busy line where many tags feeding many alarms change frequently,
  the old BuildReadCache allocated a fresh dictionary + context on every
  predicate evaluation — a steady stream of short-lived allocations the
  GC eventually has to reclaim. With the reuse, the dictionary and
  context are allocated once per alarm (on first evaluation) and refilled
  in place across every subsequent re-eval.

Implementation:
  - New private AlarmScratch class holds the reusable
    Dictionary<string, DataValueSnapshot> read cache (pre-sized to the
    alarm's Inputs.Count) and the AlarmPredicateContext that wraps it by
    reference. The context observes refilled values without being
    re-created.
  - ConcurrentDictionary<string, AlarmScratch> _scratchByAlarmId on the
    engine, cleared in LoadAsync alongside _alarms so a config-publish
    drops the prior generation's scratch (Inputs / Logger may change).
  - EvaluatePredicateToStateAsync looks up scratch via GetOrAdd, calls
    the new RefillReadCache(Dictionary, IReadOnlySet) helper to clear +
    repopulate the dictionary in place, then runs the predicate against
    the reused context.
  - BuildReadCache removed.

Safety:
  Reuse is serialised under _evalGate which guarantees no two threads
  ever observe the same scratch in a half-refilled state. The
  AlarmPredicateContext is bound to the scratch dictionary by reference,
  so the predicate's ctx.GetTag(path) sees the freshly-refilled values
  rather than a stale snapshot.

Verification:
  - All 66 ScriptedAlarms tests pass (was 63 — three new regression tests
    locking the reuse contract).
  - All 56 VirtualTags tests still pass (unchanged).
  - All 104 Core.Scripting tests still pass (unchanged).

New tests in ScriptedAlarmEngineTests:
  - Reevaluation_reuses_the_same_read_cache_dictionary — asserts
    ReferenceEquals(scratch_before, scratch_after) across two
    evaluations of the same alarm.
  - Reevaluation_reuses_the_same_predicate_context — same, for the
    context.
  - LoadAsync_drops_the_prior_generations_scratch — asserts a config
    publish wipes the prior scratch (so a stale Logger / Inputs can't
    leak into the new generation).

Internal test hooks TryGetScratchReadCacheForTest /
TryGetScratchContextForTest added via the existing
InternalsVisibleTo for the tests project. Kept internal — not part of
the public engine surface.

Docs:
  - docs/v2/Galaxy.Performance.md "Scripted-alarm engine" section
    rewritten as "hot-path allocation reuse" documenting the new
    contract + reuse safety reasoning + the three regression tests.
  - code-reviews/Core.ScriptedAlarms/findings.md -009 flipped
    Won't Fix → Resolved.
  - code-reviews/README.md regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 16:10:09 -04:00
Joseph Doherty
7b6ab2ec6f fix(scripting): unload compiled-script assemblies via collectible ALC
Core.Scripting-008 resolution: replace the legacy CSharpScript.CreateDelegate
path with hand-rolled CSharpCompilation + Emit + collectible AssemblyLoadContext,
so per-publish compile accretion no longer requires a server restart to reclaim.

Why this was needed:
  Roslyn's CSharpScript path emits dynamically-compiled script assemblies into
  the default AssemblyLoadContext, which is non-collectible. Across config-
  publish generations each Clear() drops dictionary entries but the emitted
  assemblies stay loaded for process lifetime, so memory grows steadily on
  long-running servers with frequent publishes. The accepted-limitation note
  in docs/VirtualTags.md recommended scheduled restarts as the workaround;
  operator feedback was that restarts are difficult, so the underlying
  limitation was the right thing to fix.

Implementation:
  - New ScriptAssemblyLoadContext(name, isCollectible: true) hosts one emitted
    script assembly per evaluator.
  - ScriptEvaluator.Compile synthesises a wrapper class around the user source
    (CompiledScript.Run(globals) — explicit return required per ordinary C#
    semantics, which every existing script already uses), builds a
    CSharpCompilation against the sandbox references, runs the
    ForbiddenTypeAnalyzer over the semantic model unchanged, emits to an
    in-memory PE stream, loads via ScriptAssemblyLoadContext.LoadFromStream,
    and binds a strongly-typed Func<ScriptGlobals<TContext>, TResult> delegate
    via reflection.
  - ScriptEvaluator now implements IDisposable — Dispose calls
    AssemblyLoadContext.Unload(), which makes the emitted assembly eligible
    for GC at the next collection cycle.
  - CompiledScriptCache.Clear() disposes every materialised evaluator before
    dropping its dictionary entry; CompiledScriptCache itself is now
    IDisposable for graceful server shutdown.
  - ScriptSandbox.Build returns a new SandboxConfig (References + Imports)
    instead of a Roslyn ScriptOptions; references now span BCL via the
    TRUSTED_PLATFORM_ASSEMBLIES set filtered to System.* + netstandard +
    Microsoft.Win32.Registry, so forbidden BCL types resolve at compile and
    ForbiddenTypeAnalyzer is the sole security gate (consistent with the
    Core.Scripting-001 / -002 model — references-list-only restriction is
    porous against type forwarding, so the analyzer must be the real gate).

Verification:
  - All 104 Core.Scripting tests pass (was 101 — three new regression tests
    locking the unload contract).
  - All 56 VirtualTags tests pass (unchanged).
  - All 63 ScriptedAlarms tests pass (unchanged).
  - New CompiledScriptCacheTests:
    - Dispose_unloads_compiled_script_assembly_load_context — proves single-
      evaluator ALC unload via WeakReference + bounded GC.Collect() loop.
    - Clear_disposes_every_materialised_evaluator — proves publish-replace
      releases every prior generation's ALC.
    - GetOrCompile_after_Dispose_throws_ObjectDisposedException — locks the
      post-dispose contract.

Docs:
  - docs/VirtualTags.md "Compile cache" section rewritten: the accepted-
    limitation note replaced with the unload contract + the new authoring
    convention (explicit return).
  - docs/ScriptedAlarms.md cross-reference updated to drop the obsolete
    restart guidance.
  - code-reviews/Core.Scripting/findings.md Core.Scripting-008 flipped
    Won't Fix → Resolved with the implementation summary.
  - code-reviews/README.md regenerated.

Pre-existing breakage note: Driver.Galaxy fails the solution-wide build on
master because its ProjectReference to the sibling mxaccessgw repo's
MxGateway.Client targets a path that the sibling repo no longer has after a
recent restructuring. This is unrelated to Core.Scripting-008 and was
verified to exist on master before this branch was cut.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 15:55:04 -04:00
Joseph Doherty
5a9c4591b9 fix(cli-common): name native-driver-emitted status codes in SnapshotFormatter
Driver.FOCAS.Cli-005 follow-up: extend the SnapshotFormatter.FormatStatus
shortlist with the five Bad* codes the native-protocol mappers (FOCAS,
AbCip, AbLegacy) emit but which the shortlist previously left unnamed,
so they rendered only as severity-class 'Bad' instead of the documented
'BadDeviceFailure' / 'BadNotWritable' / ... names operators are told to
read off probe/write output.

Added entries:
  0x80020000 BadInternalError
  0x803B0000 BadNotWritable
  0x803C0000 BadOutOfRange
  0x803D0000 BadNotSupported
  0x80550000 BadDeviceFailure

(BadTimeout 0x800A0000 was already added under Driver.Cli.Common-001.)

Tests: SnapshotFormatterTests gains a new [Theory]
FormatStatus_names_native_driver_emitted_codes covering the five names,
and the existing well-known [Theory] is extended with the same entries
to enforce exact '0x... (Name)' rendering. Suite now 47 green (was 42).

Flips Driver.FOCAS.Cli-005 from Deferred to Resolved; README regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 15:14:08 -04:00
Joseph Doherty
1b10194634 fix(client-ui): resolve Low code-review findings (Client.UI-003,004,006,009,010,011)
- Client.UI-003: wire Serilog properly per CLAUDE.md — console sink +
  rolling daily file sink in Program.Main, Log.CloseAndFlush in finally,
  per-VM Log.ForContext<> loggers.
- Client.UI-004: migrate the cert-store folder picker from the obsolete
  OpenFolderDialog to StorageProvider.OpenFolderPickerAsync (with
  TryGetFolderFromPathAsync seed + TryGetLocalPath extraction).
- Client.UI-006: surface formerly silent catch blocks via an observable
  StatusMessage on the Subscriptions / Alarms VMs that bubbles up into
  the shell's status bar; soft fallbacks log at Information level so
  hard failures stay distinguishable.
- Client.UI-009: docs/Client.UI.md now lists Standard Deviation in the
  Aggregate row of the Query Options table.
- Client.UI-010: removed the unused MinDateTimeProperty /
  MaxDateTimeProperty styled properties from DateTimeRangePicker.
- Client.UI-011: updated the cert-store TextBox watermark from the
  legacy AppData/LmxOpcUaClient/pki to the canonical
  AppData/OtOpcUaClient/pki.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 11:25:20 -04:00
Joseph Doherty
2a6ac07111 fix(client-shared): resolve Low code-review findings (Client.Shared-003,004,009,010,011)
- Client.Shared-003: DefaultSessionAdapter.WriteValueAsync / CallMethodAsync
  guard against null/empty Results and throw ServiceResultException with
  the response's ServiceResult code instead of indexing into a missing
  list.
- Client.Shared-004: DefaultSessionAdapter.CloseAsync / HistoryReadRawAsync
  / HistoryReadAggregateAsync use the Session.*Async overloads and honour
  the caller's CancellationToken.
- Client.Shared-009: AcknowledgeAlarmAsync returns the underlying
  ServiceResultException.StatusCode on failure instead of always Good;
  IOpcUaClientService doc updated to describe the new contract.
- Client.Shared-010: ConnectionSettings.CertificateStorePath defaults to
  empty; DefaultApplicationConfigurationFactory resolves the canonical
  PKI path lazily, so per-failover ConnectionSettings copies don't hit
  the filesystem.
- Client.Shared-011: added the alarm-fallback regression test, extracted
  EndpointSelector as a pure static, and added EndpointSelectorTests
  covering security-mode match, Basic256Sha256 preference, fallback,
  diagnostics, hostname rewrite, and null/empty guards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 11:13:21 -04:00
Joseph Doherty
7fe9f16cf8 fix(client-cli): resolve Low code-review findings (Client.CLI-002,003,004,006,007,008,009,010)
- Client.CLI-002: SubscribeCommand's neverWentBad list now requires the
  node to be present in lastStatus (i.e. received at least one update)
  so the 'suspect' bucket only contains observed nodes.
- Client.CLI-003: every long-running command validates numeric option
  ranges (Interval / Depth / MaxDepth / Duration / Max) and throws
  CliFx CommandException on out-of-range values.
- Client.CLI-004: SubscribeCommand carries XML summary docs on the
  type, ctor, every [CommandOption] property, and ExecuteAsync —
  matching the sibling commands' style.
- Client.CLI-006: HistoryReadCommand parses --start / --end with
  InvariantCulture+UTC and surfaces FormatException as CommandException;
  every NodeIdParser.ParseRequired call wraps FormatException /
  ArgumentException as CommandException.
- Client.CLI-007: CommandBase.ConfigureLogging calls Log.CloseAndFlush()
  before assigning a new Log.Logger so prior sinks are disposed.
- Client.CLI-008: rewrote the subscribe and historyread sections of
  docs/Client.CLI.md (every flag documented, summary-bucket vocabulary,
  StandardDeviation aggregate, UTC --start/--end convention).
- Client.CLI-009: SubscribeCommand / AlarmsCommand use named local
  handlers and detach them via -= after UnsubscribeAsync so no
  notification reaches the console after the command's output phase
  ends.
- Client.CLI-010: added CommandRangeValidationTests,
  EventHandlerLifecycleTests, InputValidationErrorsTests,
  LoggerLifecycleTests, and SubscribeCommandSummaryTests pinning every
  Low fix; FakeOpcUaClientService gained AddDiscoveredVariable +
  RaiseDataChanged + BrowseResultsByParent helpers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 11:13:08 -04:00
Joseph Doherty
879925180b fix(driver-historian-wonderware-client): resolve Low code-review findings (Driver.Historian.Wonderware.Client-003,004,006,008,010)
- Driver.Historian.Wonderware.Client-003: replaced the mixed Interlocked
  + healthLock counters with RecordOutcome that touches _totalQueries
  and exactly one of _totalSuccesses / _totalFailures under one
  acquisition.
- Driver.Historian.Wonderware.Client-004: InvokeAndClassifyAsync routes
  transport + sidecar classification through a single RecordOutcome
  call; the legacy ReclassifySuccessAsFailure two-step is gone.
- Driver.Historian.Wonderware.Client-006: removed the dead
  ReconnectInitialBackoff / ReconnectMaxBackoff options and added a
  doc <remarks> stating the channel performs a single in-place
  reconnect; retry/backoff stays with the caller.
- Driver.Historian.Wonderware.Client-008: the audit-suppression comment
  block now records advisory titles, why neither applies, and the
  revisit trigger.
- Driver.Historian.Wonderware.Client-010: reworded Dispose() to claim
  deadlock-safety and added a GetHealthSnapshot summary documenting the
  single-channel collapse + counter invariant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 11:12:16 -04:00
Joseph Doherty
6923be3aa2 fix(driver-focas-cli): resolve Low code-review findings (Driver.FOCAS.Cli-001,002,003,004; -005 deferred)
- Driver.FOCAS.Cli-001: WriteCommand.ParseValue now wraps numeric
  FormatException / OverflowException as CliFx CommandException with
  the offending value.
- Driver.FOCAS.Cli-002: SubscribeCommand's OnDataChange handler and the
  banner both take a writeLock so notification-callback and main-thread
  writes can't interleave; handler exceptions are warn-and-swallow.
- Driver.FOCAS.Cli-003: FocasCommandBase.ValidateOptions rejects
  --cnc-port outside 1..65535, non-positive --timeout-ms, and
  non-positive --interval-ms; ExecuteAsync calls it first.
- Driver.FOCAS.Cli-004: 'await using var driver' is the sole driver
  disposal path; dropped the redundant explicit await ShutdownAsync.
- Driver.FOCAS.Cli-005 (Deferred): the fix lives in
  Driver.Cli.Common.SnapshotFormatter — explicitly naming the
  status-code shortlist there benefits every driver CLI. Left as a
  Driver.Cli.Common follow-up.
- Registered the new tests/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Cli.Tests
  project in ZB.MOM.WW.OtOpcUa.slnx.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 11:11:55 -04:00
Joseph Doherty
80ef8806e0 fix(driver-modbus-cli): resolve Low code-review findings (Driver.Modbus.Cli-003,004,005,006,007,008)
- Driver.Modbus.Cli-003: ModbusCommandBase.ValidateEndpoint rejects
  --port outside 1..65535, non-positive --timeout-ms, and --unit-id
  outside 1..247.
- Driver.Modbus.Cli-004: wrapped SubscribeCommand's OnDataChange handler
  body in a try/catch (warn-and-swallow) and serialised the console
  write through a lock.
- Driver.Modbus.Cli-005: Probe / Read / Write now catch the
  cancellation-during-init OperationCanceledException and print
  'Cancelled.' instead of dumping a stack trace.
- Driver.Modbus.Cli-006: ProbeCommand.ComputeVerdict derives the headline
  from BOTH the driver state and the probe snapshot's OPC UA quality
  class so the headline can't disagree with the wire result.
- Driver.Modbus.Cli-007: docs/Driver.Modbus.Cli.md carries an explicit
  'CLI scope' callout — the address-string grammar is a DriverConfig
  JSON feature; the CLI takes the structured triple only.
- Driver.Modbus.Cli-008: pinned BuildOptions, ValidateEndpoint, the
  region-validation guards, ComputeVerdict, and the cancellation-during-
  initialize paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 08:35:05 -04:00