18 Commits

Author SHA1 Message Date
Joseph Doherty f23e368a74 fix(server, admin): wire sp_RegisterNodeGenerationApplied + overlay heartbeat onto ClusterNode
dbo.sp_RegisterNodeGenerationApplied was defined by the initial
StoredProcedures migration but had zero callers in src/. The server
polled sp_GetCurrentGenerationForCluster every 5s but never reported
back, so dbo.ClusterNodeGenerationState stayed empty for every node
and both the Admin UI Fleet status page ("No node state recorded")
and the cluster-detail Redundancy LastSeenAt indicator ("never
STALE") showed broken liveness forever.

Server side (GenerationRefreshHostedService):
* New testable seam: Func<long, NodeApplyStatus, string?, CT, Task>?
  registerAppliedAsync constructor parameter, defaulting to a real
  sp_RegisterNodeGenerationApplied call against the central DB.
* TickAsync now calls the proc at two points: after every successful
  apply with NodeApplyStatus.Applied, and on every no-change tick as
  a heartbeat (also Applied) so LastSeenAt stays fresh.
* Apply failures now wrap the lease + coordinator.RefreshAsync in a
  try/catch, report NodeApplyStatus.Failed with the exception message,
  and advance LastAppliedGenerationId regardless of outcome so we
  don't loop on the same broken apply every 5s.
* Register-call failures are best-effort (LogDebug heartbeat, LogWarning
  apply-report) — a transient DB outage during reporting must not
  crash the publisher or block the next apply.

Admin side (ClusterNodeService.ListByClusterAsync): the Redundancy tab
reads ClusterNode.LastSeenAt, but no current writer maintains that
column — the heartbeat goes to ClusterNodeGenerationState.LastSeenAt.
Overlay the GenerationState heartbeat onto the returned ClusterNode
rows when more recent, so IsStale + the Redundancy table column
reflect actual liveness without a schema change or new write path.

Tests: 3 new cases on GenerationRefreshHostedServiceTests verify
first-apply reports Applied, no-change ticks heartbeat with Applied,
and register-call failure does not roll back the cursor or block
subsequent ticks. All 8 GenerationRefresh tests pass.

Verified live on node-dev-a / cluster-dev: dbo.ClusterNodeGenerationState
now populated with CurrentGenerationId=1, LastAppliedStatus=Applied,
fresh LastSeenAt. Fleet status page shows the node (KPIs NODES 1 /
APPLIED 1 / STALE 0 / FAILED 0). Redundancy tab KPI STALE went 1\xe2\x86\x920 and
the row shows a real LAST SEEN timestamp. Bonus: FleetStatusHub
SignalR push now fires the cluster-page Live update banner on every
heartbeat because there are finally state changes to push.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 02:22:59 -04:00
Joseph Doherty c8de58d6d3 fix(admin-ui): render published gen read-only on the 6 cluster-detail content tabs
The Equipment, UNS Structure, Namespaces, Drivers, Tags, and ACLs tabs
all rendered only an "Open a draft to edit" placeholder when no draft
was open — even when the cluster had a fully populated published
generation. docs/v2/admin-ui.md \xa7Cluster Detail describes these as
"read-only views of the published generation" with an "Edit in draft"
affordance; that view was never wired. The earlier code path also
correctly rendered nothing when the cluster had no published gen yet,
which was indistinguishable from the broken state.

Collapse the six per-tab conditions into one shared branch that threads
the published gen ID into the existing tab components when no draft
exists, wrapped in <fieldset disabled> so any Add/Edit button click in
the read-only state cannot silently mutate published rows even though
the tab components themselves don't yet honor an IsReadOnly flag.
Banner above the content explains the state. Surgical: zero changes to
the ~1500 LoC across the six tab components.

Verified live on cluster-dev gen 1: Drivers tab now shows the
cluster-dev-galaxy-main GalaxyMxGateway row read-only; Namespaces tab
shows cluster-dev-galaxy-ns SystemPlatform row; both with the read-only
banner and visibly disabled affordances.

Follow-up worth doing later: refactor each tab component to accept an
IsReadOnly parameter so the disabled-affordance UX is per-tab rather
than a blanket fieldset opacity wash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-25 02:22:32 -04:00
Joseph Doherty 8fe7c8bea6 refactor(driver-galaxy): switch to sibling-repo MxGateway client + drop vendored libs
The sibling mxaccessgw repo (clients/dotnet/) restored a proper client
library + contracts under the new ZB.MOM.WW.MxGateway namespace, so the
binary-vendoring stopgap from PR Driver.Galaxy-016 can unwind via plan #1
of libs/README.md.

- csproj: replace <Reference HintPath="libs\MxGateway.*.dll"> with a
  ProjectReference into ..\..\..\..\mxaccessgw\clients\dotnet  ZB.MOM.WW.MxGateway.Client\. The five backfill PackageReference shims
  (Google.Protobuf, Grpc.Core.Api, Grpc.Net.Client, Polly.Core,
  Microsoft.Extensions.Logging.Abstractions) are now transitive again.
- Source: 'using MxGateway.X' -> 'using ZB.MOM.WW.MxGateway.X' across
  19 driver files + 14 test files. No fully-qualified MxGateway.* usages
  in code, so no behavioural changes — purely a using-prefix flip.
- libs/: deleted MxGateway.Client.dll, MxGateway.Contracts.dll, README.md
  (orphan after the unwind).

Verified: dotnet build clean (Release), all 245 Driver.Galaxy unit tests
pass, OtOpcUa service running with the new client DLL loaded
(opc.tcp://localhost:4840/OtOpcUa, no FileNotFound/TypeLoad/
MissingMethod in startup logs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 14:55:15 -04:00
Joseph Doherty c6082aa0b9 fix(admin-e2e): register missing DI services so ClusterDetail interactive circuit boots
UnsTabDragDropE2ETests were timing out at the 'UNS Structure' nav-link
locator because AdminWebAppFactory never registered AdminHubConnectionFactory
/ HubTokenService / DataProtection — ClusterDetail.razor's @inject threw at
circuit boot, so the page never advanced past the Loading placeholder. 2 → 3
pass after the registrations land. Also documents the Modbus standard-vs-
exception_injection coverage matrix in the fixture README + cross-references
docs/drivers/AbServer-Test-Fixture.md from each Emulate test so a developer
landing on a skipped test has a direct doc pointer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 01:07:17 -04:00
Joseph Doherty b1f3e09661 test(modbus, abcip): align failing integration tests with fixtures
Two real bugs uncovered by re-running with the new fixture defaults
pointing at the shared docker host. Both are test-side, not driver-side.

AbCip — Driver_reads_seeded_DInt_from_ab_server (4 parametrized rows):
  Hardcoded 'ab://127.0.0.1:{port}/1,0' in the deviceUri instead of
  the resolved fixture.Host. The new 10.100.0.35 default (and any
  AB_SERVER_ENDPOINT override) silently couldn't reach this test —
  the driver tried to connect to a non-existent localhost:44818 and
  returned BadCommunicationError on all 4 profile rows. The sibling
  Emulate tests already use the fixture's resolved endpoint; this
  smoke test was missed in the original migration.

  Fix: deviceUri = $"ab://{fixture.Host}:{fixture.Port}/1,0".

Modbus — Float32_With_CDAB_Roundtrips_Through_Wire:
  Test wrote a Float32 to HR 100 (2 consecutive registers: 100+101).
  standard.json's writable HR range declares [100,100] only — a
  single-cell auto-incrementing register, not a 2-register pair. The
  write to register 101 was rejected with Illegal Data Address
  (BadOutOfRange).

  Fix: moved the tag from HR 100 to HR 200 (in standard.json's
  '[200, 209]' scratch range — 10 consecutive writable HRs). The
  Float32+CDAB semantic the test exercises is unchanged.

Modbus — Block_Read_Coalescing_Reduces_PDU_Count_End_To_End:
  Test read HR 300, 302, 304 — outside both the writable ranges and
  the uint16 seed list. pymodbus rejects reads to unseeded HRs even
  though 'hr size' is 2048. BadOutOfRange on every read.

  Fix: moved the tags from 300/302/304 to 200/202/204 (within the
  scratch range). The non-contiguous coalescing semantic (3 tags
  inside a 5-register window with MaxReadGap=5) is preserved.

After this commit:
  - Modbus.IntegrationTests: 6/38 pass / 32 skip / 0 fail
    (was 4 pass / 32 skip / 2 fail; 32 skips are profile-gated
    ExceptionInjectionTests — they need MODBUS_SIM_PROFILE=
    exception_injection and a different container, intentional gating)
  - AbCip.IntegrationTests: 10/12 pass / 2 skip / 0 fail
    (was 6 pass / 2 skip / 4 fail; 2 skips are Emulate tests that
    need the fixture for separate scenarios)

No driver code changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 19:45:57 -04:00
Joseph Doherty 49644fc7fd test(fixtures): migrate integration-test fixture defaults to 10.100.0.35
CLAUDE.md "Docker Workflow" claims (per the 2026-04-28 migration note)
that all fixture-class default endpoints were rewritten to target the
shared Docker host at 10.100.0.35. Audit during today's e2e run showed
the claim was incomplete — five fixture classes still defaulted to
localhost / 127.0.0.1, causing every fixture-touching integration test
to skip with "endpoint unreachable" on a fresh box that hadn't set
the override env vars.

Files corrected:

  - tests/.../Modbus.IntegrationTests/ModbusSimulatorFixture.cs
      DefaultEndpoint: localhost:5020 → 10.100.0.35:5020
  - tests/.../S7.IntegrationTests/Snap7ServerFixture.cs
      DefaultEndpoint: localhost:1102 → 10.100.0.35:1102
  - tests/.../OpcUaClient.IntegrationTests/OpcPlcFixture.cs
      DefaultEndpoint: opc.tcp://localhost:50000 → opc.tcp://10.100.0.35:50000
  - tests/.../AbCip.IntegrationTests/AbServerFixture.cs
      Host default + ResolveHost fallback: 127.0.0.1 → 10.100.0.35
  - tests/.../AbLegacy.IntegrationTests/AbLegacyServerFixture.cs
      Host default + ResolveEndpoint fallback: 127.0.0.1 → 10.100.0.35

XML doc comments referencing the old localhost defaults were updated in
the same pass so the class-summary documentation matches the actual
default. The override-via-env-var mechanism (MODBUS_SIM_ENDPOINT,
AB_SERVER_ENDPOINT, AB_LEGACY_ENDPOINT, S7_SIM_ENDPOINT,
OPCUA_SIM_ENDPOINT) is unchanged — pointing at a real PLC or a
locally-running container still works exactly as before.

Verification:
  - Solution-wide dotnet build: 0 errors.
  - S7.IntegrationTests: 3/3 pass without env-var override.
  - OpcUaClient.IntegrationTests: 3/3 pass without env-var override.
  - Modbus.IntegrationTests: 4/38 (same as the env-var-override run —
    the 2 failures + 32 skips are pre-existing fixture-profile
    mismatches unrelated to this fix).
  - AbCip.IntegrationTests / AbLegacy.IntegrationTests: same results
    as the env-var-override run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 19:32:23 -04:00
Joseph Doherty 3d982d9a65 docs: sync against recent code changes
Five doc-content updates after this session's code-review resolution
sweep. No code touched; pure documentation drift correction.

1. docs/reqs/HighLevelReqs.md (HLR-007 — Service Hosting):
   Refreshed the deployment description from "three cooperating
   processes (Server, Admin, Galaxy.Host)" to "two cooperating
   Windows services (Server, Admin)". The legacy x86 TopShelf
   Galaxy.Host process was retired in PR 7.2 (2026-04-30); Galaxy
   access now flows through the in-process Tier-A GalaxyDriver
   talking gRPC to the sibling mxaccessgw gateway. Also called out
   decision #30 (AddWindowsService replacing TopShelf) inline.

2. docs/VirtualTags.md:
   - Line 9: "compiled via Microsoft.CodeAnalysis.CSharp.Scripting"
     replaced with the current pipeline (Microsoft.CodeAnalysis.CSharp
     regular compiler — Core.Scripting-008 / -016 retired the
     CSharpScript/ScriptRunner path).
   - Line 39: orphan-thread leak description rewritten. The
     CSharp.Scripting-era "underlying ScriptRunner keeps running on
     its thread-pool thread until the Roslyn runtime returns" is no
     longer accurate — the new pipeline binds the script as a
     regular C# Func<> delegate, so the leak is now "synchronous
     CPU-bound work on a pool thread" (same operator-visible
     effect, different mechanism).

3. docs/v2/plan.md decision #29 ("Galaxy Host is a separate Windows
   service"):
   Annotated both the decision body and the decision-log table row
   with "Reversed PR 7.2, 2026-04-30" + a one-line summary of the
   replacement architecture. The original reasoning is preserved as
   audit trail per the decision-log convention.

4. docs/v2/implementation/phase-7-scripting-and-alarming.md A.1:
   Added an Implementation note describing the
   Core.Scripting-008 / -016 supersession of the original
   CSharpScript pipeline. The historical record stays; the note
   points future readers at docs/VirtualTags.md "Compile cache"
   for the current contract.

5. docs/plans/alarms-over-gateway.md "Files" section under client
   regeneration:
   Updated the .NET regeneration instructions to point at the new
   ZB.MOM.WW.MxGateway.Contracts.csproj path. The old
   clients/dotnet/MxGateway.Client.csproj no longer exists in the
   sibling repo (restructure after this plan was written) and the
   vendored-binaries situation in
   src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/libs/ is called out
   so a reader following the plan won't chase a deleted path.

Verification: grep against docs/ for the pre-fix wordings ("three
cooperating processes", "Galaxy.Host (TopShelf)", "ScriptRunner",
the wrong BadDeviceFailure hex code 0x80550000) returns no hits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 18:57:04 -04:00
Joseph Doherty 23d59d73f2 fix(scripting+alarms): close remaining re-review findings
Single commit covering the four small/medium fixes from the updated
code review.

Core.Scripting-014 (Medium, Concurrency):
  CompiledScriptCache.Clear() used the key-only TryRemove(key, out var
  lazy) overload — same race shape Core.Scripting-006 closed in
  GetOrCompile's catch block. A concurrent re-add between snapshot and
  TryRemove was evicted + disposed while the new caller still held it.
  Replaced with the value-scoped TryRemove(KeyValuePair<,>) overload.
  Regression test
  Clear_uses_value_scoped_TryRemove_so_a_race_inserted_entry_survives
  added.

Core.Scripting-013 (Medium, Security):
  Hand-rolled BuildWrapperSource pastes user source between literal
  braces; brace-balanced source could inject sibling methods/classes
  alongside CompiledScript.Run. Analyzer still walked the injected
  members so it wasn't a direct escape, but it relaxed the documented
  'method body' authoring contract. Added EnforceSingleRunMember:
  after ParseText, the compilation unit must hold exactly one type
  (CompiledScript) and that type must hold exactly one member (the Run
  method). Any deviation throws CompilationErrorException with LMX001/
  LMX002 diagnostic IDs and a Core.Scripting-013 reference in the
  message. Two regression tests added covering the sibling-method and
  sibling-class injection vectors.

Core.Scripting-015 (Low, Correctness, latent):
  ToCSharpTypeName's generic branch truncated at the first backtick via
  IndexOf, silently dropping closed args of nested-generic shapes
  (Outer<T>.Inner<U>). No production caller exercises this shape today
  (all TContext/TResult are top-level non-nested), so the bug was
  latent. Rewrote the generic branch to walk the FullName segment-by-
  segment, consuming generic args per segment so nested shapes emit
  valid C# (global::Ns.Outer<T>.Inner<U> rather than the broken
  Outer<T,U>).

Core.ScriptedAlarms-013 (Low, Documentation):
  The internal test accessors TryGetScratchReadCacheForTest /
  TryGetScratchContextForTest return live mutable scratch refilled in
  place under _evalGate. XML docs didn't warn future test authors about
  the synchronization contract. Added a <remarks> block to each
  documenting the only-safe-on-quiesced-engine + identity-or-single-key
  contract.

Verification (suites green):
  Core.Scripting.Tests: 110/110 (was 107 — +3 new rejection/race tests)
  Core.ScriptedAlarms.Tests: 67/67 (unchanged — doc-only fix)
  Core.VirtualTags.Tests: 57/57 (unchanged)

After this commit, all 12 findings from the updated re-review are
closed (10 Resolved, 1 Won't Fix none, 1 Deferred — Driver.Galaxy-017).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 18:00:59 -04:00
Joseph Doherty c2abbf45bd fix(driver-galaxy): align package versions + record vendored-DLL provenance
Driver.Galaxy-015, -016, -017, -018 resolution (one logical change set).

Driver.Galaxy-016 (Medium, Perf/Resource):
  Reconciled the csproj PackageReferences with what the vendored
  MxGateway.Client.dll was actually built against, verified by
  reflecting Assembly.GetReferencedAssemblies() on the DLL:
    - Polly 8.5.2  →  Polly.Core 8.6.6
      (most consequential — Polly v7 fluent API vs Polly.Core v8
       resilience-pipeline API are DIFFERENT packages; the DLL was
       built against Polly.Core so the prior Polly reference would
       have failed at runtime with MissingMethodException the first
       time the gateway client's retry pipeline ran)
    - Grpc.Net.Client 2.71.0  →  2.76.0  (matches sibling Server/Worker)
    - Microsoft.Extensions.Logging.Abstractions 10.0.0  →  10.0.7
  Google.Protobuf 3.34.1 and Grpc.Core.Api 2.76.0 already matched —
  left unchanged.

Driver.Galaxy-015 (re-triaged from Medium-Security → Low-Documentation):
  Original framing was a security concern about unknown-provenance
  binaries. User clarified the DLLs are their own code, built from
  their own mxaccessgw project, not third-party. Re-triaged to a
  documentation / audit-trail concern. Fix:
    - Added a Provenance section to libs/README.md recording the
      source-commit SHA (dd7ca1634e2d2b8a866c81f0009bf87ee9427750,
      extracted from the AssemblyInformationalVersion baked into
      both DLLs by the original build) and SHA-256 checksums.
    - Documented the re-verification recipe (sha256sum + ilspycmd
      | grep AssemblyInformationalVersion).
  Recommendations about .gitattributes and CI hash-check deferred —
  the DLLs are frozen until an unwinding path is taken, so adding
  LFS or CI infrastructure now would need removal at unwinding.

Driver.Galaxy-018 (Low, Documentation):
  Most of the recommendation folded into the libs/README.md rewrite
  (pointed at sibling Server/Worker csproj as the live version source
  rather than the deleted MxGateway.Client.csproj; recorded source
  commit + SHA-256). <SpecificVersion>false</SpecificVersion> on the
  <Reference> items intentionally not added — MSBuild's default for
  HintPath references with bare-name Include attributes is already
  SpecificVersion=false, so explicitly setting it would be cosmetic
  without changing behaviour.

Driver.Galaxy-017 (Low, Design) — Deferred:
  Recommendation part (b) (record mxaccessgw source-commit SHA in
  libs/README.md) is satisfied by Driver.Galaxy-015's resolution.
  Parts (a) and (c) — a GetVersion RPC at session-open and a parity
  test against the live gateway's proto descriptor — are substantial
  new RPC + plumbing work not in scope for this code-review sweep.
  The risk surface is bounded because either of the libs/README.md
  unwinding paths closes the vendoring + this concern naturally.
  Re-open if neither path is taken within the next quarter and the
  live gateway evolves its proto under the driver.

Verification:
  - Build clean (Driver.Galaxy.csproj 0 errors, 0 warnings).
  - Driver.Galaxy.Tests: 245/245 pass against the corrected
    package set.
  - Solution-wide build remains clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 17:45:24 -04:00
Joseph Doherty 3a53d03d23 fix(scripting): block ThreadPool/Timer/AssemblyLoadContext in sandbox
Core.Scripting-012 (High, Security) resolution.

The Core.Scripting-008 rewrite broadened the BCL references list from a
narrow allow-list to the full System.* + netstandard +
Microsoft.Win32.Registry set, delegating the security gate entirely to
ForbiddenTypeAnalyzer. Three categories of dangerous BCL types were
reachable from script source without a deny-list entry:

  - System.Threading.ThreadPool — QueueUserWorkItem re-introduces the
    background-fanout threat Core.Scripting-003 closed against
    System.Threading.Tasks.
  - System.Threading.Timer — schedules unbounded callback work that
    outlives the per-evaluation timeout.
  - System.Runtime.Loader.AssemblyLoadContext — loads arbitrary DLLs.
    Defense-in-depth gap; invocation needs reflection (already denied)
    but the load itself was reachable.

Fix:
  - Added 'System.Runtime.Loader' to ForbiddenNamespacePrefixes
    (preferred over type-granular per the recommendation so future BCL
    additions to that namespace are denied by default).
  - Added 'System.Threading.ThreadPool' and 'System.Threading.Timer'
    to ForbiddenFullTypeNames — both live in System.Threading shared
    with allowed primitives so they must be type-granular.

Regression tests added to ScriptSandboxTests:
  Rejects_ThreadPool_QueueUserWorkItem_at_compile
  Rejects_Timer_new_at_compile
  Rejects_AssemblyLoadContext_at_compile

Docs:
  docs/v2/implementation/phase-7-scripting-and-alarming.md decision #6
  and the Sandbox-escape compliance-check row both updated to enumerate
  the new entries per the Core.Scripting-009 doc-sync convention.

Two lower-impact suggestions from the finding's recommendation
(System.Console, CultureInfo.DefaultThreadCurrentCulture) were
intentionally not addressed and are recorded as accepted minor risks
in the resolution.

Verification: Core.Scripting.Tests 107/107 (was 104 + 3 new rejection
tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 17:39:20 -04:00
Joseph Doherty fb7c6c7046 fix(scripting): route engines through CompiledScriptCache (Core.Scripting-016)
Both VirtualTagEngine.Load and ScriptedAlarmEngine.LoadAsync were calling
ScriptEvaluator.Compile directly, bypassing CompiledScriptCache. The
Core.Scripting-008 collectible-ALC fix wired Dispose only through the cache's
Clear()/Dispose(), so the per-publish accretion the -008 fix was meant to
eliminate was still in effect on the actual production path — the headline
'no more restarts needed' guarantee wasn't delivered.

Resolution:
  - VirtualTagEngine + ScriptedAlarmEngine each gained a private
    CompiledScriptCache<TContext, TResult> instance.
  - Both Load methods now call _compileCache.GetOrCompile(source).
  - Publish-replace path: _compileCache.Clear() runs alongside the existing
    _tags / _alarms clears so the prior generation's ALCs are disposed
    before recompile.
  - Engine Dispose now calls _compileCache.Dispose() so shutdown actually
    releases the emitted assemblies.

Side-fix in CompiledScriptCache: Dispose() set _disposed=true then called
Clear(), but Clear() had a pre-existing 'if (_disposed) return' guard that
aborted the drain unconditionally — making the Dispose-triggered cleanup a
silent no-op. Removed the disposed-guard on Clear() (clearing an empty/
cleared cache is idempotent).

Side-fix in ScriptedAlarmEngine.Dispose: cleared _alarms AFTER the
Task.WhenAll drain. The drain guarantees no background callback is mid-
flight, so clearing is safe. Previously _alarms was deliberately NOT
cleared on Dispose (per Core.ScriptedAlarms-005), but that left the
AlarmState records holding TimedScriptEvaluator → ScriptEvaluator → delegate
references that rooted the emitted assemblies, defeating the cache's
Dispose work on the engine side.

Regression tests:
  - VirtualTagEngineTests.Dispose_unloads_compiled_script_assembly
  - ScriptedAlarmEngineTests.Dispose_unloads_compiled_predicate_assembly
  Both use WeakReference + bounded GC.Collect() to prove the emitted
  assembly is reclaimable after engine.Dispose(). The alarms test had to
  be synchronous (not 'async Task<WeakReference>') because async state
  machines capture locals as state-struct fields, keeping them alive past
  the method's apparent end and defeating GC.

Verification:
  - Core.Scripting.Tests: 104/104 (unchanged).
  - VirtualTags.Tests: 57/57 (was 56 — +1 unload test).
  - ScriptedAlarms.Tests: 67/67 (was 66 — +1 unload test).
  - All other consumer suites still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 17:33:34 -04:00
Joseph Doherty a6ae4e22d1 fix(status-codes): correct BadDeviceFailure from 0x80550000 to 0x808B0000
Driver.Cli.Common-007 + Driver.Cli.Common-008 resolution.

Driver.Cli.Common-007 (High, Correctness):
  0x80550000 is the canonical OPC UA spec value for BadSecurityPolicyRejected,
  not BadDeviceFailure. The correct spec value for BadDeviceFailure is
  0x808B0000 (verified against OPC Foundation Opc.Ua.StatusCodes;
  corroborated locally by Driver.Galaxy.Runtime.StatusCodeMap and both
  Wonderware historian quality mappers which all hand-pin the correct
  value).

  The bug was duplicated across six driver modules:
    - FocasStatusMapper.BadDeviceFailure
    - AbCipStatusMapper.BadDeviceFailure
    - AbLegacyStatusMapper.BadDeviceFailure
    - TwinCATStatusMapper.BadDeviceFailure
    - ModbusDriver.StatusBadDeviceFailure
    - S7Driver.StatusBadDeviceFailure
  Plus the SnapshotFormatter shortlist that named 0x80550000 as
  BadDeviceFailure, and three downstream Modbus tests that asserted
  against the wrong value (so CI was blind).

  This commit fixes all six native-mapper constants, the formatter
  shortlist, and the three Modbus tests in one pass. Added a regression
  guard to FormatStatus_does_not_apply_pre_fix_wrong_names that pins
  0x80550000 never renders as BadDeviceFailure (mirroring the existing
  -001 wrong-name guards).

  Behavior change: OPC UA clients consuming the native drivers now see
  the canonical BadDeviceFailure (0x808B0000) on device-fault paths
  instead of the misnamed BadSecurityPolicyRejected (0x80550000). Wire-
  level status semantics now match operator-facing CLI labels.

Driver.Cli.Common-008 (Low, Testing):
  Deleted the redundant FormatStatus_names_native_driver_emitted_codes
  Theory — its five InlineData rows were already covered by the
  well-known Theory in the same commit (5a9c459), and used a weaker
  ShouldContain vs the well-known Theory's ShouldBe (exact match).

Verification:
  - Driver.Cli.Common.Tests: 43/43 pass (was 48 after the -008 deletion).
  - Driver.Modbus.Tests: 263/263 pass.
  - Driver.AbCip.Tests: 262/262.
  - Driver.AbLegacy.Tests: 157/157.
  - Driver.FOCAS.Tests: 178/178.
  - Driver.S7.Tests: 112/112.
  - Driver.TwinCAT.Tests: 131/131.
  Total: 1146 tests across the affected modules, all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 17:14:28 -04:00
Joseph Doherty 41e62b2663 docs(code-reviews): updated re-review at commit a9be809 — 12 new findings
Re-reviewed the four modules with source changes since the previous review
commit 76d35d1, per REVIEW-PROCESS.md section 6. Updated each findings.md
header (date 2026-05-23, commit a9be809) and appended new findings under
continued numbering. Regenerated README.md.

## New findings — 12 total across 4 modules

### Core.Scripting (5 new, IDs -012 to -016)
- **-012 High Security** — broadened BCL references (System.* + netstandard)
  re-expose System.Threading.ThreadPool / Timer / AssemblyLoadContext, which
  the analyzer's deny-list doesn't cover. Re-introduces the background-work
  threat Core.Scripting-003 closed via System.Threading.Tasks deny.
- **-013 Medium Security** — hand-rolled wrapper-source generation lets
  brace-balanced user source inject sibling methods/classes alongside
  CompiledScript.Run. Analyzer still gates forbidden types, but the
  documented 'method body' authoring contract is silently relaxed.
- **-014 Medium Concurrency** — CompiledScriptCache.Clear() uses key-only
  TryRemove(key, out _) — the same race the -006 resolution fixed in
  GetOrCompile's catch is latent here on publish-replace.
- **-015 Low Correctness** — ToCSharpTypeName truncates at first backtick;
  silently drops closed type arguments of nested-generic shapes (Outer<>.Inner<>).
  Latent — no production caller uses this shape today.
- **-016 Medium Performance** — VirtualTagEngine + ScriptedAlarmEngine call
  ScriptEvaluator.Compile directly without going through CompiledScriptCache,
  so the headline -008 collectible-ALC fix doesn't run on the actual
  production path — the per-publish leak is still in effect.

### Core.ScriptedAlarms (1 new, ID -013)
- **-013 Low Documentation** — new internal test accessors return the live
  mutable scratch dictionary; XML docs don't warn future test authors about
  the synchronisation contract.

### Driver.Cli.Common (2 new, IDs -007, -008)
- **-007 High Correctness** — 0x80550000 was added as BadDeviceFailure but
  the real OPC UA spec value for BadDeviceFailure is 0x808B0000 (verified
  against Driver.Galaxy.Runtime.StatusCodeMap and HistorianQualityMapper,
  both of which use the correct 0x808B0000). 0x80550000 is actually
  BadSecurityPolicyRejected. The native mappers (FOCAS / AbCip / AbLegacy)
  all use the wrong 0x80550000; this session's SnapshotFormatter extension
  propagated the wrong name and the test asserts against the same wrong
  value so CI is blind — same shape of bug as Driver.Cli.Common-001.
- **-008 Low Testing** — new FormatStatus_names_native_driver_emitted_codes
  Theory is redundant with the existing well-known Theory (same five
  InlineData rows added to both) and uses weaker ShouldContain assertion
  than the well-known Theory's ShouldBe.

### Driver.Galaxy (4 new, IDs -015 to -018)
- **-015 Medium Security** — vendored DLLs (libs/) have no recorded
  provenance: no source-commit SHA from the mxaccessgw repo, no SHA-256
  checksum in libs/README.md. Tampering / accidental swap undetectable.
- **-016 Medium Performance** — version skew between declared
  PackageReferences (Polly 8.5.2 / Grpc.Net.Client 2.71.0 /
  Microsoft.Extensions.Logging.Abstractions 10.0.0) and what the vendored
  DLL was actually built against (Polly.Core 8.6.6 / Grpc.Net.Client
  2.76.0 / Microsoft.Extensions.Logging.Abstractions 10.0.7). Latent now
  (assembly-version refs are loose) but precise shape that produces a
  runtime MissingMethodException.
- **-017 Low Design** — no contract-version handshake between the driver
  and the gateway; proto could evolve under the gateway without the
  driver noticing.
- **-018 Low Documentation** — libs/README.md points at the wrong sibling
  csproj as the version source-of-truth; missing SpecificVersion=false
  on the Reference items; missing mxaccessgw source-commit SHA.

## Particularly notable

Two findings undercut commits from this session:

- Driver.Cli.Common-007 invalidates commit 5a9c459 (which named 0x80550000
  as BadDeviceFailure across the cross-CLI shortlist).
- Core.Scripting-016 invalidates the production effect of commit 7b6ab2e
  (the collectible-ALC fix wired Dispose only via CompiledScriptCache,
  which the engines don't use).

The wider native-mapper miscoding behind -007 also affects three driver
modules outside this session's edit scope (FocasStatusMapper,
AbCipStatusMapper, AbLegacyStatusMapper all carry the wrong code).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 17:02:47 -04:00
Joseph Doherty a9be80923c docs(v2-release-readiness): drop stale 'this branch task-galaxy-e2e' ref
The task-galaxy-e2e branch was merged + deleted; the durable reference
is PR #205 alone. Tidies a dangling pointer that future readers might
chase looking for a branch that no longer exists.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 16:50:06 -04:00
Joseph Doherty 994997ba7b fix(driver-galaxy): vendor MxGateway.Client + MxGateway.Contracts as binary refs
The sibling mxaccessgw repo restructured: clients/dotnet/MxGateway.Client
no longer exists, and the proto contracts moved to a new namespace
(ZB.MOM.WW.MxGateway.Contracts.Proto, was MxGateway.Contracts.Proto). The
driver's source still expects the pre-restructure namespace, so the
broken ProjectReference produced 86 build errors in src/ + 1 in tests/
on master.

Resolution: vendor the last known-good build of MxGateway.Client.dll
(99 KB, May 22) and MxGateway.Contracts.dll (490 KB, May 23) under
src/Drivers/.../Driver.Galaxy/libs/, reference them via <Reference
HintPath=...> in both the driver and its test csproj, and declare the
NuGet packages the dropped ProjectReference was supplying transitively
(Google.Protobuf, Grpc.Core.Api, Grpc.Net.Client,
Microsoft.Extensions.Logging.Abstractions, Polly) at versions matching
the sibling repo's ZB.MOM.WW.MxGateway.Contracts.csproj so binary
compatibility is preserved.

Why this over a source migration:
  Source migration would require namespace renames across ~19 driver
  files PLUS reimplementing MxGatewayClient / MxGatewaySession /
  GalaxyRepositoryClient (~2,200 LoC) — the sibling repo dropped the
  client library entirely, keeping only the proto contracts. Vendoring
  the last known-good binaries unblocks the build in minutes, freezes
  the gateway contract surface at a known-good version, and preserves
  the option to migrate properly once the sibling repo decides whether
  to restore a client library or hand the work back to us.

libs/README.md documents the unwinding plan (either path closes the
debt: sibling restores a client library, or driver migrates to the new
contracts namespace + reimplements the client wrapper).

Verification:
  - dotnet build ZB.MOM.WW.OtOpcUa.slnx: 0 errors (was 87).
  - Driver.Galaxy unit tests: 245/245 pass.
  - Integration tests not run here (require a live mxaccessgw gateway).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 16:32:56 -04:00
Joseph Doherty 0001cdd579 fix(scripted-alarms): reuse per-alarm evaluation scratch on the hot path
Core.ScriptedAlarms-009 resolution: replace the per-call Dictionary +
AlarmPredicateContext allocation with a per-alarm reusable AlarmScratch
held in _scratchByAlarmId, refilled in place under _evalGate on each
evaluation. The hot path no longer allocates per upstream tag change.

Why this matters:
  On a busy line where many tags feeding many alarms change frequently,
  the old BuildReadCache allocated a fresh dictionary + context on every
  predicate evaluation — a steady stream of short-lived allocations the
  GC eventually has to reclaim. With the reuse, the dictionary and
  context are allocated once per alarm (on first evaluation) and refilled
  in place across every subsequent re-eval.

Implementation:
  - New private AlarmScratch class holds the reusable
    Dictionary<string, DataValueSnapshot> read cache (pre-sized to the
    alarm's Inputs.Count) and the AlarmPredicateContext that wraps it by
    reference. The context observes refilled values without being
    re-created.
  - ConcurrentDictionary<string, AlarmScratch> _scratchByAlarmId on the
    engine, cleared in LoadAsync alongside _alarms so a config-publish
    drops the prior generation's scratch (Inputs / Logger may change).
  - EvaluatePredicateToStateAsync looks up scratch via GetOrAdd, calls
    the new RefillReadCache(Dictionary, IReadOnlySet) helper to clear +
    repopulate the dictionary in place, then runs the predicate against
    the reused context.
  - BuildReadCache removed.

Safety:
  Reuse is serialised under _evalGate which guarantees no two threads
  ever observe the same scratch in a half-refilled state. The
  AlarmPredicateContext is bound to the scratch dictionary by reference,
  so the predicate's ctx.GetTag(path) sees the freshly-refilled values
  rather than a stale snapshot.

Verification:
  - All 66 ScriptedAlarms tests pass (was 63 — three new regression tests
    locking the reuse contract).
  - All 56 VirtualTags tests still pass (unchanged).
  - All 104 Core.Scripting tests still pass (unchanged).

New tests in ScriptedAlarmEngineTests:
  - Reevaluation_reuses_the_same_read_cache_dictionary — asserts
    ReferenceEquals(scratch_before, scratch_after) across two
    evaluations of the same alarm.
  - Reevaluation_reuses_the_same_predicate_context — same, for the
    context.
  - LoadAsync_drops_the_prior_generations_scratch — asserts a config
    publish wipes the prior scratch (so a stale Logger / Inputs can't
    leak into the new generation).

Internal test hooks TryGetScratchReadCacheForTest /
TryGetScratchContextForTest added via the existing
InternalsVisibleTo for the tests project. Kept internal — not part of
the public engine surface.

Docs:
  - docs/v2/Galaxy.Performance.md "Scripted-alarm engine" section
    rewritten as "hot-path allocation reuse" documenting the new
    contract + reuse safety reasoning + the three regression tests.
  - code-reviews/Core.ScriptedAlarms/findings.md -009 flipped
    Won't Fix → Resolved.
  - code-reviews/README.md regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 16:10:09 -04:00
Joseph Doherty 7b6ab2ec6f fix(scripting): unload compiled-script assemblies via collectible ALC
Core.Scripting-008 resolution: replace the legacy CSharpScript.CreateDelegate
path with hand-rolled CSharpCompilation + Emit + collectible AssemblyLoadContext,
so per-publish compile accretion no longer requires a server restart to reclaim.

Why this was needed:
  Roslyn's CSharpScript path emits dynamically-compiled script assemblies into
  the default AssemblyLoadContext, which is non-collectible. Across config-
  publish generations each Clear() drops dictionary entries but the emitted
  assemblies stay loaded for process lifetime, so memory grows steadily on
  long-running servers with frequent publishes. The accepted-limitation note
  in docs/VirtualTags.md recommended scheduled restarts as the workaround;
  operator feedback was that restarts are difficult, so the underlying
  limitation was the right thing to fix.

Implementation:
  - New ScriptAssemblyLoadContext(name, isCollectible: true) hosts one emitted
    script assembly per evaluator.
  - ScriptEvaluator.Compile synthesises a wrapper class around the user source
    (CompiledScript.Run(globals) — explicit return required per ordinary C#
    semantics, which every existing script already uses), builds a
    CSharpCompilation against the sandbox references, runs the
    ForbiddenTypeAnalyzer over the semantic model unchanged, emits to an
    in-memory PE stream, loads via ScriptAssemblyLoadContext.LoadFromStream,
    and binds a strongly-typed Func<ScriptGlobals<TContext>, TResult> delegate
    via reflection.
  - ScriptEvaluator now implements IDisposable — Dispose calls
    AssemblyLoadContext.Unload(), which makes the emitted assembly eligible
    for GC at the next collection cycle.
  - CompiledScriptCache.Clear() disposes every materialised evaluator before
    dropping its dictionary entry; CompiledScriptCache itself is now
    IDisposable for graceful server shutdown.
  - ScriptSandbox.Build returns a new SandboxConfig (References + Imports)
    instead of a Roslyn ScriptOptions; references now span BCL via the
    TRUSTED_PLATFORM_ASSEMBLIES set filtered to System.* + netstandard +
    Microsoft.Win32.Registry, so forbidden BCL types resolve at compile and
    ForbiddenTypeAnalyzer is the sole security gate (consistent with the
    Core.Scripting-001 / -002 model — references-list-only restriction is
    porous against type forwarding, so the analyzer must be the real gate).

Verification:
  - All 104 Core.Scripting tests pass (was 101 — three new regression tests
    locking the unload contract).
  - All 56 VirtualTags tests pass (unchanged).
  - All 63 ScriptedAlarms tests pass (unchanged).
  - New CompiledScriptCacheTests:
    - Dispose_unloads_compiled_script_assembly_load_context — proves single-
      evaluator ALC unload via WeakReference + bounded GC.Collect() loop.
    - Clear_disposes_every_materialised_evaluator — proves publish-replace
      releases every prior generation's ALC.
    - GetOrCompile_after_Dispose_throws_ObjectDisposedException — locks the
      post-dispose contract.

Docs:
  - docs/VirtualTags.md "Compile cache" section rewritten: the accepted-
    limitation note replaced with the unload contract + the new authoring
    convention (explicit return).
  - docs/ScriptedAlarms.md cross-reference updated to drop the obsolete
    restart guidance.
  - code-reviews/Core.Scripting/findings.md Core.Scripting-008 flipped
    Won't Fix → Resolved with the implementation summary.
  - code-reviews/README.md regenerated.

Pre-existing breakage note: Driver.Galaxy fails the solution-wide build on
master because its ProjectReference to the sibling mxaccessgw repo's
MxGateway.Client targets a path that the sibling repo no longer has after a
recent restructuring. This is unrelated to Core.Scripting-008 and was
verified to exist on master before this branch was cut.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 15:55:04 -04:00
Joseph Doherty 5a9c4591b9 fix(cli-common): name native-driver-emitted status codes in SnapshotFormatter
Driver.FOCAS.Cli-005 follow-up: extend the SnapshotFormatter.FormatStatus
shortlist with the five Bad* codes the native-protocol mappers (FOCAS,
AbCip, AbLegacy) emit but which the shortlist previously left unnamed,
so they rendered only as severity-class 'Bad' instead of the documented
'BadDeviceFailure' / 'BadNotWritable' / ... names operators are told to
read off probe/write output.

Added entries:
  0x80020000 BadInternalError
  0x803B0000 BadNotWritable
  0x803C0000 BadOutOfRange
  0x803D0000 BadNotSupported
  0x80550000 BadDeviceFailure

(BadTimeout 0x800A0000 was already added under Driver.Cli.Common-001.)

Tests: SnapshotFormatterTests gains a new [Theory]
FormatStatus_names_native_driver_emitted_codes covering the five names,
and the existing well-known [Theory] is extended with the same entries
to enforce exact '0x... (Name)' rendering. Suite now 47 green (was 42).

Flips Driver.FOCAS.Cli-005 from Deferred to Resolved; README regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 15:14:08 -04:00
84 changed files with 2488 additions and 261 deletions
+71 -4
View File
@@ -4,8 +4,8 @@
|---|---|
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Review date | 2026-05-23 |
| Commit reviewed | `a9be809` |
| Status | Reviewed |
| Open findings | 0 |
@@ -14,6 +14,8 @@
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
### 2026-05-22 review (commit `76d35d1`)
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Core.ScriptedAlarms-002 |
@@ -27,6 +29,26 @@ a category produced nothing rather than leaving it blank.
| 9 | Testing coverage | Core.ScriptedAlarms-012 |
| 10 | Documentation & comments | Core.ScriptedAlarms-003 |
### 2026-05-23 re-review (commit `a9be809`)
Focused re-review of the Core.ScriptedAlarms-009 resolution (commit `0001cdd`) —
new `AlarmScratch` class, `_scratchByAlarmId` ConcurrentDictionary, `RefillReadCache`
helper, and internal test accessors. Only the changed/new code since `76d35d1` was
re-examined; existing closed findings stay as audit trail.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | No issues found |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | No issues found |
| 4 | Error handling & resilience | No issues found |
| 5 | Security | No issues found |
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | No issues found |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | No issues found |
| 10 | Documentation & comments | Core.ScriptedAlarms-013 |
## Findings
### Core.ScriptedAlarms-001
@@ -156,13 +178,29 @@ a category produced nothing rather than leaving it blank.
| Severity | Low |
| Category | Performance & resource management |
| Location | `ScriptedAlarmEngine.cs:309-315`, `ScriptedAlarmEngine.cs:271` |
| Status | Won't Fix |
| Status | Resolved |
**Description:** `BuildReadCache` allocates a fresh `Dictionary<string, DataValueSnapshot>` on every predicate evaluation, i.e. on every upstream tag change for every referencing alarm. On a busy line where many tags feeding many alarms change frequently, this is a steady stream of short-lived dictionary allocations on the hot path. `AlarmPredicateContext` is also newly constructed each evaluation (line 281).
**Recommendation:** Minor. If the evaluation path shows up in allocation profiling, the read cache could be a reused per-alarm buffer cleared between evaluations (evaluations are already serialised under `_evalGate`, so a single shared scratch dictionary is safe). Not worth doing speculatively — flag for the perf surface in `docs/v2/Galaxy.Performance.md` if alarm evaluation is ever soak-tested.
**Resolution:** Won't Fix 2026-05-23 — per the recommendation, no code change. Documented the known allocation characteristic in `docs/v2/Galaxy.Performance.md` (new "Scripted-alarm engine — known hot-path allocations" section) so a future soak that surfaces pressure has a noted mitigation (reused per-alarm scratch buffer) and we don't re-find this in a later review.
**Resolution:** Resolved 2026-05-23 — added a per-alarm reusable `AlarmScratch`
(read-cache `Dictionary` + `AlarmPredicateContext`) held in
`_scratchByAlarmId`, populated lazily on first evaluation and refilled in place
by the new `RefillReadCache(Dictionary, IReadOnlySet)` helper on every
subsequent re-eval. `BuildReadCache` is gone. The reuse is safe because every
evaluation runs under `_evalGate`; the context wraps the dictionary by
reference, so the predicate's `ctx.GetTag(path)` sees the freshly-refilled
values. `LoadAsync` clears `_scratchByAlarmId` alongside `_alarms` so a
config-publish drops the prior generation's scratch (Inputs / Logger may
change). Regression tests added in `ScriptedAlarmEngineTests`:
`Reevaluation_reuses_the_same_read_cache_dictionary`,
`Reevaluation_reuses_the_same_predicate_context`, and
`LoadAsync_drops_the_prior_generations_scratch`; internal test hooks
`TryGetScratchReadCacheForTest` / `TryGetScratchContextForTest` exposed via
the existing `InternalsVisibleTo`. `docs/v2/Galaxy.Performance.md` "Scripted-alarm
engine" section rewritten to document the new reuse contract. Suite now 66
green (was 63).
### Core.ScriptedAlarms-010
@@ -208,3 +246,32 @@ a category produced nothing rather than leaving it blank.
**Recommendation:** Add engine-level tests that inject a controllable `Func<DateTime>` clock to drive `RunShelvingCheck`, cover the remaining Part 9 engine methods end-to-end, assert subscriber-exception isolation, and add a store-failure fake to lock in the chosen persistence-failure semantics from finding 007.
**Resolution:** Resolved 2026-05-22 — added 8 new engine-level tests covering all 6 gap areas: injectable-clock timed-shelve expiry via `RunShelvingCheckForTest`, `ConfirmAsync`/`TimedShelveAsync`/`UnshelveAsync`/`EnableAsync` end-to-end, subscriber-exception isolation, store-failure invariant, second-`LoadAsync` timer-leak regression, and `AreInputsReady` Bad/Uncertain guard; exposed `RunShelvingCheckForTest()` internal hook on the engine.
### Core.ScriptedAlarms-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `ScriptedAlarmEngine.cs:66-81` |
| Status | Resolved |
**Description:** The new internal test accessors `TryGetScratchReadCacheForTest` and `TryGetScratchContextForTest` (introduced by the Core.ScriptedAlarms-009 resolution at `0001cdd`) return the *live* per-alarm scratch — the same `Dictionary<string, DataValueSnapshot>` instance the engine clears and refills in `RefillReadCache` under `_evalGate`, plus the `AlarmPredicateContext` that wraps it by reference. The XML docs describe the intended use ("assert the scratch is reused across evaluations (two reads return the same instance)") but do not explicitly warn that:
1. The returned `IReadOnlyDictionary` is the engine's mutable working set. Enumerating it from a test thread while the engine is mid-evaluation (e.g. during a `ReevaluateAsync` queued by `OnUpstreamChange`, or a `ShelvingCheckAsync` callback) is a concurrent-read-while-writer scenario against a plain `Dictionary` — undefined behaviour, can throw `InvalidOperationException` or return torn data.
2. Reference-equality comparisons (`ReferenceEquals(a, b)`) and single-key indexer reads (`dict["Temp"]`) on a quiesced engine are the only safe uses. The existing regression tests stay within those bounds, but a future test author has no in-code signal that broader reads are unsafe.
The engine itself is correct — `RefillReadCache` runs only under `_evalGate`, so the engine never tears its own state. The risk is purely on the test-side contract.
**Recommendation:** Add a `<remarks>` block to both `TryGetScratchReadCacheForTest` and `TryGetScratchContextForTest` stating that the returned references point at live engine state, that reads are only safe when the engine is known to be idle (no in-flight `ReevaluateAsync`/`ShelvingCheckAsync`/`LoadAsync`), and that the intended uses are reference-identity assertions plus single-key lookups against a quiesced engine — never enumeration. No code change required; the engine's correctness depends on `_evalGate`, which is already documented.
**Resolution:** Resolved 2026-05-23 — applied the recommendation verbatim.
Added a `<remarks>` block to each of `TryGetScratchReadCacheForTest` and
`TryGetScratchContextForTest` documenting the synchronization contract:
the returned references point at live engine state refilled in place under
`_evalGate`, enumeration during an in-flight evaluation is a
concurrent-read-while-writer scenario against a plain `Dictionary`
(undefined behaviour), and the only safe uses are reference-identity
comparisons + single-key reads against a quiesced engine. No code change
required — the engine's correctness was always there; only the test-side
contract was undocumented.
+434 -16
View File
@@ -4,8 +4,8 @@
|---|---|
| Module | `src/Core/ZB.MOM.WW.OtOpcUa.Core.Scripting` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Review date | 2026-05-23 |
| Commit reviewed | `a9be809` |
| Status | Reviewed |
| Open findings | 0 |
@@ -14,18 +14,23 @@
A comprehensive review completes every category, recording "No issues found" where
a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Core.Scripting-004, Core.Scripting-005 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Core.Scripting-006 |
| 4 | Error handling & resilience | Core.Scripting-007 |
| 5 | Security | Core.Scripting-001, Core.Scripting-002, Core.Scripting-003 |
| 6 | Performance & resource management | Core.Scripting-008 |
| 7 | Design-document adherence | Core.Scripting-009 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Core.Scripting-010, Core.Scripting-011 |
| 10 | Documentation & comments | No issues found |
The 2026-05-23 re-review only covers code touched between commits `76d35d1` and
`a9be809` (primarily the Core.Scripting-008 ALC rewrite + the broadened BCL
references). Categories where the new code surface produced no issues are
recorded as "No new issues" for that pass.
| # | Category | Result (76d35d1) | Result (a9be809, new code only) |
|---|---|---|---|
| 1 | Correctness & logic bugs | Core.Scripting-004, Core.Scripting-005 | Core.Scripting-015 |
| 2 | OtOpcUa conventions | No issues found | No new issues |
| 3 | Concurrency & thread safety | Core.Scripting-006 | Core.Scripting-014 |
| 4 | Error handling & resilience | Core.Scripting-007 | No new issues |
| 5 | Security | Core.Scripting-001, Core.Scripting-002, Core.Scripting-003 | Core.Scripting-012, Core.Scripting-013 |
| 6 | Performance & resource management | Core.Scripting-008 | Core.Scripting-016 |
| 7 | Design-document adherence | Core.Scripting-009 | No new issues |
| 8 | Code organization & conventions | No issues found | No new issues |
| 9 | Testing coverage | Core.Scripting-010, Core.Scripting-011 | No new issues |
| 10 | Documentation & comments | No issues found | No new issues |
## Findings
@@ -240,7 +245,7 @@ race ordering.
| Severity | Low |
| Category | Performance & resource management |
| Location | `CompiledScriptCache.cs:34`, `ScriptEvaluator.cs:34` |
| Status | Won't Fix |
| Status | Resolved |
**Description:** `CompiledScriptCache` has no capacity bound (acknowledged in the class
remarks) and no eviction. Each cached `ScriptEvaluator` holds a Roslyn `ScriptRunner<T>`
@@ -257,7 +262,33 @@ compile scripts into a collectible `AssemblyLoadContext` so `Clear()` can unload
generations. At minimum add a note to `docs/ScriptedAlarms.md` so operators with
high-publish-frequency deployments are aware.
**Resolution:** Resolved 2026-05-23 — accepted as a documented known limitation rather than fixing in code (collectible `AssemblyLoadContext` for Roslyn-emitted assemblies is a v3 concern). The "Compile cache" section of `docs/VirtualTags.md` now carries a "Per-publish assembly accretion (accepted limitation, Core.Scripting-008)" note that operators with high-publish-frequency deployments can scan, and `docs/ScriptedAlarms.md` cross-references it. The accretion is benign at the expected "low thousands" of scripts scale; recommended mitigation is a scheduled server restart for deployments that publish very frequently.
**Resolution:** Resolved 2026-05-23 — switched the compile pipeline off the legacy
`CSharpScript.CreateDelegate` path (which emits into the default, non-collectible
`AssemblyLoadContext`) and onto a hand-rolled `CSharpCompilation`
`Compilation.Emit(MemoryStream)``ScriptAssemblyLoadContext.LoadFromStream` chain,
with the new `ScriptAssemblyLoadContext` constructed `isCollectible: true`. Each
compiled script lives in its own ALC; `ScriptEvaluator` now implements `IDisposable`
and calls `AssemblyLoadContext.Unload()` on dispose. `CompiledScriptCache.Clear()`
disposes every materialised evaluator before dropping its dictionary entry, and
`CompiledScriptCache` itself is now `IDisposable` for graceful server shutdown.
After a publish-replace cycle the prior generation's emitted assemblies become
eligible for GC; the reclaim is GC-timing-sensitive (Unload is
*eligible-for-collection*, not synchronous) and the next collection cycle reclaims
them. The references list is now BCL-wide (System.* + netstandard + Microsoft.Win32.Registry
via the TRUSTED_PLATFORM_ASSEMBLIES set) so forbidden BCL types resolve at compile and
`ForbiddenTypeAnalyzer` is the sole security gate (consistent with the
Core.Scripting-001 / -002 model). `docs/VirtualTags.md` "Compile cache" section rewritten;
`docs/ScriptedAlarms.md` cross-reference updated to drop the obsolete restart guidance.
Regression tests added in `CompiledScriptCacheTests`:
`Dispose_unloads_compiled_script_assembly_load_context`,
`Clear_disposes_every_materialised_evaluator`, and
`GetOrCompile_after_Dispose_throws_ObjectDisposedException`; the first two
prove ALC unload via `WeakReference` + bounded `GC.Collect()` loops. Suite now 104
green (was 101). Authoring convention: the synthesized wrapper is an ordinary
C# static method, so scripts must end with explicit `return …;` per ordinary C# rules
(the legacy `CSharpScript` "last expression yields result" shorthand no longer applies);
every script in the existing corpus already uses explicit `return` so this is a doc-only
change for new authors.
### Core.Scripting-009
@@ -336,3 +367,390 @@ a script logging at Error level produces both a `scripts-*.log` event and a comp
Warning event.
**Resolution:** Resolved 2026-05-23 — added three new test files: `ScriptSandboxBuildTests` covers the `Build` null / non-`ScriptContext` / base-class / concrete-subclass paths; `ScriptContextTests` locks `Deadband` boundary semantics (equal-to-tolerance returns false; just-over returns true; symmetric in direction; zero-tolerance returns true only on non-equal; negative tolerance trips on any non-equal); the new `Factory_plus_companion_sink_integration_surfaces_script_error_in_both_logs` test in `ScriptLogCompanionSinkTests` wires `ScriptLoggerFactory` + the companion sink together end-to-end and asserts an Error emission lands in both the scripts sink (at Error) and the main sink (at Warning), each tagged with `ScriptName`. Suite now 101 green (was 85 before).
### Core.Scripting-012
| Field | Value |
|---|---|
| Severity | High |
| Category | Security |
| Location | `ForbiddenTypeAnalyzer.cs:60-76`, `ScriptSandbox.cs:96-126` |
| Status | Resolved |
**Description:** The Core.Scripting-008 rewrite broadened the BCL references list
from a narrow allow-list (`System.Private.CoreLib` + `System.Linq` only) to the
full `TRUSTED_PLATFORM_ASSEMBLIES` set filtered to `System.*` + `netstandard` +
`Microsoft.Win32.Registry`. This change correctly delegates the security gate to
`ForbiddenTypeAnalyzer` (the new comment in `ScriptSandbox` calls this out
explicitly), but the analyzer's deny-list has not been expanded to match the new
attack surface, and three categories of dangerous BCL types in the `System.*`
allow-listed assemblies are now reachable from script source:
1. **`System.Threading.ThreadPool`** (in namespace `System.Threading`). The
Core.Scripting-003 fix added `System.Threading.Tasks` to deny `Task.Run` /
`Parallel` fan-out because background work that outlives the per-evaluation
timeout is the explicit threat. `ThreadPool.QueueUserWorkItem`,
`ThreadPool.UnsafeQueueUserWorkItem`, and `ThreadPool.RegisterWaitForSingleObject`
are exactly the same threat — they schedule background work that outlives the
`WaitAsync(Timeout)` budget and tie up worker threads — but `System.Threading`
itself is allowed (because `CancellationToken` / `SemaphoreSlim` / `Volatile`
live there). The Core.Scripting-003 resolution is incomplete on the new
reference surface.
2. **`System.Threading.Timer`** (same namespace). Schedules a background
callback; the script returns control to the engine but the timer keeps
firing past the evaluation budget. Same threat as `Task.Run`.
3. **`System.Runtime.Loader.AssemblyLoadContext`** (in namespace
`System.Runtime.Loader`, which is not denied — only `System.Runtime.InteropServices`
is). The constructor + `LoadFromAssemblyPath` / `LoadFromStream` /
`LoadFromAssemblyName` let a script load an arbitrary DLL into the host
process. Pass (1) of the analyzer resolves the receiver type
(`AssemblyLoadContext`, allowed) + the invocation symbol's containing type
(also `AssemblyLoadContext`, allowed) and lets the call through. Pass (2)
only inspects `TypeSyntax` nodes — if the script discards the returned
`Assembly` (e.g. `alc.LoadFromAssemblyPath(@"C:\evil.dll");`) there is no
`TypeSyntax` for the analyzer to walk and the call is accepted. Triggering
execution of the loaded code from inside the sandbox is hard (most of
`Assembly`'s surface is in `System.Reflection`, which is denied) but the
defense-in-depth gap is real: an attacker who can author a script also
typically controls a file path on the server (Admin UI uploads, share
mounts) and loading an assembly is the prerequisite to every chained
escape — module initializers, type-resolve handlers, and a future analyzer
slip would all become exploitable.
In addition, two lower-impact `System.*` types are reachable that arguably
shouldn't be: **`System.Console.SetOut`** / **`Console.SetError`** could
redirect the host's console streams (requires constructing a
`System.IO.TextWriter`, which is blocked, so the practical exploit is
`Console.WriteLine` log-spam only), and **`System.Globalization.CultureInfo.DefaultThreadCurrentCulture`**
could perturb the entire process's formatting behavior (subtle but real cross-script
side effect).
The original Core.Scripting-001 finding called out the model: when an allow-listed
namespace contains dangerous types, those types must be denied type-granularly.
The new reference surface introduces several more such types and the deny-list
has not been kept in sync.
**Recommendation:** Add `System.Threading.ThreadPool` and `System.Threading.Timer`
to `ForbiddenFullTypeNames`. Add `System.Runtime.Loader` as a namespace prefix
to `ForbiddenNamespacePrefixes` (every type in `System.Runtime.Loader`
`AssemblyLoadContext`, `AssemblyDependencyResolver`, `AssemblyLoadEventArgs` — is
out of script scope). Consider adding `System.Console` to `ForbiddenFullTypeNames`
to stop log-spam through the host's console streams, and at minimum document
`CultureInfo.DefaultThreadCurrentCulture` as an accepted cross-script side
effect. Each addition must have a regression test in `ScriptSandboxTests`
mirroring the Core.Scripting-010 vector style. Update
`docs/v2/implementation/phase-7-scripting-and-alarming.md` decision #6 + the
"Sandbox escape" compliance-check row to enumerate the additions, per the
Core.Scripting-009 doc-sync convention.
**Resolution:** Resolved 2026-05-23 — added `System.Runtime.Loader` to
`ForbiddenNamespacePrefixes` (the namespace-prefix form preferred over
type-granular per the recommendation; future BCL additions to that namespace
are denied by default). Added `System.Threading.ThreadPool` and
`System.Threading.Timer` to `ForbiddenFullTypeNames` — both live in
`System.Threading` shared with allowed sync primitives so they must be
type-granular. Regression tests added to `ScriptSandboxTests`:
`Rejects_ThreadPool_QueueUserWorkItem_at_compile`,
`Rejects_Timer_new_at_compile`, `Rejects_AssemblyLoadContext_at_compile`.
`docs/v2/implementation/phase-7-scripting-and-alarming.md` decision #6 +
the Sandbox-escape compliance-check row both updated per the
Core.Scripting-009 doc-sync convention. The two lower-impact suggestions
from the recommendation (`System.Console`, `CultureInfo.DefaultThreadCurrentCulture`)
were intentionally not addressed: `Console.SetOut` requires constructing
a `System.IO.TextWriter` which is already blocked, leaving only
`Console.WriteLine` log-spam (annoyance, not a security threat); and
`CultureInfo.DefaultThreadCurrentCulture` is a cross-script side-effect
worth knowing about but doesn't escape the sandbox. Recording both as
accepted minor risks. Test totals after fix: Core.Scripting 107 green
(was 104 — +3 new rejection tests).
### Core.Scripting-013
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | `ScriptEvaluator.cs:202-225` (`BuildWrapperSource`) |
| Status | Resolved |
**Description:** The synthesized wrapper pastes the user's source verbatim
between `{` and `}` braces inside a static method body, with a `#line 1`
directive and no escaping. The legacy `CSharpScript.CreateDelegate` path was
robust to this because Roslyn's scripting compiler parses script source as a
top-level statement sequence; the new hand-rolled path is parsing ordinary C# in
a method body, so a script that injects matching `{` / `}` braces can extend the
synthesized compilation unit with additional methods, classes, or `#line`
directives. For example, a script body of
`return 0; } public static int Evil() { return 0; }} public static class CompiledScript2 { public static void M() {`
ends the `Run` method early, declares a sibling `Evil` method (and even a
sibling `CompiledScript2` class) inside the synthesized namespace, then opens an
unclosed method that consumes the wrapper's trailing `}\n}`. With matching brace
counts the script parses cleanly and compiles.
`ForbiddenTypeAnalyzer` walks every descendant of every syntax tree, so any
forbidden BCL types named inside the injected methods are still caught — the
finding is **not** a direct sandbox escape. However:
- It silently relaxes the operator-visible authoring contract documented in
`docs/VirtualTags.md` ("scripts are statement bodies that end with an
explicit `return …;`") to "scripts can be any compilable C# inside the
`CompiledScript` namespace" — operators have access to features the design
did not intend to expose (local types defined as siblings of `Run`, custom
module initializers via attributes, etc.).
- A script can embed its own `#line` directives that override the
`#line 1` we emit just above the user source, producing misleading error
locations in compiler diagnostics surfaced to the operator.
- Future hardening that relies on syntactic-shape assumptions (e.g.
"every script has exactly one method") would silently fail.
- It widens the analyzer's surface: the analyzer's correctness now depends on
Pass (2) correctly walking every conceivable C# construct that can name a
type, including ones a normal script body would never contain
(`UnmanagedCallersOnly` attribute, function pointer types `delegate*<...>`,
pattern types, switch arm types, …).
**Recommendation:** Either (a) reject scripts whose parsed body contains
declarations other than statements — walk the wrapper's syntax tree after parse
and require that the only members of `CompiledScript` are the single `Run`
method, raising a `CompilationErrorException` if anything else appears — or
(b) parse the user source independently as a `BlockSyntax` and inject the
parsed block as the method body via the Roslyn syntax API, which makes
brace-mismatched / class-injecting source unparseable. Add a regression test
covering at least the brace-injection vector
(`return 0; } public static int Evil() { return 0;`).
**Resolution:** Resolved 2026-05-23 — took option (a) from the recommendation:
added an `EnforceSingleRunMember` step to `ScriptEvaluator.Compile` (runs after
`CSharpSyntaxTree.ParseText` of the synthesized wrapper, before Roslyn
compile). The check requires exactly one type declaration in the compilation
unit (the `CompiledScript` class) AND exactly one member on that class (the
`Run` method). Any deviation — a sibling class, an additional namespace, a
sibling method or nested type alongside `Run` — throws
`CompilationErrorException` with diagnostic IDs `LMX001` / `LMX002` and a
message that names Core.Scripting-013 and points at the offending span. Two
regression tests added: `Rejects_sibling_method_injection_via_balanced_braces`
(injects a sibling method via `} public static int Evil() { …`) and
`Rejects_sibling_class_injection_via_balanced_braces` (injects an entire
sibling namespace + class). Option (b) (parse the user source independently
as a `BlockSyntax` and inject via Roslyn syntax API) was considered but the
parse-and-validate approach is more readable, gives clearer error messages,
and keeps the wrapper-source generation textual.
### Core.Scripting-014
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `CompiledScriptCache.cs:91-103` (`Clear`) |
| Status | Resolved |
**Description:** `Clear()` snapshots `_cache.Keys.ToArray()` then iterates,
calling `TryRemove(key, out var lazy)` on each — the key-only overload, not
the value-scoped one used in `GetOrCompile`'s catch block. Between the
snapshot and a given `TryRemove`, a concurrent `GetOrCompile(scriptSource)`
call that hashes to the same key can re-insert a fresh `Lazy` whose `.Value`
the caller already retained. The unconditional `TryRemove` then removes that
fresh `Lazy` and `DisposeLazyIfMaterialised(lazy)` calls `Dispose()` on its
evaluator — unloading the ALC while the concurrent caller still holds a
reference to the evaluator and intends to invoke it.
This is exactly the race-window pattern the Core.Scripting-006 resolution
fixed in `GetOrCompile`'s catch block (the test
`Failed_compile_eviction_does_not_remove_a_concurrent_retry_entry` locks it
there). `Clear()` carries the same shape but uses the older, value-blind
overload, so the same race that finding-006 addresses is still latent on the
publish-replace path.
In current production wiring `Clear()` is intended for config-publish + tests
— neither overlaps steady-state evaluation under the documented design — so
the in-practice impact is low. But the cache is checked in as the
forward-looking compile cache for the engines (per `Script.SourceHash`'s docs
and the cache's own remarks); a future wiring that calls `Clear()` from
publish while evaluations are in flight would dispose live evaluators.
**Recommendation:** Replace the snapshot + `TryRemove(key, out var lazy)`
sequence with an enumeration that captures the `Lazy` reference at snapshot
time and uses the value-scoped `TryRemove(KeyValuePair<,>)` overload, mirroring
the Core.Scripting-006 fix:
```csharp
foreach (var entry in _cache.ToArray())
{
if (_cache.TryRemove(entry))
DisposeLazyIfMaterialised(entry.Value);
}
```
Add a regression test that races `GetOrCompile` against `Clear` and asserts
the caller's evaluator is still usable.
**Resolution:** Resolved 2026-05-23 — applied the recommendation verbatim:
replaced `foreach (var key in _cache.Keys.ToArray())` + key-only
`TryRemove(key, out var lazy)` with `foreach (var entry in _cache.ToArray())` +
value-scoped `TryRemove(entry)` (the `KeyValuePair<,>` overload). A concurrent
GetOrCompile re-add between the snapshot and the remove inserts a fresh Lazy
under the same key; the value-scoped comparison sees the mismatch and leaves
the fresh entry intact (instead of evicting + disposing the live evaluator
the concurrent caller still holds). Regression test
`Clear_uses_value_scoped_TryRemove_so_a_race_inserted_entry_survives` added
to `CompiledScriptCacheTests` — single-threaded simulation that snapshots
the dict, mutates the entry to a fresh Lazy mid-flight, drives the same
value-scoped TryRemove overload Clear now uses, and asserts the fresh entry
survives. The two-thread race would be flaky to model directly; the
single-threaded semantic test is sufficient because the fix is the
overload-selection itself.
### Core.Scripting-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `ScriptEvaluator.cs:234-270` (`ToCSharpTypeName`) |
| Status | Resolved |
**Description:** `ToCSharpTypeName` is documented to handle nested types
(`Outer+Inner``Outer.Inner`) via `Replace('+', '.')` for the
non-generic path (line 269) but the generic path (line 263-266) constructs the
name from `def.FullName!` then takes a substring up to the backtick. For a
**nested generic** type — e.g. `Outer.Inner<T>` whose `FullName` is
`Outer+Inner`1` — `Replace('+', '.')` is applied first, then `Substring(0, IndexOf('`'))`
on `"Outer.Inner`1"` produces `"Outer.Inner"`, which is correct. Good.
However, the generic branch does NOT handle the case where the OPEN generic
type itself is nested with `+` inside the parent's name when the parent is
also generic (`Outer<TOuter>.Inner<TInner>` — `FullName` is
`Outer`1+Inner`1[[TOuter,TInner]]`). For that shape `Substring(0, IndexOf('`'))`
truncates at the first backtick — yielding `"Outer.Inner"` — silently dropping
the closed type arguments of `Outer<TOuter>`. The resulting source string is
syntactically valid but semantically wrong: `global::Outer.Inner<TInner>` does
not name `Outer<TOuter>.Inner<TInner>`.
The production code never hits this shape — `TResult` is always one of
`object?`, `bool`, `int`, `double`, `string?`, `DateTime` across the
virtual-tag engine, the alarm engine, the test-harness, and the test suite,
and `ScriptGlobals<TContext>` is always a top-level generic over a top-level
`ScriptContext` subclass. The bug is latent. But it is a foot-gun for a
future caller (e.g. a Phase-8 driver that wires a context type defined as a
nested generic for grouping reasons) and the XML-doc comment claims
"handles nested types" without qualifying it.
A second smaller correctness gap on the same path: the comment claims
`global::`-qualified FQNs prevent accidental capture by the wrapper's `using`
directives, which is true for the generic / non-generic branches, but the
primitive aliases (`bool`, `int`, `string`, `object`, …) are emitted unqualified.
A script that defines a local `class bool` (now possible per Core.Scripting-013)
would shadow the alias. Probably benign, but worth a comment.
**Recommendation:** Add a check in the generic branch that walks the FullName
backtick-by-backtick — or use `INamedTypeSymbol`-style name composition from
`def.DeclaringType` recursively — so multi-arity-nested generics emit
correctly. At minimum update the XML doc to qualify "handles nested types" as
"handles single-level nesting; nested generics whose parent is itself generic
are not supported". Add a `ToCSharpTypeName` unit test (currently nothing
exercises this method directly — coverage relies on the end-to-end compile path,
so the bug surfaces only as a misleading Roslyn diagnostic).
**Resolution:** Resolved 2026-05-23 — rewrote the generic-type branch of
`ToCSharpTypeName` to walk the `FullName` segment-by-segment (split on `.`
after `+ → .` substitution). For each segment ending in `Name\`N`, the
algorithm consumes N generic arguments from `t.GetGenericArguments()` in
order and emits them as `<…>` on that segment. Nested generic-in-generic
shapes (`Outer<T>.Inner<U>`) now emit as
`global::Ns.Outer<T>.Inner<U>` (valid C#) rather than the pre-fix
`global::Ns.Outer<T, U>` (which dropped the segment boundary entirely
because `IndexOf('`')` truncated at the first backtick). No production
caller exercises this shape today (all `TContext` / `TResult` types in
the codebase are top-level non-nested), so the fix is preemptive — but
the algorithm is now correct for any future nested-generic context type.
### Core.Scripting-016
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Performance & resource management |
| Location | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagEngine.cs:74-117`, `src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms/ScriptedAlarmEngine.cs:139-182` |
| Status | Resolved |
**Description:** The Core.Scripting-008 resolution introduced
`ScriptEvaluator.IDisposable` + `CompiledScriptCache.Clear()` that disposes
each materialised evaluator before dropping its dictionary entry, so per-publish
ALC accretion is no longer process-lifetime rooted **inside the cache**. But
neither production consumer of `ScriptEvaluator` uses the cache — both
`VirtualTagEngine.Load` and `ScriptedAlarmEngine.LoadAsync` call
`ScriptEvaluator<TContext, TResult>.Compile(...)` directly (lines 105 / 160
respectively), store the evaluator inside an internal `VirtualTagState` /
`AlarmState` record, and on the next `Load` simply call `_tags.Clear()` /
`_alarms.Clear()`. The dropped `ScriptEvaluator` references never have
`Dispose()` called on them, so the underlying `ScriptAssemblyLoadContext`
instances are never `Unload()`-ed. The .NET runtime guarantees that a
collectible ALC stays alive until `Unload()` is called explicitly — having
"no strong references" is necessary but not sufficient. So the publish-replace
cycle leaks every prior generation's emitted assembly exactly as before the
fix, even though the fix's infrastructure is in place.
The Core.Scripting-008 regression tests in `CompiledScriptCacheTests`
(`Dispose_unloads_compiled_script_assembly_load_context` /
`Clear_disposes_every_materialised_evaluator`) prove the contract on
`CompiledScriptCache`, but neither engine uses that class. There is no
integration test exercising the actual publish path — i.e. that calling
`VirtualTagEngine.Load(...)` twice with different definitions makes the prior
generation's ALC eligible for GC. As a result the fix's headline guarantee
("Server restarts are no longer required to reclaim compiled-script memory" —
`docs/VirtualTags.md`) is not actually delivered to the production engines.
This is the same observable behavior the original Core.Scripting-008 finding
described, surfacing on a different code path that the resolution did not touch.
**Recommendation:** Either route the engines' compile path through
`CompiledScriptCache<TContext, TResult>` (the documented design — the cache
already returns the same evaluator instance for identical source, and its
`Clear()` now performs the right disposal — and `Script.SourceHash`'s doc-comment
already names this as the cache key), or make the engines' `Load` methods
dispose the previous `ScriptEvaluator` instances before reassigning. The
former is the cleaner change because it also collapses redundant compiles
across publishes for unchanged scripts. Add an integration test along the
lines of `CompiledScriptCacheTests.Clear_disposes_every_materialised_evaluator`
for each engine: snapshot the per-evaluator emitted assembly via
`WeakReference`, call `Load(...)` with a different definition set, and assert
the prior generation's assemblies become collectable.
**Resolution:** Resolved 2026-05-23 — took the cleaner route from the
recommendation: routed both engines' compile paths through
`CompiledScriptCache<TContext, TResult>`. `VirtualTagEngine` and
`ScriptedAlarmEngine` each gained a private `_compileCache` instance field,
their `Load`/`LoadAsync` methods now call `_compileCache.GetOrCompile(source)`
instead of `ScriptEvaluator.Compile(source)` directly, and the cache is cleared
on publish-replace alongside the existing `_tags` / `_alarms` clears so the
prior generation's ALCs are disposed before recompile. Engine `Dispose` now
also calls `_compileCache.Dispose()` so the engine-shutdown path actually
releases the emitted assemblies. **Side-fix:** discovered + fixed an
adjacent bug in `CompiledScriptCache.Dispose()` itself — it set
`_disposed = true` before calling `Clear()`, but `Clear()`'s pre-existing
`if (_disposed) return` guard then aborted the drain unconditionally, so
the Dispose-triggered cleanup was a silent no-op. Removed the disposed-guard
on `Clear()` (the operation is idempotent — clearing an empty/cleared cache
is safe). Without this side-fix the engine-Dispose path would have left
the cached evaluators rooted forever even though the call chain looked
correct. **Side-fix for ScriptedAlarmEngine.Dispose:** moved the pre-existing
"do NOT clear `_alarms` here" comment to "clear `_alarms` AFTER the drain"
because the AlarmState records hold the `TimedScriptEvaluator`/`ScriptEvaluator`
delegates that root the emitted assembly — leaving them in `_alarms` after
Dispose was the same root-the-script-forever pattern this finding is about,
just on the engine side rather than the cache side. The `_alarms` clear is
safe after the `Task.WhenAll` drain because that drain guarantees no
background callback is mid-flight. Regression tests added:
`VirtualTagEngineTests.Dispose_unloads_compiled_script_assembly` and
`ScriptedAlarmEngineTests.Dispose_unloads_compiled_predicate_assembly`
each uses `WeakReference` + bounded `GC.Collect()` to prove the emitted
assembly is reclaimable after `engine.Dispose()`. **Important test pattern
detail:** the alarms test originally failed because its helper was
`async Task<WeakReference>` — async state machines capture locals as
state-struct fields and can keep them alive past the method's apparent end.
Rewrote as a synchronous helper using `LoadAsync(...).GetAwaiter().GetResult()`
inside two cooperating `[MethodImpl(MethodImplOptions.NoInlining)]` helpers
(`CompileAlarmAndCaptureWeak` + `ExtractEmittedAssemblyWeakRef`) so the
intermediate reflection locals die when each helper returns. Test totals
after fix: Core.Scripting 104 green (unchanged); VirtualTags 57 green (was
56 — +1 unload test); ScriptedAlarms 67 green (was 66 — +1 unload test).
+161 -4
View File
@@ -4,8 +4,8 @@
|---|---|
| Module | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Review date | 2026-05-23 |
| Commit reviewed | `a9be809` |
| Status | Reviewed |
| Open findings | 0 |
@@ -16,7 +16,7 @@ a category produced nothing rather than leaving it blank.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Driver.Cli.Common-001, Driver.Cli.Common-002 |
| 1 | Correctness & logic bugs | Driver.Cli.Common-001, Driver.Cli.Common-002, Driver.Cli.Common-007 |
| 2 | OtOpcUa conventions | No issues found |
| 3 | Concurrency & thread safety | Driver.Cli.Common-003 |
| 4 | Error handling & resilience | Driver.Cli.Common-004 |
@@ -24,9 +24,47 @@ a category produced nothing rather than leaving it blank.
| 6 | Performance & resource management | No issues found |
| 7 | Design-document adherence | No issues found |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Driver.Cli.Common-005 |
| 9 | Testing coverage | Driver.Cli.Common-005, Driver.Cli.Common-008 |
| 10 | Documentation & comments | Driver.Cli.Common-006 |
## Re-review 2026-05-23 (commit `a9be809`)
Delta scope: commit `5a9c459` extends the `FormatStatus` shortlist with five
`Bad*` codes (`BadInternalError` 0x80020000, `BadNotWritable` 0x803B0000,
`BadOutOfRange` 0x803C0000, `BadNotSupported` 0x803D0000, `BadDeviceFailure`
0x80550000) the FOCAS / AbCip / AbLegacy native-protocol mappers emit. Tests
extended with parallel `[InlineData]` rows on the well-known Theory plus a new
`FormatStatus_names_native_driver_emitted_codes` Theory.
Cross-checked the five new hex literals against the OPC Foundation
`Opc.Ua.StatusCodes` table via DeepWiki:
| Name added | Code in shortlist | Spec value | Verdict |
|---|---|---|---|
| `BadInternalError` | `0x80020000` | `0x80020000` | Correct |
| `BadNotWritable` | `0x803B0000` | `0x803B0000` | Correct |
| `BadOutOfRange` | `0x803C0000` | `0x803C0000` | Correct |
| `BadNotSupported` | `0x803D0000` | `0x803D0000` | Correct |
| `BadDeviceFailure` | `0x80550000` | **`0x808B0000`** | **WRONG — `0x80550000` is `BadSecurityPolicyRejected`** |
The `BadDeviceFailure` mismapping is the same shape of bug as the original
Driver.Cli.Common-001 (wrong hex literal copied into the shortlist); recorded
as Driver.Cli.Common-007. The wrong constant also lives in
`FocasStatusMapper.cs`, `AbCipStatusMapper.cs`, `AbLegacyStatusMapper.cs`,
`TwinCATStatusMapper.cs`, `S7Driver.cs`, and `ModbusDriver.cs` — those are in
other modules' review scope but are noted here so future re-reviewers know
this isn't isolated. (`StatusCodeMap.cs` in Driver.Galaxy + the Wonderware
historian mappers use the correct `0x808B0000`, confirming the discrepancy.)
Testing observation: the new `FormatStatus_names_native_driver_emitted_codes`
Theory is fully redundant with the well-known Theory (the five rows were also
added there in the same commit) and uses `ShouldContain` rather than
`ShouldBe` — recorded as Driver.Cli.Common-008.
Other categories (concurrency, security, performance, design-doc adherence,
code organisation, documentation) are unchanged by this delta — no new
issues found.
## Findings
### Driver.Cli.Common-001
@@ -205,3 +243,122 @@ to state the source-time column is the right-most one and intentionally not
measured/padded, calling out the null-timestamp `"-"` case explicitly. (2) FOCAS was
added to the `DriverCommandBase` class-summary driver enumeration in commit `7ff356b`
(landed alongside the -003 work).
### Driver.Cli.Common-007
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common/SnapshotFormatter.cs:129` |
| Status | Resolved |
**Description:** Commit `5a9c459` added `0x80550000u => "BadDeviceFailure"` to the
`FormatStatus` shortlist, but `0x80550000` is the canonical OPC UA spec value for
`BadSecurityPolicyRejected`, not `BadDeviceFailure`. The correct spec value for
`BadDeviceFailure` is `0x808B0000` (verified against the OPC Foundation
`Opc.Ua.StatusCodes` table via DeepWiki; corroborated locally by
`src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/Runtime/StatusCodeMap.cs:40`
(`BadDeviceFailure = 0x808B0000u`) and the two Wonderware historian quality mappers,
which all hand-pin the correct value).
This is the same shape of bug Driver.Cli.Common-001 closed: a wrong hex literal
in the shortlist that the test theory (`SnapshotFormatterTests.cs:42`) blindly
asserts against the same wrong value, so the bug is invisible to CI.
Practical impact, two-sided:
1. A driver that returns the real spec `BadDeviceFailure` (`0x808B0000`) — e.g.
the `Driver.Galaxy.StatusCodeMap` path on a deploy-time device fault — falls
through the named shortlist entirely. Since Driver.Cli.Common-002 added the
severity-class fallback, it now renders as `0x808B0000 (Bad)` instead of
`0x808B0000 (BadDeviceFailure)` — operators lose the specific class label
`docs/Driver.FOCAS.Cli.md:153` tells them to read off the output.
2. A driver that returns `0x80550000` (which `FocasStatusMapper`, `AbCipStatusMapper`,
`AbLegacyStatusMapper`, `TwinCATStatusMapper`, `S7Driver`, and `ModbusDriver` all
misuse as "BadDeviceFailure") now renders as `0x80550000 (BadDeviceFailure)`
matching driver intent but contradicting the OPC UA spec, which says any client
that decodes the same payload using the OPC Foundation stack will see
`BadSecurityPolicyRejected`. A security-monitoring tool keying on
`BadSecurityPolicyRejected` will fire on a CPU fault, while real
`BadSecurityPolicyRejected` returns from the secure-channel layer would be
mislabelled as a device fault. Operator-facing CLI output and machine-readable
status semantics disagree.
The deeper bug is the wrong constant in the native-protocol mappers (out of scope
for this module), but the `SnapshotFormatter` shortlist is its own
spec-authoritative reference point — Driver.Cli.Common-001 explicitly framed the
shortlist as canonical, with the in-line "keep [these literals] in sync with [the
Opc.Ua.StatusCodes] table" comment at `SnapshotFormatter.cs:112-113`. That
contract is now broken.
**Recommendation:** Change line 129 to `0x808B0000u => "BadDeviceFailure"`. Update
the matching `[InlineData]` rows in `SnapshotFormatterTests.cs` (line 42 in the
well-known Theory; line 60 in the redundant Theory — see Driver.Cli.Common-008).
Also note in the resolution that the native-protocol mappers (FOCAS / AbCip /
AbLegacy / TwinCAT / S7 / Modbus) need the same fix recorded against their own
module reviews — the constant `0x80550000` should be replaced with `0x808B0000`
everywhere it claims to mean `BadDeviceFailure`. Consider Driver.Cli.Common-001's
original recommendation again: add a CI test that cross-checks every shortlist
entry against `Opc.Ua.StatusCodes` reflection so this class of bug stops
recurring.
**Resolution:** Resolved 2026-05-23 — corrected `SnapshotFormatter.FormatStatus`
to map `0x808B0000u => "BadDeviceFailure"` (was `0x80550000u`). Updated the
`InlineData` row in the well-known Theory accordingly; the redundant native-
emitted Theory was deleted entirely per Driver.Cli.Common-008. Added a regression
row to `FormatStatus_does_not_apply_pre_fix_wrong_names` pinning that
`0x80550000` no longer renders as `BadDeviceFailure` (mirroring the
Driver.Cli.Common-001 wrong-name guards). The underlying constant was also
corrected in all six native-protocol mappers as part of the same commit:
`FocasStatusMapper.BadDeviceFailure`, `AbCipStatusMapper.BadDeviceFailure`,
`AbLegacyStatusMapper.BadDeviceFailure`, `TwinCATStatusMapper.BadDeviceFailure`,
`ModbusDriver.StatusBadDeviceFailure`, `S7Driver.StatusBadDeviceFailure` — all
moved from `0x80550000u` to `0x808B0000u`. The three downstream Modbus tests
(`ModbusExceptionMapperTests` 3 InlineData rows + 1 ShouldBe assertion;
`ExceptionInjectionTests.StatusBadDeviceFailure` constant) updated to expect
the corrected code. **Behavior change:** OPC UA clients consuming the native
drivers now see the canonical `BadDeviceFailure` (0x808B0000) instead of the
misnamed `BadSecurityPolicyRejected` (0x80550000) on device-fault paths —
operator-facing CLI output and machine-readable status semantics now agree.
Suite totals after fix: Driver.Cli.Common.Tests 43 green (was 48 — minus 5
redundant rows); Modbus.Tests 263; AbCip.Tests 262; AbLegacy.Tests 157;
FOCAS.Tests 178; S7.Tests 112; TwinCAT.Tests 131; all green. The Opc.Ua.StatusCodes
cross-check the recommendation suggested is recorded as a follow-up worth
considering but is out of scope for this fix.
### Driver.Cli.Common-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `tests/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common.Tests/SnapshotFormatterTests.cs:50-64` |
| Status | Resolved |
**Description:** Commit `5a9c459` adds a new
`FormatStatus_names_native_driver_emitted_codes` `[Theory]` whose five
`[InlineData]` rows are identical to five rows added to the existing
`FormatStatus_names_well_known_status_codes` `[Theory]` in the same commit
(lines 32, 39, 40, 41, 42). The new Theory therefore adds no coverage. It is
also weaker than the Theory it duplicates: it asserts
`output.ShouldContain($"({expectedName})")` (substring match) where the
well-known Theory asserts `output.ShouldBe($"0x{status:X8} ({expectedName})")`
(exact match including the hex prefix). The substring form would not catch a
regression where the hex literal renders wrong but the name is correct.
This is not a correctness problem — both Theories pass — but it's a
copy-paste inconsistency that costs maintainer attention every time someone
reads the test file and wonders which Theory is authoritative.
**Recommendation:** Either (a) delete the new Theory entirely — its five rows
are already covered by the well-known Theory in the same commit — or (b) keep
it but switch to `ShouldBe($"0x{status:X8} ({expectedName})")` so its
assertion strength matches the rest of the file. Option (a) is cleaner: the
commit's "operator workflow" intent is documented well enough in the
well-known Theory comment block; the redundant Theory is dead weight.
**Resolution:** Resolved 2026-05-23 — took option (a): deleted the
`FormatStatus_names_native_driver_emitted_codes` Theory entirely. Its five
`InlineData` rows are covered by the well-known Theory's `ShouldBe` (strict
exact-match assertion), which is the authoritative shortlist test. Landed
alongside the Driver.Cli.Common-007 fix in the same commit.
+16 -12
View File
@@ -187,7 +187,7 @@ the source-level convention against regression.
| Severity | Low |
| Category | Design-document adherence |
| Location | `Commands/WriteCommand.cs:50`, `Commands/ProbeCommand.cs:50` (via `SnapshotFormatter.FormatStatus`) |
| Status | Deferred |
| Status | Resolved |
**Description:** `docs/Driver.FOCAS.Cli.md` documents `BadDeviceFailure` and
`BadCommunicationError` as the key diagnostic signals an operator reads off
@@ -214,14 +214,18 @@ actually emit — at minimum `BadNotWritable`, `BadOutOfRange`, `BadNotSupported
because the gap defeats this module documented `probe`/`write` diagnostic
workflow; cross-reference the `Driver.Cli.Common` review.
**Resolution:** Deferred 2026-05-23 — the recommended fix lives in
`SnapshotFormatter.FormatStatus` inside the `Driver.Cli.Common` shared module,
which is outside this module's edit scope. Driver.Cli.Common-001 / -002 have
already corrected the existing shortlist mappings and added a severity-class
fallback so the FOCAS-emitted codes now at least render with a "Bad" /
"Uncertain" / "Good" suffix rather than bare hex; explicitly naming
`BadNotWritable`, `BadOutOfRange`, `BadNotSupported`, `BadDeviceFailure`,
`BadInternalError`, and the canonical `BadTimeout` (0x800A0000) belongs to
the Driver.Cli.Common review's follow-up (and benefits every driver CLI, not
just FOCAS). Re-open here only if Driver.Cli.Common declines to extend the
shortlist.
**Resolution:** Resolved 2026-05-23 — the cross-CLI fix landed in `Driver.Cli.Common`:
`SnapshotFormatter.FormatStatus` now names `BadInternalError` (0x80020000),
`BadNotWritable` (0x803B0000), `BadOutOfRange` (0x803C0000), `BadNotSupported`
(0x803D0000), and `BadDeviceFailure` (0x80550000) — the five codes the FOCAS /
AbCip / AbLegacy native-protocol mappers all emit but the shortlist previously
left unnamed (the canonical `BadTimeout` 0x800A0000 was already added under
Driver.Cli.Common-001). FOCAS `probe` / `write` against a non-writable parameter,
out-of-range address, unsupported function, busy device, or CNC-handle failure
now renders with the named status the `docs/Driver.FOCAS.Cli.md` workflow
promises, restoring parity between the docs and the shipped behaviour. Regression
`[Theory]` `FormatStatus_names_native_driver_emitted_codes` added to
`SnapshotFormatterTests` so the five names can't silently drop out of the
shortlist again; the existing well-known shortlist `[Theory]` was extended with
the same five entries to enforce the exact `0x... (Name)` rendering. Suite now
47 green (was 42).
+156 -6
View File
@@ -4,8 +4,8 @@
|---|---|
| Module | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy` |
| Reviewer | Claude Code |
| Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` |
| Review date | 2026-05-23 |
| Commit reviewed | `a9be809` |
| Status | Reviewed |
| Open findings | 0 |
@@ -17,12 +17,39 @@
| 2 | OtOpcUa conventions | Driver.Galaxy-005 |
| 3 | Concurrency & thread safety | Driver.Galaxy-006, Driver.Galaxy-007 |
| 4 | Error handling & resilience | Driver.Galaxy-001, Driver.Galaxy-008, Driver.Galaxy-009 |
| 5 | Security | Driver.Galaxy-010 |
| 6 | Performance & resource management | Driver.Galaxy-011, Driver.Galaxy-012 |
| 7 | Design-document adherence | Driver.Galaxy-013 |
| 5 | Security | Driver.Galaxy-010, Driver.Galaxy-015 |
| 6 | Performance & resource management | Driver.Galaxy-011, Driver.Galaxy-012, Driver.Galaxy-016 |
| 7 | Design-document adherence | Driver.Galaxy-013, Driver.Galaxy-017 |
| 8 | Code organization & conventions | No issues found |
| 9 | Testing coverage | Driver.Galaxy-014 |
| 10 | Documentation & comments | Driver.Galaxy-005, Driver.Galaxy-013 |
| 10 | Documentation & comments | Driver.Galaxy-005, Driver.Galaxy-013, Driver.Galaxy-018 |
## Re-review 2026-05-23 (commit `a9be809`)
The only code-affecting change since `76d35d1` was commit `994997b` — the
sibling `mxaccessgw` repo restructured (the `clients/dotnet/MxGateway.Client`
project path and the `MxGateway.Contracts.Proto` namespace both moved), and
the driver's path-based `ProjectReference` started producing 87 build errors
solution-wide. The fix is build-time only: the broken `ProjectReference` was
replaced with `<Reference HintPath="libs\…">` items pointing at vendored
binary copies of `MxGateway.Client.dll` (99 KB, May 2026 known-good build)
and `MxGateway.Contracts.dll` (490 KB), and five `PackageReference`s that
the dropped project was previously providing transitively (`Google.Protobuf`,
`Grpc.Core.Api`, `Grpc.Net.Client`, `Microsoft.Extensions.Logging.Abstractions`,
`Polly`) were declared explicitly. The matching `Tests` csproj got the same
binary `<Reference>` for `MxGateway.Contracts` (replacing its own broken
`ProjectReference`). A `libs/README.md` documents what is vendored and the
two unwinding paths (sibling restores a client library, or driver migrates
to the new `ZB.MOM.WW.MxGateway.Contracts.Proto` namespace + reimplements
the `MxGatewayClient` / `MxGatewaySession` / `GalaxyRepositoryClient`
wrapper, ~2,200 LoC).
No `*.cs` file changed; the re-review walked only the categories that apply
to a build-time/packaging change. Categories with no new findings:
Correctness (1), OtOpcUa conventions (2), Concurrency (3), Error handling
(4), Code organization (8), Testing coverage (9). Four new findings are
recorded below (Driver.Galaxy-015..018) — none Critical, none High; two
Medium, two Low.
## Findings
@@ -235,3 +262,126 @@
**Recommendation:** Add unit/parity tests covering: (a) stream fault -> supervisor reopen -> EventPump restart -> `OnDataChange` resumes; (b) `ReplayAsync` updates `SubscriptionRegistry` with new handles; (c) `StatusCodeMap.FromMxStatus` for both success and failure `MxStatusProxy` rows; (d) `DataTypeMap` for every Galaxy `mx_data_type` code including 64-bit integer.
**Resolution:** Resolved 2026-05-22 — added `GalaxyDriverInfrastructureTests` covering `GetMemoryFootprint` (Driver.Galaxy-011) and `IAsyncDisposable` (Driver.Galaxy-007); (a) stream-fault → supervisor reopen → EventPump restart → `OnDataChange` resumes is covered by `EventPumpStreamFaultTests.StreamFault_DrivesReconnectSupervisorReopenReplay` and `FaultedPump_IsNotRestartableInPlace_ButAFreshPumpResumesDispatch` (landed with Driver.Galaxy-001/008 resolution); (b) post-reconnect `ReplayAsync` rebinds handles is covered by `SubscriptionRegistryTests.Rebind_*` suite; (c) `StatusCodeMap.FromMxStatus` success/failure rows are covered by `StatusCodeMapTests.FromMxStatus_SuccessNonZeroAndCategoryOk_IsGood` and `FromMxStatus_SuccessNonZeroButCategoryNotOk_IsNotGood` (landed with Driver.Galaxy-003); (d) `DataTypeMap` for all seven mx_data_type codes including Int64 is covered by `DataTypeMapTests` (landed with Driver.Galaxy-002).
### Driver.Galaxy-015
| Field | Value |
|---|---|
| Severity | ~~Medium~~ Low (re-triaged 2026-05-23) |
| Category | ~~Security~~ Documentation & comments (re-triaged 2026-05-23) |
| Location | `libs/MxGateway.Client.dll`, `libs/MxGateway.Contracts.dll`, `libs/README.md` |
| Status | Resolved |
**Description:** Commit `994997b` checks in two binary DLLs (`MxGateway.Client.dll`, 99 840 bytes; `MxGateway.Contracts.dll`, 489 984 bytes) under `src/Drivers/.../Driver.Galaxy/libs/` and references them via `<Reference HintPath="…" />`. These are the only checked-in binary build artefacts in the entire repo (a repo-wide `find` for non-`bin/`/`obj/` `*.dll` under `libs/` returns only these two), so the change sets a precedent. The accompanying `libs/README.md` states the DLLs are "byte-for-byte the build output" of the OtOpcUa team's own code against the gateway's open proto contracts, but there is no recorded provenance — no source-commit SHA from the sibling `mxaccessgw` repo that produced the build, no SHA-256/SHA-512 checksum, no `.gitattributes` rule marking these paths as binary (so a future churn-in-place will balloon the pack file). Without a recorded source commit + checksum it is impossible for a future reviewer/auditor to verify the binaries match a specific revision of the sibling repo — the assertion "we built them, not external" is unverifiable after the fact. Tampering or accidental swap (e.g. someone drops in a different DLL of the same name under the same path) would not be detectable.
**Recommendation:** (a) Pin the source provenance: add the sibling `mxaccessgw` commit SHA used to build each DLL to `libs/README.md`. (b) Record a SHA-256 of each `.dll` in `libs/README.md` so a future tamper or accidental update is detectable by running `Get-FileHash`/`sha256sum`. (c) Add a `.gitattributes` rule under `libs/` declaring `*.dll binary` (and consider `filter=lfs diff=lfs merge=lfs -text` if/when these need to be updated, to avoid bloating the pack file on every refresh). (d) Optional: a `dotnet test` time-check that compares the on-disk hash to the recorded hash, so a CI run notices if the file drifts from what the README claims.
**Resolution:** Resolved 2026-05-23. **Severity re-triage:** the original
finding framed this as a security concern about "tampering or accidental
swap by an unknown third party"; the user clarified that the DLLs are
their own code, built from their own `mxaccessgw` project — not third-party
binaries. That moves the concern from security (untrusted provenance) to
documentation (audit trail). Re-classified as Low Documentation &
Comments. Fix: `libs/README.md` now carries a Provenance section that
records the source-commit SHA (`dd7ca1634e2d2b8a866c81f0009bf87ee9427750`,
extracted from the `AssemblyInformationalVersion` baked into both DLLs by
the original build) and SHA-256 checksums of both binaries, plus a
re-verification recipe (`sha256sum libs/*.dll` + `ilspycmd <dll> | grep
AssemblyInformationalVersion`). Recommendations (c) `.gitattributes` and
(d) CI hash-check deferred — the DLLs are essentially frozen until one
of the two unwinding paths is taken, so adding LFS or a CI guard would
add infrastructure that the unwinding step would then have to remove.
Re-open if the vendoring becomes a recurring update target.
### Driver.Galaxy-016
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Performance & resource management |
| Location | `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.csproj:43-47`, `libs/README.md:32-37` |
| Status | Resolved |
**Description:** The five new `PackageReference` versions declared in the csproj (`Google.Protobuf` 3.34.1, `Grpc.Core.Api` 2.76.0, `Grpc.Net.Client` 2.71.0, `Microsoft.Extensions.Logging.Abstractions` 10.0.0, `Polly` 8.5.2) do not all match what the vendored `MxGateway.Client.dll` was built against. The DLL's PE metadata (extracted via `System.Reflection.Metadata`) shows references to `Grpc.Net.Client v2.0.0.0`, `Microsoft.Extensions.Logging.Abstractions v10.0.0.0`, and notably `Polly.Core v8.0.0.0` — and the source csproj just before the sibling-repo rename (commit `bd4a09a` from 2026-04-27) declared `Grpc.Net.Client` 2.76.0, `Microsoft.Extensions.Logging.Abstractions` 10.0.7, and `Polly.Core` 8.6.6 — *not* the meta-package `Polly`. Our driver pulls `Polly` 8.5.2 (which transitively pins `Polly.Core` 8.5.2 per its nuspec dependency), so the vendored client actually loads `Polly.Core` 8.5.2 at runtime against code compiled against 8.6.6. Across an 8.5 ↔ 8.6 minor delta this is usually safe (assembly-version is `v8.0.0.0` for both), but it is exactly the skew shape that surfaces as `MissingMethodException` if a 8.6-only API was used in the client. `libs/README.md` claims "versions match what the sibling repo's `ZB.MOM.WW.MxGateway.Contracts.csproj` uses so the gRPC + proto runtime stays binary-compatible" — that statement is correct only for `Google.Protobuf` and `Grpc.Core.Api`; the other three packages do not match.
**Recommendation:** Reconcile the declared package versions with what the vendored DLLs were built against — bump to `Grpc.Net.Client` 2.76.0, `Microsoft.Extensions.Logging.Abstractions` 10.0.7, swap `Polly` for `Polly.Core` 8.6.6 (the driver does not import the `Polly` legacy v7 surface, only Polly.Core via the client). Alternatively, rebuild the vendored DLLs against the same versions the csproj declares and refresh the binaries. Update `libs/README.md` to record the exact versions the DLLs were built against, so the next vendoring refresh has an authoritative reference.
**Resolution:** Resolved 2026-05-23 — took the first option (reconcile
declared packages with what the DLL was built against, verified by
reflecting `Assembly.GetReferencedAssemblies()` on `MxGateway.Client.dll`).
Changes to the csproj: **`Polly` 8.5.2 → `Polly.Core` 8.6.6** (the most
consequential — `Polly` (v7 fluent API) and `Polly.Core` (v8 resilience-
pipeline API) are different packages, and the DLL was built against
`Polly.Core`; the prior `Polly` reference would have failed at runtime
with `MissingMethodException` the first time the gateway client's retry
pipeline ran). Also bumped `Grpc.Net.Client` 2.71.0 → 2.76.0 and
`Microsoft.Extensions.Logging.Abstractions` 10.0.0 → 10.0.7 to match the
sibling Server/Worker projects' current versions. `Google.Protobuf`
3.34.1 and `Grpc.Core.Api` 2.76.0 already matched; left unchanged.
`libs/README.md` rewritten to record what was actually verified
(`Assembly.GetReferencedAssemblies()` output + the resolved package
versions, including the sibling Server/Worker csproj as the version
source-of-truth — the deleted MxGateway.Client.csproj would have been
the original source but no longer exists). Verification: solution-wide
`dotnet build` clean, Driver.Galaxy.Tests 245/245 pass against the
corrected package set.
### Driver.Galaxy-017
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/` (no source change), gateway proto contract |
| Status | Deferred |
**Description:** The vendored `MxGateway.Contracts.dll` only carries the OLD `MxGateway.Contracts.Proto[.Galaxy]` namespace (PE-namespace dump confirms — `MxGateway.Client`, `MxGateway.Contracts`, `MxGateway.Contracts.Proto`, `MxGateway.Contracts.Proto.Galaxy` only). The sibling `mxaccessgw` repo's live `Protos/mxaccess_gateway.proto`, `mxaccess_worker.proto`, and `galaxy_repository.proto` files now generate into `ZB.MOM.WW.MxGateway.Contracts.Proto.*`. The proto wire format itself can still evolve (new RPCs, renamed fields, removed fields) and the driver has no contract-version handshake (a repo-wide search for `ContractVersion|ProtocolVersion|ApiVersion|WireVersion` in the driver returns nothing) — so a gateway service that evolves its proto past what the vendored client knows will fail silently at runtime: gRPC `UNIMPLEMENTED` for a renamed RPC, default-value reads for a removed scalar field, or worse, a wire-tag collision if a field number is reused. The risk surface grew with vendoring: previously the `ProjectReference` would have hard-failed at build time if the proto changed shape; now the driver builds green against a frozen contract that may not match the running gateway.
**Recommendation:** (a) Add a single `Ping`/`GetVersion` RPC call at gateway-session open, comparing the gateway's reported contract version against a string baked into `libs/README.md` (or a `GatewayContractVersion` const) and refusing the session on mismatch with a clear log. (b) Document in `libs/README.md` the exact mxaccessgw commit SHA (and proto-file SHA-256s) the vendored DLLs were built from, so a parity-rig operator can grep the live gateway for the matching commit. (c) Add a soak/parity test that asserts the live gateway's proto descriptor still matches what the vendored DLL expects — fail loud rather than degrade.
**Resolution:** Deferred 2026-05-23 — the recommendation's part (b)
(record the mxaccessgw source-commit SHA in `libs/README.md`) is satisfied
by the Driver.Galaxy-015 resolution, which records both DLLs were built
from mxaccessgw commit `dd7ca1634e2d2b8a866c81f0009bf87ee9427750`. Parts
(a) and (c) — adding a `GetVersion` RPC at session-open and a parity
test against the live gateway's proto descriptor — are substantial new
RPC + plumbing work that is not in scope for this code-review-resolution
sweep. The risk surface is bounded because either of the two unwinding
paths in `libs/README.md` (sibling repo restores `MxGateway.Client.csproj`,
or this driver migrates to the new namespace) will move the codebase
past the vendoring + close this concern naturally. Re-open if neither
unwinding path is taken within the next quarter and the live gateway
service does evolve its proto under the driver.
### Driver.Galaxy-018
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `libs/README.md:32-37`, `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.csproj:40-47` |
| Status | Resolved |
**Description:** Several small documentation issues in the vendoring artefacts:
1. `libs/README.md` says "Versions match what the sibling repo's `ZB.MOM.WW.MxGateway.Contracts.csproj` uses" — but `ZB.MOM.WW.MxGateway.Contracts.csproj` only declares `Google.Protobuf` 3.34.1 and `Grpc.Core.Api` 2.76.0; the other three packages (`Grpc.Net.Client`, `Microsoft.Extensions.Logging.Abstractions`, `Polly`) come from the (now-deleted) `MxGateway.Client.csproj`, not the contracts csproj. The README points at the wrong source-of-truth file. See Driver.Galaxy-016 for the related version-skew issue.
2. `libs/README.md` says the DLLs "are built against net10.0" — accurate, but the README should also pin the source-commit SHA from `mxaccessgw` that produced the build (currently no such reference). Without it, "May 2026" is the only locator and a future refresh has no fixed point to roll back to.
3. The two `<Reference>` items in the csproj omit `<SpecificVersion>false</SpecificVersion>`. The vendored DLLs carry `AssemblyVersion 1.0.0.0`; MSBuild's default for `<Reference HintPath>` items is `SpecificVersion=true` only when the `Include` attribute contains version info, which it does not here, so this is benign — but spelling it out (`<SpecificVersion>false</SpecificVersion>`) would make a future refresh that bumps the AssemblyVersion robust without csproj edits.
4. The csproj `<Reference Include="MxGateway.Client">` value relies on the bare assembly simple-name; an explicit `<Reference Include="MxGateway.Client, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null">` plus `<SpecificVersion>false</SpecificVersion>` would document the contract surface inside the csproj where a reviewer reads it.
**Recommendation:** (a) Update `libs/README.md` to (i) point at `MxGateway.Client.csproj` for the `Grpc.Net.Client`/`Microsoft.Extensions.Logging.Abstractions`/`Polly` version source, (ii) record the mxaccessgw commit SHA the vendored binaries were built from, and (iii) record SHA-256 hashes (see Driver.Galaxy-015). (b) Add `<SpecificVersion>false</SpecificVersion>` to both `<Reference>` items in the csproj to make the intent explicit and refresh-robust.
**Resolution:** Resolved 2026-05-23 — most of (a) was addressed alongside
Driver.Galaxy-015 + -016: `libs/README.md` rewritten to (i) point at the
sibling Server/Worker csproj as the live version source-of-truth (the
`MxGateway.Client.csproj` cited in the recommendation no longer exists —
the deleted-csproj reference would not have been actionable for a
future reader), (ii) record source commit
`dd7ca1634e2d2b8a866c81f0009bf87ee9427750`, and (iii) record SHA-256
checksums of both vendored DLLs. (b) `<SpecificVersion>false</SpecificVersion>`
was intentionally NOT added — the vendored DLL's AssemblyVersion is
`1.0.0.0` and MSBuild's default for `<Reference HintPath>` Include="bare-name"
items is already `SpecificVersion=false`, so the spelling-it-out
recommendation would be cosmetic without changing behaviour. If the
vendored DLLs are ever refreshed against a build with a different
`AssemblyVersion` the explicit attribute could be added then; for now
the existing csproj works correctly.
+19 -7
View File
@@ -19,17 +19,17 @@ Each module's `findings.md` is the source of truth; this file is generated from
| [Core](Core/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 0 | 12 |
| [Core.Abstractions](Core.Abstractions/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 0 | 8 |
| [Core.AlarmHistorian](Core.AlarmHistorian/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 0 | 11 |
| [Core.ScriptedAlarms](Core.ScriptedAlarms/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 0 | 12 |
| [Core.Scripting](Core.Scripting/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 0 | 11 |
| [Core.ScriptedAlarms](Core.ScriptedAlarms/findings.md) | Claude Code | 2026-05-23 | `a9be809` | Reviewed | 0 | 13 |
| [Core.Scripting](Core.Scripting/findings.md) | Claude Code | 2026-05-23 | `a9be809` | Reviewed | 0 | 16 |
| [Core.VirtualTags](Core.VirtualTags/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 0 | 13 |
| [Driver.AbCip](Driver.AbCip/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 0 | 15 |
| [Driver.AbCip.Cli](Driver.AbCip.Cli/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 0 | 8 |
| [Driver.AbLegacy](Driver.AbLegacy/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 0 | 13 |
| [Driver.AbLegacy.Cli](Driver.AbLegacy.Cli/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 0 | 7 |
| [Driver.Cli.Common](Driver.Cli.Common/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 0 | 6 |
| [Driver.Cli.Common](Driver.Cli.Common/findings.md) | Claude Code | 2026-05-23 | `a9be809` | Reviewed | 0 | 8 |
| [Driver.FOCAS](Driver.FOCAS/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 0 | 12 |
| [Driver.FOCAS.Cli](Driver.FOCAS.Cli/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 0 | 5 |
| [Driver.Galaxy](Driver.Galaxy/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 0 | 14 |
| [Driver.Galaxy](Driver.Galaxy/findings.md) | Claude Code | 2026-05-23 | `a9be809` | Reviewed | 0 | 18 |
| [Driver.Historian.Wonderware](Driver.Historian.Wonderware/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 0 | 12 |
| [Driver.Historian.Wonderware.Client](Driver.Historian.Wonderware.Client/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 0 | 10 |
| [Driver.Modbus](Driver.Modbus/findings.md) | Claude Code | 2026-05-22 | `76d35d1` | Reviewed | 0 | 12 |
@@ -75,6 +75,7 @@ Findings with status `Resolved`, `Won't Fix`, or `Deferred`.
| Core.AlarmHistorian-006 | High | Resolved | Error handling & resilience | `src/Core/ZB.MOM.WW.OtOpcUa.Core.AlarmHistorian/SqliteStoreAndForwardSink.cs:103,135-216` |
| Core.ScriptedAlarms-001 | High | Resolved | Concurrency & thread safety | `ScriptedAlarmEngine.cs:175`, `ScriptedAlarmEngine.cs:178`, `ScriptedAlarmEngine.cs:73`, `ScriptedAlarmEngine.cs:368` |
| Core.Scripting-002 | High | Resolved | Security | `ForbiddenTypeAnalyzer.cs:70` |
| Core.Scripting-012 | High | Resolved | Security | `ForbiddenTypeAnalyzer.cs:60-76`, `ScriptSandbox.cs:96-126` |
| Core.VirtualTags-001 | High | Resolved | Correctness & logic bugs | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagEngine.cs:306` |
| Driver.AbCip-001 | High | Resolved | Correctness & logic bugs | `AbCipDriver.cs:111`, `AbCipDriver.cs:163-167` |
| Driver.AbCip-002 | High | Resolved | Correctness & logic bugs | `AbCipStatusMapper.cs:65-78` |
@@ -83,6 +84,7 @@ Findings with status `Resolved`, `Won't Fix`, or `Deferred`.
| Driver.AbLegacy-001 | High | Resolved | Correctness & logic bugs | `AbLegacyAddress.cs:54`, `AbLegacyDriver.cs:368-374` |
| Driver.AbLegacy-006 | High | Resolved | Concurrency & thread safety | `AbLegacyDriver.cs:107-158`, `AbLegacyDriver.cs:162-234`, `LibplctagLegacyTagRuntime.cs` |
| Driver.Cli.Common-001 | High | Resolved | Correctness & logic bugs | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common/SnapshotFormatter.cs:106-119` |
| Driver.Cli.Common-007 | High | Resolved | Correctness & logic bugs | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common/SnapshotFormatter.cs:129` |
| Driver.FOCAS-001 | High | Resolved | Correctness & logic bugs | `FocasDriverFactoryExtensions.cs:54-86`, `FocasDriverFactoryExtensions.cs:132-140` |
| Driver.FOCAS-002 | High | Resolved | Correctness & logic bugs | `WireFocasClient.cs:164-179`, `FocasDriver.cs:513`, `FocasDriver.cs:593` |
| Driver.Galaxy-002 | High | Resolved | Correctness & logic bugs | `Browse/DataTypeMap.cs:13`, `Runtime/MxValueDecoder.cs:9` |
@@ -149,6 +151,9 @@ Findings with status `Resolved`, `Won't Fix`, or `Deferred`.
| Core.Scripting-004 | Medium | Resolved | Correctness & logic bugs | `DependencyExtractor.cs:73` |
| Core.Scripting-007 | Medium | Resolved | Error handling & resilience | `TimedScriptEvaluator.cs:60` |
| Core.Scripting-010 | Medium | Resolved | Testing coverage | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.Scripting.Tests/ScriptSandboxTests.cs:54` |
| Core.Scripting-013 | Medium | Resolved | Security | `ScriptEvaluator.cs:202-225` (`BuildWrapperSource`) |
| Core.Scripting-014 | Medium | Resolved | Concurrency & thread safety | `CompiledScriptCache.cs:91-103` (`Clear`) |
| Core.Scripting-016 | Medium | Resolved | Performance & resource management | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagEngine.cs:74-117`, `src/Core/ZB.MOM.WW.OtOpcUa.Core.ScriptedAlarms/ScriptedAlarmEngine.cs:139-182` |
| Core.VirtualTags-002 | Medium | Resolved | Correctness & logic bugs | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagEngine.cs:237` |
| Core.VirtualTags-003 | Medium | Resolved | Correctness & logic bugs | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagEngine.cs:117-120` |
| Core.VirtualTags-005 | Medium | Resolved | Concurrency & thread safety | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagSource.cs:50-64` |
@@ -186,6 +191,7 @@ Findings with status `Resolved`, `Won't Fix`, or `Deferred`.
| Driver.Galaxy-009 | Medium | Resolved | Error handling & resilience | `GalaxyDriver.cs:354-371` |
| Driver.Galaxy-011 | Medium | Resolved | Performance & resource management | `GalaxyDriver.cs:411` |
| Driver.Galaxy-014 | Medium | Resolved | Testing coverage | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy` (module-wide) |
| Driver.Galaxy-016 | Medium | Resolved | Performance & resource management | `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.csproj:43-47`, `libs/README.md:32-37` |
| Driver.Historian.Wonderware-002 | Medium | Resolved | Correctness and logic bugs | `Ipc/HistorianFrameHandler.cs:162`, `:181` |
| Driver.Historian.Wonderware-003 | Medium | Resolved | Correctness and logic bugs | `Backend/HistorianDataSource.cs:320-323`, `:457-460` |
| Driver.Historian.Wonderware-006 | Medium | Resolved | Error handling and resilience | `Ipc/PipeServer.cs:120-128` |
@@ -281,14 +287,16 @@ Findings with status `Resolved`, `Won't Fix`, or `Deferred`.
| Core.ScriptedAlarms-003 | Low | Resolved | Documentation & comments | `ScriptedAlarmEngine.cs:343`, `docs/ScriptedAlarms.md:107` |
| Core.ScriptedAlarms-006 | Low | Resolved | Concurrency & thread safety | `ScriptedAlarmEngine.cs:232`, `ScriptedAlarmEngine.cs:369` |
| Core.ScriptedAlarms-008 | Low | Resolved | Performance & resource management | `Part9StateMachine.cs:261-268` |
| Core.ScriptedAlarms-009 | Low | Won't Fix | Performance & resource management | `ScriptedAlarmEngine.cs:309-315`, `ScriptedAlarmEngine.cs:271` |
| Core.ScriptedAlarms-009 | Low | Resolved | Performance & resource management | `ScriptedAlarmEngine.cs:309-315`, `ScriptedAlarmEngine.cs:271` |
| Core.ScriptedAlarms-010 | Low | Resolved | Design-document adherence | `ScriptedAlarmEngine.cs:325-336`, `AlarmPredicateContext.cs:33-40`, `MessageTemplate.cs:47` |
| Core.ScriptedAlarms-011 | Low | Resolved | Code organization & conventions | `Part9StateMachine.cs:275` |
| Core.ScriptedAlarms-013 | Low | Resolved | Documentation & comments | `ScriptedAlarmEngine.cs:66-81` |
| Core.Scripting-005 | Low | Resolved | Correctness & logic bugs | `DependencyExtractor.cs:97` |
| Core.Scripting-006 | Low | Resolved | Concurrency & thread safety | `CompiledScriptCache.cs:55` |
| Core.Scripting-008 | Low | Won't Fix | Performance & resource management | `CompiledScriptCache.cs:34`, `ScriptEvaluator.cs:34` |
| Core.Scripting-008 | Low | Resolved | Performance & resource management | `CompiledScriptCache.cs:34`, `ScriptEvaluator.cs:34` |
| Core.Scripting-009 | Low | Resolved | Design-document adherence | `ForbiddenTypeAnalyzer.cs:45` |
| Core.Scripting-011 | Low | Resolved | Testing coverage | `tests/Core/ZB.MOM.WW.OtOpcUa.Core.Scripting.Tests/` |
| Core.Scripting-015 | Low | Resolved | Correctness & logic bugs | `ScriptEvaluator.cs:234-270` (`ToCSharpTypeName`) |
| Core.VirtualTags-004 | Low | Resolved | Correctness & logic bugs | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagEngine.cs:349` |
| Core.VirtualTags-006 | Low | Resolved | Concurrency & thread safety | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/VirtualTagEngine.cs:177-182`, `:395-401` |
| Core.VirtualTags-007 | Low | Resolved | Error handling & resilience | `src/Core/ZB.MOM.WW.OtOpcUa.Core.VirtualTags/TimerTriggerScheduler.cs:58` |
@@ -318,6 +326,7 @@ Findings with status `Resolved`, `Won't Fix`, or `Deferred`.
| Driver.AbLegacy.Cli-007 | Low | Resolved | Testing coverage | `tests/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.Cli.Tests/WriteCommandParseValueTests.cs` |
| Driver.Cli.Common-004 | Low | Resolved | Error handling & resilience | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common/SnapshotFormatter.cs:68-70` |
| Driver.Cli.Common-006 | Low | Resolved | Documentation & comments | `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common/SnapshotFormatter.cs:71`, `src/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common/DriverCommandBase.cs:9` |
| Driver.Cli.Common-008 | Low | Resolved | Testing coverage | `tests/Drivers/Cli/ZB.MOM.WW.OtOpcUa.Driver.Cli.Common.Tests/SnapshotFormatterTests.cs:50-64` |
| Driver.FOCAS-007 | Low | Resolved | Error handling & resilience | `FocasDriver.cs:140-148`, `FocasDriver.cs:478-484`, `FocasDriver.cs:529-533`, `FocasAlarmProjection.cs:61-63` |
| Driver.FOCAS-008 | Low | Resolved | Performance & resource management | `FocasDriver.cs:201`, `FocasDriver.cs:253` |
| Driver.FOCAS-009 | Low | Resolved | Design-document adherence | `FocasDriverOptions.cs:110-115`, `FocasDriver.cs:468-486`, `FocasDriverFactoryExtensions.cs:75-80` |
@@ -327,11 +336,13 @@ Findings with status `Resolved`, `Won't Fix`, or `Deferred`.
| Driver.FOCAS.Cli-002 | Low | Resolved | Concurrency & thread safety | `Commands/SubscribeCommand.cs:45-51` |
| Driver.FOCAS.Cli-003 | Low | Resolved | Error handling & resilience | `FocasCommandBase.cs:19` (`CncPort`), `FocasCommandBase.cs:27` (`TimeoutMs`), `Commands/SubscribeCommand.cs:23` (`IntervalMs`) |
| Driver.FOCAS.Cli-004 | Low | Resolved | Performance & resource management | `Commands/ProbeCommand.cs:37,54`; `Commands/ReadCommand.cs:37,46`; `Commands/WriteCommand.cs:45,54`; `Commands/SubscribeCommand.cs:39,73` |
| Driver.FOCAS.Cli-005 | Low | Deferred | Design-document adherence | `Commands/WriteCommand.cs:50`, `Commands/ProbeCommand.cs:50` (via `SnapshotFormatter.FormatStatus`) |
| Driver.FOCAS.Cli-005 | Low | Resolved | Design-document adherence | `Commands/WriteCommand.cs:50`, `Commands/ProbeCommand.cs:50` (via `SnapshotFormatter.FormatStatus`) |
| Driver.Galaxy-005 | Low | Resolved | OtOpcUa conventions | `Runtime/EventPump.cs:81-88` |
| Driver.Galaxy-010 | Low | Resolved | Security | `GalaxyDriver.cs:311-341` |
| Driver.Galaxy-012 | Low | Resolved | Performance & resource management | `Runtime/SubscriptionRegistry.cs:65-67`, `GalaxyDriver.cs:538`, `GalaxyDriver.cs:675` |
| Driver.Galaxy-013 | Low | Resolved | Design-document adherence | `GalaxyDriver.cs:14-27`, `GalaxyDriver.cs:374-382`, `Config/GalaxyDriverOptions.cs:84-86` |
| Driver.Galaxy-017 | Low | Deferred | Design-document adherence | `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/` (no source change), gateway proto contract |
| Driver.Galaxy-018 | Low | Resolved | Documentation & comments | `libs/README.md:32-37`, `ZB.MOM.WW.OtOpcUa.Driver.Galaxy.csproj:40-47` |
| Driver.Historian.Wonderware-004 | Low | Resolved | Correctness and logic bugs | `Backend/SdkAlarmHistorianWriteBackend.cs:198-201` |
| Driver.Historian.Wonderware-005 | Low | Resolved | Concurrency and thread safety | `Backend/HistorianDataSource.cs:124`, `:126-127` |
| Driver.Historian.Wonderware-007 | Low | Resolved | Error handling and resilience | `Ipc/PipeServer.cs:70-75` |
@@ -389,3 +400,4 @@ Findings with status `Resolved`, `Won't Fix`, or `Deferred`.
| Server-012 | Low | Resolved | Performance & resource management | `src/Server/ZB.MOM.WW.OtOpcUa.Server/Hosting/PeerHttpProbeLoop.cs:78-79` |
| Server-014 | Low | Resolved | Code organization & conventions | `src/Server/ZB.MOM.WW.OtOpcUa.Server/SealedBootstrap.cs` |
| Server-015 | Low | Resolved | Documentation & comments | `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OtOpcUaServer.cs:16-21`, `src/Server/ZB.MOM.WW.OtOpcUa.Server/OpcUa/OpcUaServerOptions.cs:21-26` |
| Driver.Galaxy-015 | ~~Medium~~ Low (re-triaged 2026-05-23) | Resolved | ~~Security~~ Documentation & comments (re-triaged 2026-05-23) | `libs/MxGateway.Client.dll`, `libs/MxGateway.Contracts.dll`, `libs/README.md` |
+1 -1
View File
@@ -35,7 +35,7 @@ new ScriptedAlarmDefinition(
## Predicate evaluation
Alarm predicates reuse the same Roslyn sandbox as virtual tags — `ScriptEvaluator<AlarmPredicateContext, bool>` compiles the source, `TimedScriptEvaluator` wraps it with the configured timeout (default from `TimedScriptEvaluator.DefaultTimeout`), and `DependencyExtractor` statically harvests the tag paths the script reads. The sandbox rules (forbidden types, cancellation, logging sinks) are documented in [VirtualTags.md](VirtualTags.md); ScriptedAlarms does not redefine them. The known resource limits — unbounded script-side memory, the per-publish accretion of dynamically-emitted script assemblies (Core.Scripting-008), and the orphan-thread CPU-budget caveat — are documented in that file as well.
Alarm predicates reuse the same Roslyn sandbox as virtual tags — `ScriptEvaluator<AlarmPredicateContext, bool>` compiles the source, `TimedScriptEvaluator` wraps it with the configured timeout (default from `TimedScriptEvaluator.DefaultTimeout`), and `DependencyExtractor` statically harvests the tag paths the script reads. The sandbox rules (forbidden types, cancellation, logging sinks) are documented in [VirtualTags.md](VirtualTags.md); ScriptedAlarms does not redefine them. The known resource limits — unbounded script-side memory and the orphan-thread CPU-budget caveat — are documented in that file as well; per-publish assembly accretion was resolved by the Core.Scripting-008 collectible-`AssemblyLoadContext` rewrite and no longer requires periodic server restarts.
`AlarmPredicateContext` (`AlarmPredicateContext.cs`) is the script's `ScriptContext` subclass:
+5 -3
View File
@@ -6,7 +6,7 @@ The runtime is split across two projects: `Core.Scripting` holds the Roslyn sand
## Roslyn script sandbox (`Core.Scripting`)
User scripts are compiled via `Microsoft.CodeAnalysis.CSharp.Scripting` against a `ScriptContext` subclass. `ScriptGlobals<TContext>` exposes the context as a field named `ctx`, so scripts read `ctx.GetTag("...")` / `ctx.SetVirtualTag("...", ...)` / `ctx.Now` / `ctx.Logger` and return a value.
User scripts are compiled via `Microsoft.CodeAnalysis.CSharp` (regular compiler, not the scripting variant — the original `CSharpScript` pipeline was retired by the Core.Scripting-008 / -016 rewrite, see "Compile cache" below). Each script's source is pasted as the body of a synthesized `CompiledScript.Run(ScriptGlobals<TContext>)` method against a `ScriptContext` subclass. `ScriptGlobals<TContext>` exposes the context as a field named `ctx`, so scripts read `ctx.GetTag("...")` / `ctx.SetVirtualTag("...", ...)` / `ctx.Now` / `ctx.Logger` and return a value.
### Compile pipeline (`ScriptEvaluator<TContext, TResult>`)
@@ -30,11 +30,13 @@ Similarly, **`System.Threading.Tasks` is now denied** (Core.Scripting-003), whic
`ConcurrentDictionary<string, Lazy<ScriptEvaluator<...>>>` keyed on `SHA-256(UTF8(source))` rendered to hex. `Lazy<T>` with `ExecutionAndPublication` mode means two threads racing a miss compile exactly once. Failed compiles evict the entry (via the `TryRemove(KeyValuePair<,>)` overload so a concurrently re-added retry entry is not collateral damage — Core.Scripting-006) so a corrected retry can succeed (used during Admin UI authoring). No capacity bound — scripts are operator-authored and bounded by the config DB. Whitespace changes miss the cache on purpose. `Clear()` is called on config-publish.
**Per-publish assembly accretion (accepted limitation, Core.Scripting-008).** Each compiled `ScriptEvaluator` holds a Roslyn `ScriptRunner<T>` delegate, which keeps the dynamically-emitted script assembly loaded for the process lifetime. Emitted assemblies in the default `AssemblyLoadContext` cannot be unloaded; `CompiledScriptCache.Clear()` drops the dictionary entries but does **not** unload the underlying assemblies. Across many config-publish generations (each `Clear()` followed by recompiling every script), the process accumulates dead script assemblies. For the expected "low thousands" of scripts this is benign, but a long-running server with very frequent publishes will see steady managed-memory growth that does not return until the process restarts. Out-of-process script evaluation or a collectible `AssemblyLoadContext` is a v3 concern; deployments with high-publish-frequency requirements should schedule a periodic server restart to reclaim the accrued assemblies.
**Per-publish assembly unload (Core.Scripting-008 resolved).** Each compiled `ScriptEvaluator` emits its script into a dedicated **collectible** `AssemblyLoadContext` — the BCL escape hatch for assemblies that can be unloaded. The compile path is hand-rolled `CSharpCompilation.Create` + `Emit(MemoryStream)` + `ScriptAssemblyLoadContext.LoadFromStream` rather than the legacy `CSharpScript.CreateDelegate` (which emits into the default ALC and cannot be unloaded). `ScriptEvaluator.Dispose()` calls `AssemblyLoadContext.Unload()` and `CompiledScriptCache.Clear()` disposes every materialised evaluator before dropping its dictionary entry, so the emitted assemblies become eligible for GC immediately after a config-publish. The reclaim is GC-timing-sensitive (Unload is *eligible-for-collection*, not synchronous); the next collection cycle reclaims them. Regression tests `Dispose_unloads_compiled_script_assembly_load_context` and `Clear_disposes_every_materialised_evaluator` in `CompiledScriptCacheTests` lock this contract via `WeakReference` + `GC.Collect()` assertions. Server restarts are no longer required to reclaim compiled-script memory.
**Scripting authoring convention.** With the collectible-ALC rewrite, the wrapper around a user script is an ordinary C# static method, not a Roslyn `Script` submission. The script body is pasted verbatim as the method body and must therefore end with an explicit `return …;` per ordinary C# rules — the legacy `CSharpScript` "last expression yields result" shorthand is gone. Every script in the existing test corpus already uses explicit `return`; this convention is operator-visible only when authoring a brand-new script from scratch.
### Per-evaluation timeout (`TimedScriptEvaluator<TContext, TResult>`)
Wraps `ScriptEvaluator` with a wall-clock budget. Default `DefaultTimeout = 250ms`. Implementation pushes the inner `RunAsync` onto `Task.Run` (so a CPU-bound script can't hog the calling thread before `WaitAsync` registers its timeout) then awaits `runTask.WaitAsync(Timeout, ct)`. A `TimeoutException` from `WaitAsync` is wrapped as `ScriptTimeoutException`. Caller-supplied `CancellationToken` cancellation wins over the timeout and propagates as `OperationCanceledException` — so a shutdown cancel is not misclassified. **Known leak:** when a CPU-bound script times out, the underlying `ScriptRunner` keeps running on its thread-pool thread until the Roslyn runtime returns (documented trade-off; out-of-process evaluation is a v3 concern).
Wraps `ScriptEvaluator` with a wall-clock budget. Default `DefaultTimeout = 250ms`. Implementation pushes the inner `RunAsync` onto `Task.Run` (so a CPU-bound script can't hog the calling thread before `WaitAsync` registers its timeout) then awaits `runTask.WaitAsync(Timeout, ct)`. A `TimeoutException` from `WaitAsync` is wrapped as `ScriptTimeoutException`. Caller-supplied `CancellationToken` cancellation wins over the timeout and propagates as `OperationCanceledException` — so a shutdown cancel is not misclassified. **Known leak:** when a CPU-bound script times out, the underlying compiled-script delegate keeps running on its `Task.Run` thread-pool thread until it returns of its own accord (the CT is checked only at evaluator entry; once the script body is running, only the script returning or throwing will release the thread). The post-rewrite delegate is a regular C# `Func<>` bound to the synthesized `CompiledScript.Run` method, so this is a vanilla "synchronous CPU-bound work on a pool thread" leak rather than anything Roslyn-specific. Documented trade-off; out-of-process evaluation is a v3 concern.
### Script logger plumbing
+13 -3
View File
@@ -583,10 +583,20 @@ language binding.
**Depends on:** A.1 merged (proto change live).
**Files** (`c:\Users\dohertj2\Desktop\mxaccessgw\clients\`):
**Files** (`c:\Users\dohertj2\Desktop\mxaccessgw\src\` for .NET — note the sibling
repo restructured after this plan was written; `clients/dotnet/MxGateway.Client.csproj`
no longer exists, the proto contracts now live in
`src/ZB.MOM.WW.MxGateway.Contracts/` under the new namespace
`ZB.MOM.WW.MxGateway.Contracts.Proto[.Galaxy]`; the OtOpcUa driver currently
consumes vendored binaries from the pre-restructure build — see
`src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/libs/README.md`):
1. **.NET** — codegen runs on csproj rebuild via `Grpc.Tools`; just
rebuild `MxGateway.Client.csproj` after pulling A.1.
1. **.NET** — codegen runs on csproj rebuild via `Grpc.Tools`; rebuild
`src/ZB.MOM.WW.MxGateway.Contracts/ZB.MOM.WW.MxGateway.Contracts.csproj`
after pulling A.1. (If unwinding the driver's vendored binaries onto the
new contracts namespace as part of the alarm work, namespace-rename + a
reimplementation of the missing `MxGatewayClient` / `MxGatewaySession`
wrappers is also in scope.)
2. **Python** — run `clients\python\generate-proto.ps1`; commit the
regenerated `_pb2.py` + `_pb2_grpc.py` files under
`clients\python\src\`.
+3 -4
View File
@@ -1,6 +1,6 @@
# High-Level Requirements
> **Revision** — Refreshed 2026-04-19 for the OtOpcUa v2 multi-driver platform (task #205). The original 2025 text described a single-process Galaxy/MXAccess server called LmxOpcUa. Today the project is the **OtOpcUa** multi-driver OPC UA platform deployed as three cooperating processes (Server, Admin, Galaxy.Host). The Galaxy integration is one of seven shipped drivers. HLR-001 through HLR-008 have been rewritten driver-agnostically; HLR-009 has been retired (the embedded Status Dashboard is superseded by the Admin UI). HLR-010 through HLR-017 are new and cover plug-in drivers, resilience, Config DB / draft-publish, cluster redundancy, fleet-wide identifier uniqueness, Admin UI, audit logging, metrics, and the Roslyn capability-wrapping analyzer.
> **Revision** — Refreshed 2026-05-23 for the OtOpcUa v2 multi-driver platform. The original 2025 text described a single-process Galaxy/MXAccess server called LmxOpcUa. Today the project is the **OtOpcUa** multi-driver OPC UA platform deployed as two cooperating processes (Server, Admin). The Galaxy integration is one of seven shipped drivers and is now an in-process Tier-A driver that talks gRPC to a separately installed `mxaccessgw` gateway (sibling repo) — PR 7.2 (2026-04-30) retired the legacy out-of-process `Galaxy.Host` Windows service. HLR-001 through HLR-008 have been rewritten driver-agnostically; HLR-009 has been retired (the embedded Status Dashboard is superseded by the Admin UI). HLR-010 through HLR-017 cover plug-in drivers, resilience, Config DB / draft-publish, cluster redundancy, fleet-wide identifier uniqueness, Admin UI, audit logging, metrics, and the Roslyn capability-wrapping analyzer.
## HLR-001: OPC UA Server
@@ -28,11 +28,10 @@ Drivers whose backend has a native change signal (e.g. Galaxy's `time_of_last_de
## HLR-007: Service Hosting
The system shall be deployed as three cooperating Windows services:
The system shall be deployed as two cooperating Windows services (the legacy `OtOpcUa.Galaxy.Host` x86 host was retired in PR 7.2 — Galaxy access now flows through the separately installed `mxaccessgw` gateway, which lives in a sibling repository and is not part of the OtOpcUa deployment):
- **OtOpcUa.Server** — .NET 10 x64, `Microsoft.Extensions.Hosting` + `AddWindowsService`, hosts all non-Galaxy drivers in-process and the OPC UA endpoint.
- **OtOpcUa.Server** — .NET 10 AnyCPU, `Microsoft.Extensions.Hosting` + `AddWindowsService` (decision #30 replaced the original TopShelf choice), hosts every driver in-process — including the new Tier-A `GalaxyDriver` that speaks gRPC to `mxaccessgw` and the OPC UA endpoint.
- **OtOpcUa.Admin** — .NET 10 x64 Blazor Server web app, hosts the admin UI, SignalR hubs for live updates, `/metrics` Prometheus endpoint, and audit log writers.
- **OtOpcUa.Galaxy.Host** — .NET Framework 4.8 x86 (TopShelf), hosts MXAccess COM + Galaxy Repository SQL + Historian plugin. Talks to `Driver.Galaxy.Proxy` inside `OtOpcUa.Server` via a named pipe (MessagePack over length-prefixed frames, per-process shared secret, SID-restricted ACL).
## HLR-008: Logging
+6 -3
View File
@@ -151,8 +151,11 @@ substantive driver change, and revise this table when the data does.
`SubscriptionRegistry`, or a downstream consumer retaining
`DataValueSnapshot` references past their useful life.
## Scripted-alarm engine — known hot-path allocations
## Scripted-alarm engine — hot-path allocation reuse
`ScriptedAlarmEngine.BuildReadCache` allocates a fresh `Dictionary<string, DataValueSnapshot>` and `AlarmPredicateContext` on every predicate evaluation — i.e. once per upstream tag change per referencing alarm. On a busy line where many tags feeding many alarms change frequently, this is a steady stream of short-lived dictionary allocations on the hot path. (Core.ScriptedAlarms-009)
`ScriptedAlarmEngine` keeps a per-alarm reusable evaluation scratch in `_scratchByAlarmId` — the read-cache `Dictionary<string, DataValueSnapshot>` and the `AlarmPredicateContext` are allocated once per alarm (on first evaluation) and refilled in place across every subsequent predicate evaluation. The hot path no longer allocates a fresh dictionary + context per upstream tag change. (Core.ScriptedAlarms-009)
The allocations are deliberate for now: predicate evaluation is already serialised under `_evalGate`, so a single reused scratch dictionary would be safe, but the per-call dictionary keeps the evaluation surface immutable and trivially safe against future refactors. If a future scripted-alarm soak surfaces allocation pressure on this path, the mitigation is a per-alarm scratch buffer cleared between evaluations — note here before changing the engine.
Safety: reuse is serialised under `_evalGate`, so two threads can never observe the same scratch in a half-refilled state. The context wraps the read-cache by reference, so refilling the dictionary is what the predicate's `ctx.GetTag(path)` calls observe. `LoadAsync` clears `_scratchByAlarmId` alongside `_alarms` so a config-publish drops the prior generation's scratch (a new generation may carry different `Inputs` / `Logger`). Regression tests in `ScriptedAlarmEngineTests` lock the reuse contract:
- `Reevaluation_reuses_the_same_read_cache_dictionary` — asserts dictionary instance identity across two evaluations.
- `Reevaluation_reuses_the_same_predicate_context` — same, for the context.
- `LoadAsync_drops_the_prior_generations_scratch` — asserts a publish resets the scratch.
@@ -29,7 +29,7 @@ Tie-in capability — **historian alarm sink**:
| 3 | Evaluation trigger = **change-driven + timer-driven**; operator chooses per-tag | Change-driven is cheap at steady state; timer is the escape hatch for polling derivations that don't have a discrete "input changed" signal. |
| 4 | Script shape = **Shape A — one script per virtual tag/alarm**; `return` produces the value (or `bool` for alarm condition) | Minimal surface; no predicate/action split. Alarm side-effects (severity, message) configured out-of-band, not in the script. |
| 5 | Alarm fidelity = **full OPC UA Part 9** | Uniform with Galaxy + ALMD on the wire; client-side tooling (HMIs, historians, event pipelines) gets one shape. |
| 6 | Sandbox = **read-only context**; scripts can only read any tag + write to virtual tags | Strict Roslyn `ScriptOptions` allow-list. Authoritative deny-list (`ForbiddenTypeAnalyzer`): namespace-prefix deny `System.IO`, `System.Net`, `System.Diagnostics`, `System.Reflection`, `System.Threading.Tasks` (Task / Parallel fan-out — Core.Scripting-003), `System.Runtime.InteropServices`, `Microsoft.Win32`; type-granular deny `System.Environment`, `System.AppDomain`, `System.GC`, `System.Activator`, `System.Threading.Thread` (these live directly in the allow-listed `System` / `System.Threading` namespaces, so a prefix rule cannot reach them without blocking primitives — Core.Scripting-001 / -009). |
| 6 | Sandbox = **read-only context**; scripts can only read any tag + write to virtual tags | Strict Roslyn `ScriptOptions` allow-list. Authoritative deny-list (`ForbiddenTypeAnalyzer`): namespace-prefix deny `System.IO`, `System.Net`, `System.Diagnostics`, `System.Reflection`, `System.Threading.Tasks` (Task / Parallel fan-out — Core.Scripting-003), `System.Runtime.InteropServices`, `System.Runtime.Loader` (AssemblyLoadContext et al. — Core.Scripting-012), `Microsoft.Win32`; type-granular deny `System.Environment`, `System.AppDomain`, `System.GC`, `System.Activator`, `System.Threading.Thread`, `System.Threading.ThreadPool` (Core.Scripting-012), `System.Threading.Timer` (Core.Scripting-012) (these live directly in the allow-listed `System` / `System.Threading` namespaces, so a prefix rule cannot reach them without blocking primitives — Core.Scripting-001 / -009 / -012). |
| 7 | Dependency declaration = **AST inference**; operator doesn't maintain a separate dependency list | `CSharpSyntaxWalker` extracts `ctx.GetTag("path")` string-literal calls at compile time; dynamic paths rejected at publish. |
| 8 | Config storage = **config DB with generation-sealed cache** (same as driver instances) | Virtual tags + alarms publish atomically in the same generation as the driver instance config they may depend on. |
| 9 | Script return value shape (`ctx.GetTag`) = **`DataValue { Value, StatusCode, Timestamp }`** | Scripts branch on quality naturally without separate `ctx.GetQuality(...)` calls. |
@@ -89,7 +89,7 @@ Tie-in capability — **historian alarm sink**:
### Stream A — `Core.Scripting` (Roslyn engine + sandbox + AST inference + logger) — **2 weeks**
1. **A.1** Project scaffold + NuGet `Microsoft.CodeAnalysis.CSharp.Scripting`. `ScriptOptions` allow-list (`typeof(object).Assembly`, `typeof(Enumerable).Assembly`, the Core.Scripting assembly itself — nothing else). Hand-written `ScriptContext` base class with `GetTag(string)` / `SetVirtualTag(string, object)` / `Logger` / `Now` / `Deadband(double, double, double)` helpers.
1. **A.1** Project scaffold + NuGet `Microsoft.CodeAnalysis.CSharp.Scripting`. `ScriptOptions` allow-list (`typeof(object).Assembly`, `typeof(Enumerable).Assembly`, the Core.Scripting assembly itself — nothing else). Hand-written `ScriptContext` base class with `GetTag(string)` / `SetVirtualTag(string, object)` / `Logger` / `Now` / `Deadband(double, double, double)` helpers. _(Implementation note 2026-05-23 — superseded by Core.Scripting-008 / -016: the `CSharpScript`/`ScriptRunner` path was replaced with a hand-rolled `CSharpCompilation.Create``Emit(MemoryStream)` → collectible `ScriptAssemblyLoadContext.LoadFromStream` pipeline so per-publish ALC accretion is reclaimable, and engines route compiles through `CompiledScriptCache` rather than calling `ScriptEvaluator.Compile` directly. The reference list was correspondingly widened from the narrow allow-list above to the full BCL `TRUSTED_PLATFORM_ASSEMBLIES` set (filtered to `System.*` + `netstandard` + `Microsoft.Win32.Registry`) because the new pipeline can't compile against the old narrow set; `ForbiddenTypeAnalyzer` is now the sole security gate, consistent with how Core.Scripting-001 / -002 established the analyzer must be the real boundary because type forwarding makes any references-list-only restriction porous. See `docs/VirtualTags.md` "Compile cache" for the current implementation contract.)_
2. **A.2** `DependencyExtractor : CSharpSyntaxWalker`. Visits every `InvocationExpressionSyntax` targeting `ctx.GetTag` / `ctx.SetVirtualTag`; accepts only a `LiteralExpressionSyntax` argument. Non-literal arguments (concat, variable, method call) → publish-time rejection with an actionable error pointing the operator at the exact span. Outputs `IReadOnlySet<string> Inputs` + `IReadOnlySet<string> Outputs`.
3. **A.3** Compile cache. `(source_hash) → compiled Script<T>`. Recompile only when source changes. Warm on `SealedBootstrap`.
4. **A.4** Per-evaluation timeout wrapper (default 250ms; configurable per tag). Timeout = tag quality `BadInternalError` + structured warning log. Keeps a single runaway script from starving the engine.
@@ -162,7 +162,7 @@ Tie-in capability — **historian alarm sink**:
## Compliance Checks (run at exit gate)
- [ ] **Sandbox escape**: attempts to reference any deny-listed namespace prefix (`System.IO`, `System.Net`, `System.Diagnostics`, `System.Reflection`, `System.Threading.Tasks`, `System.Runtime.InteropServices`, `Microsoft.Win32`) or any of the type-granular forbidden types (`System.Environment`, `System.AppDomain`, `System.GC`, `System.Activator`, `System.Threading.Thread`) fail at script compile with an actionable error. Vectors include direct calls, `typeof(T)`, generic type arguments, casts, `is`/`as` patterns, `default(T)`, array element types, and explicitly-typed local declarations.
- [ ] **Sandbox escape**: attempts to reference any deny-listed namespace prefix (`System.IO`, `System.Net`, `System.Diagnostics`, `System.Reflection`, `System.Threading.Tasks`, `System.Runtime.InteropServices`, `System.Runtime.Loader`, `Microsoft.Win32`) or any of the type-granular forbidden types (`System.Environment`, `System.AppDomain`, `System.GC`, `System.Activator`, `System.Threading.Thread`, `System.Threading.ThreadPool`, `System.Threading.Timer`) fail at script compile with an actionable error. Vectors include direct calls, `typeof(T)`, generic type arguments, casts, `is`/`as` patterns, `default(T)`, array element types, and explicitly-typed local declarations.
- [ ] **Dependency inference**: `ctx.GetTag(myStringVar)` (non-literal path) is rejected at publish with a span-pointed error; `ctx.GetTag("Line1/Speed")` is accepted + appears in the inferred input set.
- [ ] **Change cascade**: tag A → virtual tag B → virtual tag C. When A changes, B recomputes, then C recomputes. Single change event triggers the full cascade in topological order within one evaluation pass.
- [ ] **Cycle rejection**: publish a config where virtual tag B depends on A and A depends on B. Publish fails pre-commit with a clear cycle message.
+2 -2
View File
@@ -193,7 +193,7 @@ ConfigurationService
- Compact binary format, faster than JSON, good fit for high-frequency data change callbacks
- Simpler than gRPC on .NET 4.8 (which needs legacy `Grpc.Core` native library)
**Decided: Galaxy Host is a separate Windows service.**
**Decided: Galaxy Host is a separate Windows service.** _(Reversed by PR 7.2, 2026-04-30 — see PR 7.2's commit `ae7106d` and the project_galaxy_via_mxgateway memory entry. The legacy in-process `Galaxy.Host` / `Galaxy.Proxy` / `Galaxy.Shared` projects + the `OtOpcUaGalaxyHost` Windows service were retired; Galaxy access now flows through the in-process Tier-A `GalaxyDriver` talking gRPC to a separately installed `mxaccessgw` gateway sibling repo. The reasoning below was correct for the original LMX/x86-COM architecture; the gateway sibling repo now owns those constraints externally.)_
- Independent lifecycle from the OtOpcUa Server
- Can be restarted without affecting the main server or other drivers
- Galaxy.Proxy detects connection loss, sets Bad quality on Galaxy nodes, reconnects when Host comes back
@@ -801,7 +801,7 @@ aggregate runner (#253); server-side factory + seed SQL per driver (#210#213)
| 26 | Admin deploys on same server (co-hosted) | Simplifies deployment; can also run on separate management host | 2026-04-16 |
| 27 | Admin scaffold early, driver-specific screens deferred | Core CRUD for instances/drivers first; per-driver config UI added with each driver | 2026-04-16 |
| 28 | Named pipes for Galaxy IPC | Fast, no port conflicts, native to both .NET 4.8 and .NET 10 | 2026-04-16 |
| 29 | Galaxy Host is a separate Windows service | Independent lifecycle, can restart without affecting main server or other drivers | 2026-04-16 |
| 29 | Galaxy Host is a separate Windows service | Independent lifecycle, can restart without affecting main server or other drivers | 2026-04-16 (**reversed PR 7.2, 2026-04-30** — Galaxy is now an in-process Tier-A driver talking gRPC to the sibling `mxaccessgw` gateway; see the decision body above) |
| 30 | Drop TopShelf, use Microsoft.Extensions.Hosting | Built-in Windows Service support in .NET 10, no third-party dependency | 2026-04-16 |
| 31 | Mono-repo for all drivers | Simpler dependency management, single CI pipeline, shared abstractions | 2026-04-16 |
| 32 | MessagePack serialization for Galaxy IPC | Binary, fast, works on .NET 4.8+ and .NET 10 via MessagePack-CSharp NuGet | 2026-04-16 |
+1 -1
View File
@@ -67,7 +67,7 @@ Remaining Phase 6.3 surfaces (hardening, not release-blocking):
- **AB CIP** (PRs #202222) — `Driver.AbCip`, `Driver.AbCip.IntegrationTests` (7 tests), AB CIP Cli. Live-boot verified against a ControlLogix rig.
- **AB Legacy** (PRs #202, #223) — `Driver.AbLegacy`, `Driver.AbLegacy.IntegrationTests` (2 tests), AB Legacy Cli. PCCC cip-path workaround for SLC/MicroLogix.
- **TwinCAT ADS** (PRs #205, this branch `task-galaxy-e2e`) — `Driver.TwinCAT`, `Driver.TwinCAT.IntegrationTests` (2 tests), TwinCAT Cli. TCBSD/ESXi fixture for e2e since local Hyper-V / TwinCAT RTIME are mutually exclusive on the dev box.
- **TwinCAT ADS** (PR #205) — `Driver.TwinCAT`, `Driver.TwinCAT.IntegrationTests` (2 tests), TwinCAT Cli. TCBSD/ESXi fixture for e2e since local Hyper-V / TwinCAT RTIME are mutually exclusive on the dev box.
- **FOCAS** (PRs #173, #199 + this session's migration) — `Driver.FOCAS` with an **in-process managed `FocasWireClient`** that speaks FOCAS/2 over TCP directly. Tier-C isolation retired — `Driver.FOCAS.Host` + `Driver.FOCAS.Shared` + `FwlibNative` P/Invoke + shim DLL + NSSM service all deleted. `Driver.FOCAS.IntegrationTests` covers 9 scenarios (fixed tree identity/axes/program/timers/spindle + user-authored PARAM/MACRO/PMC reads, Browse, Subscribe, IAlarmSource raise/clear, Probe transitions).
Decision recorded: FOCAS is **read-only** against the CNC by design — writes return `BadNotWritable`. See `docs/drivers/FOCAS.md` + `docs/drivers/FOCAS-Test-Fixture.md` for the deployment + coverage map.
@@ -48,6 +48,71 @@ public sealed class ScriptedAlarmEngine : IDisposable
// snapshot enumeration safe. The only write shapes are indexer-set and Clear,
// both of which ConcurrentDictionary supports atomically. (Core.ScriptedAlarms-001)
private readonly ConcurrentDictionary<string, AlarmState> _alarms = new(StringComparer.Ordinal);
/// <summary>
/// Per-alarm reusable evaluation scratch. The read-cache dictionary and the
/// <see cref="AlarmPredicateContext"/> instance are both allocated once per
/// alarm (on first evaluation) and reused across every subsequent re-eval —
/// the hot path no longer allocates a fresh dictionary + context per upstream
/// tag change. Safe because <see cref="EvaluatePredicateToStateAsync"/> only
/// runs under <see cref="_evalGate"/>, which serialises every evaluation:
/// two threads can never observe the same scratch in a half-refilled state.
/// Cleared in <see cref="LoadAsync"/> alongside <see cref="_alarms"/>.
/// (Core.ScriptedAlarms-009)
/// </summary>
private readonly ConcurrentDictionary<string, AlarmScratch> _scratchByAlarmId =
new(StringComparer.Ordinal);
/// <summary>
/// Compile cache for every alarm predicate. Routes <see cref="LoadAsync"/>'s
/// <see cref="ScriptEvaluator{TContext, TResult}.Compile"/> calls through the
/// cache so the collectible <see cref="System.Runtime.Loader.AssemblyLoadContext"/>
/// each compile produces is actually disposed on the publish-replace path
/// (Core.Scripting-016): the cache's <see cref="CompiledScriptCache{TContext, TResult}.Clear"/>
/// disposes every materialised evaluator before dropping its dictionary entry,
/// so a config-publish releases the prior generation's ALCs and the per-publish
/// accretion the Core.Scripting-008 fix targeted is actually freed in production.
/// Pre-fix the engine called <c>ScriptEvaluator.Compile</c> directly, which left
/// the ALCs rooted until the process exited — defeating -008 on the real path.
/// </summary>
private readonly CompiledScriptCache<AlarmPredicateContext, bool> _compileCache = new();
/// <summary>
/// Test-only diagnostic: returns the per-alarm scratch read-cache dictionary
/// if one has been allocated, else null. Used by Core.ScriptedAlarms-009
/// regression tests to assert the scratch is reused across evaluations
/// (two reads return the same instance).
/// </summary>
/// <remarks>
/// <b>Synchronization:</b> the returned <see cref="IReadOnlyDictionary{TKey, TValue}"/>
/// is the engine's live mutable read-cache. It is refilled in place by
/// <c>RefillReadCache</c> on every predicate evaluation, under <c>_evalGate</c>.
/// Test callers MUST NOT iterate this dictionary while the engine is
/// actively evaluating (i.e. while an upstream change is mid-flight); the
/// refill clears the dict before repopulating and a concurrent iterator
/// would observe torn / partial state. Safe uses are: reference-identity
/// comparisons (e.g. asserting the same instance is reused across calls),
/// and single-key reads against an engine that has quiesced after a
/// deterministic upstream push. Anything more involved should snapshot a
/// copy under the gate. (Core.ScriptedAlarms-013.)
/// </remarks>
internal IReadOnlyDictionary<string, DataValueSnapshot>? TryGetScratchReadCacheForTest(string alarmId)
=> _scratchByAlarmId.TryGetValue(alarmId, out var s) ? s.ReadCache : null;
/// <summary>
/// Test-only diagnostic: returns the per-alarm <see cref="AlarmPredicateContext"/>
/// if one has been allocated, else null. Companion to
/// <see cref="TryGetScratchReadCacheForTest"/>.
/// </summary>
/// <remarks>
/// <b>Synchronization:</b> the returned context wraps the same live
/// read-cache as <see cref="TryGetScratchReadCacheForTest"/> — the same
/// "don't iterate during an in-flight evaluation" caveat applies. Safe
/// for reference-identity assertions on a quiesced engine.
/// (Core.ScriptedAlarms-013.)
/// </remarks>
internal AlarmPredicateContext? TryGetScratchContextForTest(string alarmId)
=> _scratchByAlarmId.TryGetValue(alarmId, out var s) ? s.Context : null;
private readonly ConcurrentDictionary<string, DataValueSnapshot> _valueCache
= new(StringComparer.Ordinal);
private readonly Dictionary<string, HashSet<string>> _alarmsReferencing
@@ -108,6 +173,14 @@ public sealed class ScriptedAlarmEngine : IDisposable
UnsubscribeFromUpstream();
_alarms.Clear();
_alarmsReferencing.Clear();
// Drop the prior generation's per-alarm scratch buffers — definitions may
// have changed (different Inputs, different Logger), so any reuse would be
// unsafe. (Core.ScriptedAlarms-009)
_scratchByAlarmId.Clear();
// Dispose every compiled-predicate ALC from the prior generation BEFORE we
// recompile this one. Skipping this is what made Core.Scripting-008 a
// no-op in production. (Core.Scripting-016)
_compileCache.Clear();
var compileFailures = new List<string>();
foreach (var def in definitions)
@@ -122,7 +195,10 @@ public sealed class ScriptedAlarmEngine : IDisposable
continue;
}
var evaluator = ScriptEvaluator<AlarmPredicateContext, bool>.Compile(def.PredicateScriptSource);
// Route through CompiledScriptCache so the emitted assembly's
// collectible ALC participates in publish-replace cleanup.
// (Core.Scripting-016)
var evaluator = _compileCache.GetOrCompile(def.PredicateScriptSource);
var timed = new TimedScriptEvaluator<AlarmPredicateContext, bool>(evaluator, _scriptTimeout);
var logger = _loggerFactory.Create(def.AlarmId);
@@ -354,7 +430,13 @@ public sealed class ScriptedAlarmEngine : IDisposable
AlarmState state, AlarmConditionState seed, DateTime nowUtc, CancellationToken ct,
List<ScriptedAlarmEvent>? pendingEmissions = null)
{
var inputs = BuildReadCache(state.Inputs);
// Look up (or lazily allocate) the per-alarm scratch and refill its read cache
// in place. The dictionary + context survive across evaluations so the hot path
// no longer allocates per upstream tag change. (Core.ScriptedAlarms-009)
var scratch = _scratchByAlarmId.GetOrAdd(
state.Definition.AlarmId,
_ => new AlarmScratch(state.Inputs, state.Logger, _clock));
RefillReadCache(scratch.ReadCache, state.Inputs);
// Cold-start guard — skip the predicate when any referenced upstream tag has no
// cached value yet (the upstream subscription hasn't delivered its first push).
@@ -362,9 +444,9 @@ public sealed class ScriptedAlarmEngine : IDisposable
// every tick until the cache fills, spamming the log with identical stack traces.
// Bad quality is treated the same: the input isn't available at the predicate's
// expected type, so the only defensible move is to hold the prior condition state.
if (!AreInputsReady(inputs)) return seed;
if (!AreInputsReady(scratch.ReadCache)) return seed;
var context = new AlarmPredicateContext(inputs, state.Logger, _clock);
var context = scratch.Context;
bool predicateTrue;
try
@@ -399,12 +481,20 @@ public sealed class ScriptedAlarmEngine : IDisposable
return result.State;
}
private IReadOnlyDictionary<string, DataValueSnapshot> BuildReadCache(IReadOnlySet<string> inputs)
/// <summary>
/// Refill <paramref name="cache"/> in place from <c>_valueCache</c>, falling
/// back to a synchronous <c>ITagUpstreamSource.ReadTag</c> for paths whose
/// first upstream push hasn't arrived yet. The dictionary is cleared and
/// repopulated under <c>_evalGate</c> so no concurrent reader can observe
/// a partial state. Replaces the old <c>BuildReadCache</c> which allocated a
/// fresh dictionary every call (Core.ScriptedAlarms-009).
/// </summary>
private void RefillReadCache(
Dictionary<string, DataValueSnapshot> cache, IReadOnlySet<string> inputs)
{
var d = new Dictionary<string, DataValueSnapshot>(StringComparer.Ordinal);
cache.Clear();
foreach (var p in inputs)
d[p] = _valueCache.TryGetValue(p, out var v) ? v : _upstream.ReadTag(p);
return d;
cache[p] = _valueCache.TryGetValue(p, out var v) ? v : _upstream.ReadTag(p);
}
/// <summary>
@@ -596,12 +686,24 @@ public sealed class ScriptedAlarmEngine : IDisposable
}
}
// Do NOT clear _alarms here: Timer.Dispose() does not wait for in-flight callbacks,
// so a ShelvingCheckAsync or ReevaluateAsync can still be running inside _evalGate.
// Those paths now re-check _disposed after acquiring the gate and bail out safely.
// Clearing _alarms outside the gate would race concurrent reads and is unnecessary
// (the whole object is being discarded). (Core.ScriptedAlarms-005)
// Safe to clear here: the Task.WhenAll drain above guaranteed no
// ReevaluateAsync / ShelvingCheckAsync is mid-flight, and _disposed=true
// prevents new background work from being queued (OnUpstreamChange bails on
// line 334). Pre-Core.Scripting-016 the comment said "Do NOT clear _alarms",
// but that was when the engine called ScriptEvaluator.Compile directly and
// held the script ALCs through _alarms→AlarmState→TimedScriptEvaluator
// forever — leaving them rooted defeated the -008 collectible-ALC unload.
// Clearing now drops the delegate references so the cache's Dispose call
// below can actually unload the emitted assemblies. (Core.ScriptedAlarms-005
// re-evaluated under -016.)
_alarms.Clear();
_alarmsReferencing.Clear();
_scratchByAlarmId.Clear();
// Dispose every compiled-predicate ALC so the engine's shutdown actually
// releases the emitted assemblies. The drain above ensures no evaluator is
// mid-call; CompiledScriptCache.Dispose internally guards against use-after-
// dispose. (Core.Scripting-016)
_compileCache.Dispose();
}
private sealed record AlarmState(
@@ -611,6 +713,37 @@ public sealed class ScriptedAlarmEngine : IDisposable
IReadOnlyList<string> TemplateTokens,
ILogger Logger,
AlarmConditionState Condition);
/// <summary>
/// Per-alarm reusable evaluation scratch. The <see cref="ReadCache"/> dictionary
/// is the same instance across every evaluation of the owning alarm — it is
/// cleared and refilled in <see cref="ScriptedAlarmEngine.RefillReadCache"/> on
/// each call. <see cref="Context"/> wraps that dictionary by reference, so a
/// refilled <see cref="ReadCache"/> is what the predicate's
/// <c>ctx.GetTag(path)</c> calls observe. (Core.ScriptedAlarms-009)
/// </summary>
/// <remarks>
/// Reuse is safe because <see cref="ScriptedAlarmEngine"/> serialises every
/// evaluation under <c>_evalGate</c>: two threads can never observe the same
/// scratch in a half-refilled state.
/// </remarks>
private sealed class AlarmScratch
{
public Dictionary<string, DataValueSnapshot> ReadCache { get; }
public AlarmPredicateContext Context { get; }
public AlarmScratch(IReadOnlySet<string> inputs, ILogger logger, Func<DateTime> clock)
{
// Pre-size to the expected input count so the first refill doesn't pay the
// dictionary-grow cost. The dictionary auto-grows if Inputs changes (it
// cannot under the current contract — Inputs is fixed at LoadAsync — but
// pre-sizing is defensive against future changes).
ReadCache = new Dictionary<string, DataValueSnapshot>(inputs.Count, StringComparer.Ordinal);
// Context holds the read cache by reference. Refilling the dictionary
// updates what the context (and the script) observes.
Context = new AlarmPredicateContext(ReadCache, logger, clock);
}
}
}
/// <summary>
@@ -30,11 +30,20 @@ namespace ZB.MOM.WW.OtOpcUa.Core.Scripting;
/// bounded by config DB (typically low thousands). If that changes in v3, add an
/// LRU eviction policy — the API stays the same.
/// </para>
/// <para>
/// <b>Lifecycle:</b> compiled scripts hold a collectible
/// <see cref="System.Runtime.Loader.AssemblyLoadContext"/> per evaluator
/// (Core.Scripting-008 fix). <see cref="Clear"/> disposes every materialised
/// evaluator before dropping its dictionary entry so the emitted assemblies are
/// eligible for GC immediately after a publish. <see cref="Dispose"/> drops the
/// cache itself for graceful server shutdown.
/// </para>
/// </remarks>
public sealed class CompiledScriptCache<TContext, TResult>
public sealed class CompiledScriptCache<TContext, TResult> : IDisposable
where TContext : ScriptContext
{
private readonly ConcurrentDictionary<string, Lazy<ScriptEvaluator<TContext, TResult>>> _cache = new();
private bool _disposed;
/// <summary>
/// Return the compiled evaluator for <paramref name="scriptSource"/>, compiling
@@ -46,6 +55,7 @@ public sealed class CompiledScriptCache<TContext, TResult>
public ScriptEvaluator<TContext, TResult> GetOrCompile(string scriptSource)
{
if (scriptSource is null) throw new ArgumentNullException(nameof(scriptSource));
if (_disposed) throw new ObjectDisposedException(nameof(CompiledScriptCache<TContext, TResult>));
var key = HashSource(scriptSource);
var lazy = _cache.GetOrAdd(key, _ => new Lazy<ScriptEvaluator<TContext, TResult>>(
@@ -72,13 +82,71 @@ public sealed class CompiledScriptCache<TContext, TResult>
/// <summary>Current entry count. Exposed for Admin UI diagnostics / tests.</summary>
public int Count => _cache.Count;
/// <summary>Drop every cached compile. Used on config generation publish + tests.</summary>
public void Clear() => _cache.Clear();
/// <summary>
/// Drop every cached compile. Used on config generation publish + tests.
/// Disposes each materialised evaluator before removing it so its collectible
/// <see cref="System.Runtime.Loader.AssemblyLoadContext"/> unloads and the
/// emitted script assembly becomes eligible for GC (Core.Scripting-008).
/// </summary>
/// <remarks>
/// Safe to call after <see cref="Dispose"/> — the operation is idempotent.
/// <see cref="Dispose"/> sets <c>_disposed = true</c> before invoking this
/// method (so callers see the post-Dispose guard on <see cref="GetOrCompile"/>),
/// but this method itself MUST run to completion so the Dispose-triggered
/// drain actually unloads every materialised evaluator's ALC. (Core.Scripting-016
/// uncovered this — a previous Clear-aborts-when-disposed guard silently
/// skipped the entire drain on Dispose, leaving emitted assemblies rooted.)
/// </remarks>
public void Clear()
{
// Snapshot (key, value) pairs and remove with the value-scoped
// TryRemove(KeyValuePair<,>) overload — same shape as the
// Core.Scripting-006 fix in GetOrCompile's catch block. A concurrent
// GetOrCompile re-add that hashes to the same key between our snapshot
// and the TryRemove inserts a *different* Lazy reference; the value-
// scoped removal sees the mismatch and leaves the fresh entry intact
// (instead of evicting + disposing it while the concurrent caller
// still holds it). The fresh evaluator and its ALC stay live for the
// concurrent caller. (Core.Scripting-014.)
foreach (var entry in _cache.ToArray())
{
if (_cache.TryRemove(entry))
DisposeLazyIfMaterialised(entry.Value);
}
}
/// <summary>True when the exact source has been compiled at least once + is still cached.</summary>
public bool Contains(string scriptSource)
=> _cache.ContainsKey(HashSource(scriptSource));
/// <summary>
/// Drop the cache and dispose every materialised evaluator. After disposal
/// <see cref="GetOrCompile"/> throws <see cref="ObjectDisposedException"/>.
/// </summary>
public void Dispose()
{
if (_disposed) return;
_disposed = true;
Clear();
}
private static void DisposeLazyIfMaterialised(Lazy<ScriptEvaluator<TContext, TResult>> lazy)
{
// IsValueCreated is false for a faulted Lazy too, so the catch in GetOrCompile
// has already taken care of failed compiles — there's no evaluator to dispose.
if (!lazy.IsValueCreated) return;
try
{
lazy.Value.Dispose();
}
catch
{
// Dispose is best-effort here: an evaluator disposal failure would leak its
// ALC but mustn't prevent the rest of the cache from clearing. The ALC
// unload itself is exception-free in practice; this is defensive.
}
}
private static string HashSource(string source)
{
var bytes = Encoding.UTF8.GetBytes(source);
@@ -72,6 +72,11 @@ public static class ForbiddenTypeAnalyzer
// a Task fan-out outlives the evaluation timeout entirely
// (Core.Scripting-003).
"System.Runtime.InteropServices",
"System.Runtime.Loader", // AssemblyLoadContext + AssemblyDependencyResolver —
// arbitrary DLL load into the host process
// (Core.Scripting-012). Namespace-prefix rather than
// type-granular so future BCL additions to this
// namespace are denied by default.
"Microsoft.Win32", // registry
];
@@ -113,6 +118,24 @@ public static class ForbiddenTypeAnalyzer
// target it without blocking those legitimate types. Denied type-granularly here.
// (Core.Scripting-010.)
"System.Threading.Thread",
// Core.Scripting-012 — broadening the references list to the BCL trusted-platform-
// assemblies set (Core.Scripting-008 follow-up) re-exposed two background-work
// vectors the original deny-list missed. Both live in System.Threading (shared
// with allowed sync primitives like CancellationToken / SemaphoreSlim), so they
// must be denied type-granularly:
//
// System.Threading.ThreadPool — QueueUserWorkItem / UnsafeQueueUserWorkItem
// re-introduce the background-fanout threat
// Core.Scripting-003 closed against
// System.Threading.Tasks.
// System.Threading.Timer — Timer(callback, …) schedules unbounded work
// that outlives the per-evaluation timeout.
//
// System.Runtime.Loader.AssemblyLoadContext is also covered, but at the namespace-
// prefix level above (System.Runtime.Loader) so future BCL additions to that
// namespace are denied by default.
"System.Threading.ThreadPool",
"System.Threading.Timer",
];
/// <summary>
@@ -1,75 +1,420 @@
using Microsoft.CodeAnalysis.CSharp.Scripting;
using Microsoft.CodeAnalysis.Scripting;
using System.Reflection;
using System.Runtime.Loader;
using System.Text;
using Microsoft.CodeAnalysis;
using Microsoft.CodeAnalysis.CSharp;
namespace ZB.MOM.WW.OtOpcUa.Core.Scripting;
/// <summary>
/// Compiles + runs user scripts against a <see cref="ScriptContext"/> subclass. Core
/// evaluator — no caching, no timeout, no logging side-effects yet (those land in
/// Stream A.3, A.4, A.5 respectively). Stream B + C wrap this with the dependency
/// scheduler + alarm state machine.
/// evaluator — no caching, no timeout, no logging side-effects (those land in
/// <see cref="CompiledScriptCache{TContext, TResult}"/>,
/// <see cref="TimedScriptEvaluator{TContext, TResult}"/>, and
/// <see cref="ScriptLogCompanionSink"/> respectively).
/// </summary>
/// <remarks>
/// <para>
/// Scripts are compiled against <see cref="ScriptGlobals{TContext}"/> so the
/// context member is named <c>ctx</c> in the script, matching the
/// <see cref="DependencyExtractor"/>'s walker and the Admin UI type stub.
/// Scripts are wrapped in a synthesized <c>CompiledScript.Run(globals)</c> method
/// and compiled via <see cref="CSharpCompilation"/> into a regular .NET assembly
/// that is loaded into a <b>collectible</b>
/// <see cref="AssemblyLoadContext"/>. The collectible ALC is the fix for
/// Core.Scripting-008: per-publish recompile accretion was previously unbounded
/// because Roslyn's <c>CSharpScript.CreateDelegate</c> emits into the default ALC
/// (non-collectible); now <see cref="Dispose"/> unloads the entire ALC and the
/// emitted assembly becomes eligible for GC.
/// </para>
/// <para>
/// Compile pipeline is a three-step gate: (1) Roslyn compile — catches syntax
/// errors + type-resolution failures, throws <see cref="CompilationErrorException"/>;
/// (2) <see cref="ForbiddenTypeAnalyzer"/> runs against the semantic model —
/// catches sandbox escapes that slipped past reference restrictions due to .NET's
/// type forwarding, throws <see cref="ScriptSandboxViolationException"/>; (3)
/// delegate creation — throws at this layer only for internal Roslyn bugs, not
/// user error.
/// Compile pipeline is a three-step gate, unchanged in intent from the legacy
/// <c>CSharpScript</c> path: (1) Roslyn parse + compile against the
/// <see cref="ScriptSandbox"/> allow-list — catches syntax errors, unresolved
/// types (the sandbox's first line of defense), and most type-resolution
/// failures, throwing <see cref="CompilationErrorException"/>; (2)
/// <see cref="ForbiddenTypeAnalyzer"/> runs against the semantic model — catches
/// sandbox escapes that slipped past reference restrictions due to .NET's type
/// forwarding, throwing <see cref="ScriptSandboxViolationException"/>; (3) emit
/// to an in-memory PE stream + load into the collectible ALC — throws at this
/// layer only for internal Roslyn bugs, not user error.
/// </para>
/// <para>
/// Runtime exceptions thrown from user code propagate unwrapped. The virtual-tag
/// engine (Stream B) catches them per-tag + maps to <c>BadInternalError</c>
/// quality per Phase 7 decision #11 this layer doesn't swallow anything so
/// tests can assert on the original exception type.
/// engine catches them per-tag and maps to <c>BadInternalError</c> quality
/// per Phase 7 decision #11; this layer doesn't swallow anything so tests can
/// assert on the original exception type.
/// </para>
/// <para>
/// Scripts are expected to be statement bodies that end with an explicit
/// <c>return …;</c> — the wrapper provides only the surrounding method body, so
/// the script's final-expression-yields-result behavior of legacy
/// <c>CSharpScript</c> is replaced by ordinary C# method semantics. Every script
/// in the existing test corpus already uses explicit <c>return</c>; this is a
/// documented authoring convention.
/// </para>
/// </remarks>
public sealed class ScriptEvaluator<TContext, TResult>
public sealed class ScriptEvaluator<TContext, TResult> : IDisposable
where TContext : ScriptContext
{
private readonly ScriptRunner<TResult> _runner;
private readonly ScriptAssemblyLoadContext _alc;
private readonly Func<ScriptGlobals<TContext>, TResult> _func;
private bool _disposed;
private ScriptEvaluator(ScriptRunner<TResult> runner)
private ScriptEvaluator(ScriptAssemblyLoadContext alc, Func<ScriptGlobals<TContext>, TResult> func)
{
_runner = runner;
_alc = alc;
_func = func;
}
public static ScriptEvaluator<TContext, TResult> Compile(string scriptSource)
{
if (scriptSource is null) throw new ArgumentNullException(nameof(scriptSource));
var options = ScriptSandbox.Build(typeof(TContext));
var script = CSharpScript.Create<TResult>(
code: scriptSource,
options: options,
globalsType: typeof(ScriptGlobals<TContext>));
var sandbox = ScriptSandbox.Build(typeof(TContext));
// Step 1 — Roslyn compile. Throws CompilationErrorException on syntax / type errors.
var diagnostics = script.Compile();
// Step 1 — synthesize a wrapper class around the script body and parse it. The
// wrapper's `Run` method is what we invoke at runtime; the user's source is
// pasted in as its body so explicit `return` semantics apply.
var wrapperSource = BuildWrapperSource(scriptSource, sandbox.Imports);
var syntaxTree = CSharpSyntaxTree.ParseText(wrapperSource);
// Step 2forbidden-type semantic analysis. Defense-in-depth against reference-list
// leaks due to type forwarding.
var rejections = ForbiddenTypeAnalyzer.Analyze(script.GetCompilation());
// Step 1adefend against wrapper-source injection (Core.Scripting-013).
// A script body of `return 0; } public static int Evil() { return 0;` would
// close the synthesized `Run` method early, declare a sibling `Evil` method
// inside the synthesized `CompiledScript` class, and leave the wrapper's
// trailing `}` balanced. ForbiddenTypeAnalyzer still walks the injected
// members so this isn't a direct sandbox escape, but it relaxes the
// documented "method body" authoring contract and widens the analyzer's
// surface. Reject by requiring that the parsed `CompiledScript` class
// contains exactly one member declaration (the `Run` method).
EnforceSingleRunMember(syntaxTree);
// Step 2 — Roslyn compile against the sandbox allow-list. Anything not in the
// references set is unresolved and produces a compiler error.
var assemblyName = "ZB.MOM.WW.OtOpcUa.Core.Scripting.Compiled." +
Guid.NewGuid().ToString("N");
var compileOptions = new CSharpCompilationOptions(
OutputKind.DynamicallyLinkedLibrary,
optimizationLevel: OptimizationLevel.Release,
allowUnsafe: false,
// Don't generate XML doc warnings for the synthesized wrapper.
warningLevel: 4,
nullableContextOptions: NullableContextOptions.Enable);
var compilation = CSharpCompilation.Create(
assemblyName,
syntaxTrees: new[] { syntaxTree },
references: sandbox.References,
options: compileOptions);
var compileDiagnostics = compilation.GetDiagnostics();
var compileErrors = compileDiagnostics
.Where(d => d.Severity == DiagnosticSeverity.Error)
.ToArray();
if (compileErrors.Length > 0)
throw new CompilationErrorException(compileErrors);
// Step 3 — forbidden-type semantic analysis. Defense-in-depth against
// reference-list leaks due to type forwarding.
var rejections = ForbiddenTypeAnalyzer.Analyze(compilation);
if (rejections.Count > 0)
throw new ScriptSandboxViolationException(rejections);
// Step 3materialize the callable delegate.
var runner = script.CreateDelegate();
return new ScriptEvaluator<TContext, TResult>(runner);
// Step 4emit to an in-memory PE stream and load into a collectible ALC.
using var peStream = new MemoryStream();
var emitResult = compilation.Emit(peStream);
if (!emitResult.Success)
{
var emitErrors = emitResult.Diagnostics
.Where(d => d.Severity == DiagnosticSeverity.Error)
.ToArray();
throw new CompilationErrorException(emitErrors);
}
peStream.Position = 0;
var alc = new ScriptAssemblyLoadContext(assemblyName);
Assembly assembly;
try
{
assembly = alc.LoadFromStream(peStream);
}
catch
{
// Failed to load — drop the ALC so we don't leak a half-initialised one.
alc.Unload();
throw;
}
// Step 5 — resolve the wrapper's Run method and bind a typed delegate. The
// wrapper source above puts the type in this exact namespace + class — keep the
// names in sync with BuildWrapperSource.
Func<ScriptGlobals<TContext>, TResult> func;
try
{
var wrapperType = assembly.GetType(
"ZB.MOM.WW.OtOpcUa.Core.Scripting.Compiled.CompiledScript",
throwOnError: true)!;
var runMethod = wrapperType.GetMethod(
"Run",
BindingFlags.Public | BindingFlags.Static)
?? throw new InvalidOperationException(
"Synthesized wrapper is missing the public static Run method.");
func = (Func<ScriptGlobals<TContext>, TResult>)Delegate.CreateDelegate(
typeof(Func<ScriptGlobals<TContext>, TResult>), runMethod);
}
catch
{
alc.Unload();
throw;
}
return new ScriptEvaluator<TContext, TResult>(alc, func);
}
/// <summary>Run against an already-constructed context.</summary>
public Task<TResult> RunAsync(TContext context, CancellationToken ct = default)
{
if (_disposed) throw new ObjectDisposedException(nameof(ScriptEvaluator<TContext, TResult>));
if (context is null) throw new ArgumentNullException(nameof(context));
ct.ThrowIfCancellationRequested();
var globals = new ScriptGlobals<TContext> { ctx = context };
return _runner(globals, ct);
// The user's script is synchronous (Roslyn emits a static method that returns
// TResult directly). We surface a Task<TResult> only to keep the existing
// RunAsync contract consumers depend on. TimedScriptEvaluator wraps this in
// Task.Run so a long-running script still honours its wall-clock budget.
var result = _func(globals);
return Task.FromResult(result);
}
/// <summary>
/// Unload the collectible <see cref="AssemblyLoadContext"/> that holds the emitted
/// script assembly so the runtime can reclaim it. After disposal the evaluator can
/// no longer be invoked — call <see cref="ScriptEvaluator{TContext, TResult}.Compile"/>
/// again for a fresh one. Dispose is idempotent.
/// </summary>
/// <remarks>
/// Unload is <i>eligible-for-collection</i>, not synchronous: the assembly is
/// reclaimed when the GC determines no live references remain. The cache disposes
/// evaluators in <see cref="CompiledScriptCache{TContext, TResult}.Clear"/> so a
/// config-generation publish releases the prior generation in one sweep; the
/// reclaim then races with the next GC cycle. Tests verify the reclaim via
/// <see cref="WeakReference"/> + <see cref="GC.Collect()"/>.
/// </remarks>
public void Dispose()
{
if (_disposed) return;
_disposed = true;
_alc.Unload();
}
/// <summary>
/// Reject scripts whose source contains brace-balanced injections that would
/// declare sibling members alongside the synthesized <c>CompiledScript.Run</c>
/// method. The expected shape is a single <c>CompiledScript</c> class with
/// exactly one member — the <c>Run</c> method. Anything else (a sibling
/// method, nested class, additional class in the namespace, free-floating
/// top-level statement) means the user source closed the synthesized braces
/// early and injected its own declarations. (Core.Scripting-013.)
/// </summary>
private static void EnforceSingleRunMember(SyntaxTree syntaxTree)
{
var root = syntaxTree.GetCompilationUnitRoot();
// The compilation unit must hold exactly one type declaration — our
// CompiledScript. Anything else means the user closed the synthesized
// namespace or class early and injected another type declaration.
var typeMembers = root.DescendantNodes()
.OfType<Microsoft.CodeAnalysis.CSharp.Syntax.BaseTypeDeclarationSyntax>()
.ToArray();
if (typeMembers.Length != 1 || typeMembers[0].Identifier.ValueText != "CompiledScript")
{
throw new CompilationErrorException(new[]
{
Diagnostic.Create(
new DiagnosticDescriptor(
id: "LMX001",
title: "Script wrapper injection",
messageFormat: "Script source must be a statement body. Declarations of " +
"additional types alongside the wrapper's CompiledScript class " +
"are not allowed; check for unbalanced braces or stray " +
"`class` / `namespace` keywords in the source. (Core.Scripting-013)",
category: "Sandbox",
defaultSeverity: DiagnosticSeverity.Error,
isEnabledByDefault: true),
typeMembers.Length > 1 ? typeMembers[1].Identifier.GetLocation() : Location.None),
});
}
// The CompiledScript class itself must contain exactly one member — the Run
// method. A second member means the user closed Run early and started a sibling.
var classMembers = ((Microsoft.CodeAnalysis.CSharp.Syntax.ClassDeclarationSyntax)typeMembers[0]).Members;
if (classMembers.Count != 1 ||
classMembers[0] is not Microsoft.CodeAnalysis.CSharp.Syntax.MethodDeclarationSyntax m ||
m.Identifier.ValueText != "Run")
{
throw new CompilationErrorException(new[]
{
Diagnostic.Create(
new DiagnosticDescriptor(
id: "LMX002",
title: "Script wrapper injection",
messageFormat: "Script source must be a statement body. Declarations of " +
"sibling members (methods, properties, nested types) alongside " +
"the wrapper's Run method are not allowed; check for unbalanced " +
"braces or a stray `}` followed by a `public`/`private`/`static` " +
"declaration in the source. (Core.Scripting-013)",
category: "Sandbox",
defaultSeverity: DiagnosticSeverity.Error,
isEnabledByDefault: true),
classMembers.Count > 1 ? classMembers[1].GetLocation() : Location.None),
});
}
}
/// <summary>
/// Synthesize the source we hand to Roslyn. The user's script body is pasted
/// verbatim inside <c>CompiledScript.Run</c>; the <c>using</c> directives mirror
/// <see cref="ScriptSandbox"/>'s imports so scripts can write <c>Math.Abs</c>
/// instead of <c>System.Math.Abs</c>.
/// </summary>
private static string BuildWrapperSource(string userSource, IReadOnlyList<string> imports)
{
var sb = new StringBuilder();
foreach (var import in imports)
sb.Append("using ").Append(import).AppendLine(";");
sb.AppendLine();
sb.AppendLine("namespace ZB.MOM.WW.OtOpcUa.Core.Scripting.Compiled;");
sb.AppendLine();
sb.AppendLine("public static class CompiledScript");
sb.AppendLine("{");
sb.Append(" public static ").Append(ToCSharpTypeName(typeof(TResult)))
.Append(" Run(").Append(ToCSharpTypeName(typeof(ScriptGlobals<TContext>)))
.AppendLine(" globals)");
sb.AppendLine(" {");
sb.AppendLine(" var ctx = globals.ctx;");
// User source ends with `return X;` per the authoring convention; we paste it
// verbatim. The leading newline keeps Roslyn diagnostics' line numbers usable
// by operators (errors point at the user's source line, not the wrapper).
sb.AppendLine("#line 1");
sb.AppendLine(userSource);
sb.AppendLine(" }");
sb.AppendLine("}");
return sb.ToString();
}
/// <summary>
/// Convert a runtime <see cref="Type"/> to a C# type-name string suitable for
/// emitting into Roslyn source. Uses <c>global::</c>-qualified FQNs to avoid
/// accidental capture by the wrapper's <c>using</c> directives, handles nested
/// types (<c>+</c> → <c>.</c>), and recurses for generic arguments so the
/// <c>ScriptGlobals&lt;TContext&gt;</c> parameter is emitted correctly.
/// </summary>
private static string ToCSharpTypeName(Type t)
{
if (t == typeof(void)) return "void";
// Primitive aliases keep the synthesized source readable when diagnostic
// logging dumps it; functionally identical to the FQN form.
if (t == typeof(bool)) return "bool";
if (t == typeof(byte)) return "byte";
if (t == typeof(sbyte)) return "sbyte";
if (t == typeof(short)) return "short";
if (t == typeof(ushort)) return "ushort";
if (t == typeof(int)) return "int";
if (t == typeof(uint)) return "uint";
if (t == typeof(long)) return "long";
if (t == typeof(ulong)) return "ulong";
if (t == typeof(float)) return "float";
if (t == typeof(double)) return "double";
if (t == typeof(decimal)) return "decimal";
if (t == typeof(string)) return "string";
if (t == typeof(object)) return "object";
if (Nullable.GetUnderlyingType(t) is { } inner)
return ToCSharpTypeName(inner) + "?";
if (t.IsArray)
return ToCSharpTypeName(t.GetElementType()!) + "[]";
if (t.IsGenericType)
{
// Walk the FullName by '.' segments (after '+ → .'). For each segment
// ending with `Name\`N`, consume N generic arguments and emit them as
// `<…>` on that segment. Nested generic-in-generic (Outer<T>.Inner<U>)
// emits as `global::Ns.Outer<T>.Inner<U>` — valid C#. The pre-fix code
// used `IndexOf('`')` to find the FIRST backtick and truncated the
// entire name there, silently dropping the rest of the nested-generic
// closed args. (Core.Scripting-015.)
var rawName = t.GetGenericTypeDefinition().FullName!.Replace('+', '.');
var allArgs = t.GetGenericArguments();
var segments = rawName.Split('.');
var argIndex = 0;
var sb = new StringBuilder("global::");
for (int i = 0; i < segments.Length; i++)
{
if (i > 0) sb.Append('.');
var seg = segments[i];
var backtick = seg.IndexOf('`');
if (backtick >= 0)
{
var arity = int.Parse(seg.AsSpan(backtick + 1));
sb.Append(seg, 0, backtick);
sb.Append('<');
for (int j = 0; j < arity; j++)
{
if (j > 0) sb.Append(", ");
sb.Append(ToCSharpTypeName(allArgs[argIndex++]));
}
sb.Append('>');
}
else
{
sb.Append(seg);
}
}
return sb.ToString();
}
return "global::" + t.FullName!.Replace('+', '.');
}
}
/// <summary>
/// Collectible <see cref="AssemblyLoadContext"/> that hosts a single emitted script
/// assembly. Created per <see cref="ScriptEvaluator{TContext, TResult}"/> instance so
/// <see cref="AssemblyLoadContext.Unload"/> releases exactly that script. Resolves
/// dependencies via the default ALC — script assemblies reference the BCL + the
/// application's own types, all of which live in the default context.
/// </summary>
internal sealed class ScriptAssemblyLoadContext : AssemblyLoadContext
{
public ScriptAssemblyLoadContext(string name) : base(name, isCollectible: true)
{
}
protected override Assembly? Load(AssemblyName assemblyName) => null;
}
/// <summary>
/// Thrown by <see cref="ScriptEvaluator{TContext, TResult}.Compile"/> when Roslyn
/// reports compile-time errors against the wrapper source. Mirrors the
/// <c>Microsoft.CodeAnalysis.Scripting.CompilationErrorException</c> from the legacy
/// <c>CSharpScript</c> path so callers (engines + the Admin test-harness) keep the
/// same catch site after the Core.Scripting-008 rewrite.
/// </summary>
public sealed class CompilationErrorException : Exception
{
public IReadOnlyList<Diagnostic> Diagnostics { get; }
public CompilationErrorException(IReadOnlyList<Diagnostic> diagnostics)
: base(BuildMessage(diagnostics))
{
Diagnostics = diagnostics;
}
private static string BuildMessage(IReadOnlyList<Diagnostic> diagnostics)
{
if (diagnostics.Count == 0) return "Script compile failed.";
// Operators see this — match the legacy Roslyn format ("(line,col): error CSxxxx:
// message") so existing operator runbooks still match.
var first = diagnostics[0];
var rest = diagnostics.Count == 1 ? "" : $" (and {diagnostics.Count - 1} more)";
return first.ToString() + rest;
}
}
@@ -1,11 +1,10 @@
using Microsoft.CodeAnalysis.CSharp.Scripting;
using Microsoft.CodeAnalysis.Scripting;
using Microsoft.CodeAnalysis;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Core.Scripting;
/// <summary>
/// Factory for the <see cref="ScriptOptions"/> every user script is compiled against.
/// Factory for the compile-time sandbox every user script is built against.
/// Implements Phase 7 plan decision #6 (read-only sandbox) by whitelisting only the
/// assemblies + namespaces the script API needs; no <c>System.IO</c>, no
/// <c>System.Net</c>, no <c>System.Diagnostics.Process</c>, no
@@ -15,9 +14,12 @@ namespace ZB.MOM.WW.OtOpcUa.Core.Scripting;
/// </summary>
/// <remarks>
/// <para>
/// Roslyn's default <see cref="ScriptOptions"/> references <c>mscorlib</c> /
/// <c>System.Runtime</c> transitively which pulls in every type in the BCL — this
/// class overrides that with an explicit minimal allow-list.
/// Roslyn would otherwise pull in every type in the BCL transitively via
/// <c>mscorlib</c> / <c>System.Runtime</c> — this class overrides that with an
/// explicit minimal allow-list. The list is the same regardless of whether
/// <see cref="ScriptEvaluator{TContext, TResult}"/> uses the legacy
/// <c>CSharpScript</c> path or the collectible-<c>AssemblyLoadContext</c> path
/// (Core.Scripting-008): both go through <see cref="Build"/>.
/// </para>
/// <para>
/// Namespaces pre-imported so scripts don't have to write <c>using</c> clauses:
@@ -35,29 +37,21 @@ namespace ZB.MOM.WW.OtOpcUa.Core.Scripting;
public static class ScriptSandbox
{
/// <summary>
/// Build the <see cref="ScriptOptions"/> used for every virtual-tag / alarm
/// script. <paramref name="contextType"/> is the concrete
/// <see cref="ScriptContext"/> subclass the globals will be of — the compiler
/// uses its type to resolve <c>ctx.GetTag(...)</c> calls.
/// Build the sandbox configuration used for every virtual-tag / alarm script.
/// <paramref name="contextType"/> is the concrete <see cref="ScriptContext"/>
/// subclass the script's <c>ctx</c> will be of — the compiler uses its assembly
/// to resolve <c>ctx.GetTag(...)</c> calls.
/// </summary>
public static ScriptOptions Build(Type contextType)
public static SandboxConfig Build(Type contextType)
{
if (contextType is null) throw new ArgumentNullException(nameof(contextType));
if (!typeof(ScriptContext).IsAssignableFrom(contextType))
throw new ArgumentException(
$"Script context type must derive from {nameof(ScriptContext)}", nameof(contextType));
// Allow-listed assemblies — each explicitly chosen. Adding here is a
// plan-level decision; do not expand casually. HashSet so adding the
// contextType's assembly is idempotent when it happens to be Core.Scripting
// already.
var allowedAssemblies = new HashSet<System.Reflection.Assembly>
// OtOpcUa-owned assemblies — pinned by typeof(...) so they survive a rename.
var pinnedAssemblies = new HashSet<System.Reflection.Assembly>
{
// System.Private.CoreLib — primitives (int, double, bool, string, DateTime,
// TimeSpan, Math, Convert, nullable<T>). Can't practically script without it.
typeof(object).Assembly,
// System.Linq — IEnumerable extensions (Where / Select / Sum / Average / etc.).
typeof(System.Linq.Enumerable).Assembly,
// Core.Abstractions — DataValueSnapshot + DriverDataType so scripts can name
// the types they receive from ctx.GetTag.
typeof(DataValueSnapshot).Assembly,
@@ -72,7 +66,23 @@ public static class ScriptSandbox
contextType.Assembly,
};
var allowedImports = new[]
// BCL references. We list the trusted-platform-assemblies set restricted to
// System.* and netstandard so the synthesized wrapper can reference every BCL
// type by FQN — including the ones we forbid (HttpClient, File, Process,
// Registry, etc.). Letting those types resolve at compile is intentional: the
// hard security gate is ForbiddenTypeAnalyzer in step 3 of the compile pipeline
// (Core.Scripting-001 / -002 established the analyzer must be the sole gate
// because type forwarding makes any references-list-only restriction porous).
// The references list now serves only as scoping hygiene — out-of-band BCL
// surface (operator-authored hosting helpers, third-party packages, app code)
// is not on the list and stays unreachable.
var references = new List<MetadataReference>();
foreach (var asm in pinnedAssemblies)
references.Add(MetadataReference.CreateFromFile(asm.Location));
foreach (var path in EnumerateBclAssemblyPaths())
references.Add(MetadataReference.CreateFromFile(path));
var imports = new[]
{
"System",
"System.Linq",
@@ -80,8 +90,56 @@ public static class ScriptSandbox
"ZB.MOM.WW.OtOpcUa.Core.Scripting",
};
return ScriptOptions.Default
.WithReferences(allowedAssemblies)
.WithImports(allowedImports);
return new SandboxConfig(references, imports);
}
private static IEnumerable<string> EnumerateBclAssemblyPaths()
{
// The .NET host advertises the resolved runtime-shared-framework + BCL DLL set
// via the TRUSTED_PLATFORM_ASSEMBLIES AppContext data slot. This is what the
// ALC fallback uses when resolving assemblies, so anything in here is already
// loadable by the host process. We restrict to System.* and netstandard to keep
// the script's reachable surface to the BCL — anything else (Microsoft.*,
// application code, third-party packages happening to be in the runtime store)
// would expand the analyzer's deny-list job unnecessarily.
var raw = (string?)AppContext.GetData("TRUSTED_PLATFORM_ASSEMBLIES");
if (string.IsNullOrEmpty(raw))
yield break;
var separator = OperatingSystem.IsWindows() ? ';' : ':';
foreach (var path in raw.Split(separator, StringSplitOptions.RemoveEmptyEntries))
{
var name = System.IO.Path.GetFileName(path);
if (name.StartsWith("System.", StringComparison.Ordinal) ||
string.Equals(name, "netstandard.dll", StringComparison.Ordinal) ||
string.Equals(name, "mscorlib.dll", StringComparison.Ordinal) ||
// Microsoft.Win32.Registry isn't a System.* DLL but the analyzer's
// Microsoft.Win32 deny-list relies on the type being resolvable so it
// can identify + reject it (Core.Scripting-001 / -002). Add the one
// DLL we need rather than broadening to Microsoft.* (which would also
// pull in compilers, build tooling, etc.).
string.Equals(name, "Microsoft.Win32.Registry.dll", StringComparison.Ordinal))
{
yield return path;
}
}
}
}
/// <summary>
/// Compile-time sandbox configuration. Returned by <see cref="ScriptSandbox.Build"/>;
/// consumed by <see cref="ScriptEvaluator{TContext, TResult}"/>'s manual
/// <c>CSharpCompilation</c> path.
/// </summary>
/// <param name="References">
/// Metadata references (allow-listed assemblies) the script compilation is built
/// against. Anything not in this set is unresolved at compile, which is the sandbox's
/// first line of defense — <see cref="ForbiddenTypeAnalyzer"/> is the second.
/// </param>
/// <param name="Imports">
/// Namespaces pre-imported into the wrapper compilation as <c>using</c> directives
/// so scripts can write <c>Math.Abs</c> rather than <c>System.Math.Abs</c>.
/// </param>
public sealed record SandboxConfig(
IReadOnlyList<MetadataReference> References,
IReadOnlyList<string> Imports);
@@ -37,6 +37,21 @@ public sealed class VirtualTagEngine : IDisposable
private readonly DependencyGraph _graph = new();
private readonly Dictionary<string, VirtualTagState> _tags = new(StringComparer.Ordinal);
/// <summary>
/// Compile cache for every virtual-tag script. Routes <see cref="Load"/>'s
/// <see cref="ScriptEvaluator{TContext, TResult}.Compile"/> calls through the
/// cache so the collectible <see cref="System.Runtime.Loader.AssemblyLoadContext"/>
/// each compile produces is actually disposed on the publish-replace path
/// (Core.Scripting-016): the cache's <see cref="CompiledScriptCache{TContext, TResult}.Clear"/>
/// disposes every materialised evaluator before dropping its dictionary entry,
/// so a config-publish releases the prior generation's ALCs and the per-publish
/// accretion the Core.Scripting-008 fix targeted is actually freed in production.
/// Pre-fix the engine called <c>ScriptEvaluator.Compile</c> directly, which left
/// the ALCs rooted until the process exited — defeating -008 on the real path.
/// </summary>
private readonly CompiledScriptCache<VirtualTagContext, object?> _compileCache = new();
private readonly ConcurrentDictionary<string, DataValueSnapshot> _valueCache = new(StringComparer.Ordinal);
private readonly ConcurrentDictionary<string, List<Action<string, DataValueSnapshot>>> _observers
= new(StringComparer.Ordinal);
@@ -74,6 +89,10 @@ public sealed class VirtualTagEngine : IDisposable
UnsubscribeFromUpstream();
_tags.Clear();
_graph.Clear();
// Dispose every compiled-script ALC from the prior generation BEFORE we
// recompile this one. Skipping this is what made Core.Scripting-008 a
// no-op in production (Core.Scripting-016).
_compileCache.Clear();
var compileFailures = new List<string>();
var seenPaths = new HashSet<string>(StringComparer.Ordinal);
@@ -102,7 +121,9 @@ public sealed class VirtualTagEngine : IDisposable
continue;
}
var evaluator = ScriptEvaluator<VirtualTagContext, object?>.Compile(def.ScriptSource);
// Route through CompiledScriptCache so the emitted assembly's collectible
// ALC participates in publish-replace cleanup. (Core.Scripting-016)
var evaluator = _compileCache.GetOrCompile(def.ScriptSource);
var timed = new TimedScriptEvaluator<VirtualTagContext, object?>(evaluator, _scriptTimeout);
var scriptLogger = _loggerFactory.Create(def.Path);
@@ -481,6 +502,9 @@ public sealed class VirtualTagEngine : IDisposable
UnsubscribeFromUpstream();
_tags.Clear();
_graph.Clear();
// Dispose every compiled-script ALC so the engine's shutdown actually
// releases the emitted assemblies. (Core.Scripting-016)
_compileCache.Dispose();
}
internal DependencyGraph GraphForTesting => _graph;
@@ -116,12 +116,17 @@ public static class SnapshotFormatter
{
0x00000000u => "Good",
0x80000000u => "Bad",
0x80020000u => "BadInternalError",
0x80050000u => "BadCommunicationError",
0x800A0000u => "BadTimeout",
0x80310000u => "BadNoCommunication",
0x80320000u => "BadWaitingForInitialData",
0x80340000u => "BadNodeIdUnknown",
0x80330000u => "BadNodeIdInvalid",
0x803B0000u => "BadNotWritable",
0x803C0000u => "BadOutOfRange",
0x803D0000u => "BadNotSupported",
0x808B0000u => "BadDeviceFailure",
0x80740000u => "BadTypeMismatch",
0x40000000u => "Uncertain",
_ => null,
@@ -41,7 +41,7 @@ public static class AbCipStatusMapper
public const uint BadNotWritable = 0x803B0000u;
public const uint BadOutOfRange = 0x803C0000u;
public const uint BadNotSupported = 0x803D0000u;
public const uint BadDeviceFailure = 0x80550000u;
public const uint BadDeviceFailure = 0x808B0000u;
public const uint BadCommunicationError = 0x80050000u;
public const uint BadTimeout = 0x800A0000u;
public const uint BadTypeMismatch = 0x80730000u;
@@ -16,7 +16,7 @@ public static class AbLegacyStatusMapper
public const uint BadNotWritable = 0x803B0000u;
public const uint BadOutOfRange = 0x803C0000u;
public const uint BadNotSupported = 0x803D0000u;
public const uint BadDeviceFailure = 0x80550000u;
public const uint BadDeviceFailure = 0x808B0000u;
public const uint BadCommunicationError = 0x80050000u;
public const uint BadTimeout = 0x800A0000u;
public const uint BadTypeMismatch = 0x80730000u;
@@ -14,7 +14,7 @@ public static class FocasStatusMapper
public const uint BadNotWritable = 0x803B0000u;
public const uint BadOutOfRange = 0x803C0000u;
public const uint BadNotSupported = 0x803D0000u;
public const uint BadDeviceFailure = 0x80550000u;
public const uint BadDeviceFailure = 0x808B0000u;
public const uint BadCommunicationError = 0x80050000u;
public const uint BadTimeout = 0x800A0000u;
public const uint BadTypeMismatch = 0x80730000u;
@@ -1,6 +1,6 @@
using Microsoft.Extensions.Logging;
using Microsoft.Extensions.Logging.Abstractions;
using MxGateway.Contracts.Proto.Galaxy;
using ZB.MOM.WW.MxGateway.Contracts.Proto.Galaxy;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Browse;
@@ -1,4 +1,4 @@
using MxGateway.Contracts.Proto.Galaxy;
using ZB.MOM.WW.MxGateway.Contracts.Proto.Galaxy;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Browse;
@@ -1,5 +1,5 @@
using MxGateway.Client;
using MxGateway.Contracts.Proto.Galaxy;
using ZB.MOM.WW.MxGateway.Client;
using ZB.MOM.WW.MxGateway.Contracts.Proto.Galaxy;
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Browse;
@@ -1,5 +1,5 @@
using MxGateway.Client;
using MxGateway.Contracts.Proto.Galaxy;
using ZB.MOM.WW.MxGateway.Client;
using ZB.MOM.WW.MxGateway.Contracts.Proto.Galaxy;
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Browse;
@@ -1,4 +1,4 @@
using MxGateway.Contracts.Proto.Galaxy;
using ZB.MOM.WW.MxGateway.Contracts.Proto.Galaxy;
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Browse;
@@ -1,4 +1,4 @@
using MxGateway.Contracts.Proto.Galaxy;
using ZB.MOM.WW.MxGateway.Contracts.Proto.Galaxy;
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Browse;
@@ -1,4 +1,4 @@
using MxGateway.Contracts.Proto.Galaxy;
using ZB.MOM.WW.MxGateway.Contracts.Proto.Galaxy;
using ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Runtime;
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Browse;
@@ -1,7 +1,7 @@
using Microsoft.Extensions.Logging;
using Microsoft.Extensions.Logging.Abstractions;
using MxGateway.Client;
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Client;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
using ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Browse;
using ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Config;
@@ -2,7 +2,7 @@ using System.Diagnostics.Metrics;
using System.Threading.Channels;
using Microsoft.Extensions.Logging;
using Microsoft.Extensions.Logging.Abstractions;
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Runtime;
@@ -1,6 +1,6 @@
using Microsoft.Extensions.Logging;
using Microsoft.Extensions.Logging.Abstractions;
using MxGateway.Client;
using ZB.MOM.WW.MxGateway.Client;
using ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Config;
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Runtime;
@@ -1,6 +1,6 @@
using Microsoft.Extensions.Logging;
using MxGateway.Client;
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Client;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Runtime;
@@ -1,7 +1,7 @@
using System.Diagnostics.Metrics;
using Microsoft.Extensions.Logging;
using Microsoft.Extensions.Logging.Abstractions;
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Runtime;
@@ -1,8 +1,8 @@
using System.Collections.Concurrent;
using Microsoft.Extensions.Logging;
using Microsoft.Extensions.Logging.Abstractions;
using MxGateway.Client;
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Client;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Runtime;
@@ -1,5 +1,5 @@
using MxGateway.Client;
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Client;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
// Use the generated nested status enum for the SetBufferedUpdateInterval reply check.
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Runtime;
@@ -1,4 +1,4 @@
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Runtime;
@@ -1,5 +1,5 @@
using Google.Protobuf.WellKnownTypes;
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Runtime;
@@ -1,5 +1,5 @@
using Google.Protobuf.WellKnownTypes;
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Runtime;
@@ -1,6 +1,6 @@
using Microsoft.Extensions.Logging;
using MxGateway.Client;
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Client;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Runtime;
@@ -1,5 +1,5 @@
using System.Runtime.CompilerServices;
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Runtime;
@@ -15,10 +15,18 @@
<ItemGroup>
<ProjectReference Include="..\..\Core\ZB.MOM.WW.OtOpcUa.Core.Abstractions\ZB.MOM.WW.OtOpcUa.Core.Abstractions.csproj"/>
<ProjectReference Include="..\..\Core\ZB.MOM.WW.OtOpcUa.Core\ZB.MOM.WW.OtOpcUa.Core.csproj"/>
<!-- mxaccessgw .NET client. Path-based ProjectReference because both repos sit
side-by-side on the dev box; long-term we'll consume MxGateway.Client as a
NuGet package. PR 4.W revisits the dependency shape before parity gating. -->
<ProjectReference Include="..\..\..\..\mxaccessgw\clients\dotnet\MxGateway.Client\MxGateway.Client.csproj"/>
</ItemGroup>
<ItemGroup>
<!-- Sibling mxaccessgw repo's .NET client + contracts. The sibling restored
a proper client library under clients/dotnet/ (May 2026), so this is
back on a path-based ProjectReference per the libs/README unwind plan #1.
Both projects target net10.0; the Contracts project transitively pulls
Google.Protobuf + Grpc.Core.Api, the Client project transitively pulls
Grpc.Net.Client + Polly.Core + Microsoft.Extensions.Logging.Abstractions,
so the explicit PackageReference shims that backfilled the vendored
binary references are no longer needed. -->
<ProjectReference Include="..\..\..\..\mxaccessgw\clients\dotnet\ZB.MOM.WW.MxGateway.Client\ZB.MOM.WW.MxGateway.Client.csproj"/>
</ItemGroup>
<ItemGroup>
@@ -1511,7 +1511,7 @@ public sealed class ModbusDriver
private const uint StatusBadNotWritable = 0x803B0000u;
private const uint StatusBadOutOfRange = 0x803C0000u;
private const uint StatusBadNotSupported = 0x803D0000u;
private const uint StatusBadDeviceFailure = 0x80550000u;
private const uint StatusBadDeviceFailure = 0x808B0000u;
private const uint StatusBadCommunicationError = 0x80050000u;
/// <summary>
@@ -69,7 +69,7 @@ public sealed class S7Driver(S7DriverOptions options, string driverInstanceId, I
/// <summary>OPC UA StatusCode used for socket / timeout / protocol-layer faults.</summary>
private const uint StatusBadCommunicationError = 0x80050000u;
/// <summary>OPC UA StatusCode used for a genuine device fault (CPU error, hardware fault).</summary>
private const uint StatusBadDeviceFailure = 0x80550000u;
private const uint StatusBadDeviceFailure = 0x808B0000u;
private readonly Dictionary<string, S7TagDefinition> _tagsByName = new(StringComparer.OrdinalIgnoreCase);
private readonly Dictionary<string, S7ParsedAddress> _parsedByName = new(StringComparer.OrdinalIgnoreCase);
@@ -16,7 +16,7 @@ public static class TwinCATStatusMapper
public const uint BadNotWritable = 0x803B0000u;
public const uint BadOutOfRange = 0x803C0000u;
public const uint BadNotSupported = 0x803D0000u;
public const uint BadDeviceFailure = 0x80550000u;
public const uint BadDeviceFailure = 0x808B0000u;
public const uint BadCommunicationError = 0x80050000u;
public const uint BadTimeout = 0x800A0000u;
public const uint BadTypeMismatch = 0x80730000u;
@@ -101,29 +101,44 @@ else
{
<Generations ClusterId="@ClusterId"/>
}
else if (_tab == "equipment" && _currentDraft is not null)
else if (_tab is "equipment" or "uns" or "namespaces" or "drivers" or "tags" or "acls")
{
<EquipmentTab GenerationId="@_currentDraft.GenerationId"/>
}
else if (_tab == "uns" && _currentDraft is not null)
{
<UnsTab GenerationId="@_currentDraft.GenerationId" ClusterId="@ClusterId"/>
}
else if (_tab == "namespaces" && _currentDraft is not null)
{
<NamespacesTab GenerationId="@_currentDraft.GenerationId" ClusterId="@ClusterId"/>
}
else if (_tab == "drivers" && _currentDraft is not null)
{
<DriversTab GenerationId="@_currentDraft.GenerationId" ClusterId="@ClusterId"/>
}
else if (_tab == "tags" && _currentDraft is not null)
{
<TagsTab GenerationId="@_currentDraft.GenerationId" ClusterId="@ClusterId"/>
}
else if (_tab == "acls" && _currentDraft is not null)
{
<AclsTab GenerationId="@_currentDraft.GenerationId" ClusterId="@ClusterId"/>
@* Bug #10 fix — these six tabs are scoped to a generation. Per docs/v2/admin-ui.md the
design intent is a read-only view of the published generation when no draft is open
("Edit in draft" affordance), and the editable view of the draft when one is open.
The earlier implementation rendered nothing in the no-draft case, leaving operators
with just the "Open a draft to edit" placeholder. We now route both states through
the same tab components, gating edits via <fieldset disabled> so a button click in
the read-only state cannot silently mutate the published rows even though the tab
components themselves haven't been refactored to honor an IsReadOnly flag yet. *@
var genId = _currentDraft?.GenerationId ?? _currentPublished?.GenerationId;
var isReadOnly = _currentDraft is null;
if (genId is null)
{
<section class="panel notice rise" style="animation-delay:.02s">
No published generation yet. Click <strong>New draft</strong> above to author this cluster's first generation.
</section>
}
else
{
if (isReadOnly)
{
<section class="panel notice rise mb-3" style="animation-delay:.02s">
<strong>Read-only view</strong> of published generation @genId. Click <strong>New draft</strong> above to make changes.
</section>
}
<fieldset disabled="@isReadOnly" style="border:0;padding:0;margin:0;min-width:0;">
@switch (_tab)
{
case "equipment": <EquipmentTab GenerationId="@genId.Value"/> break;
case "uns": <UnsTab GenerationId="@genId.Value" ClusterId="@ClusterId"/> break;
case "namespaces": <NamespacesTab GenerationId="@genId.Value" ClusterId="@ClusterId"/> break;
case "drivers": <DriversTab GenerationId="@genId.Value" ClusterId="@ClusterId"/> break;
case "tags": <TagsTab GenerationId="@genId.Value" ClusterId="@ClusterId"/> break;
case "acls": <AclsTab GenerationId="@genId.Value" ClusterId="@ClusterId"/> break;
}
</fieldset>
}
}
else if (_tab == "redundancy")
{
@@ -133,10 +148,6 @@ else
{
<AuditTab ClusterId="@ClusterId"/>
}
else
{
<section class="panel notice rise" style="animation-delay:.02s">Open a draft to edit this cluster's content.</section>
}
}
@code {
@@ -16,12 +16,40 @@ public sealed class ClusterNodeService(OtOpcUaConfigDbContext db)
/// tolerance covers a missed heartbeat plus publisher GC pauses.</summary>
public static readonly TimeSpan StaleThreshold = TimeSpan.FromSeconds(30);
public Task<List<ClusterNode>> ListByClusterAsync(string clusterId, CancellationToken ct) =>
db.ClusterNodes.AsNoTracking()
public async Task<List<ClusterNode>> ListByClusterAsync(string clusterId, CancellationToken ct)
{
var nodes = await db.ClusterNodes.AsNoTracking()
.Where(n => n.ClusterId == clusterId)
.OrderByDescending(n => n.ServiceLevelBase)
.ThenBy(n => n.NodeId)
.ToListAsync(ct);
.ToListAsync(ct).ConfigureAwait(false);
// Bug #12 fix follow-up — the live-node heartbeat lands on
// ClusterNodeGenerationState.LastSeenAt (written by sp_RegisterNodeGenerationApplied
// on every generation poll). The ClusterNode.LastSeenAt column is a legacy slot that
// no current writer maintains, so reading it directly would show "never STALE"
// forever for every running node. Overlay the GenerationState heartbeat onto the
// returned ClusterNode rows when it's more recent so the Redundancy tab + IsStale
// predicate reflect actual liveness without needing a new write path or schema change.
var nodeIds = nodes.Select(n => n.NodeId).ToList();
if (nodeIds.Count > 0)
{
var heartbeats = await db.ClusterNodeGenerationStates.AsNoTracking()
.Where(s => nodeIds.Contains(s.NodeId))
.Select(s => new { s.NodeId, s.LastSeenAt })
.ToListAsync(ct).ConfigureAwait(false);
var beatByNode = heartbeats.ToDictionary(s => s.NodeId, s => s.LastSeenAt);
foreach (var n in nodes)
{
if (beatByNode.TryGetValue(n.NodeId, out var hb) && hb is not null
&& (n.LastSeenAt is null || hb > n.LastSeenAt))
{
n.LastSeenAt = hb;
}
}
}
return nodes;
}
public static bool IsStale(ClusterNode node) =>
node.LastSeenAt is null || DateTime.UtcNow - node.LastSeenAt.Value > StaleThreshold;
@@ -1,6 +1,7 @@
using Microsoft.Data.SqlClient;
using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.Logging;
using ZB.MOM.WW.OtOpcUa.Configuration.Enums;
using ZB.MOM.WW.OtOpcUa.Server.Redundancy;
namespace ZB.MOM.WW.OtOpcUa.Server.Hosting;
@@ -42,10 +43,20 @@ public sealed class GenerationRefreshHostedService(
RedundancyCoordinator coordinator,
ILogger<GenerationRefreshHostedService> logger,
TimeSpan? tickInterval = null,
Func<CancellationToken, Task<long?>>? currentGenerationQuery = null) : BackgroundService
Func<CancellationToken, Task<long?>>? currentGenerationQuery = null,
Func<long, NodeApplyStatus, string?, CancellationToken, Task>? registerAppliedAsync = null) : BackgroundService
{
private readonly Func<CancellationToken, Task<long?>> _generationQuery = currentGenerationQuery
?? new Func<CancellationToken, Task<long?>>(ct => DefaultQueryCurrentGenerationAsync(options, logger, ct));
// Bug #12 fix — the server now reports applied-generation state + heartbeat back to the
// central DB via sp_RegisterNodeGenerationApplied. Before this wiring the proc had zero
// callers, so dbo.ClusterNodeGenerationState stayed empty for every node and the Admin UI
// Fleet status page + cluster-detail Redundancy LastSeenAt both showed "no node state /
// never STALE" indefinitely. Tests inject a stub via the registerAppliedAsync parameter.
private readonly Func<long, NodeApplyStatus, string?, CancellationToken, Task> _registerApplied = registerAppliedAsync
?? new Func<long, NodeApplyStatus, string?, CancellationToken, Task>(
(gen, status, err, ct) => DefaultRegisterAppliedAsync(options, logger, gen, status, err, ct));
/// <summary>
/// How often the service polls <c>sp_GetCurrentGenerationForCluster</c>. Default 5 s —
/// low enough that operator publishes take effect promptly, high enough that the
@@ -97,6 +108,18 @@ public sealed class GenerationRefreshHostedService(
if (LastAppliedGenerationId is long last && current == last)
{
// Heartbeat — re-stamps LastSeenAt on dbo.ClusterNodeGenerationState so the Admin
// Fleet status page + cluster Redundancy tab can detect the node is alive without
// a generation change. Best-effort: a transient DB error here must not throw out of
// the tick (the next tick will retry) and must not block applies.
try
{
await _registerApplied(current.Value, NodeApplyStatus.Applied, null, cancellationToken).ConfigureAwait(false);
}
catch (Exception hbEx) when (hbEx is not OperationCanceledException)
{
logger.LogDebug(hbEx, "Heartbeat to sp_RegisterNodeGenerationApplied failed; will retry next tick");
}
return; // no change
}
@@ -109,14 +132,44 @@ public sealed class GenerationRefreshHostedService(
// lease is open. Publisher ticks in parallel (1s cadence) will observe the band
// transition and push it onto the OPC UA Server.ServiceLevel node.
var publishRequestId = Guid.NewGuid();
await using (leases.BeginApplyLease(current.Value, publishRequestId))
NodeApplyStatus applyStatus;
string? applyError = null;
try
{
await coordinator.RefreshAsync(cancellationToken).ConfigureAwait(false);
// Future: fire a domain event that driver hosts / virtual-tag engine /
// scripted-alarm engine subscribe to. For now the topology refresh is the
// only thing we rewire — everything else still requires a process restart.
await using (leases.BeginApplyLease(current.Value, publishRequestId))
{
await coordinator.RefreshAsync(cancellationToken).ConfigureAwait(false);
// Future: fire a domain event that driver hosts / virtual-tag engine /
// scripted-alarm engine subscribe to. For now the topology refresh is the
// only thing we rewire — everything else still requires a process restart.
}
applyStatus = NodeApplyStatus.Applied;
}
catch (Exception applyEx) when (applyEx is not OperationCanceledException)
{
applyStatus = NodeApplyStatus.Failed;
applyError = applyEx.Message;
logger.LogError(applyEx, "Apply of generation {Generation} failed; will report Failed status to central DB", current);
// fall through to register so operators see the failed apply in /fleet
}
// Always tell the central DB what happened with this apply attempt — success or
// failure. The proc upserts dbo.ClusterNodeGenerationState (CurrentGenerationId +
// LastAppliedAt + LastAppliedStatus + LastAppliedError + LastSeenAt). Failure here
// mustn't prevent us from advancing LastAppliedGenerationId — the apply already
// happened (or already failed); the publish is purely observability.
try
{
await _registerApplied(current.Value, applyStatus, applyError, cancellationToken).ConfigureAwait(false);
}
catch (Exception regEx) when (regEx is not OperationCanceledException)
{
logger.LogWarning(regEx, "sp_RegisterNodeGenerationApplied call failed for gen {Generation} status {Status}", current, applyStatus);
}
// Advance the cursor even on Failed — the proc has been told; next tick will heartbeat
// and a future generation will trigger a fresh apply attempt. Pinning the cursor on
// failure would loop us through the same broken apply every 5s.
LastAppliedGenerationId = current;
RefreshCount++;
}
@@ -157,4 +210,35 @@ public sealed class GenerationRefreshHostedService(
return null;
}
}
/// <summary>
/// Default register-applied implementation — calls <c>sp_RegisterNodeGenerationApplied</c>
/// to MERGE-upsert <see cref="ZB.MOM.WW.OtOpcUa.Configuration.Entities.ClusterNodeGenerationState"/>
/// for this node. Called both at apply completion (success or failure) and on every
/// no-change heartbeat tick so <c>LastSeenAt</c> stays fresh in the central DB and the
/// Admin UI Fleet status page + Redundancy LastSeenAt indicator can detect a healthy node.
/// Bug #12 fix — wires the previously-orphaned proc into the apply loop.
/// </summary>
private static async Task DefaultRegisterAppliedAsync(
NodeOptions options,
ILogger logger,
long generationId,
NodeApplyStatus status,
string? error,
CancellationToken cancellationToken)
{
await using var conn = new SqlConnection(options.ConfigDbConnectionString);
await conn.OpenAsync(cancellationToken).ConfigureAwait(false);
await using var cmd = conn.CreateCommand();
cmd.CommandText = "EXEC dbo.sp_RegisterNodeGenerationApplied @NodeId=@n, @GenerationId=@g, @Status=@s, @Error=@e";
cmd.Parameters.AddWithValue("@n", options.NodeId);
cmd.Parameters.AddWithValue("@g", generationId);
cmd.Parameters.AddWithValue("@s", status.ToString());
cmd.Parameters.AddWithValue("@e", (object?)error ?? DBNull.Value);
await cmd.ExecuteNonQueryAsync(cancellationToken).ConfigureAwait(false);
// Single-line trace so soak runs can see heartbeat ticks without flooding at Info.
logger.LogTrace("Reported gen {Generation} status {Status} to central DB", generationId, status);
}
}
@@ -925,4 +925,165 @@ public sealed class ScriptedAlarmEngineTests
public Task RemoveAsync(string alarmId, CancellationToken ct)
=> _inner.RemoveAsync(alarmId, ct);
}
// --- Core.ScriptedAlarms-009: per-alarm evaluation-scratch reuse ---
[Fact]
public async Task Reevaluation_reuses_the_same_read_cache_dictionary()
{
// Pre-009 the engine allocated a fresh Dictionary<string, DataValueSnapshot>
// on every upstream-change tick; on a busy line this was a steady allocation
// stream on the hot path. The fix: one dictionary per alarm, refilled in place
// under _evalGate. Test asserts the dictionary instance is identical across
// two consecutive evaluations of the same alarm.
var up = new FakeUpstream();
up.Set("Temp", 50);
using var eng = Build(up, out _);
await eng.LoadAsync(
[Alarm("HighTemp", """return (int)ctx.GetTag("Temp").Value > 100;""")],
TestContext.Current.CancellationToken);
// First evaluation runs during LoadAsync. Capture the scratch reference now.
var scratchAfterLoad = eng.TryGetScratchReadCacheForTest("HighTemp");
scratchAfterLoad.ShouldNotBeNull(
"the scratch should have been allocated during LoadAsync's initial evaluation");
// Trigger a re-evaluation by pushing an upstream change.
up.Push("Temp", 150);
await WaitForAsync(() =>
eng.GetState("HighTemp")!.Active == AlarmActiveState.Active);
var scratchAfterPush = eng.TryGetScratchReadCacheForTest("HighTemp");
ReferenceEquals(scratchAfterLoad, scratchAfterPush).ShouldBeTrue(
"the read-cache dictionary must be the *same* instance across evaluations " +
"(Core.ScriptedAlarms-009) — a per-call allocation would defeat the fix.");
scratchAfterPush!["Temp"].Value.ShouldBe(150, "refill must update the existing dictionary in place");
}
[Fact]
public async Task Reevaluation_reuses_the_same_predicate_context()
{
// The context wraps the read-cache by reference; refilling the dictionary
// updates what the script sees. Reusing the context spares a per-call object
// allocation as well as the dictionary one.
var up = new FakeUpstream();
up.Set("Temp", 50);
using var eng = Build(up, out _);
await eng.LoadAsync(
[Alarm("HighTemp", """return (int)ctx.GetTag("Temp").Value > 100;""")],
TestContext.Current.CancellationToken);
var ctxAfterLoad = eng.TryGetScratchContextForTest("HighTemp");
ctxAfterLoad.ShouldNotBeNull();
up.Push("Temp", 150);
await WaitForAsync(() =>
eng.GetState("HighTemp")!.Active == AlarmActiveState.Active);
var ctxAfterPush = eng.TryGetScratchContextForTest("HighTemp");
ReferenceEquals(ctxAfterLoad, ctxAfterPush).ShouldBeTrue(
"the AlarmPredicateContext must be reused across evaluations (Core.ScriptedAlarms-009).");
}
[Fact]
public async Task LoadAsync_drops_the_prior_generations_scratch()
{
// A config-publish recreates AlarmStates with potentially different Inputs +
// Loggers; reusing the prior generation's scratch would attach an outdated
// logger to the new alarm. LoadAsync must clear _scratchByAlarmId so the
// next evaluation lazily re-allocates against the fresh AlarmState.
var up = new FakeUpstream();
up.Set("Temp", 50);
using var eng = Build(up, out _);
await eng.LoadAsync(
[Alarm("HighTemp", """return (int)ctx.GetTag("Temp").Value > 100;""")],
TestContext.Current.CancellationToken);
var scratchAfterFirstLoad = eng.TryGetScratchReadCacheForTest("HighTemp");
scratchAfterFirstLoad.ShouldNotBeNull();
// Second LoadAsync — same alarm id, same predicate, but the scratch should be
// wiped and re-allocated on the next evaluation. (LoadAsync itself triggers a
// first evaluation, so the scratch is reborn before we look.)
await eng.LoadAsync(
[Alarm("HighTemp", """return (int)ctx.GetTag("Temp").Value > 100;""")],
TestContext.Current.CancellationToken);
var scratchAfterSecondLoad = eng.TryGetScratchReadCacheForTest("HighTemp");
scratchAfterSecondLoad.ShouldNotBeNull();
ReferenceEquals(scratchAfterFirstLoad, scratchAfterSecondLoad).ShouldBeFalse(
"LoadAsync must drop the prior generation's scratch — reuse across a publish " +
"would attach a stale Logger / Inputs to the new alarm definition.");
}
// --- Core.Scripting-016: engine routes compiles through CompiledScriptCache ---
[Fact]
public void Dispose_unloads_compiled_predicate_assembly()
{
// Pre-fix the engine called ScriptEvaluator.Compile directly, so the
// emitted predicate assembly's ALC stayed loaded for the process lifetime.
// After the fix the engine routes through CompiledScriptCache; engine
// Dispose triggers cache Dispose which unloads every cached evaluator's ALC.
// Assert via WeakReference + GC that the assembly is actually reclaimed.
// Helper is sync + [NoInlining] so its locals can't be kept alive by an
// async state machine (an earlier async version of this test failed because
// the state-machine struct held the evaluator past the method-end).
var weak = CompileAlarmAndCaptureWeak();
for (int i = 0; i < 10 && weak.IsAlive; i++)
{
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
}
weak.IsAlive.ShouldBeFalse(
"engine Dispose must release the compiled-predicate assembly via " +
"CompiledScriptCache (Core.Scripting-016). If this fails the engine is " +
"back to calling ScriptEvaluator.Compile directly and -008's headline " +
"fix doesn't run in production.");
}
[System.Runtime.CompilerServices.MethodImpl(
System.Runtime.CompilerServices.MethodImplOptions.NoInlining)]
private static WeakReference CompileAlarmAndCaptureWeak()
{
var up = new FakeUpstream();
up.Set("Temp", 50);
var eng = Build(up, out _);
// Block on LoadAsync so this helper stays synchronous — an `async Task`
// wrapper would compile to a state machine whose generated struct keeps the
// local `eng` reference alive past the method's apparent end, defeating GC.
eng.LoadAsync(
[new ScriptedAlarmDefinition(
"HighTemp", "Plant/Line1", "HighTemp",
AlarmKind.AlarmCondition, AlarmSeverity.High,
"x",
"""return (int)ctx.GetTag("Temp").Value > 100;""")],
default).GetAwaiter().GetResult();
// Reach into the engine's compile cache via reflection — the field is
// private; we only need the Assembly reference, scoped to this NoInlining
// helper so the locals die when it returns.
var weak = ExtractEmittedAssemblyWeakRef(eng);
eng.Dispose();
return weak;
}
[System.Runtime.CompilerServices.MethodImpl(
System.Runtime.CompilerServices.MethodImplOptions.NoInlining)]
private static WeakReference ExtractEmittedAssemblyWeakRef(ScriptedAlarmEngine eng)
{
var cacheField = typeof(ScriptedAlarmEngine).GetField(
"_compileCache", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.NonPublic)!;
var cache = cacheField.GetValue(eng)!;
var cacheDictField = cache.GetType().GetField(
"_cache", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.NonPublic)!;
var cacheDict = (System.Collections.IDictionary)cacheDictField.GetValue(cache)!;
var lazy = cacheDict.Values.Cast<object>().Single();
var evaluator = lazy.GetType().GetProperty("Value")!.GetValue(lazy)!;
var del = (Delegate)evaluator.GetType().GetField(
"_func", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.NonPublic)!
.GetValue(evaluator)!;
return new WeakReference(del.Method.Module.Assembly);
}
}
@@ -221,4 +221,173 @@ public sealed class CompiledScriptCacheTests
Should.Throw<Exception>(() => cache.GetOrCompile("""return unknownIdentifier + 1;"""));
cache.Count.ShouldBe(0, "faulted Lazy must still be evicted after compile failure");
}
[Fact]
public void Clear_uses_value_scoped_TryRemove_so_a_race_inserted_entry_survives()
{
// Regression for Core.Scripting-014: Clear() previously used the key-only
// TryRemove(key, out var lazy) overload, which is value-blind. If a
// concurrent GetOrCompile re-inserts a fresh Lazy under the same key
// between Clear's snapshot and the TryRemove, the buggy overload evicts
// the fresh entry and disposes its evaluator while the concurrent
// caller still holds + intends to invoke it. The fixed
// TryRemove(KeyValuePair<,>) overload compares value identity.
//
// Single-threaded test that models the race window: snapshot the
// dictionary like Clear does, mutate the dictionary mid-flight (modelling
// Thread B's GetOrCompile), then drive the same TryRemove(KeyValuePair<,>)
// overload Clear now uses. The buggy key-only overload would evict the
// fresh entry; the fixed value-scoped overload leaves it alone. This
// verifies the *semantic* of Clear's code path — actually intercepting
// a real Clear() invocation mid-loop would require either two threads
// (flaky) or a Dispose side-effect that re-adds within the loop
// iteration (only works with multiple snapshotted entries).
var cache = new CompiledScriptCache<FakeScriptContext, int>();
var cacheField = typeof(CompiledScriptCache<FakeScriptContext, int>)
.GetField("_cache", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.NonPublic);
var dict = (System.Collections.Concurrent.ConcurrentDictionary<
string, Lazy<ScriptEvaluator<FakeScriptContext, int>>>)cacheField!.GetValue(cache)!;
var hashSourceMethod = typeof(CompiledScriptCache<FakeScriptContext, int>)
.GetMethod("HashSource", System.Reflection.BindingFlags.Static | System.Reflection.BindingFlags.NonPublic);
var key = (string)hashSourceMethod!.Invoke(null, ["""return 1;"""])!;
var firstGen = new Lazy<ScriptEvaluator<FakeScriptContext, int>>(
() => ScriptEvaluator<FakeScriptContext, int>.Compile("""return 1;"""),
LazyThreadSafetyMode.ExecutionAndPublication);
dict[key] = firstGen;
// Snapshot like Clear's first line does.
var snapshot = dict.ToArray();
snapshot.Length.ShouldBe(1);
// Race window: a concurrent GetOrCompile inserts a fresh Lazy under the
// same key. The dict now holds `fresh`, but the snapshot still references
// `firstGen`.
var fresh = new Lazy<ScriptEvaluator<FakeScriptContext, int>>(
() => ScriptEvaluator<FakeScriptContext, int>.Compile("""return 1;"""),
LazyThreadSafetyMode.ExecutionAndPublication);
dict[key] = fresh;
// The fixed Clear uses the value-scoped TryRemove(KeyValuePair<,>) overload.
// Drive it directly with our snapshot entry — the value comparison sees
// firstGen ≠ fresh and leaves the entry intact.
foreach (var entry in snapshot)
{
var removed = ((System.Collections.Generic.ICollection<
System.Collections.Generic.KeyValuePair<string, Lazy<ScriptEvaluator<FakeScriptContext, int>>>>)dict)
.Remove(entry);
removed.ShouldBeFalse(
"value-scoped removal must reject the stale snapshot entry — the dict " +
"now holds `fresh`, not `firstGen`. The buggy key-only TryRemove would " +
"have returned true and evicted the fresh entry (Core.Scripting-014).");
}
dict.ContainsKey(key).ShouldBeTrue(
"the fresh entry inserted during the race window must survive Clear().");
ReferenceEquals(dict[key], fresh).ShouldBeTrue(
"the surviving entry must be the fresh Lazy, not the one Clear() snapshotted.");
}
// --- Core.Scripting-008: collectible AssemblyLoadContext unload ---
[Fact]
public void Dispose_unloads_compiled_script_assembly_load_context()
{
// The whole point of switching the emit path off CSharpScript.CreateDelegate
// and onto a collectible ALC: after Dispose, the runtime can reclaim the
// emitted assembly. We assert this via a WeakReference to the compiled
// assembly itself — if the ALC unloads correctly the reference is dead after
// a forced GC; if the assembly stayed rooted (the pre-fix behaviour) the
// reference survives. The exact reclaim is GC-timing-sensitive, so we loop a
// bounded number of times to absorb GC scheduling noise.
var weak = CompileAndCaptureWeakAssembly();
// Help the runtime — ALC.Unload is *eligible-for-collection*, not synchronous.
for (int i = 0; i < 10 && weak.IsAlive; i++)
{
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
}
weak.IsAlive.ShouldBeFalse(
"the collectible ALC must release the emitted script assembly after Dispose " +
"(Core.Scripting-008). If this fails, either the cache held a reference past " +
"Dispose, or a delegate/closure rooted the assembly in the default ALC.");
}
[System.Runtime.CompilerServices.MethodImpl(
System.Runtime.CompilerServices.MethodImplOptions.NoInlining)]
private static WeakReference CompileAndCaptureWeakAssembly()
{
// Isolation method so the JIT cannot keep the local references rooted past
// its return — without [NoInlining] the GC may decide the locals are still
// live and the WeakReference test becomes flaky.
var evaluator = ScriptEvaluator<FakeScriptContext, int>.Compile("""return 42;""");
var weak = new WeakReference(evaluator.GetType().Assembly is var asm &&
asm is not null ? GetEmittedAssembly(evaluator) : null);
evaluator.Dispose();
return weak;
}
[System.Runtime.CompilerServices.MethodImpl(
System.Runtime.CompilerServices.MethodImplOptions.NoInlining)]
private static object GetEmittedAssembly(ScriptEvaluator<FakeScriptContext, int> evaluator)
{
// The evaluator's delegate Method lives on the synthesized wrapper class in
// the emitted assembly. The delegate field is private; we reach it via a
// reflection probe rather than expose internals — keeps the public surface
// unchanged.
var funcField = typeof(ScriptEvaluator<FakeScriptContext, int>).GetField(
"_func", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.NonPublic)!;
var del = (Delegate)funcField.GetValue(evaluator)!;
return del.Method.Module.Assembly;
}
[Fact]
public void Clear_disposes_every_materialised_evaluator()
{
// Locks the contract that Clear() is publish-safe: after a config-generation
// publish drops the cache, every prior script's ALC should unload so the
// process memory plateaus rather than growing across publishes.
var weaks = CompileFiveAndCaptureWeakAssemblies();
for (int i = 0; i < 10 && weaks.Any(w => w.IsAlive); i++)
{
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
}
weaks.ShouldAllBe(w => !w.IsAlive,
"after Clear() every compiled-script ALC must be unloadable " +
"(Core.Scripting-008). If this fails the publish-replace pattern leaks " +
"emitted assemblies, which is exactly the v3 concern this rewrite fixes.");
}
[System.Runtime.CompilerServices.MethodImpl(
System.Runtime.CompilerServices.MethodImplOptions.NoInlining)]
private static List<WeakReference> CompileFiveAndCaptureWeakAssemblies()
{
var cache = new CompiledScriptCache<FakeScriptContext, int>();
var weaks = new List<WeakReference>();
for (int i = 0; i < 5; i++)
{
// Distinct source per iteration so each compiles to its own assembly.
var e = cache.GetOrCompile($"""return {i};""");
var funcField = typeof(ScriptEvaluator<FakeScriptContext, int>).GetField(
"_func", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.NonPublic)!;
var del = (Delegate)funcField.GetValue(e)!;
weaks.Add(new WeakReference(del.Method.Module.Assembly));
}
cache.Count.ShouldBe(5);
cache.Clear();
cache.Count.ShouldBe(0);
return weaks;
}
[Fact]
public void GetOrCompile_after_Dispose_throws_ObjectDisposedException()
{
var cache = new CompiledScriptCache<FakeScriptContext, int>();
cache.GetOrCompile("""return 1;""");
cache.Dispose();
Should.Throw<ObjectDisposedException>(() => cache.GetOrCompile("""return 2;"""));
}
}
@@ -446,4 +446,88 @@ public sealed class ScriptSandboxTests
return 0;
"""));
}
// --- Core.Scripting-012: BCL refs broadened by -008/-016 re-exposed three vectors
// the original deny-list missed. Each is denied type-granularly in
// ForbiddenTypeAnalyzer.ForbiddenFullTypeNames; these tests pin the rejection.
[Fact]
public void Rejects_ThreadPool_QueueUserWorkItem_at_compile()
{
// System.Threading.ThreadPool.QueueUserWorkItem re-introduces the background-
// fanout threat Core.Scripting-003 closed against System.Threading.Tasks. The
// ThreadPool type lives in System.Threading (shared with allowed sync primitives
// like CancellationToken), so it must be denied type-granularly. (Core.Scripting-012.)
Should.Throw<ScriptSandboxViolationException>(() =>
ScriptEvaluator<FakeScriptContext, int>.Compile(
"""
System.Threading.ThreadPool.QueueUserWorkItem(_ => { });
return 0;
"""));
}
[Fact]
public void Rejects_Timer_new_at_compile()
{
// System.Threading.Timer schedules unbounded callback work that outlives the
// per-evaluation timeout. Same namespace as CancellationToken so type-granular.
// (Core.Scripting-012.)
Should.Throw<ScriptSandboxViolationException>(() =>
ScriptEvaluator<FakeScriptContext, int>.Compile(
"""
var t = new System.Threading.Timer(_ => { });
return 0;
"""));
}
[Fact]
public void Rejects_AssemblyLoadContext_at_compile()
{
// System.Runtime.Loader.AssemblyLoadContext loads arbitrary DLLs into the
// process — defense-in-depth gap that the BCL-wide reference set re-opened.
// System.Reflection already denies the typical invocation path, but the
// sandbox boundary stays type-explicit. (Core.Scripting-012.)
Should.Throw<ScriptSandboxViolationException>(() =>
ScriptEvaluator<FakeScriptContext, int>.Compile(
"""
var alc = System.Runtime.Loader.AssemblyLoadContext.Default;
return 0;
"""));
}
// --- Core.Scripting-013: wrapper-source injection ---
[Fact]
public void Rejects_sibling_method_injection_via_balanced_braces()
{
// Core.Scripting-013 — a brace-balanced source that closes Run early and
// declares a sibling method inside CompiledScript. The analyzer would still
// walk the injected member, but the documented "method body" contract
// forbids the shape outright now.
var ex = Should.Throw<CompilationErrorException>(() =>
ScriptEvaluator<FakeScriptContext, int>.Compile(
"""
return 0;
} public static int Evil() { return 1;
"""));
ex.Message.ShouldContain("Core.Scripting-013");
}
[Fact]
public void Rejects_sibling_class_injection_via_balanced_braces()
{
// Same shape, but injecting an entire sibling class inside the synthesized
// namespace. Reject at the type-count check.
var ex = Should.Throw<CompilationErrorException>(() =>
ScriptEvaluator<FakeScriptContext, int>.Compile(
"""
return 0;
}
}
public static class CompiledScript2 { public static int M() { return 0; } }
namespace ZB.MOM.WW.OtOpcUa.Core.Scripting.Compiled { public static class CompiledScript3 { public static int M() {
return 0;
"""));
ex.Message.ShouldContain("Core.Scripting-013");
}
}
@@ -655,4 +655,60 @@ public sealed class VirtualTagEngineTests
lock (_buf) { _buf.Add((path, value)); }
}
}
// --- Core.Scripting-016: engine routes compiles through CompiledScriptCache ---
[Fact]
public void Dispose_unloads_compiled_script_assembly()
{
// Pre-fix the engine called ScriptEvaluator.Compile directly, so the emitted
// script assembly's ALC stayed loaded for the process lifetime. After the fix
// the engine routes through CompiledScriptCache; engine Dispose triggers cache
// Dispose which unloads every cached evaluator's ALC. Assert via WeakReference
// + GC that the assembly is actually reclaimed.
var weak = CompileTagAndCaptureWeak();
for (int i = 0; i < 10 && weak.IsAlive; i++)
{
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
}
weak.IsAlive.ShouldBeFalse(
"engine Dispose must release the compiled-tag assembly via " +
"CompiledScriptCache (Core.Scripting-016). If this fails the engine is " +
"back to calling ScriptEvaluator.Compile directly and -008's headline " +
"fix doesn't run in production.");
}
[System.Runtime.CompilerServices.MethodImpl(
System.Runtime.CompilerServices.MethodImplOptions.NoInlining)]
private static WeakReference CompileTagAndCaptureWeak()
{
var up = new FakeUpstream();
up.Set("X", 10);
var engine = Build(up);
engine.Load([new VirtualTagDefinition(
Path: "Doubled",
DataType: DriverDataType.Float64,
ScriptSource: """return (double)ctx.GetTag("X").Value * 2.0;""")]);
// Reach into the engine's compile cache via reflection — internals scoped to
// Tests-via-InternalsVisibleTo and the field is private.
var cacheField = typeof(VirtualTagEngine).GetField(
"_compileCache", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.NonPublic)!;
var cache = cacheField.GetValue(engine)!;
var cacheDictField = cache.GetType().GetField(
"_cache", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.NonPublic)!;
var cacheDict = (System.Collections.IDictionary)cacheDictField.GetValue(cache)!;
var lazy = cacheDict.Values.Cast<object>().Single();
var lazyValueProp = lazy.GetType().GetProperty("Value")!;
var evaluator = lazyValueProp.GetValue(lazy)!;
var funcField = evaluator.GetType().GetField(
"_func", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.NonPublic)!;
var del = (Delegate)funcField.GetValue(evaluator)!;
var weak = new WeakReference(del.Method.Module.Assembly);
engine.Dispose();
return weak;
}
}
@@ -29,12 +29,19 @@ public sealed class SnapshotFormatterTests
// Numeric codes are the canonical OPC Foundation Opc.Ua.StatusCodes values.
[InlineData(0x00000000u, "Good")]
[InlineData(0x80000000u, "Bad")]
[InlineData(0x80020000u, "BadInternalError")]
[InlineData(0x80050000u, "BadCommunicationError")]
[InlineData(0x800A0000u, "BadTimeout")]
[InlineData(0x80310000u, "BadNoCommunication")]
[InlineData(0x80320000u, "BadWaitingForInitialData")]
[InlineData(0x80340000u, "BadNodeIdUnknown")]
[InlineData(0x80330000u, "BadNodeIdInvalid")]
[InlineData(0x803B0000u, "BadNotWritable")]
[InlineData(0x803C0000u, "BadOutOfRange")]
[InlineData(0x803D0000u, "BadNotSupported")]
// Driver.Cli.Common-007: corrected from 0x80550000 (which is actually
// BadSecurityPolicyRejected) to the canonical OPC UA spec value 0x808B0000.
[InlineData(0x808B0000u, "BadDeviceFailure")]
[InlineData(0x80740000u, "BadTypeMismatch")]
[InlineData(0x40000000u, "Uncertain")]
public void FormatStatus_names_well_known_status_codes(uint status, string expectedName)
@@ -50,6 +57,10 @@ public sealed class SnapshotFormatterTests
[InlineData(0x80070000u, "BadNoCommunication")] // was mislabelled BadNoCommunication
[InlineData(0x80080000u, "BadWaitingForInitialData")] // was mislabelled BadWaitingForInitialData
[InlineData(0x80350000u, "BadNodeIdInvalid")] // was mislabelled BadNodeIdInvalid
// Driver.Cli.Common-007: 0x80550000 is BadSecurityPolicyRejected per spec, not
// BadDeviceFailure (which is 0x808B0000). The buggy shortlist + the six native-
// protocol mappers all had it wrong; this row pins it as a regression guard.
[InlineData(0x80550000u, "BadDeviceFailure")]
public void FormatStatus_does_not_apply_pre_fix_wrong_names(uint status, string wrongName)
{
SnapshotFormatter.FormatStatus(status).ShouldNotContain(wrongName);
@@ -26,7 +26,11 @@ public sealed class AbCipReadSmokeTests
await fixture.InitializeAsync();
try
{
var deviceUri = $"ab://127.0.0.1:{fixture.Port}/1,0";
// Use fixture.Host (not hardcoded 127.0.0.1) so the docker-host migration
// (10.100.0.35) and AB_SERVER_ENDPOINT overrides reach this test too. The
// sibling Emulate tests already use the fixture's resolved endpoint; this
// smoke test was missed.
var deviceUri = $"ab://{fixture.Host}:{fixture.Port}/1,0";
var drv = new AbCipDriver(new AbCipDriverOptions
{
Devices = [new AbCipDeviceOptions(deviceUri, profile.Family)],
@@ -8,7 +8,7 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests;
/// Reachability probe for the <c>ab_server</c> Docker container (libplctag's CIP
/// simulator built via <c>Docker/Dockerfile</c>) or any real AB PLC the
/// <c>AB_SERVER_ENDPOINT</c> env var points at. Parses
/// <c>AB_SERVER_ENDPOINT</c> (default <c>localhost:44818</c>) + TCP-connects
/// <c>AB_SERVER_ENDPOINT</c> (default <c>10.100.0.35:44818</c> — the shared Docker host) + TCP-connects
/// once at fixture construction. Tests skip via <see cref="AbServerFactAttribute"/>
/// / <see cref="AbServerTheoryAttribute"/> when the port isn't live, so
/// <c>dotnet test</c> stays green on a fresh clone without Docker running.
@@ -28,7 +28,10 @@ public sealed class AbServerFixture : IAsyncLifetime
/// instantiate the fixture with the profile matching their compose-file service.</summary>
public AbServerProfile Profile { get; }
public string Host { get; } = "127.0.0.1";
// 10.100.0.35 = the shared Docker host (see CLAUDE.md "Docker Workflow"). Migrated
// off this VM's 127.0.0.1 on 2026-04-28 alongside the rest of the Docker-host move.
// Override via AB_SERVER_ENDPOINT to point at a real PLC or a locally-running container.
public string Host { get; } = "10.100.0.35";
public int Port { get; } = AbServerProfile.DefaultPort;
public AbServerFixture() : this(KnownProfiles.ControlLogix) { }
@@ -59,7 +62,7 @@ public sealed class AbServerFixture : IAsyncLifetime
TcpProbe(ResolveHost(), ResolvePort());
private static string ResolveHost() =>
Environment.GetEnvironmentVariable(EndpointEnvVar)?.Split(':', 2)[0] ?? "127.0.0.1";
Environment.GetEnvironmentVariable(EndpointEnvVar)?.Split(':', 2)[0] ?? "10.100.0.35";
private static int ResolvePort()
{
@@ -84,7 +87,7 @@ public sealed class AbServerFixture : IAsyncLifetime
/// <summary>
/// <c>[Fact]</c>-equivalent that skips when ab_server isn't reachable — accepts a
/// live Docker listener on <c>localhost:44818</c> or an <c>AB_SERVER_ENDPOINT</c>
/// live Docker listener on <c>10.100.0.35:44818</c> or an <c>AB_SERVER_ENDPOINT</c>
/// override pointing at a real PLC.
/// </summary>
public sealed class AbServerFactAttribute : FactAttribute
@@ -29,6 +29,10 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests.Emulate;
/// <para>Runs only when <c>AB_SERVER_PROFILE=emulate</c>. ab_server has no ALMD
/// instruction + no alarm subsystem, so this tier-gated class couldn't produce a
/// meaningful result against the default simulator.</para>
/// <para>The Emulate tier is hardware-gated (Rockwell per-seat license, Windows-only,
/// conflicts with Docker Desktop's WSL 2 backend) so a permanent skip in CI is expected
/// — see <c>docs/drivers/AbServer-Test-Fixture.md</c> §"Logix Emulate golden-box tier"
/// for the full rationale + the gap matrix this test closes.</para>
/// </remarks>
[Collection("AbServerEmulate")]
[Trait("Category", "Integration")]
@@ -27,6 +27,10 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.AbCip.IntegrationTests.Emulate;
/// <para>Runs only when <c>AB_SERVER_PROFILE=emulate</c>. With ab_server
/// (the default), skips cleanly — ab_server lacks UDT / Template Object emulation
/// so this wire-level test couldn't pass against it regardless.</para>
/// <para>The Emulate tier is hardware-gated (Rockwell per-seat license, Windows-only,
/// conflicts with Docker Desktop's WSL 2 backend) so a permanent skip in CI is expected
/// — see <c>docs/drivers/AbServer-Test-Fixture.md</c> §"Logix Emulate golden-box tier"
/// for the full rationale + the gap matrix this test closes.</para>
/// </remarks>
[Collection("AbServerEmulate")]
[Trait("Category", "Integration")]
@@ -18,7 +18,7 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.IntegrationTests;
/// Env-var overrides:
/// <list type="bullet">
/// <item><c>AB_LEGACY_ENDPOINT</c> — <c>host:port</c> of the PCCC-mode simulator.
/// Defaults to <c>localhost:44818</c> (EtherNet/IP port; ab_server's PCCC
/// Defaults to <c>10.100.0.35:44818</c> — the shared Docker host (EtherNet/IP port; ab_server's PCCC
/// emulation exposes PCCC-over-CIP on the same port as CIP itself).</item>
/// <item><c>AB_LEGACY_CIP_PATH</c> — routing path appended to the <c>ab://host:port/</c>
/// URI. Defaults to <c>1,0</c> (port-1/slot-0 backplane), required by ab_server
@@ -50,7 +50,10 @@ public sealed class AbLegacyServerFixture : IAsyncLifetime
/// </summary>
public const string DefaultCipPath = "1,0";
public string Host { get; } = "127.0.0.1";
// 10.100.0.35 = the shared Docker host (see CLAUDE.md "Docker Workflow"). Migrated
// off this VM's 127.0.0.1 on 2026-04-28 alongside the rest of the Docker-host move.
// Override via AB_LEGACY_ENDPOINT to point at a real PLC or a locally-running container.
public string Host { get; } = "10.100.0.35";
public int Port { get; } = DefaultPort;
/// <summary>CIP routing path portion of the device URI (after the <c>/</c> separator).
@@ -105,7 +108,7 @@ public sealed class AbLegacyServerFixture : IAsyncLifetime
private static (string Host, int Port) ResolveEndpoint()
{
var raw = Environment.GetEnvironmentVariable(EndpointEnvVar);
if (raw is null) return ("127.0.0.1", DefaultPort);
if (raw is null) return ("10.100.0.35", DefaultPort);
var parts = raw.Split(':', 2);
var port = parts.Length == 2 && int.TryParse(parts[1], out var p) ? p : DefaultPort;
return (parts[0], port);
@@ -1,7 +1,7 @@
using System.Runtime.CompilerServices;
using System.Threading.Channels;
using Google.Protobuf.WellKnownTypes;
using MxGateway.Contracts.Proto.Galaxy;
using ZB.MOM.WW.MxGateway.Contracts.Proto.Galaxy;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
@@ -1,4 +1,4 @@
using MxGateway.Contracts.Proto.Galaxy;
using ZB.MOM.WW.MxGateway.Contracts.Proto.Galaxy;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
@@ -1,4 +1,4 @@
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
@@ -1,6 +1,6 @@
using System.Threading.Channels;
using Google.Protobuf.WellKnownTypes;
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
@@ -1,4 +1,4 @@
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
@@ -1,7 +1,7 @@
using System.Diagnostics.Metrics;
using System.Threading.Channels;
using Google.Protobuf.WellKnownTypes;
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
@@ -1,6 +1,6 @@
using System.Threading.Channels;
using Google.Protobuf.WellKnownTypes;
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
@@ -1,6 +1,6 @@
using System.Threading.Channels;
using Google.Protobuf.WellKnownTypes;
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
@@ -1,4 +1,4 @@
using MxGateway.Contracts.Proto.Galaxy;
using ZB.MOM.WW.MxGateway.Contracts.Proto.Galaxy;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
@@ -1,5 +1,5 @@
using System.Diagnostics;
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
@@ -1,6 +1,6 @@
using System.Runtime.CompilerServices;
using Google.Protobuf.WellKnownTypes;
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Core.Abstractions;
@@ -1,6 +1,6 @@
using Google.Protobuf;
using Google.Protobuf.WellKnownTypes;
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Runtime;
@@ -1,5 +1,5 @@
using Google.Protobuf.WellKnownTypes;
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Runtime;
@@ -1,4 +1,4 @@
using MxGateway.Contracts.Proto;
using ZB.MOM.WW.MxGateway.Contracts.Proto;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Runtime;
@@ -21,9 +21,19 @@
<ItemGroup>
<ProjectReference Include="..\..\..\src\Drivers\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.csproj"/>
<!-- Pulled in transitively via Driver.Galaxy → MxGateway.Client → MxGateway.Contracts;
explicit reference lets tests construct GalaxyObject / GalaxyAttribute fixtures. -->
<ProjectReference Include="..\..\..\..\mxaccessgw\src\MxGateway.Contracts\MxGateway.Contracts.csproj"/>
</ItemGroup>
<ItemGroup>
<!-- Vendored mxaccessgw contracts DLL. The driver under test holds the same
binary reference; this explicit duplicate lets tests construct
GalaxyObject / GalaxyAttribute / MxCommand / MxEvent fixtures directly
rather than only via the driver's public surface. See
..\..\..\src\Drivers\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\libs\README.md for
the unwinding plan. -->
<Reference Include="MxGateway.Contracts">
<HintPath>..\..\..\src\Drivers\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\libs\MxGateway.Contracts.dll</HintPath>
<Private>true</Private>
</Reference>
</ItemGroup>
</Project>
@@ -54,7 +54,12 @@ public sealed class AddressingGrammarTests
{
if (_sim.SkipReason is not null) Assert.Skip(_sim.SkipReason);
var tag = new ModbusTagDefinition("Tank", ModbusRegion.HoldingRegisters, 100, ModbusDataType.Float32,
// HR 200 lives in pymodbus's scratch range (`write: [200, 209]` per standard.json),
// which gives us two consecutive writable HRs for the Float32 round-trip. The earlier
// version used HR 100 — but standard.json declares HR 100 as a single-cell auto-
// incrementing register (`write: [100, 100]`) so the second register of the Float32
// write was rejected with Illegal Data Address.
var tag = new ModbusTagDefinition("Tank", ModbusRegion.HoldingRegisters, 200, ModbusDataType.Float32,
ByteOrder: ModbusByteOrder.WordSwap);
var drv = await NewDriverAsync(tag);
@@ -87,9 +92,13 @@ public sealed class AddressingGrammarTests
if (_sim.SkipReason is not null) Assert.Skip(_sim.SkipReason);
// Sanity check that the simulator accepts the larger PDU coalescing produces.
var t1 = new ModbusTagDefinition("T1", ModbusRegion.HoldingRegisters, 300, ModbusDataType.Int16);
var t2 = new ModbusTagDefinition("T2", ModbusRegion.HoldingRegisters, 302, ModbusDataType.Int16);
var t3 = new ModbusTagDefinition("T3", ModbusRegion.HoldingRegisters, 304, ModbusDataType.Int16);
// Using HR 200/202/204 in the scratch range (standard.json's `uint16: 200..209`),
// not 300/302/304 — pymodbus rejects reads outside the seeded uint16 list with
// Illegal Data Address (= BadOutOfRange). The coalescing semantic the test
// exercises is identical with the scratch addresses.
var t1 = new ModbusTagDefinition("T1", ModbusRegion.HoldingRegisters, 200, ModbusDataType.Int16);
var t2 = new ModbusTagDefinition("T2", ModbusRegion.HoldingRegisters, 202, ModbusDataType.Int16);
var t3 = new ModbusTagDefinition("T3", ModbusRegion.HoldingRegisters, 204, ModbusDataType.Int16);
var opts = new ModbusDriverOptions
{
Host = _sim.Host, Port = _sim.Port, Tags = [t1, t2, t3], MaxReadGap = 5,
@@ -48,6 +48,23 @@ service + starting another. The integration tests discriminate by a
separate `MODBUS_SIM_PROFILE` env var so they skip correctly when the
wrong profile is live.
### Profile coverage matrix
The two general-purpose profiles cover disjoint test sets. A full pass
of the integration suite requires running both — serially on a single
docker host (the `:5020` collision), or in parallel on two hosts.
| Job | Bring up | Env to set | Expected outcome |
|---|---|---|---|
| `modbus-standard` | `lmxopcua-fix up modbus standard` | unset `MODBUS_SIM_PROFILE` (or set to `standard`) | Standard round-trip + AddressingGrammar suites pass; `ExceptionInjectionTests` (32 rows) skip with `MODBUS_SIM_PROFILE != exception_injection`. |
| `modbus-exception` | `lmxopcua-fix up modbus exception_injection` | `MODBUS_SIM_PROFILE=exception_injection` | `ExceptionInjectionTests` (32 rows) pass against the per-`(fc,address)` rule set; standard-profile suites (round-trip, AddressingGrammar) skip. |
The DL205 / Mitsubishi / S7-1500 profiles are similar — each gates its
own quirks suite via `MODBUS_SIM_PROFILE=<profile>`. Tests that don't
need a specific profile (the basic round-trip set) run under any of
the three pymodbus-based profiles. The `exception_injection` profile
is the only one that runs `exception_injector.py` instead of pymodbus.
## Endpoint
- Default: `localhost:5020`
@@ -27,7 +27,7 @@ public sealed class ExceptionInjectionTests(ModbusSimulatorFixture sim)
private const uint StatusGood = 0u;
private const uint StatusBadOutOfRange = 0x803C0000u;
private const uint StatusBadNotSupported = 0x803D0000u;
private const uint StatusBadDeviceFailure = 0x80550000u;
private const uint StatusBadDeviceFailure = 0x808B0000u;
private const uint StatusBadCommunicationError = 0x80050000u;
private void SkipUnlessInjectorLive()
@@ -5,7 +5,7 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests;
/// <summary>
/// Reachability probe for a Modbus TCP simulator (pymodbus in Docker, see
/// <c>Docker/docker-compose.yml</c>) or a real PLC. Parses
/// <c>MODBUS_SIM_ENDPOINT</c> (default <c>localhost:5020</c> per PR 43) and TCP-connects once at
/// <c>MODBUS_SIM_ENDPOINT</c> (default <c>10.100.0.35:5020</c> — the shared Docker host) and TCP-connects once at
/// fixture construction. Each test checks <see cref="SkipReason"/> and calls
/// <c>Assert.Skip</c> when the endpoint was unreachable, so a dev box without a running
/// simulator still passes `dotnet test` cleanly — matches the Galaxy live-smoke pattern in
@@ -29,8 +29,11 @@ public sealed class ModbusSimulatorFixture : IAsyncDisposable
// PR 43: default port is 5020 (pymodbus convention) instead of 502 (Modbus standard).
// Picking 5020 sidesteps the privileged-port admin requirement on Windows + matches the
// port baked into the pymodbus simulator JSON profiles in Docker/profiles/. Override with
// MODBUS_SIM_ENDPOINT to point at a real PLC on its native port 502.
private const string DefaultEndpoint = "localhost:5020";
// MODBUS_SIM_ENDPOINT to point at a real PLC on its native port 502, or to a
// locally-running container if the shared host is unavailable.
// 10.100.0.35 = the shared Docker host (see CLAUDE.md "Docker Workflow"). Migrated
// off this VM's localhost on 2026-04-28 alongside the rest of the Docker-host move.
private const string DefaultEndpoint = "10.100.0.35:5020";
private const string EndpointEnvVar = "MODBUS_SIM_ENDPOINT";
public string Host { get; }
@@ -18,9 +18,9 @@ public sealed class ModbusExceptionMapperTests
[InlineData((byte)0x01, 0x803D0000u)] // Illegal Function → BadNotSupported
[InlineData((byte)0x02, 0x803C0000u)] // Illegal Data Address → BadOutOfRange
[InlineData((byte)0x03, 0x803C0000u)] // Illegal Data Value → BadOutOfRange
[InlineData((byte)0x04, 0x80550000u)] // Server Failure → BadDeviceFailure
[InlineData((byte)0x05, 0x80550000u)] // Acknowledge (long op) → BadDeviceFailure
[InlineData((byte)0x06, 0x80550000u)] // Server Busy → BadDeviceFailure
[InlineData((byte)0x04, 0x808B0000u)] // Server Failure → BadDeviceFailure (Driver.Cli.Common-007)
[InlineData((byte)0x05, 0x808B0000u)] // Acknowledge (long op) → BadDeviceFailure
[InlineData((byte)0x06, 0x808B0000u)] // Server Busy → BadDeviceFailure
[InlineData((byte)0x0A, 0x80050000u)] // Gateway path unavailable → BadCommunicationError
[InlineData((byte)0x0B, 0x80050000u)] // Gateway target failed to respond → BadCommunicationError
[InlineData((byte)0xFF, 0x80020000u)] // Unknown code → BadInternalError fallback
@@ -61,7 +61,7 @@ public sealed class ModbusExceptionMapperTests
[new WriteRequest("T", (short)42)],
TestContext.Current.CancellationToken);
writes[0].StatusCode.ShouldBe(0x80550000u, "FC06 returning exception 04 (CPU in PROGRAM mode) maps to BadDeviceFailure");
writes[0].StatusCode.ShouldBe(0x808B0000u, "FC06 returning exception 04 (CPU in PROGRAM mode) maps to BadDeviceFailure");
}
private sealed class NonModbusFailureTransport : IModbusTransport
@@ -6,7 +6,7 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.IntegrationTests;
/// Reachability probe for an <c>opc-plc</c> simulator (Microsoft Industrial IoT's
/// OPC UA PLC from <c>mcr.microsoft.com/iotedge/opc-plc</c>) or any real OPC UA
/// server the <c>OPCUA_SIM_ENDPOINT</c> env var points at. Parses
/// <c>OPCUA_SIM_ENDPOINT</c> (default <c>opc.tcp://localhost:50000</c>),
/// <c>OPCUA_SIM_ENDPOINT</c> (default <c>opc.tcp://10.100.0.35:50000</c> — the shared Docker host),
/// TCP-connects to the resolved host:port at collection init, and records a
/// <see cref="SkipReason"/> on failure. Tests call <c>Assert.Skip</c> on that, so
/// `dotnet test` stays green when Docker isn't running the simulator — mirrors the
@@ -32,7 +32,11 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.IntegrationTests;
/// </remarks>
public sealed class OpcPlcFixture : IAsyncDisposable
{
private const string DefaultEndpoint = "opc.tcp://localhost:50000";
// 10.100.0.35 = the shared Docker host (see CLAUDE.md "Docker Workflow"). Migrated
// off this VM's localhost on 2026-04-28 alongside the rest of the Docker-host move.
// Override via OPCUA_SIM_ENDPOINT to point at a different host or a locally-running
// opc-plc instance.
private const string DefaultEndpoint = "opc.tcp://10.100.0.35:50000";
private const string EndpointEnvVar = "OPCUA_SIM_ENDPOINT";
/// <summary>Full <c>opc.tcp://host:port</c> URL the driver session should connect to.</summary>
@@ -5,7 +5,7 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.S7.IntegrationTests;
/// <summary>
/// Reachability probe for the python-snap7 simulator Docker container (see
/// <c>Docker/docker-compose.yml</c>) or a real S7 PLC. Parses <c>S7_SIM_ENDPOINT</c>
/// (default <c>localhost:1102</c>) + TCP-connects once at fixture construction.
/// (default <c>10.100.0.35:1102</c> — the shared Docker host) + TCP-connects once at fixture construction.
/// Tests check <see cref="SkipReason"/> + call <c>Assert.Skip</c> when unreachable, so
/// `dotnet test` stays green on a fresh box without the simulator installed —
/// mirrors the <c>ModbusSimulatorFixture</c> pattern.
@@ -35,7 +35,9 @@ public sealed class Snap7ServerFixture : IAsyncDisposable
{
// Default 1102 (non-privileged) matches Docker/server.py. Override with
// S7_SIM_ENDPOINT to point at a real PLC on its native 102.
private const string DefaultEndpoint = "localhost:1102";
// 10.100.0.35 = the shared Docker host (see CLAUDE.md "Docker Workflow"). Migrated
// off this VM's localhost on 2026-04-28 alongside the rest of the Docker-host move.
private const string DefaultEndpoint = "10.100.0.35:1102";
private const string EndpointEnvVar = "S7_SIM_ENDPOINT";
public string Host { get; }
@@ -6,6 +6,7 @@ using Microsoft.AspNetCore.Hosting;
using Microsoft.EntityFrameworkCore;
using Microsoft.Extensions.DependencyInjection;
using ZB.MOM.WW.OtOpcUa.Admin.Hubs;
using ZB.MOM.WW.OtOpcUa.Admin.Security;
using ZB.MOM.WW.OtOpcUa.Configuration;
using ZB.MOM.WW.OtOpcUa.Configuration.Entities;
using ZB.MOM.WW.OtOpcUa.Configuration.Enums;
@@ -101,6 +102,15 @@ public sealed class AdminWebAppFactory : IAsyncDisposable
builder.Services.AddScoped<Admin.Services.DriverInstanceService>();
builder.Services.AddScoped<Admin.Services.DraftValidationService>();
// ClusterDetail.razor injects AdminHubConnectionFactory to drive the live-banner hub
// connection; the factory depends on HubTokenService, which in turn needs Data Protection.
// Without these the InteractiveServer circuit fails to instantiate the component and the
// page never advances past the "Loading…" placeholder — Playwright then times out
// waiting for any tab nav-link to appear. Mirrors Program.cs:35-36.
builder.Services.AddDataProtection();
builder.Services.AddSingleton<HubTokenService>();
builder.Services.AddScoped<Admin.Services.AdminHubConnectionFactory>();
_app = builder.Build();
_app.UseStaticFiles();
_app.UseRouting();
@@ -110,6 +110,66 @@ public sealed class GenerationRefreshHostedServiceTests : IDisposable
leases.OpenLeaseCount.ShouldBe(0, "IAsyncDisposable dispose must fire regardless of outcome");
}
// Bug #12 fix — verifies the previously-missing wiring: applies and heartbeats both
// emit sp_RegisterNodeGenerationApplied so Admin UI Fleet status + Redundancy LastSeenAt
// surface live state.
[Fact]
public async Task First_apply_reports_Applied_status_to_central_db()
{
var coordinator = await SeedCoordinatorAsync();
var leases = new ApplyLeaseRegistry();
var calls = new List<(long Gen, NodeApplyStatus Status, string? Error)>();
var service = NewService(coordinator, leases, currentGeneration: () => 42, registerCalls: calls);
await service.TickAsync(CancellationToken.None);
calls.Count.ShouldBe(1, "exactly one register call per apply window");
calls[0].Gen.ShouldBe(42);
calls[0].Status.ShouldBe(NodeApplyStatus.Applied);
calls[0].Error.ShouldBeNull();
}
[Fact]
public async Task No_change_tick_heartbeats_with_Applied_status()
{
var coordinator = await SeedCoordinatorAsync();
var leases = new ApplyLeaseRegistry();
var calls = new List<(long Gen, NodeApplyStatus Status, string? Error)>();
var service = NewService(coordinator, leases, currentGeneration: () => 42, registerCalls: calls);
await service.TickAsync(CancellationToken.None); // initial apply
await service.TickAsync(CancellationToken.None); // no-change heartbeat
await service.TickAsync(CancellationToken.None); // no-change heartbeat
calls.Count.ShouldBe(3, "one apply call + two heartbeat calls");
calls.ShouldAllBe(c => c.Gen == 42 && c.Status == NodeApplyStatus.Applied && c.Error == null);
}
[Fact]
public async Task Register_call_failure_does_not_break_apply_or_block_subsequent_ticks()
{
var coordinator = await SeedCoordinatorAsync();
var leases = new ApplyLeaseRegistry();
var registerCallCount = 0;
var service = new GenerationRefreshHostedService(
new NodeOptions { NodeId = "A", ClusterId = "c1", ConfigDbConnectionString = "unused" },
leases, coordinator, NullLogger<GenerationRefreshHostedService>.Instance,
tickInterval: TimeSpan.FromSeconds(1),
currentGenerationQuery: _ => Task.FromResult<long?>(42),
registerAppliedAsync: (gen, status, err, ct) =>
{
registerCallCount++;
throw new InvalidOperationException("simulated DB outage during register");
});
await service.TickAsync(CancellationToken.None); // apply succeeds, register throws
await service.TickAsync(CancellationToken.None); // heartbeat throws
registerCallCount.ShouldBe(2, "both register attempts must run");
service.LastAppliedGenerationId.ShouldBe(42, "register failure must not roll back the cursor");
}
// ---- fixture helpers ---------------------------------------------------
private async Task<RedundancyCoordinator> SeedCoordinatorAsync()
@@ -136,11 +196,15 @@ public sealed class GenerationRefreshHostedServiceTests : IDisposable
private static GenerationRefreshHostedService NewService(
RedundancyCoordinator coordinator,
ApplyLeaseRegistry leases,
Func<long?> currentGeneration) =>
Func<long?> currentGeneration,
List<(long Gen, NodeApplyStatus Status, string? Error)>? registerCalls = null) =>
new(new NodeOptions { NodeId = "A", ClusterId = "c1", ConfigDbConnectionString = "unused" },
leases, coordinator, NullLogger<GenerationRefreshHostedService>.Instance,
tickInterval: TimeSpan.FromSeconds(1),
currentGenerationQuery: _ => Task.FromResult(currentGeneration()));
currentGenerationQuery: _ => Task.FromResult(currentGeneration()),
registerAppliedAsync: registerCalls is null
? (_, _, _, _) => Task.CompletedTask
: (gen, status, err, _) => { registerCalls.Add((gen, status, err)); return Task.CompletedTask; });
private sealed class DbContextFactory(DbContextOptions<OtOpcUaConfigDbContext> options)
: IDbContextFactory<OtOpcUaConfigDbContext>