Phase 2 PR 4 — close 4 open MXAccess findings (push frames + reconnect + write-await + read-cancel) #3
Reference in New Issue
Block a user
Delete Branch "phase-2-pr4-findings"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
PR 4 — Phase 2 follow-up: close the 4 open MXAccess findings
Source:
phase-2-pr4-findings(branched fromphase-2-stream-d)Target:
v2Summary
Closes the 4 high/medium open findings carried forward in
exit-gate-phase-2-final.md:ReadAsyncsubscription-leak on cancel. One-shot read now wraps thesubscribe→first-OnDataChange→unsubscribe pattern in a
try/finallyso the per-tagcallback is always detached, and if the read installed the underlying MXAccess
subscription itself (no other caller had it), it tears it down on the way out.
MxAccessClientOptions { AutoReconnect, MonitorInterval, StaleThreshold }+ a backgroundMonitorLoopAsyncthat watches a stale-activity threshold + probes the proxy via ano-op COM call, then reconnects-with-replay (re-Register, re-AddItem every active
subscription) when the proxy is dead. Liveness signal: every
OnDataChangecallback bumps_lastObservedActivityUtc. Defaults match v1 monitor cadence (5s poll, 60s stale).ReconnectCountexposed for diagnostics;ConnectionStateChangedevent for downstreamconsumers (the supervisor on the Proxy side already surfaces this through its
HeartbeatMonitor, but the Host-side event lets local logging/metrics hook in).
MxAccessGalaxyBackend.SubscribeAsyncdoesn't push OnDataChange frames back tothe Proxy. New
IGalaxyBackend.OnDataChange/OnAlarmEvent/OnHostStatusChangedevents that the new
GalaxyFrameHandler.AttachConnectionsubscribes per-connection andforwards as outbound
OnDataChangeNotification/AlarmEvent/RuntimeStatusChangeframes through the connection'sFrameWriter.MxAccessGalaxyBackendfans out per-tag value changes to every
SubscriptionIdthat's listening to that tag(multiple Proxy subs may share a Galaxy attribute — single COM subscription, multi-fan-out
on the wire). Stub + DbBacked backends declare the events with
#pragma warning disable CS0067(treat-warnings-as-errors would otherwise fail on never-raised events that existonly to satisfy the interface).
WriteValuesAsyncdoesn't awaitOnWriteComplete. NewWriteAsync(...)overload returnsboolafter awaiting the OnWriteComplete callback viathe v1-style
TaskCompletionSource-keyed-by-item-handle pattern in_pendingWrites.MxAccessGalaxyBackend.WriteValuesAsyncnow reports per-tagBad_InternalErrorwhen theruntime rejected the write, instead of false-positive
Good.Pipe server change
IFrameHandlergainsAttachConnection(FrameWriter writer): IDisposableso the handler canregister backend event sinks on each accepted connection and detach them at disconnect. The
PipeServer.RunOneConnectionAsynccalls it after the Hello handshake and disposes it in thefinally of the per-connection scope.
StubFrameHandlerreturnsIFrameHandler.NoopAttachment.Instance(net48 doesn't support default interface methods, so the empty-attach lives as a public nested
class).
Tests
dotnet test ZB.MOM.WW.OtOpcUa.slnx: 460 pass / 7 skip (E2E on admin shell) / 1pre-existing baseline failure. No regressions. The Driver.Galaxy.Host unit tests + 5 live
ZB smoke + 3 live MXAccess COM smoke all pass unchanged.
Test plan for reviewers
dotnet buildcleandotnet testshows 460/7-skip/1-baselineMxAccessClient.MonitorLoopAsyncagainst v1'sMxAccessClient.Monitorpartial (
src/ZB.MOM.WW.OtOpcUa.Host/MxAccess/MxAccessClient.Monitor.cs) — samepolling cadence, same probe-then-reconnect-with-replay shape
GalaxyFrameHandler.ConnectionSink.Disposeand confirm event handlers aredetached on connection close (no leaked invocation list refs)
WriteValuesAsyncreturningBad_InternalErroron a runtime-rejected write is thecorrect shape — confirm against the v1
MxAccessClient.ReadWrite.cspatternWhat's NOT in this PR
MxAccessGalaxyBackend.SubscribeAlarmsAsyncis still a no-op).OnAlarmEventis declared on the backend interface and pushed by the frame handler whenraised;
MxAccessGalaxyBackendjust doesn't raise it yet (waits for the alarm-trackingport from v1's
AlarmObjectFilter+ Galaxy alarm primitives).OnHostStatusChanged) — declared on the interface and pushed by theframe handler;
MxAccessGalaxyBackenddoesn't raise it (the Galaxy.Host'sHostConnectivityProbefrom v1 needs porting too, scoped under the Historian PR).Adversarial review
Quick pass over the PR 4 deltas. No new findings beyond:
MonitorLoopAsync's$Heartbeatprobe item-handle is leaked(
AddItemsucceeds, neverRemoveItem'd). Cosmetic — the probe item is internal tothe COM connection, dies with
Unregisterat disconnect/recycle. Worth a follow-upto call
RemoveItemafter the probe succeeds.MonitorLoopAsyncswallows per-subscription failures. IfGalaxy permanently rejects a previously-valid reference (rare but possible after a
re-deploy), the user gets silent data loss for that one subscription. The stub-handler-
unaware operator wouldn't notice. Worth surfacing as a
ConnectionStateChanged(false) → ConnectionStateChanged(true)payload that includes the replay-failures list.Both are low-priority follow-ups, not PR 4 blockers.
otopcua-mssql-data:/var/opt/mssqlon the SQL Server container so DB files survive container restart anddocker rm; sqlcmd verification command using the newmssql-tools18path that the 2022 image ships with; EF Core CLI install for use starting in Phase 1 Stream B; bumped step count from 8 → 10. Also adds a Troubleshooting subsection covering the seven most common Windows install snags (WSL distro not auto-installed needs-d Ubuntu; Docker PATH not refreshed needs new shell or sign-in; docker-users group membership needs sign-out/in; WSL 2 kernel update needs manual install on legacy systems; SA password complexity rules; Linux vs Windows containers mode mismatch; Hyper-V coexistence with Docker requires WSL 2 backend not Hyper-V backend per decision #134). Step 1 acceptance criteria gain "docker ps shows otopcua-mssql Up" and explicit note that steps 4a/4b need admin elevation (no silent admin-free path exists on Windows). bf6741ba7fotopcua-mssqlat localhost:1433 with sa/OtOpcUaDev_2026! credentials and Docker named volumeotopcua-mssql-datamounted at /var/opt/mssql, dev Galaxy, GLAuth at C:\publish\glauth\ on ports 3893/3894, plus rows for not-yet-standing services like OPC Foundation reference server / FOCAS stub / Modbus simulator / ab_server / Snap7 / TwinCAT XAR VM with target ports to stand up later); Connection strings for appsettings.Development.json (copy-paste-ready, flagged never-commit); Container management quick reference (start/stop/logs/shell/query/nuclear-reset); Credential rotation note. fc0ce36308requiredmodifier — net48 lacks RequiredMemberAttribute and we'd need a polyfill shim like the existing IsExternalInit one; default-string init is simpler). System.Data.SqlClient 4.9.0 added (the same package the v1 Host uses; net48-compatible). Backend/DbBackedGalaxyBackend wraps the repository: DiscoverAsync builds a real DiscoverHierarchyResponse (groups attributes by gobject, resolves parent-by-tagname, maps category_id → human-readable template-category name mirroring v1 AlarmObjectFilter); ReadValuesAsync/WriteValuesAsync/HistoryReadAsync still surface "MXAccess code lift pending (Phase 2 Task B.1)" because runtime data values genuinely need the COM client; OpenSession/CloseSession/Subscribe/Unsubscribe/AlarmSubscribe/AlarmAck/Recycle return success without backend work (subscription ID is a synthetic counter for now). Live smoke tests (GalaxyRepositoryLiveSmokeTests) skip when localhost ZB is unreachable; when present they verify (1) TestConnection returns true, (2) GetHierarchy returns at least one deployed gobject with a non-empty TagName, (3) GetAttributes returns rows with FullTagReference matching the "tag.attribute" shape, (4) GetLastDeployTime returns a value, (5) DbBackedBackend.DiscoverAsync returns at least one gobject with attributes and a populated TemplateCategory. All 5 pass against the local Galaxy. Full solution 957 pass / 1 pre-existing Phase 0 baseline; the 494 v1 IntegrationTests + 6 v1 IntegrationTests-net48 tests still pass — legacy OtOpcUa.Host untouched. Remaining for the Phase 2 exit gate is the MXAccess COM client port itself (the v1 MxAccessClient partials + IMxProxy abstraction + StaPump-based Connect/Subscribe/Read/Write semantics) — Discover is now solved in DB-backed form, so the lift can focus exclusively on the runtime data-plane. 549cd36662git rm -r src/ZB.MOM.WW.OtOpcUa.Hostis destructive enough to need explicit operator authorization on a real PR review. scripts/migration/Migrate-AppSettings-To-DriverConfig.ps1 takes a v1 appsettings.json and emits the v2 DriverInstance.DriverConfig JSON blob (MxAccess/Database/Historian sections) ready to upsert into the central Configuration DB; null-leaf stripping; -DryRun mode; smoke-tested against the dev appsettings.json and produces the expected three-section ordered-dictionary output. scripts/install/Install-Services.ps1 registers the two v2 services with sc.exe — OtOpcUaGalaxyHost first (net48 x86 EXE with OTOPCUA_GALAXY_PIPE/OTOPCUA_ALLOWED_SID/OTOPCUA_GALAXY_SECRET/OTOPCUA_GALAXY_BACKEND/OTOPCUA_GALAXY_ZB_CONN/OTOPCUA_GALAXY_CLIENT_NAME env vars set via HKLM:\SYSTEM\CurrentControlSet\Services\OtOpcUaGalaxyHost\Environment registry), then OtOpcUa with depend=OtOpcUaGalaxyHost; resolves down-level account names to SID for the IPC ACL; generates a fresh 32-byte base64 shared secret per install if not supplied (kept out of registry — operators record offline for service rebinding scenarios); echoes start commands. scripts/install/Uninstall-Services.ps1 stops + removes both services. tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Proxy.Tests/HostSubprocessParityTests.cs is the production-shape parity test — Proxy (.NET 10) spawns the actual OtOpcUa.Driver.Galaxy.Host.exe (net48 x86) as a subprocess via Process.Start with backend=db env vars, connects via real named pipe, calls Discover, asserts at least one Galaxy gobject comes back. Skipped when running as Administrator (PipeAcl denies admins, same guard as other IPC integration tests), when the Host EXE hasn't been built, or when the ZB SQL endpoint is unreachable. This is the cross-FX integration that the parity suite genuinely needs — the previous IPC tests all ran in-process; this one validates the production deployment topology where Proxy and Host are separate processes communicating only over the named pipe. docs/v2/implementation/stream-d-removal-procedure.md is the next-session playbook: Option A (rewrite 494 v1 tests via a ProxyMxAccessClientAdapter that implements v1's IMxAccessClient by forwarding to GalaxyProxyDriver — Vtq↔DataValueSnapshot, Quality↔StatusCode, OnTagValueChanged↔OnDataChange mapping; 3-5 days, full coverage), Option B (rename OtOpcUa.Tests → OtOpcUa.Tests.v1Archive with [Trait("Category", "v1Archive")] for opt-in CI runs; new OtOpcUa.Driver.Galaxy.E2E test project with 10-20 representative tests via the HostSubprocessParityTests pattern; 1-2 days, accreted coverage); deletion checklist with eight pre-conditions, ten ordered steps, and a rollback path (git revert restores the legacy Host alongside the v2 stack — both topologies remain installable until the downstream consumer cutover). Full solution 964 pass / 1 pre-existing Phase 0 baseline; the 494 v1 IntegrationTests + 6 v1 IntegrationTests-net48 still pass because legacy OtOpcUa.Host stays untouched until an interactive session executes the procedure doc. 7403b92b72dotnet test ZB.MOM.WW.OtOpcUa.slnxnow skips them via IsTestProject=false on the test projects + archive-status PropertyGroup comments on the src projects. The destructive deletion is reserved for Phase 2 PR 3 with explicit operator review per CLAUDE.md "only use destructive operations when truly the best approach". tests/ZB.MOM.WW.OtOpcUa.Tests/ renamed via git mv to tests/ZB.MOM.WW.OtOpcUa.Tests.v1Archive/; csproj <AssemblyName> kept as the original ZB.MOM.WW.OtOpcUa.Tests so v1 OtOpcUa.Host's [InternalsVisibleTo("ZB.MOM.WW.OtOpcUa.Tests")] still matches and the project rebuilds clean. tests/ZB.MOM.WW.OtOpcUa.IntegrationTests gets <IsTestProject>false</IsTestProject>. src/ZB.MOM.WW.OtOpcUa.Host + src/ZB.MOM.WW.OtOpcUa.Historian.Aveva get PropertyGroup archive-status comments documenting they're functionally superseded but kept in-build because cascading dependencies (Historian.Aveva → Host; IntegrationTests → Host) make a single-PR deletion high blast-radius. New tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.E2E/ project (.NET 10) with ParityFixture that spawns OtOpcUa.Driver.Galaxy.Host.exe (net48 x86) as a Process.Start subprocess with OTOPCUA_GALAXY_BACKEND=db env vars, awaits 2s for the PipeServer to bind, then exposes a connected GalaxyProxyDriver; skips on non-Windows / Administrator shells (PipeAcl denies admins per decision #76) / ZB unreachable / Host EXE not built — each skip carries a SkipReason string the test method reads via Assert.Skip(SkipReason). RecordingAddressSpaceBuilder captures every Folder/Variable/AddProperty registration so parity tests can assert on the same shape v1 LmxNodeManager produced. HierarchyParityTests (3) — Discover returns gobjects with attributes; attribute full references match the tag.attribute Galaxy reference grammar; HistoryExtension flag flows through correctly. StabilityFindingsRegressionTests (4) — one test per 2026-04-13 stability finding from commitsc76ab8fand7310925: phantom probe subscription doesn't corrupt unrelated host status; HostStatusChangedEventArgs structurally carries a specific HostName + OldState + NewState (event signature mathematically prevents the v1 cross-host quality-clear bug); all GalaxyProxyDriver capability methods return Task or Task<T> (sync-over-async would deadlock OPC UA stack thread); AcknowledgeAsync completes before returning (no fire-and-forget background work that could race shutdown). Solution test count: 470 pass / 7 skip (E2E on admin shell) / 1 pre-existing Phase 0 baseline. Run archived suites explicitly:dotnet test tests/ZB.MOM.WW.OtOpcUa.Tests.v1Archive(494 pass) +dotnet test tests/ZB.MOM.WW.OtOpcUa.IntegrationTests(6 pass). docs/v2/V1_ARCHIVE_STATUS.md inventories every archived surface with run-it-explicitly instructions + a 10-step deletion plan for PR 3 + rollback procedure (git revert restores all four projects). docs/v2/implementation/exit-gate-phase-2-final.md supersedes the two partial-exit docs with the per-stream status table (A/B/C/D/E all addressed, D split across PR 2/3 per safety protocol), the test count breakdown, fresh adversarial review of PR 2 deltas (4 new findings: medium IsTestProject=false safety net loss, medium structural-vs-behavioral stability tests, low backend=db default, low Process.Start env inheritance), the 8 carried-forward findings from exit-gate-phase-2.md, the recommended PR order (1 → 2 → 3 → 4). docs/v2/implementation/pr-2-body.md is the Gitea web-UI paste-in for opening PR 2 once pushed. a3d16a28f1(_, __) => {}callback) now wires OnTagValueChanged that fans out per-tag value changes to every subscription ID listening (one MXAccess subscription, multi-fan-out — _refToSubs reverse map). UnsubscribeAsync also reverse-walks the map to only call mx.UnsubscribeAsync when the LAST sub for a tag drops. Stub + DbBacked backends declare the events with #pragma warning disable CS0067 because they never raise them but must satisfy the interface (treat-warnings-as-errors would otherwise fail). Medium 4 (WriteValuesAsync doesn't await OnWriteComplete): MxAccessClient.WriteAsync rewritten to return Task<bool> via the v1-style TaskCompletionSource-keyed-by-item-handle pattern in _pendingWrites — adds the TCS before the Write call, awaits it with a configurable timeout (default 5s), removes the TCS in finally, returns true only when OnWriteComplete reported success. MxAccessGalaxyBackend.WriteValuesAsync now reports per-tag Bad_InternalError ("MXAccess runtime reported write failure") when the bool returns false, instead of false-positive Good. PipeServer's IFrameHandler interface adds the AttachConnection(FrameWriter):IDisposable method + a public NoopAttachment nested class (net48 doesn't support default interface methods so the empty-attach is exposed for stub implementations). StubFrameHandler returns IFrameHandler.NoopAttachment.Instance. RunOneConnectionAsync calls AttachConnection after HelloAck andusings the returned disposable so it disposes at the connection scope's finally. ConnectionStateChanged event added on MxAccessClient (caller-facing diagnostics for false→true reconnect transitions). docs/v2/implementation/pr-4-body.md is the Gitea web-UI paste-in for opening PR 4 once pushed; includes 2 new low-priority adversarial findings (probe item-handle leak; replay-loop silently swallows per-subscription failures) flagged as follow-ups not PR 4 blockers. Full solution 460 pass / 7 skip (E2E on admin shell) / 1 pre-existing Phase 0 baseline. No regressions vs PR 2's baseline. caa9cb86f6