Fix second-pass review findings: subscription leak on rebuild, metrics accuracy, and MxAccess startup recovery

- Preserve and replay subscription ref counts across address space rebuilds to prevent MXAccess subscription leaks - Mark read timeouts and write failures as unsuccessful in PerformanceMetrics for accurate health reporting - Add deferred MxAccess reconnect path when initial connection fails at startup - Update code review document with verified completions and new findings - Add covering tests for all fixes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 09:41:12 -04:00
parent 71254e005e
commit 09ed15bdda
12 changed files with 307 additions and 51 deletions
@@ -1,41 +1,75 @@
-# Full Review: `src`
+# Full Review Update: `src`

-Overall verdict: **patch is incorrect**
+Overall verdict: **patch is still incorrect**

-I reviewed the current implementation under `src/` as a code review, not just the Git diff. The main issues are correctness bugs in the OPC UA subscription path and MXAccess recovery/write handling.
+This pass verified the previous findings against the current implementation, then performed another full review of `src/` with emphasis on service-mode reliability, continuous OPC UA publishing, and recovery from MXAccess disconnects.

-## Findings
+## Verified Completed

-### [P1] OPC UA monitored items never trigger MXAccess subscriptions
-File: `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:327-345`
+### [DONE] OPC UA monitored items now start MXAccess subscriptions
+Verified in `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:355-423`.

-`LmxNodeManager` has a `SubscribeTag` helper, but nothing in the node manager calls it when OPC UA clients create monitored items. There is also no monitored-item lifecycle override in this class. The result is that browsing or subscribing from an OPC UA client does not call `_mxAccessClient.SubscribeAsync(...)`, so live MXAccess data changes are never started for those tags. Clients can still do synchronous reads, but any client expecting pushed value updates will see stale `BadWaitingForInitialData` or last-cached values instead of runtime updates.
+The node manager now hooks monitored-item create/delete callbacks and translates them into ref-counted `SubscribeTag` / `UnsubscribeTag` calls. This closes the original gap where OPC UA subscriptions never triggered runtime MX subscriptions.

-### [P1] Write timeouts are reported as successful writes
-File: `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.ReadWrite.cs:101-105`
+Validation:
+- `MultiClientTests.MultipleClients_SubscribeToSameTag_AllReceiveDataChanges` passed

-On write timeout, the cancellation callback completes the pending task with `true`. That means a missing `OnWriteComplete` callback is treated as a successful write, and `LmxNodeManager.Write(...)` will return `ServiceResult.Good` upstream. In any timeout scenario where MXAccess never acknowledged the write, OPC UA clients will be told the write succeeded even though the runtime value may never have changed.
+### [DONE] Write timeout no longer reports success
+Verified in `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.ReadWrite.cs:101-107`.

-### [P1] Auto-reconnect stops permanently after one failed reconnect attempt
+The timeout path now completes the write with `false` instead of `true`, so OPC UA writes are no longer acknowledged when MXAccess never confirms them.
+
+### [DONE] Monitor retries from `Error` state as well as `Disconnected`
+Verified in `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.Monitor.cs:38-43`.
+
+The reconnect loop now treats `ConnectionState.Error` as reconnectable, which closes the prior issue where one failed reconnect left the client permanently stuck.
+
+Validation:
+- `MxAccessClientMonitorTests.Monitor_ReconnectsOnDisconnect` passed
+- `MxAccessClientMonitorTests.Monitor_ProbeStale_ForcesReconnect` passed
+
+### [DONE] Address-space construction no longer depends on input ordering
+Verified in `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:75-77` and `:174-199`.
+
+The added topological sort ensures parents are materialized before children even when the hierarchy input is unsorted.
+
+### [DONE] Startup no longer forces an immediate redundant rebuild
+Verified in `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs:143-184` and `src/ZB.MOM.WW.LmxOpcUa.Host/GalaxyRepository/ChangeDetectionService.cs:25-30`.
+
+The initial deploy time is now captured during startup and passed into `ChangeDetectionService`, preventing the unconditional first-poll rebuild when startup already loaded the same deploy.
+
+Validation:
+- `ChangeDetectionServiceTests` passed
+
+## New Findings
+
+### [P1] Service never recovers if MXAccess is unavailable during startup
 Files:
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.Connection.cs:101-110`
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.Monitor.cs:38-45`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs:107-123`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs:139-140`

-The monitor only retries when `_state == ConnectionState.Disconnected`, but both `ConnectAsync` and `ReconnectAsync` move the client to `ConnectionState.Error` on a failed attempt. After the first reconnect failure, the monitor loop no longer matches its reconnect condition, so temporary outages become permanent until the whole service is restarted.
+If the initial `MxAccessClient.ConnectAsync()` fails, startup disposes the real client and STA thread, replaces them with `NullMxAccessClient`, and still starts the OPC UA server. After that point there is no code path that ever retries the real MXAccess connection, so a temporary LMX outage at service startup becomes a permanent loss of runtime publishing until the whole Windows service is restarted.

-### [P2] Address-space construction depends on parent-before-child ordering
-File: `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:98-104`
-
-`BuildAddressSpace` attaches each object to `rootFolder` whenever its parent is not already in `nodeMap`. That only works if the input hierarchy is topologically ordered with every parent appearing before every descendant. The method does not enforce that ordering itself, so any unsorted hierarchy list will silently build the wrong OPC UA tree by promoting children to the root level.
-
-### [P3] Startup always performs an unnecessary second rebuild
+### [P1] Address-space rebuild loses live subscription bookkeeping
 Files:
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs:142-182`
- `src/ZB.MOM.WW.LmxOpcUa.Host/GalaxyRepository/ChangeDetectionService.cs:46-60`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:163-169`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:390-423`

-`OpcUaService.Start()` already reads the Galaxy hierarchy/attributes and builds the initial address space before change detection starts. `ChangeDetectionService` then unconditionally fires `OnGalaxyChanged` on its first poll, even when the deploy timestamp has not changed, causing an immediate second full DB fetch and address-space rebuild on every startup. That doubles startup load and rebuild latency for no functional gain.
+`RebuildAddressSpace()` clears `_subscriptionRefCounts` without unsubscribing the existing MXAccess subscriptions or rebuilding the ref counts for still-active monitored items. If a Galaxy rebuild happens while clients are subscribed, later deletes no longer call `_mxAccessClient.UnsubscribeAsync(...)`, and any future subscription to the same tag starts from zero and creates another runtime subscription. In a long-running service this leaks live subscriptions and can duplicate data-change delivery after successive rebuilds.

-## Notes
+### [P2] Failed reads and writes are still recorded as successful operations
+Files:
+- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.ReadWrite.cs:36-42`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.ReadWrite.cs:97-109`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/Status/HealthCheckService.cs:24-38`

- This was a static review of the `src` tree.
- I attempted `dotnet test ZB.MOM.WW.LmxOpcUa.slnx`, but the run timed out after about 124 seconds, so I did not rely on a full green test pass for validation.
+`PerformanceMetrics` only gets marked unsuccessful when an exception is thrown. A read timeout returns `Vtq.Bad(...)` and a rejected or timed-out write returns `false`, but neither path calls `scope.SetSuccess(false)`. Since the dashboard health logic uses those success rates to detect degraded behavior, sustained runtime failures can still leave the service reporting healthy even while reads or writes are failing.
+
+## Test Notes
+
+Focused verification completed successfully:
+- `dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests/ZB.MOM.WW.LmxOpcUa.Tests.csproj --filter "FullyQualifiedName~MultiClientTests.MultipleClients_SubscribeToSameTag_AllReceiveDataChanges"`
+- `dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests/ZB.MOM.WW.LmxOpcUa.Tests.csproj --filter "FullyQualifiedName~MxAccessClientMonitorTests.Monitor_ReconnectsOnDisconnect|FullyQualifiedName~MxAccessClientMonitorTests.Monitor_ProbeStale_ForcesReconnect"`
+- `dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests/ZB.MOM.WW.LmxOpcUa.Tests.csproj --filter "FullyQualifiedName~ChangeDetectionServiceTests"`
+
+I did not rerun the entire solution test suite in this pass. A previous full-solution run had timed out, so verification here was focused on the behavior touched by the prior fixes and the service-reliability paths reviewed above.