Files
lmxopcua/codereviews/review.md
Joseph Doherty 09ed15bdda Fix second-pass review findings: subscription leak on rebuild, metrics accuracy, and MxAccess startup recovery
- Preserve and replay subscription ref counts across address space rebuilds to prevent MXAccess subscription leaks
- Mark read timeouts and write failures as unsuccessful in PerformanceMetrics for accurate health reporting
- Add deferred MxAccess reconnect path when initial connection fails at startup
- Update code review document with verified completions and new findings
- Add covering tests for all fixes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 09:41:12 -04:00

5.1 KiB

Full Review Update: src

Overall verdict: patch is still incorrect

This pass verified the previous findings against the current implementation, then performed another full review of src/ with emphasis on service-mode reliability, continuous OPC UA publishing, and recovery from MXAccess disconnects.

Verified Completed

[DONE] OPC UA monitored items now start MXAccess subscriptions

Verified in src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:355-423.

The node manager now hooks monitored-item create/delete callbacks and translates them into ref-counted SubscribeTag / UnsubscribeTag calls. This closes the original gap where OPC UA subscriptions never triggered runtime MX subscriptions.

Validation:

  • MultiClientTests.MultipleClients_SubscribeToSameTag_AllReceiveDataChanges passed

[DONE] Write timeout no longer reports success

Verified in src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.ReadWrite.cs:101-107.

The timeout path now completes the write with false instead of true, so OPC UA writes are no longer acknowledged when MXAccess never confirms them.

[DONE] Monitor retries from Error state as well as Disconnected

Verified in src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.Monitor.cs:38-43.

The reconnect loop now treats ConnectionState.Error as reconnectable, which closes the prior issue where one failed reconnect left the client permanently stuck.

Validation:

  • MxAccessClientMonitorTests.Monitor_ReconnectsOnDisconnect passed
  • MxAccessClientMonitorTests.Monitor_ProbeStale_ForcesReconnect passed

[DONE] Address-space construction no longer depends on input ordering

Verified in src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:75-77 and :174-199.

The added topological sort ensures parents are materialized before children even when the hierarchy input is unsorted.

[DONE] Startup no longer forces an immediate redundant rebuild

Verified in src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs:143-184 and src/ZB.MOM.WW.LmxOpcUa.Host/GalaxyRepository/ChangeDetectionService.cs:25-30.

The initial deploy time is now captured during startup and passed into ChangeDetectionService, preventing the unconditional first-poll rebuild when startup already loaded the same deploy.

Validation:

  • ChangeDetectionServiceTests passed

New Findings

[P1] Service never recovers if MXAccess is unavailable during startup

Files:

  • src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs:107-123
  • src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs:139-140

If the initial MxAccessClient.ConnectAsync() fails, startup disposes the real client and STA thread, replaces them with NullMxAccessClient, and still starts the OPC UA server. After that point there is no code path that ever retries the real MXAccess connection, so a temporary LMX outage at service startup becomes a permanent loss of runtime publishing until the whole Windows service is restarted.

[P1] Address-space rebuild loses live subscription bookkeeping

Files:

  • src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:163-169
  • src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:390-423

RebuildAddressSpace() clears _subscriptionRefCounts without unsubscribing the existing MXAccess subscriptions or rebuilding the ref counts for still-active monitored items. If a Galaxy rebuild happens while clients are subscribed, later deletes no longer call _mxAccessClient.UnsubscribeAsync(...), and any future subscription to the same tag starts from zero and creates another runtime subscription. In a long-running service this leaks live subscriptions and can duplicate data-change delivery after successive rebuilds.

[P2] Failed reads and writes are still recorded as successful operations

Files:

  • src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.ReadWrite.cs:36-42
  • src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.ReadWrite.cs:97-109
  • src/ZB.MOM.WW.LmxOpcUa.Host/Status/HealthCheckService.cs:24-38

PerformanceMetrics only gets marked unsuccessful when an exception is thrown. A read timeout returns Vtq.Bad(...) and a rejected or timed-out write returns false, but neither path calls scope.SetSuccess(false). Since the dashboard health logic uses those success rates to detect degraded behavior, sustained runtime failures can still leave the service reporting healthy even while reads or writes are failing.

Test Notes

Focused verification completed successfully:

  • dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests/ZB.MOM.WW.LmxOpcUa.Tests.csproj --filter "FullyQualifiedName~MultiClientTests.MultipleClients_SubscribeToSameTag_AllReceiveDataChanges"
  • dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests/ZB.MOM.WW.LmxOpcUa.Tests.csproj --filter "FullyQualifiedName~MxAccessClientMonitorTests.Monitor_ReconnectsOnDisconnect|FullyQualifiedName~MxAccessClientMonitorTests.Monitor_ProbeStale_ForcesReconnect"
  • dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests/ZB.MOM.WW.LmxOpcUa.Tests.csproj --filter "FullyQualifiedName~ChangeDetectionServiceTests"

I did not rerun the entire solution test suite in this pass. A previous full-solution run had timed out, so verification here was focused on the behavior touched by the prior fixes and the service-reliability paths reviewed above.