Add alarm acknowledge plan and incorporate code review fixes
Adds alarm_ack.md documenting the two-way acknowledge flow (OPC UA client writes AckMsg, Galaxy confirms via Acked data change). Includes external code review fixes for subscriptions and node manager, and removes stale plan files now superseded by component documentation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,75 +0,0 @@
|
||||
# Full Review Update: `src`
|
||||
|
||||
Overall verdict: **patch is still incorrect**
|
||||
|
||||
This pass verified the previous findings against the current implementation, then performed another full review of `src/` with emphasis on service-mode reliability, continuous OPC UA publishing, and recovery from MXAccess disconnects.
|
||||
|
||||
## Verified Completed
|
||||
|
||||
### [DONE] OPC UA monitored items now start MXAccess subscriptions
|
||||
Verified in `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:355-423`.
|
||||
|
||||
The node manager now hooks monitored-item create/delete callbacks and translates them into ref-counted `SubscribeTag` / `UnsubscribeTag` calls. This closes the original gap where OPC UA subscriptions never triggered runtime MX subscriptions.
|
||||
|
||||
Validation:
|
||||
- `MultiClientTests.MultipleClients_SubscribeToSameTag_AllReceiveDataChanges` passed
|
||||
|
||||
### [DONE] Write timeout no longer reports success
|
||||
Verified in `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.ReadWrite.cs:101-107`.
|
||||
|
||||
The timeout path now completes the write with `false` instead of `true`, so OPC UA writes are no longer acknowledged when MXAccess never confirms them.
|
||||
|
||||
### [DONE] Monitor retries from `Error` state as well as `Disconnected`
|
||||
Verified in `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.Monitor.cs:38-43`.
|
||||
|
||||
The reconnect loop now treats `ConnectionState.Error` as reconnectable, which closes the prior issue where one failed reconnect left the client permanently stuck.
|
||||
|
||||
Validation:
|
||||
- `MxAccessClientMonitorTests.Monitor_ReconnectsOnDisconnect` passed
|
||||
- `MxAccessClientMonitorTests.Monitor_ProbeStale_ForcesReconnect` passed
|
||||
|
||||
### [DONE] Address-space construction no longer depends on input ordering
|
||||
Verified in `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:75-77` and `:174-199`.
|
||||
|
||||
The added topological sort ensures parents are materialized before children even when the hierarchy input is unsorted.
|
||||
|
||||
### [DONE] Startup no longer forces an immediate redundant rebuild
|
||||
Verified in `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs:143-184` and `src/ZB.MOM.WW.LmxOpcUa.Host/GalaxyRepository/ChangeDetectionService.cs:25-30`.
|
||||
|
||||
The initial deploy time is now captured during startup and passed into `ChangeDetectionService`, preventing the unconditional first-poll rebuild when startup already loaded the same deploy.
|
||||
|
||||
Validation:
|
||||
- `ChangeDetectionServiceTests` passed
|
||||
|
||||
## New Findings
|
||||
|
||||
### [P1] Service never recovers if MXAccess is unavailable during startup
|
||||
Files:
|
||||
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs:107-123`
|
||||
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs:139-140`
|
||||
|
||||
If the initial `MxAccessClient.ConnectAsync()` fails, startup disposes the real client and STA thread, replaces them with `NullMxAccessClient`, and still starts the OPC UA server. After that point there is no code path that ever retries the real MXAccess connection, so a temporary LMX outage at service startup becomes a permanent loss of runtime publishing until the whole Windows service is restarted.
|
||||
|
||||
### [P1] Address-space rebuild loses live subscription bookkeeping
|
||||
Files:
|
||||
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:163-169`
|
||||
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:390-423`
|
||||
|
||||
`RebuildAddressSpace()` clears `_subscriptionRefCounts` without unsubscribing the existing MXAccess subscriptions or rebuilding the ref counts for still-active monitored items. If a Galaxy rebuild happens while clients are subscribed, later deletes no longer call `_mxAccessClient.UnsubscribeAsync(...)`, and any future subscription to the same tag starts from zero and creates another runtime subscription. In a long-running service this leaks live subscriptions and can duplicate data-change delivery after successive rebuilds.
|
||||
|
||||
### [P2] Failed reads and writes are still recorded as successful operations
|
||||
Files:
|
||||
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.ReadWrite.cs:36-42`
|
||||
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.ReadWrite.cs:97-109`
|
||||
- `src/ZB.MOM.WW.LmxOpcUa.Host/Status/HealthCheckService.cs:24-38`
|
||||
|
||||
`PerformanceMetrics` only gets marked unsuccessful when an exception is thrown. A read timeout returns `Vtq.Bad(...)` and a rejected or timed-out write returns `false`, but neither path calls `scope.SetSuccess(false)`. Since the dashboard health logic uses those success rates to detect degraded behavior, sustained runtime failures can still leave the service reporting healthy even while reads or writes are failing.
|
||||
|
||||
## Test Notes
|
||||
|
||||
Focused verification completed successfully:
|
||||
- `dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests/ZB.MOM.WW.LmxOpcUa.Tests.csproj --filter "FullyQualifiedName~MultiClientTests.MultipleClients_SubscribeToSameTag_AllReceiveDataChanges"`
|
||||
- `dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests/ZB.MOM.WW.LmxOpcUa.Tests.csproj --filter "FullyQualifiedName~MxAccessClientMonitorTests.Monitor_ReconnectsOnDisconnect|FullyQualifiedName~MxAccessClientMonitorTests.Monitor_ProbeStale_ForcesReconnect"`
|
||||
- `dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests/ZB.MOM.WW.LmxOpcUa.Tests.csproj --filter "FullyQualifiedName~ChangeDetectionServiceTests"`
|
||||
|
||||
I did not rerun the entire solution test suite in this pass. A previous full-solution run had timed out, so verification here was focused on the behavior touched by the prior fixes and the service-reliability paths reviewed above.
|
||||
59
codereviews/solution_review_20260327.json
Normal file
59
codereviews/solution_review_20260327.json
Normal file
@@ -0,0 +1,59 @@
|
||||
{
|
||||
"findings": [
|
||||
{
|
||||
"title": "[P1] Synchronize dispatch lookups with address-space rebuilds",
|
||||
"body": "The data-change dispatcher reads `_tagToVariableNode` and `_alarmInAlarmTags` here before taking `Lock`, while `BuildAddressSpace`, `SyncAddressSpace`, and `TearDownGobjects` mutate those same `Dictionary` instances under `Lock`. If a Galaxy rebuild lands while MXAccess is delivering changes, `TryGetValue` can observe concurrent mutation and throw; because these lookups sit outside the inner `try/catch`, the dispatch thread dies and all subscribed OPC UA items stop publishing until the service is restarted.",
|
||||
"confidence_score": 0.98,
|
||||
"priority": 1,
|
||||
"code_location": {
|
||||
"absolute_file_path": "C:\\Users\\dohertj2\\Desktop\\lmxopcua\\src\\ZB.MOM.WW.LmxOpcUa.Host\\OpcUa\\LmxNodeManager.cs",
|
||||
"line_range": {
|
||||
"start": 1516,
|
||||
"end": 1530
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "[P1] Guard subscription refcounts with the same mutex",
|
||||
"body": "This block rewrites `_subscriptionRefCounts` while holding `Lock`, but the normal monitored-item paths (`SubscribeTag`, `UnsubscribeTag`, and `RestoreTransferredSubscriptions`) protect the same dictionary with `_lock` instead. A monitored-item create/delete that overlaps a rebuild can therefore race with the restore path here, losing counts or sending extra `SubscribeAsync`/`UnsubscribeAsync` calls. In a live system that leaves surviving subscriptions missing or leaked after a deploy even though the OPC UA client state never changed.",
|
||||
"confidence_score": 0.93,
|
||||
"priority": 1,
|
||||
"code_location": {
|
||||
"absolute_file_path": "C:\\Users\\dohertj2\\Desktop\\lmxopcua\\src\\ZB.MOM.WW.LmxOpcUa.Host\\OpcUa\\LmxNodeManager.cs",
|
||||
"line_range": {
|
||||
"start": 556,
|
||||
"end": 557
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "[P2] Skip duplicate runtime subscriptions for the same tag",
|
||||
"body": "Once the client is connected, `SubscribeAsync` always opens a fresh COM item for the address instead of reusing an existing one. Callers can hit that through alarm auto-subscriptions, transferred-monitored-item recovery, or any repeated higher-level subscribe for the same tag. The new handle overwrites `_addressToHandle[address]`, so `UnsubscribeAsync` removes only the newest item while the original subscription keeps receiving callbacks until disconnect.",
|
||||
"confidence_score": 0.97,
|
||||
"priority": 2,
|
||||
"code_location": {
|
||||
"absolute_file_path": "C:\\Users\\dohertj2\\Desktop\\lmxopcua\\src\\ZB.MOM.WW.LmxOpcUa.Host\\MxAccess\\MxAccessClient.Subscription.cs",
|
||||
"line_range": {
|
||||
"start": 17,
|
||||
"end": 20
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"title": "[P2] Detach MX callbacks before disposing the dispatch signal",
|
||||
"body": "The node manager subscribes to `_mxAccessClient.OnTagValueChanged` in the constructor, but `Dispose(bool)` only stops the thread and disposes `_dataChangeSignal`. During `OpcUaService.Stop()`, the OPC UA server is stopped before the MXAccess client disconnects, so any runtime callback delivered in that window will call `OnMxAccessDataChange` and hit `Set()` on a disposed `AutoResetEvent`. That produces shutdown-time exceptions and keeps the dead node manager rooted by the event subscription.",
|
||||
"confidence_score": 0.89,
|
||||
"priority": 2,
|
||||
"code_location": {
|
||||
"absolute_file_path": "C:\\Users\\dohertj2\\Desktop\\lmxopcua\\src\\ZB.MOM.WW.LmxOpcUa.Host\\OpcUa\\LmxNodeManager.cs",
|
||||
"line_range": {
|
||||
"start": 1632,
|
||||
"end": 1633
|
||||
}
|
||||
}
|
||||
}
|
||||
],
|
||||
"overall_correctness": "patch is incorrect",
|
||||
"overall_explanation": "The solution has multiple runtime reliability issues in the MXAccess-to-OPC-UA bridge: address-space rebuilds race with live data dispatch, subscription bookkeeping is not synchronized consistently, and duplicate or stale subscriptions can be left behind. Those issues can break live publishing or leak runtime handles under ordinary reconnect and rebuild scenarios.",
|
||||
"overall_confidence_score": 0.94
|
||||
}
|
||||
Reference in New Issue
Block a user