Phase 2 PR 6 — close PR 4 monitor-loop low findings (probe leak + replay signal) #5
Reference in New Issue
Block a user
Delete Branch "phase-2-pr6-monitor-findings"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Close the 2 low findings carried forward from PR 4 (monitor-loop follow-ups):
$Heartbeatprobe handle leak.MonitorLoopAsynccalledAddItem(handle, "$Heartbeat")to probe liveness but discarded the returned item handle. Over a 24h uptime with the default 5s monitor interval that's ~17,000 leaked handles accumulated in the MXAccess proxy's handle table.catch { /* skip */ }so operators had no signal when specific tags failed to replay — the first indication was a quality drop on OPC UA clients downstream.Fixes
MxAccessClient.MonitorLoopAsync— capture the probe item handle,RemoveItemit infinallyon the same pump turn. Best-effort swallow so a dying proxy doesn't throw secondary exceptions out of the probe path. Probe success becomeshandle > 0to match v1's handling ofAddItemreturning 0.SubscriptionReplayFailedEventArgs+SubscriptionReplayFailedevent raised once per tag that fails to replay after reconnect.Log.Warningper failure + a summary log line ("{failed} of {total} failed"or"{total} re-subscribed cleanly").Tests
Heartbeat_probe_calls_RemoveItem_for_every_AddItem— drives the monitor loop through several probe cycles viaMonitorInterval=150ms+StaleThreshold=50ms; assertsAddCount > 1,AddCount == RemoveCount,OutstandingHeartbeatHandles == 0after dispose.SubscriptionReplayFailed_fires_for_each_tag_that_fails_to_replay— subscribes GoodTag.X + BadTag.A + BadTag.B; triggers a probe failure to drive the reconnect path; asserts exactly 2SubscriptionReplayFailedevents fire with both bad tags represented (GoodTag.X replays cleanly so no event).Test plan
dotnet build src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/— 0 errors, 0 warningsdotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.Tests/ --filter "Category=Unit"— 26/26 pass (2 new + 24 pre-existing)Merge order
This PR branches off
phase-2-pr4-findingsso it rebases cleanly whether PR 4 merges first or PR 4/5/6 are rebased together onto master. No conflicts expected with PR 5 (Historian) — different subsystems.🤖 Generated with Claude Code
probeHandle > 0so any AddItem that returns 0 (MXAccess's "could not create") counts as a failed probe, matching v1 behavior. Low finding #2 (subscription replay silently swallowed per-tag failures): after a reconnect, the replay loop iterates the pre-reconnect subscription snapshot and calls SubscribeOnPumpAsync for each; previously those failures went into a bare catch { /* skip */ } so an operator had no signal when specific tags failed to re-subscribe — the first indication downstream was a quality drop on OPC UA clients. Fix: new SubscriptionReplayFailedEventArgs (TagReference + Exception) + SubscriptionReplayFailed event on MxAccessClient that fires once per tag that fails to re-subscribe, Log.Warning per failure with the reconnect counter + tag reference, and a summary log line at the end of the replay loop ("{failed} of {total} failed" or "{total} re-subscribed cleanly"). Serilogusing+ILogger Log = Serilog.Log.ForContext<MxAccessClient>()added. Tests — MxAccessClientMonitorLoopTests (new file, 2 cases): Heartbeat_probe_calls_RemoveItem_for_every_AddItem constructs a CountingProxy IMxProxy that tracks AddItem/RemoveItem pair counts scoped to the "$Heartbeat" address, runs the client with MonitorInterval=150ms + StaleThreshold=50ms for 700ms, asserts HeartbeatAddCount > 1, HeartbeatAddCount == HeartbeatRemoveCount, OutstandingHeartbeatHandles == 0 after dispose; SubscriptionReplayFailed_fires_for_each_tag_that_fails_to_replay uses a ReplayFailingProxy that throws on the next $Heartbeat probe (to trigger the reconnect path) and throws on the replay-time AddItem for specified tag names ("BadTag.A", "BadTag.B"), subscribes GoodTag.X + BadTag.A + BadTag.B before triggering probe failure, collects SubscriptionReplayFailed args into a ConcurrentBag, asserts exactly 2 events fired and both bad tags are represented — GoodTag.X replays cleanly so it does not fire. Host.Tests csproj gains a Reference to lib/ArchestrA.MxAccess.dll because IMxProxy's MxDataChangeHandler delegate signature mentions MXSTATUS_PROXY and the compiler resolves all delegate parameter types when a test class implements the interface, even if the test code never names the type. No regressions: full Galaxy.Host.Tests Unit suite 26 pass / 0 fail (2 new monitor-loop tests + 9 PR5 historian + 15 pre-existing PostMortemMmf/RecyclePolicy/StaPump/MemoryWatchdog/EndToEndIpc/Handshake). Galaxy.Host builds clean (0 errors, 0 warnings) — the new Serilog.Log.ForContext usage picks up the existing Serilog package ref that PR 4 pulled in for the monitor-loop infrastructure. Both findings were flagged as non-blocking for PR 4 merge and are now resolved alongside whichever merge order the reviewer picks; this PR branches off phase-2-pr4-findings so it can rebase cleanly if PR 4 lands first or be re-based onto master after PR 4 merges. 1c2bf74d38