Phase 2 PR 6 — close PR 4 monitor-loop low findings (probe leak + replay signal) #5

Merged
dohertj2 merged 2 commits from phase-2-pr6-monitor-findings into v2 2026-04-18 06:57:58 -04:00
Owner

Summary

Close the 2 low findings carried forward from PR 4 (monitor-loop follow-ups):

  • Low #1$Heartbeat probe handle leak. MonitorLoopAsync called AddItem(handle, "$Heartbeat") to probe liveness but discarded the returned item handle. Over a 24h uptime with the default 5s monitor interval that's ~17,000 leaked handles accumulated in the MXAccess proxy's handle table.
  • Low #2 — replay-loop silently ate per-tag failures. After a reconnect, the replay loop iterated the pre-reconnect subscription snapshot. Per-tag resubscribe exceptions went into a bare catch { /* skip */ } so operators had no signal when specific tags failed to replay — the first indication was a quality drop on OPC UA clients downstream.

Fixes

  • MxAccessClient.MonitorLoopAsync — capture the probe item handle, RemoveItem it in finally on the same pump turn. Best-effort swallow so a dying proxy doesn't throw secondary exceptions out of the probe path. Probe success becomes handle > 0 to match v1's handling of AddItem returning 0.
  • New SubscriptionReplayFailedEventArgs + SubscriptionReplayFailed event raised once per tag that fails to replay after reconnect. Log.Warning per failure + a summary log line ("{failed} of {total} failed" or "{total} re-subscribed cleanly").

Tests

  • Heartbeat_probe_calls_RemoveItem_for_every_AddItem — drives the monitor loop through several probe cycles via MonitorInterval=150ms + StaleThreshold=50ms; asserts AddCount > 1, AddCount == RemoveCount, OutstandingHeartbeatHandles == 0 after dispose.
  • SubscriptionReplayFailed_fires_for_each_tag_that_fails_to_replay — subscribes GoodTag.X + BadTag.A + BadTag.B; triggers a probe failure to drive the reconnect path; asserts exactly 2 SubscriptionReplayFailed events fire with both bad tags represented (GoodTag.X replays cleanly so no event).

Test plan

  • dotnet build src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/ — 0 errors, 0 warnings
  • dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.Tests/ --filter "Category=Unit" — 26/26 pass (2 new + 24 pre-existing)
  • Reviewer: verify the fix via long-running soak on a machine with the real ArchestrA runtime — watch the MXAccess proxy's handle count stabilize across many monitor cycles instead of growing linearly.

Merge order

This PR branches off phase-2-pr4-findings so it rebases cleanly whether PR 4 merges first or PR 4/5/6 are rebased together onto master. No conflicts expected with PR 5 (Historian) — different subsystems.


🤖 Generated with Claude Code

## Summary Close the 2 low findings carried forward from PR 4 (monitor-loop follow-ups): - **Low #1 — `$Heartbeat` probe handle leak.** `MonitorLoopAsync` called `AddItem(handle, "$Heartbeat")` to probe liveness but discarded the returned item handle. Over a 24h uptime with the default 5s monitor interval that's ~17,000 leaked handles accumulated in the MXAccess proxy's handle table. - **Low #2 — replay-loop silently ate per-tag failures.** After a reconnect, the replay loop iterated the pre-reconnect subscription snapshot. Per-tag resubscribe exceptions went into a bare `catch { /* skip */ }` so operators had no signal when specific tags failed to replay — the first indication was a quality drop on OPC UA clients downstream. ## Fixes - `MxAccessClient.MonitorLoopAsync` — capture the probe item handle, `RemoveItem` it in `finally` on the same pump turn. Best-effort swallow so a dying proxy doesn't throw secondary exceptions out of the probe path. Probe success becomes `handle > 0` to match v1's handling of `AddItem` returning 0. - New `SubscriptionReplayFailedEventArgs` + `SubscriptionReplayFailed` event raised once per tag that fails to replay after reconnect. `Log.Warning` per failure + a summary log line (`"{failed} of {total} failed"` or `"{total} re-subscribed cleanly"`). ## Tests - `Heartbeat_probe_calls_RemoveItem_for_every_AddItem` — drives the monitor loop through several probe cycles via `MonitorInterval=150ms` + `StaleThreshold=50ms`; asserts `AddCount > 1`, `AddCount == RemoveCount`, `OutstandingHeartbeatHandles == 0` after dispose. - `SubscriptionReplayFailed_fires_for_each_tag_that_fails_to_replay` — subscribes GoodTag.X + BadTag.A + BadTag.B; triggers a probe failure to drive the reconnect path; asserts exactly 2 `SubscriptionReplayFailed` events fire with both bad tags represented (GoodTag.X replays cleanly so no event). ## Test plan - [x] `dotnet build src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/` — 0 errors, 0 warnings - [x] `dotnet test tests/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.Tests/ --filter "Category=Unit"` — 26/26 pass (2 new + 24 pre-existing) - [ ] Reviewer: verify the fix via long-running soak on a machine with the real ArchestrA runtime — watch the MXAccess proxy's handle count stabilize across many monitor cycles instead of growing linearly. ## Merge order This PR branches off `phase-2-pr4-findings` so it rebases cleanly whether PR 4 merges first or PR 4/5/6 are rebased together onto master. No conflicts expected with PR 5 (Historian) — different subsystems. --- 🤖 Generated with [Claude Code](https://claude.com/claude-code)
dohertj2 changed target branch from phase-2-pr4-findings to v2 2026-04-18 06:57:51 -04:00
dohertj2 added 2 commits 2026-04-18 06:57:51 -04:00
Phase 2 PR 5 — port Wonderware Historian SDK into Driver.Galaxy.Host/Backend/Historian/. The full v1 Historian.Aveva code path (HistorianDataSource + HistorianClusterEndpointPicker + IHistorianConnectionFactory + SdkHistorianConnectionFactory) now lives inside Galaxy.Host instead of the previously-required out-of-tree plugin + HistorianPluginLoader AssemblyResolve hack, and MxAccessGalaxyBackend.HistoryReadAsync — which previously returned a Phase 2 Task B.1.h follow-up placeholder — now delegates to the ported HistorianDataSource.ReadRawAsync, maps HistorianSample to GalaxyDataValue via the IPC wire shape, and reports Success=true with per-tag HistoryTagValues arrays. OPC-UA-free surface inside Galaxy.Host: the v1 code returned Opc.Ua.DataValue on the hot path, which would have required dragging OPCFoundation.NetStandard.Opc.Ua.Server into net48 x86 Galaxy.Host and bleeding OPC types across the IPC boundary — instead, the port introduces HistorianSample (Value, Quality byte, TimestampUtc) + HistorianAggregateSample (Value, TimestampUtc) POCOs that carry the raw MX quality byte through the pipe unchanged, and the OPC translation happens on the Proxy side via the existing QualityMapper that the live-read path already uses. Decision #13's IPC data-shape contract survives intact — GalaxyDataValue (TagReference + ValueBytes MessagePack + ValueMessagePackType + StatusCode + SourceTimestampUtcUnixMs + ServerTimestampUtcUnixMs) — so no Shared.Contracts wire break vs PR 4. Cluster failover preserved verbatim: HistorianClusterEndpointPicker is the thread-safe pure-logic picker ported verbatim with no SDK dependency (injected DateTime clock, per-node cooldown state, unknown-node-name tolerance, case-insensitive de-dup on configuration-order list), ConnectToAnyHealthyNode iterates the picker's healthy candidates, clones config per attempt, calls the factory, marks healthy on success / failed on exception with the failure message stored for dashboard surfacing, throws "All N healthy historian candidate(s) failed" with the last exception chained when every node exhausts. Process path + Event path use separate HistorianAccess connections (CreateHistoryQuery vs CreateEventQuery vs CreateAnalogSummaryQuery on the SDK surface) guarded by independent _connection/_eventConnection locks — a mid-query failure on one silo resets only that connection, the other stays open. Four SDK paths ported: ReadRawAsync (RetrievalMode.Full, BatchSize from config.MaxValuesPerRead, MoveNext pump, per-sample quality + value decode with the StringValue/Value fallback the v1 code did, limit-based early exit), ReadAggregateAsync (AnalogSummaryQuery + Resolution in ms, ExtractAggregateValue maps Average/Minimum/Maximum/ValueCount/First/Last/StdDev column names — the NodeId to column mapping is moved to the Proxy side since the IPC request carries a string column), ReadAtTimeAsync (per-timestamp HistoryQuery with RetrievalMode.Interpolated + BatchSize=1, returns Quality=0 / Value=null for missing samples), ReadEventsAsync (EventQuery + AddEventFilter("Source",Equal,sourceName) when sourceName is non-null, EventOrder.Ascending, EventCount = maxEvents or config.MaxValuesPerRead); GetHealthSnapshot returns the full runtime-health snapshot (TotalQueries/Successes/Failures + ConsecutiveFailures + LastSuccess/FailureTime + LastError + ProcessConnectionOpen/EventConnectionOpen + ActiveProcessNode/ActiveEventNode + per-node state list). ReadRaw is the only path wired through IPC in PR 5 (HistoryReadRequest/HistoryTagValues/HistoryReadResponse already existed in Shared.Contracts); Aggregate/AtTime/Events/Health are ported-but-not-yet-IPC-exposed — they stay internal to Galaxy.Host for PR 6+ to surface via new contract message kinds (aggregate = OPC UA HistoryReadProcessed, at-time = HistoryReadAtTime, events = HistoryReadEvents, health = admin dashboard IPC query). Galaxy.Host csproj gains aahClientManaged + aahClientCommon references with Private=false (managed wrappers) + None items for aahClient.dll + Historian.CBE.dll + Historian.DPAPI.dll + ArchestrA.CloudHistorian.Contract.dll native satellites staged alongside the host exe via CopyToOutputDirectory=PreserveNewest so aahClientManaged can P/Invoke into them at runtime without an AssemblyResolve hook (cleaner than the v1 HistorianPluginLoader.cs 180-LOC AssemblyResolve + Assembly.LoadFrom dance that existed solely because the plugin was loaded late from Host/bin/Debug/net48/Historian/). Program.cs adds BuildHistorianIfEnabled() that reads OTOPCUA_HISTORIAN_ENABLED (true or 1) + OTOPCUA_HISTORIAN_SERVER + OTOPCUA_HISTORIAN_SERVERS (comma-separated cluster list overrides single-server) + OTOPCUA_HISTORIAN_PORT (default 32568) + OTOPCUA_HISTORIAN_INTEGRATED (default true) + OTOPCUA_HISTORIAN_USER/OTOPCUA_HISTORIAN_PASS + OTOPCUA_HISTORIAN_TIMEOUT_SEC (30) + OTOPCUA_HISTORIAN_MAX_VALUES (10000) + OTOPCUA_HISTORIAN_COOLDOWN_SEC (60), returns null when disabled so MxAccessGalaxyBackend.HistoryReadAsync surfaces a clean "Historian disabled" Success=false instead of a localhost-SDK hang; server.RunAsync finally block now also casts backend to IDisposable.Dispose() so the historian SDK connections get cleanly closed on Ctrl+C. MxAccessGalaxyBackend gains an IHistorianDataSource? historian constructor parameter (defaults null to preserve existing Host.Tests call sites that don't exercise HistoryRead), implements IDisposable that forwards to _historian.Dispose(), and the pragma warning disable CS0618 is locally scoped to the ToDto(HistorianEvent) mapper since the SDK marks Id/Source/DisplayText/Severity obsolete but the replacement surface isn't available in the aahClientManaged version we bind against — every other deprecated-SDK use still surfaces as an error under TreatWarningsAsErrors. Ported from v1 Historian.Aveva unchanged: the CloneConfigWithServerName helper that preserves every config field except ServerName per attempt; the double-checked locking in EnsureConnected/EnsureEventConnected (fast path = Volatile.Read outside lock, slow path acquires lock + re-checks + disposes any raced-in-parallel connection); HandleConnectionError/HandleEventConnectionError that close the dead connection, clear the active-node tracker, MarkFailed the picker entry with the exception message so the node enters cooldown, and log the reset with node= for operator correlation; RecordSuccess/RecordFailure that bump counters under _healthLock. Tests: HistorianClusterEndpointPickerTests (7 cases) — single-node ServerName fallback when ServerNames empty, MarkFailed enters cooldown and skips, cooldown expires after window, MarkHealthy immediately clears, all-in-cooldown returns empty healthy list, Snapshot reports failure count + last error + IsHealthy, case-insensitive de-dup on duplicate hostnames. HistorianWiringTests (2 cases) — HistoryReadAsync returns "Historian disabled" Success=false when historian:null passed; HistoryReadAsync with a fake IHistorianDataSource maps the returned HistorianSample (Value=42.5, Quality=192 Good, Timestamp) to a GalaxyDataValue with StatusCode=0u + SourceTimestampUtcUnixMs matching the sample + MessagePack-encoded value bytes. InternalsVisibleTo("...Host.Tests") added to Galaxy.Host.csproj so tests can reach the internal HistorianClusterEndpointPicker. Full Galaxy.Host.Tests suite: 24 pass / 0 fail (9 new historian + 15 pre-existing MemoryWatchdog/PostMortemMmf/RecyclePolicy/StaPump/EndToEndIpc/Handshake). Full solution build: 0 errors (202 pre-existing warnings). The v1 Historian.Aveva project + Historian.Aveva.Tests still build intact because the archive PR (Stream D.1 destructive delete) is still ahead of us — PR 5 intentionally does not delete either; once PR 2+3 merge and the archive-delete PR lands, a follow-up cleanup can remove Historian.Aveva + its 4 source files + 18 test cases. Alarm subsystem wire-up (OnAlarmEvent raising from MxAccessGalaxyBackend via AlarmExtension primitives) + host-status push (OnHostStatusChanged via a ported GalaxyRuntimeProbeManager) remain PR 6 candidates; they were on the same "Task B.1.h follow-up" list and share the IPC connection-sink wiring with the historian events path — it made PR 5 scope-manageable to do Historian first since that's what has the biggest surface area (981 LOC v1 plus SDK binding) and alarms/host-status have more bespoke integration with the existing MxAccess subscription fan-out. 6df1a79d35
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 PR 6 — close the 2 low findings carried forward from PR 4. Low finding #1 ($Heartbeat probe handle leak in MonitorLoopAsync): the probe calls _proxy.AddItem(connectionHandle, "$Heartbeat") on every monitor tick that observes the connection is past StaleThreshold, but previously discarded the returned item handle — so every probe (one per MonitorInterval, default 5s) leaked one item handle into the MXAccess proxy's internal handle table. Fix: capture the item handle, call RemoveItem(connectionHandle, probeHandle) in the InvokeAsync's finally block so it runs on the same pump turn as the AddItem, best-effort RemoveItem swallow so a dying proxy doesn't throw secondary exceptions out of the probe path. Probe ok becomes probeHandle > 0 so any AddItem that returns 0 (MXAccess's "could not create") counts as a failed probe, matching v1 behavior. Low finding #2 (subscription replay silently swallowed per-tag failures): after a reconnect, the replay loop iterates the pre-reconnect subscription snapshot and calls SubscribeOnPumpAsync for each; previously those failures went into a bare catch { /* skip */ } so an operator had no signal when specific tags failed to re-subscribe — the first indication downstream was a quality drop on OPC UA clients. Fix: new SubscriptionReplayFailedEventArgs (TagReference + Exception) + SubscriptionReplayFailed event on MxAccessClient that fires once per tag that fails to re-subscribe, Log.Warning per failure with the reconnect counter + tag reference, and a summary log line at the end of the replay loop ("{failed} of {total} failed" or "{total} re-subscribed cleanly"). Serilog using + ILogger Log = Serilog.Log.ForContext<MxAccessClient>() added. Tests — MxAccessClientMonitorLoopTests (new file, 2 cases): Heartbeat_probe_calls_RemoveItem_for_every_AddItem constructs a CountingProxy IMxProxy that tracks AddItem/RemoveItem pair counts scoped to the "$Heartbeat" address, runs the client with MonitorInterval=150ms + StaleThreshold=50ms for 700ms, asserts HeartbeatAddCount > 1, HeartbeatAddCount == HeartbeatRemoveCount, OutstandingHeartbeatHandles == 0 after dispose; SubscriptionReplayFailed_fires_for_each_tag_that_fails_to_replay uses a ReplayFailingProxy that throws on the next $Heartbeat probe (to trigger the reconnect path) and throws on the replay-time AddItem for specified tag names ("BadTag.A", "BadTag.B"), subscribes GoodTag.X + BadTag.A + BadTag.B before triggering probe failure, collects SubscriptionReplayFailed args into a ConcurrentBag, asserts exactly 2 events fired and both bad tags are represented — GoodTag.X replays cleanly so it does not fire. Host.Tests csproj gains a Reference to lib/ArchestrA.MxAccess.dll because IMxProxy's MxDataChangeHandler delegate signature mentions MXSTATUS_PROXY and the compiler resolves all delegate parameter types when a test class implements the interface, even if the test code never names the type. No regressions: full Galaxy.Host.Tests Unit suite 26 pass / 0 fail (2 new monitor-loop tests + 9 PR5 historian + 15 pre-existing PostMortemMmf/RecyclePolicy/StaPump/MemoryWatchdog/EndToEndIpc/Handshake). Galaxy.Host builds clean (0 errors, 0 warnings) — the new Serilog.Log.ForContext usage picks up the existing Serilog package ref that PR 4 pulled in for the monitor-loop infrastructure. Both findings were flagged as non-blocking for PR 4 merge and are now resolved alongside whichever merge order the reviewer picks; this PR branches off phase-2-pr4-findings so it can rebase cleanly if PR 4 lands first or be re-based onto master after PR 4 merges. 1c2bf74d38
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dohertj2 merged commit 067ad78e06 into v2 2026-04-18 06:57:58 -04:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: dohertj2/lmxopcua#5