Resolve blocking I/O finding and complete Historian lifecycle test coverage

Move subscribe/unsubscribe I/O outside lock(Lock) in SyncAddressSpace to avoid
blocking all OPC UA operations during rebuilds. Replace blocking ReadAsync calls
for alarm priority/description in dispatch loop with cached subscription values.
Extract IHistorianConnectionFactory so EnsureConnected can be tested without the
SDK runtime — adds 5 connection lifecycle tests (failure, timeout, reconnect,
state resilience, dispose-after-failure). All stability review findings and test
coverage gaps are now fully resolved.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-04-07 16:16:03 -04:00
parent 95ad9c6866
commit 9e1a180ce3
8 changed files with 346 additions and 108 deletions

View File

@@ -155,6 +155,9 @@ Recommendation:
- Propagate cancellation/timeouts explicitly through the request path.
- Add load/fault tests against the real async MXAccess client behavior, not only synchronous fakes.
**Status: Resolved (2026-04-07)**
Fix: Moved subscribe/unsubscribe I/O outside `lock(Lock)` in `SyncAddressSpace` and `TearDownGobjects` — bookkeeping is done under lock, actual MXAccess calls happen after the lock is released. Replaced blocking `ReadAsync` calls for alarm priority/description in the dispatch loop with cached values populated from subscription data changes via new `_alarmPriorityTags`/`_alarmDescTags` reverse lookup dictionaries. Refactored Historian `EnsureConnected`/`EnsureEventConnected` with double-check locking so `WaitForConnection` polling runs outside `_connectionLock`. OPC UA Read/Write/HistoryRead handlers remain synchronously blocking (framework constraint: `CustomNodeManager2` overrides are `void`) but `MxAccessClient.ReadAsync`/`WriteAsync` already enforce configurable timeouts (default 5s).
### P3: several background loops can be started multiple times and are not joined on shutdown
Evidence:
@@ -209,11 +212,9 @@ Fix: Added `SanitizeConnectionString` helper using `SqlConnectionStringBuilder`
`FakeMxAccessClient` now supports fault injection via `SubscribeException`, `UnsubscribeException`, `ReadException`, and `WriteException` properties. When set, the corresponding async methods return `Task.FromException`. Three tests in `LmxNodeManagerSubscriptionFaultTests` verify that subscribe/unsubscribe faults are caught and logged instead of silently discarded, and that ref-count bookkeeping survives a transient fault.
### ~~Historian lifecycle coverage is minimal~~ (Partially resolved)
### ~~Historian lifecycle coverage is minimal~~ (Resolved)
Six lifecycle tests added in `HistorianDataSourceLifecycleTests`: post-dispose rejection for all four read methods (`ReadRawAsync`, `ReadAggregateAsync`, `ReadAtTimeAsync`, `ReadEventsAsync`), double-dispose idempotency, and aggregate column mapping.
Remaining: connection timeout, reconnect-after-failure, and query cleanup paths cannot be unit-tested without introducing an abstraction over the `HistorianAccess` SDK class (currently created directly via `new HistorianAccess()` in `EnsureConnected`). Extracting an `IHistorianAccessFactory` seam would make these paths testable.
Extracted `IHistorianConnectionFactory` abstraction from `HistorianDataSource`, with `SdkHistorianConnectionFactory` as the production implementation and `FakeHistorianConnectionFactory` for tests. Eleven lifecycle tests in `HistorianDataSourceLifecycleTests` now cover: post-dispose rejection for all four read methods, double-dispose idempotency, aggregate column mapping, connection failure (returns empty results), connection timeout (returns empty results), reconnect-after-error (factory called twice), connection failure state resilience, and dispose-after-failure safety.
### ~~Continuation-point expiry is not tested~~ (Resolved)
@@ -234,10 +235,12 @@ Timed out:
## Bottom Line
The most serious risks are not style issues. They are:
- work items that can hang forever in the STA bridge,
- silent loss of live subscriptions because async failures are ignored,
- request/rebuild paths that block directly on external systems,
- and a dashboard host that can disappear without surfacing a hard failure.
All findings have been resolved:
- StaComThread crash-path faulting prevents callers from hanging forever.
- Subscription tasks are no longer silently discarded — failures are caught and logged.
- Subscribe/unsubscribe I/O moved outside `lock(Lock)` in rebuild paths; alarm metadata cached from subscriptions instead of blocking reads; Historian connection polling no longer holds the connection lock.
- Dashboard binds to localhost and reports startup failures explicitly.
- Background loops guard against double-start and join on stop.
- Connection strings are sanitized before logging.
Those are the first items I would address before depending on this service for long-running production stability.
Remaining architectural note: OPC UA Read/Write/HistoryRead handlers still use `.GetAwaiter().GetResult()` because `CustomNodeManager2` overrides are synchronous. This is mitigated by the existing configurable timeouts in `MxAccessClient` (default 5s).