Files
lmxopcua/docs/stability-review-20260407.md
Joseph Doherty 95ad9c6866 Resolve 6 of 7 stability review findings and close test coverage gaps
Fixes P1 StaComThread hang (crash-path faulting via WorkItem queue), P1 subscription
fire-and-forget (block+log or ContinueWith on 5 call sites), P2 continuation point
leak (PurgeExpired on Retrieve/Release), P2 dashboard bind failure (localhost prefix,
bool Start), P3 background loop double-start (task handles + join on stop in 3 files),
and P3 config logging exposure (SqlConnectionStringBuilder password masking). Adds
FakeMxAccessClient fault injection and 12 new tests. Documents required runtime
assemblies in ServiceHosting.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 15:37:27 -04:00

14 KiB

Stability Review

Date: 2026-04-07

Scope:

  • Service startup/shutdown lifecycle
  • MXAccess threading and reconnect behavior
  • OPC UA node manager request paths
  • Historian history-read paths
  • Status dashboard hosting
  • Test coverage around the above

Findings

P1: StaComThread can leave callers blocked forever after pump failure or shutdown races

Evidence:

  • src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/StaComThread.cs:106
  • src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/StaComThread.cs:132
  • src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/StaComThread.cs:178

Details:

  • RunAsync enqueues work and calls PostThreadMessage(...), but it ignores the returned bool.
  • If the STA thread has already exited, or if posting fails for any other reason, the queued TaskCompletionSource is never completed or faulted.
  • ThreadEntry logs pump crashes, but it does not drain/fault queued work, does not reset _nativeThreadId, and does not prevent later calls from queueing more work unless Dispose() happened first.

Note:

  • Clean shutdown via Dispose() posts WM_APP+1, which calls DrainQueue() before PostQuitMessage, so queued work is drained on the normal shutdown path. The gap is crash and unexpected-exit paths only.
  • ThreadEntry catches crashes and calls _ready.TrySetException(ex), but _ready is only awaited during Start(). A crash after startup completes does not fault any pending or future callers.

Impact:

  • Any caller waiting synchronously on these tasks can hang indefinitely after a pump crash (not a clean shutdown).
  • This is especially dangerous because higher layers regularly use .GetAwaiter().GetResult() during connect, disconnect, rebuild, and request processing.

Recommendation:

  • Check the return value of PostThreadMessage.
  • If post fails, remove/fault the queued work item immediately.
  • Mark the worker unusable when the pump exits unexpectedly and fault all remaining queued items.
  • Add a shutdown/crash-path test that verifies queued callers fail fast instead of hanging.

Status: Resolved (2026-04-07) Fix: Refactored queue to WorkItem type with separate Execute/Fault actions. Added _pumpExited flag set in ThreadEntry finally block. DrainAndFaultQueue() faults all pending TCS instances without executing user actions. RunAsync checks _pumpExited before enqueueing. PostThreadMessage return value is checked — false triggers drain-and-fault. Added crash-path test via PostQuitMessage.

P1: LmxNodeManager discards subscription tasks, so failures can be silent

Evidence:

  • src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:396
  • src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1906
  • src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1934
  • src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1960
  • tests/ZB.MOM.WW.LmxOpcUa.Tests/Helpers/FakeMxAccessClient.cs:83

Details:

  • Several subscription and unsubscription calls are fire-and-forget.
  • SubscribeAlarmTags() even wraps SubscribeAsync(...) in try/catch, but because the returned task is not awaited, asynchronous failures bypass that catch.
  • The test suite mostly uses FakeMxAccessClient, whose subscribe/unsubscribe methods complete immediately, so these failure paths are not exercised.

Impact:

  • A failed runtime subscribe/unsubscribe can silently leave monitored OPC UA items stale or orphaned.
  • The service can appear healthy while live updates quietly stop flowing for part of the address space.

Note:

  • LmxNodeManager also runs a separate _dataChangeDispatchThread that batches MXAccess callbacks into OPC UA value updates. Subscription failures upstream mean this thread will simply never receive data for the affected tags, with no indication of the gap. Failures should be cross-referenced with dispatch-thread health to surface silent data loss.

Recommendation:

  • Stop discarding these tasks.
  • If the boundary must remain synchronous, centralize the wait and log/fail deterministically.
  • Add tests that inject asynchronously failing subscribe/unsubscribe operations.

Status: Resolved (2026-04-07) Fix: SubscribeTag and UnsubscribeTag (critical monitored-item paths) now use .GetAwaiter().GetResult() with try/catch logging. SubscribeAlarmTags, BuildSubtree alarm subscribes, and RestoreTransferredSubscriptions (batch paths) now use .ContinueWith(OnlyOnFaulted) to log failures instead of silently discarding tasks.

P2: history continuation points can leak memory after expiry

Evidence:

  • src/ZB.MOM.WW.LmxOpcUa.Host/Historian/HistoryContinuationPoint.cs:23
  • src/ZB.MOM.WW.LmxOpcUa.Host/Historian/HistoryContinuationPoint.cs:66
  • tests/ZB.MOM.WW.LmxOpcUa.Tests/Historian/HistoryContinuationPointTests.cs:25

Details:

  • Expired continuation points are purged only from Store().
  • If a client requests continuation points and then never resumes or releases them, the stored List<DataValue> instances remain in memory until another Store() happens.
  • Existing tests cover store/retrieve/release but do not cover expiration or reclamation.

Impact:

  • A burst of abandoned history reads can retain large result sets in memory until the next Store() call triggers PurgeExpired(). On an otherwise idle system with no new history reads, this retention is indefinite.

Recommendation:

  • Purge expired entries on Retrieve() and Release().
  • Consider a periodic sweep or a hard cap on stored continuation payloads.
  • Add an expiry-focused test.

Status: Resolved (2026-04-07) Fix: PurgeExpired() now called at the start of both Retrieve() and Release(). Added internal constructor accepting TimeSpan timeout for testability. Added two expiry-focused tests.

P2: the status dashboard can fail to bind and disable itself silently

Evidence:

  • src/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusWebServer.cs:53
  • src/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusWebServer.cs:64
  • tests/ZB.MOM.WW.LmxOpcUa.Tests/Status/StatusWebServerTests.cs:30
  • tests/ZB.MOM.WW.LmxOpcUa.Tests/Status/StatusWebServerTests.cs:149

Observed test result:

  • dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore --filter StatusWebServerTests
  • Result: 9 failed, 0 passed
  • Failures were all consistent with the listener not starting (IsRunning == false, connection refused).

Details:

  • Start() swallows startup exceptions, logs a warning, and leaves _listener = null.
  • The code binds http://+:{port}/, which is more permission-sensitive than a narrower host-specific prefix.
  • Callers get no explicit failure signal, so the dashboard can simply vanish at runtime.

Impact:

  • Operators and external checks can assume a dashboard exists when it does not.
  • Health visibility degrades exactly when the service most needs diagnosability.

Note:

  • The http://+:{port}/ wildcard prefix requires either administrator privileges or a pre-configured URL ACL (netsh http add urlacl). This is also the likely cause of the 9/9 test failures — tests run without elevation will always fail to bind.

Recommendation:

  • Fail fast, or at least return an explicit startup status.
  • Default to http://localhost:{port}/ unless wildcard binding is explicitly configured — this avoids the ACL requirement for single-machine deployments and fixes the test suite without special privileges.
  • Add a startup test that asserts the service reports bind failures clearly.

Status: Resolved (2026-04-07) Fix: Changed prefix from http://+:{port}/ to http://localhost:{port}/. Start() now returns bool. Bind failure logged at Error level. Test suite now passes 9/9.

P2: blocking remote I/O is performed directly in request and rebuild paths

Evidence:

  • src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:586
  • src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:617
  • src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:641
  • src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1228
  • src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1289
  • src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1386
  • src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1526
  • src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1601
  • src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1655
  • src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1718
  • src/ZB.MOM.WW.LmxOpcUa.Host/Historian/HistorianDataSource.cs:175

Details:

  • OPC UA read/write/history handlers synchronously block on MXAccess and Historian calls.
  • Incremental sync also performs blocking subscribe/unsubscribe operations while holding the node-manager lock.
  • Historian connection establishment uses polling plus Thread.Sleep(250), so slow connects directly occupy request threads.

Impact:

  • Slow runtime dependencies can starve OPC UA worker threads and make rebuilds stall the namespace lock.
  • This is not just a latency issue; it turns transient backend slowness into whole-service responsiveness problems.

Recommendation:

  • Move I/O out of locked sections.
  • Propagate cancellation/timeouts explicitly through the request path.
  • Add load/fault tests against the real async MXAccess client behavior, not only synchronous fakes.

P3: several background loops can be started multiple times and are not joined on shutdown

Evidence:

  • src/ZB.MOM.WW.LmxOpcUa.Host/GalaxyRepository/ChangeDetectionService.cs:58
  • src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.Monitor.cs:15
  • src/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusWebServer.cs:57

Details:

  • Start() methods overwrite cancellation tokens and launch Task.Run(...) without keeping the returned Task.
  • Calling Start() twice leaks the earlier loop and its CTS.
  • Stop() only cancels; it does not wait for loop completion.

Impact:

  • Duplicate starts or restart paths become nondeterministic.
  • Shutdown can race active loops that are still touching shared state.

Recommendation:

  • Guard against duplicate starts.
  • Keep the background task handle and wait for orderly exit during stop/dispose.

Status: Resolved (2026-04-07) Fix: All three services (ChangeDetectionService, MxAccessClient.Monitor, StatusWebServer) now store the Task returned by Task.Run. Start() cancels+joins any previous loop before launching a new one. Stop() cancels the token and waits on the task with a 5-second timeout.

P3: startup logging exposes sensitive configuration

Evidence:

  • src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/ConfigurationValidator.cs:71
  • src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/ConfigurationValidator.cs:118

Details:

  • The validator logs the full Galaxy repository connection string and detailed authentication-related settings.
  • In many deployments, the connection string will contain credentials.

Impact:

  • Credential exposure in logs increases operational risk and complicates incident handling.

Details on scope:

  • The primary exposure is GalaxyRepository.ConnectionString logged verbatim at ConfigurationValidator.cs:72. When using SQL authentication, this contains the password in the connection string.
  • Historian credentials (UserName/Password) are checked for emptiness but not logged as values — this section is safe.
  • LDAP ServiceAccountDn is checked for emptiness but not logged as a value — also safe.

Recommendation:

  • Redact secrets before logging. Parse the connection string and mask or omit password segments.
  • Log connection targets (server, database) and non-sensitive settings only.

Status: Resolved (2026-04-07) Fix: Added SanitizeConnectionString helper using SqlConnectionStringBuilder to mask passwords with ********. Falls back to (unparseable) if the string can't be parsed.

Test Coverage Gaps

Real async failure modes are under-tested (Resolved)

FakeMxAccessClient now supports fault injection via SubscribeException, UnsubscribeException, ReadException, and WriteException properties. When set, the corresponding async methods return Task.FromException. Three tests in LmxNodeManagerSubscriptionFaultTests verify that subscribe/unsubscribe faults are caught and logged instead of silently discarded, and that ref-count bookkeeping survives a transient fault.

Historian lifecycle coverage is minimal (Partially resolved)

Six lifecycle tests added in HistorianDataSourceLifecycleTests: post-dispose rejection for all four read methods (ReadRawAsync, ReadAggregateAsync, ReadAtTimeAsync, ReadEventsAsync), double-dispose idempotency, and aggregate column mapping.

Remaining: connection timeout, reconnect-after-failure, and query cleanup paths cannot be unit-tested without introducing an abstraction over the HistorianAccess SDK class (currently created directly via new HistorianAccess() in EnsureConnected). Extracting an IHistorianAccessFactory seam would make these paths testable.

Continuation-point expiry is not tested (Resolved)

Two expiry tests added: Retrieve_ExpiredContinuationPoint_ReturnsNull and Release_PurgesExpiredEntries.

Commands Run

Successful:

  • dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore --filter HistoryContinuationPointTests
  • dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore --filter ChangeDetectionServiceTests
  • dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore --filter StaComThreadTests

Failed:

  • dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore --filter StatusWebServerTests

Timed out:

  • dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore

Bottom Line

The most serious risks are not style issues. They are:

  • work items that can hang forever in the STA bridge,
  • silent loss of live subscriptions because async failures are ignored,
  • request/rebuild paths that block directly on external systems,
  • and a dashboard host that can disappear without surfacing a hard failure.

Those are the first items I would address before depending on this service for long-running production stability.