Move subscribe/unsubscribe I/O outside lock(Lock) in SyncAddressSpace to avoid blocking all OPC UA operations during rebuilds. Replace blocking ReadAsync calls for alarm priority/description in dispatch loop with cached subscription values. Extract IHistorianConnectionFactory so EnsureConnected can be tested without the SDK runtime — adds 5 connection lifecycle tests (failure, timeout, reconnect, state resilience, dispose-after-failure). All stability review findings and test coverage gaps are now fully resolved. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
15 KiB
Stability Review
Date: 2026-04-07
Scope:
- Service startup/shutdown lifecycle
- MXAccess threading and reconnect behavior
- OPC UA node manager request paths
- Historian history-read paths
- Status dashboard hosting
- Test coverage around the above
Findings
P1: StaComThread can leave callers blocked forever after pump failure or shutdown races
Evidence:
src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/StaComThread.cs:106src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/StaComThread.cs:132src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/StaComThread.cs:178
Details:
RunAsyncenqueues work and callsPostThreadMessage(...), but it ignores the returnedbool.- If the STA thread has already exited, or if posting fails for any other reason, the queued
TaskCompletionSourceis never completed or faulted. ThreadEntrylogs pump crashes, but it does not drain/fault queued work, does not reset_nativeThreadId, and does not prevent later calls from queueing more work unlessDispose()happened first.
Note:
- Clean shutdown via
Dispose()postsWM_APP+1, which callsDrainQueue()beforePostQuitMessage, so queued work is drained on the normal shutdown path. The gap is crash and unexpected-exit paths only. ThreadEntrycatches crashes and calls_ready.TrySetException(ex), but_readyis only awaited duringStart(). A crash after startup completes does not fault any pending or future callers.
Impact:
- Any caller waiting synchronously on these tasks can hang indefinitely after a pump crash (not a clean shutdown).
- This is especially dangerous because higher layers regularly use
.GetAwaiter().GetResult()during connect, disconnect, rebuild, and request processing.
Recommendation:
- Check the return value of
PostThreadMessage. - If post fails, remove/fault the queued work item immediately.
- Mark the worker unusable when the pump exits unexpectedly and fault all remaining queued items.
- Add a shutdown/crash-path test that verifies queued callers fail fast instead of hanging.
Status: Resolved (2026-04-07)
Fix: Refactored queue to WorkItem type with separate Execute/Fault actions. Added _pumpExited flag set in ThreadEntry finally block. DrainAndFaultQueue() faults all pending TCS instances without executing user actions. RunAsync checks _pumpExited before enqueueing. PostThreadMessage return value is checked — false triggers drain-and-fault. Added crash-path test via PostQuitMessage.
P1: LmxNodeManager discards subscription tasks, so failures can be silent
Evidence:
src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:396src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1906src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1934src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1960tests/ZB.MOM.WW.LmxOpcUa.Tests/Helpers/FakeMxAccessClient.cs:83
Details:
- Several subscription and unsubscription calls are fire-and-forget.
SubscribeAlarmTags()even wrapsSubscribeAsync(...)intry/catch, but because the returned task is not awaited, asynchronous failures bypass that catch.- The test suite mostly uses
FakeMxAccessClient, whose subscribe/unsubscribe methods complete immediately, so these failure paths are not exercised.
Impact:
- A failed runtime subscribe/unsubscribe can silently leave monitored OPC UA items stale or orphaned.
- The service can appear healthy while live updates quietly stop flowing for part of the address space.
Note:
LmxNodeManageralso runs a separate_dataChangeDispatchThreadthat batches MXAccess callbacks into OPC UA value updates. Subscription failures upstream mean this thread will simply never receive data for the affected tags, with no indication of the gap. Failures should be cross-referenced with dispatch-thread health to surface silent data loss.
Recommendation:
- Stop discarding these tasks.
- If the boundary must remain synchronous, centralize the wait and log/fail deterministically.
- Add tests that inject asynchronously failing subscribe/unsubscribe operations.
Status: Resolved (2026-04-07)
Fix: SubscribeTag and UnsubscribeTag (critical monitored-item paths) now use .GetAwaiter().GetResult() with try/catch logging. SubscribeAlarmTags, BuildSubtree alarm subscribes, and RestoreTransferredSubscriptions (batch paths) now use .ContinueWith(OnlyOnFaulted) to log failures instead of silently discarding tasks.
P2: history continuation points can leak memory after expiry
Evidence:
src/ZB.MOM.WW.LmxOpcUa.Host/Historian/HistoryContinuationPoint.cs:23src/ZB.MOM.WW.LmxOpcUa.Host/Historian/HistoryContinuationPoint.cs:66tests/ZB.MOM.WW.LmxOpcUa.Tests/Historian/HistoryContinuationPointTests.cs:25
Details:
- Expired continuation points are purged only from
Store(). - If a client requests continuation points and then never resumes or releases them, the stored
List<DataValue>instances remain in memory until anotherStore()happens. - Existing tests cover store/retrieve/release but do not cover expiration or reclamation.
Impact:
- A burst of abandoned history reads can retain large result sets in memory until the next
Store()call triggersPurgeExpired(). On an otherwise idle system with no new history reads, this retention is indefinite.
Recommendation:
- Purge expired entries on
Retrieve()andRelease(). - Consider a periodic sweep or a hard cap on stored continuation payloads.
- Add an expiry-focused test.
Status: Resolved (2026-04-07)
Fix: PurgeExpired() now called at the start of both Retrieve() and Release(). Added internal constructor accepting TimeSpan timeout for testability. Added two expiry-focused tests.
P2: the status dashboard can fail to bind and disable itself silently
Evidence:
src/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusWebServer.cs:53src/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusWebServer.cs:64tests/ZB.MOM.WW.LmxOpcUa.Tests/Status/StatusWebServerTests.cs:30tests/ZB.MOM.WW.LmxOpcUa.Tests/Status/StatusWebServerTests.cs:149
Observed test result:
dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore --filter StatusWebServerTests- Result: 9 failed, 0 passed
- Failures were all consistent with the listener not starting (
IsRunning == false, connection refused).
Details:
Start()swallows startup exceptions, logs a warning, and leaves_listener = null.- The code binds
http://+:{port}/, which is more permission-sensitive than a narrower host-specific prefix. - Callers get no explicit failure signal, so the dashboard can simply vanish at runtime.
Impact:
- Operators and external checks can assume a dashboard exists when it does not.
- Health visibility degrades exactly when the service most needs diagnosability.
Note:
- The
http://+:{port}/wildcard prefix requires either administrator privileges or a pre-configured URL ACL (netsh http add urlacl). This is also the likely cause of the 9/9 test failures — tests run without elevation will always fail to bind.
Recommendation:
- Fail fast, or at least return an explicit startup status.
- Default to
http://localhost:{port}/unless wildcard binding is explicitly configured — this avoids the ACL requirement for single-machine deployments and fixes the test suite without special privileges. - Add a startup test that asserts the service reports bind failures clearly.
Status: Resolved (2026-04-07)
Fix: Changed prefix from http://+:{port}/ to http://localhost:{port}/. Start() now returns bool. Bind failure logged at Error level. Test suite now passes 9/9.
P2: blocking remote I/O is performed directly in request and rebuild paths
Evidence:
src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:586src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:617src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:641src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1228src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1289src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1386src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1526src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1601src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1655src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1718src/ZB.MOM.WW.LmxOpcUa.Host/Historian/HistorianDataSource.cs:175
Details:
- OPC UA read/write/history handlers synchronously block on MXAccess and Historian calls.
- Incremental sync also performs blocking subscribe/unsubscribe operations while holding the node-manager lock.
- Historian connection establishment uses polling plus
Thread.Sleep(250), so slow connects directly occupy request threads.
Impact:
- Slow runtime dependencies can starve OPC UA worker threads and make rebuilds stall the namespace lock.
- This is not just a latency issue; it turns transient backend slowness into whole-service responsiveness problems.
Recommendation:
- Move I/O out of locked sections.
- Propagate cancellation/timeouts explicitly through the request path.
- Add load/fault tests against the real async MXAccess client behavior, not only synchronous fakes.
Status: Resolved (2026-04-07)
Fix: Moved subscribe/unsubscribe I/O outside lock(Lock) in SyncAddressSpace and TearDownGobjects — bookkeeping is done under lock, actual MXAccess calls happen after the lock is released. Replaced blocking ReadAsync calls for alarm priority/description in the dispatch loop with cached values populated from subscription data changes via new _alarmPriorityTags/_alarmDescTags reverse lookup dictionaries. Refactored Historian EnsureConnected/EnsureEventConnected with double-check locking so WaitForConnection polling runs outside _connectionLock. OPC UA Read/Write/HistoryRead handlers remain synchronously blocking (framework constraint: CustomNodeManager2 overrides are void) but MxAccessClient.ReadAsync/WriteAsync already enforce configurable timeouts (default 5s).
P3: several background loops can be started multiple times and are not joined on shutdown
Evidence:
src/ZB.MOM.WW.LmxOpcUa.Host/GalaxyRepository/ChangeDetectionService.cs:58src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.Monitor.cs:15src/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusWebServer.cs:57
Details:
Start()methods overwrite cancellation tokens and launchTask.Run(...)without keeping the returnedTask.- Calling
Start()twice leaks the earlier loop and its CTS. Stop()only cancels; it does not wait for loop completion.
Impact:
- Duplicate starts or restart paths become nondeterministic.
- Shutdown can race active loops that are still touching shared state.
Recommendation:
- Guard against duplicate starts.
- Keep the background task handle and wait for orderly exit during stop/dispose.
Status: Resolved (2026-04-07)
Fix: All three services (ChangeDetectionService, MxAccessClient.Monitor, StatusWebServer) now store the Task returned by Task.Run. Start() cancels+joins any previous loop before launching a new one. Stop() cancels the token and waits on the task with a 5-second timeout.
P3: startup logging exposes sensitive configuration
Evidence:
src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/ConfigurationValidator.cs:71src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/ConfigurationValidator.cs:118
Details:
- The validator logs the full Galaxy repository connection string and detailed authentication-related settings.
- In many deployments, the connection string will contain credentials.
Impact:
- Credential exposure in logs increases operational risk and complicates incident handling.
Details on scope:
- The primary exposure is
GalaxyRepository.ConnectionStringlogged verbatim atConfigurationValidator.cs:72. When using SQL authentication, this contains the password in the connection string. - Historian credentials (
UserName/Password) are checked for emptiness but not logged as values — this section is safe. - LDAP
ServiceAccountDnis checked for emptiness but not logged as a value — also safe.
Recommendation:
- Redact secrets before logging. Parse the connection string and mask or omit password segments.
- Log connection targets (server, database) and non-sensitive settings only.
Status: Resolved (2026-04-07)
Fix: Added SanitizeConnectionString helper using SqlConnectionStringBuilder to mask passwords with ********. Falls back to (unparseable) if the string can't be parsed.
Test Coverage Gaps
Real async failure modes are under-tested (Resolved)
FakeMxAccessClient now supports fault injection via SubscribeException, UnsubscribeException, ReadException, and WriteException properties. When set, the corresponding async methods return Task.FromException. Three tests in LmxNodeManagerSubscriptionFaultTests verify that subscribe/unsubscribe faults are caught and logged instead of silently discarded, and that ref-count bookkeeping survives a transient fault.
Historian lifecycle coverage is minimal (Resolved)
Extracted IHistorianConnectionFactory abstraction from HistorianDataSource, with SdkHistorianConnectionFactory as the production implementation and FakeHistorianConnectionFactory for tests. Eleven lifecycle tests in HistorianDataSourceLifecycleTests now cover: post-dispose rejection for all four read methods, double-dispose idempotency, aggregate column mapping, connection failure (returns empty results), connection timeout (returns empty results), reconnect-after-error (factory called twice), connection failure state resilience, and dispose-after-failure safety.
Continuation-point expiry is not tested (Resolved)
Two expiry tests added: Retrieve_ExpiredContinuationPoint_ReturnsNull and Release_PurgesExpiredEntries.
Commands Run
Successful:
dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore --filter HistoryContinuationPointTestsdotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore --filter ChangeDetectionServiceTestsdotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore --filter StaComThreadTests
Failed:
dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore --filter StatusWebServerTests
Timed out:
dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore
Bottom Line
All findings have been resolved:
- StaComThread crash-path faulting prevents callers from hanging forever.
- Subscription tasks are no longer silently discarded — failures are caught and logged.
- Subscribe/unsubscribe I/O moved outside
lock(Lock)in rebuild paths; alarm metadata cached from subscriptions instead of blocking reads; Historian connection polling no longer holds the connection lock. - Dashboard binds to localhost and reports startup failures explicitly.
- Background loops guard against double-start and join on stop.
- Connection strings are sanitized before logging.
Remaining architectural note: OPC UA Read/Write/HistoryRead handlers still use .GetAwaiter().GetResult() because CustomNodeManager2 overrides are synchronous. This is mitigated by the existing configurable timeouts in MxAccessClient (default 5s).