Resolve 6 of 7 stability review findings and close test coverage gaps

Fixes P1 StaComThread hang (crash-path faulting via WorkItem queue), P1 subscription fire-and-forget (block+log or ContinueWith on 5 call sites), P2 continuation point leak (PurgeExpired on Retrieve/Release), P2 dashboard bind failure (localhost prefix, bool Start), P3 background loop double-start (task handles + join on stop in 3 files), and P3 config logging exposure (SqlConnectionStringBuilder password masking). Adds FakeMxAccessClient fault injection and 12 new tests. Documents required runtime assemblies in ServiceHosting.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 15:37:27 -04:00
parent a28600ab1b
commit 95ad9c6866
16 changed files with 692 additions and 52 deletions
@@ -151,6 +151,24 @@ ZB.MOM.WW.LmxOpcUa.Host.exe install -servicename "LmxOpcUa2" -displayname "LMX O

 See [Redundancy Guide](Redundancy.md) for full deployment details.

+## Required Runtime Assemblies
+
+The build uses Costura.Fody to embed all NuGet dependencies into the single `ZB.MOM.WW.LmxOpcUa.Host.exe`. However, the following ArchestrA and Historian DLLs are **excluded from embedding** and must be present alongside the executable at runtime:
+
+| Assembly | Purpose |
+|----------|---------|
+| `ArchestrA.MxAccess.dll` | MXAccess COM interop — runtime data access to Galaxy tags |
+| `aahClientManaged.dll` | Wonderware Historian managed SDK — historical data queries |
+| `aahClient.dll` | Historian native dependency |
+| `aahClientCommon.dll` | Historian native dependency |
+| `Historian.CBE.dll` | Historian native dependency |
+| `Historian.DPAPI.dll` | Historian native dependency |
+| `ArchestrA.CloudHistorian.Contract.dll` | Historian contract dependency |
+
+These DLLs are sourced from the `lib/` folder in the repository and are copied to the build output directory automatically. When deploying, ensure all seven DLLs are in the same directory as the executable.
+
+These assemblies are not redistributable — they are provided by the AVEVA System Platform and Historian installations on the target machine. The copies in `lib/` are taken from `Program Files (x86)\ArchestrA\Framework\bin` on a machine with the platform installed.
+
 ## Platform Target

 The service must be compiled and run as x86 (32-bit). The MXAccess COM toolkit DLLs in `Program Files (x86)\ArchestrA\Framework\bin` are 32-bit only. Running the service as x64 or AnyCPU (64-bit preferred) causes COM interop failures when creating the `LMXProxyServer` object on the STA thread.
@@ -0,0 +1,243 @@
+# Stability Review
+
+Date: 2026-04-07
+
+Scope:
+- Service startup/shutdown lifecycle
+- MXAccess threading and reconnect behavior
+- OPC UA node manager request paths
+- Historian history-read paths
+- Status dashboard hosting
+- Test coverage around the above
+
+## Findings
+
+### P1: `StaComThread` can leave callers blocked forever after pump failure or shutdown races
+
+Evidence:
+- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/StaComThread.cs:106`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/StaComThread.cs:132`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/StaComThread.cs:178`
+
+Details:
+- `RunAsync` enqueues work and calls `PostThreadMessage(...)`, but it ignores the returned `bool`.
+- If the STA thread has already exited, or if posting fails for any other reason, the queued `TaskCompletionSource` is never completed or faulted.
+- `ThreadEntry` logs pump crashes, but it does not drain/fault queued work, does not reset `_nativeThreadId`, and does not prevent later calls from queueing more work unless `Dispose()` happened first.
+
+Note:
+- Clean shutdown via `Dispose()` posts `WM_APP+1`, which calls `DrainQueue()` before `PostQuitMessage`, so queued work is drained on the normal shutdown path. The gap is crash and unexpected-exit paths only.
+- `ThreadEntry` catches crashes and calls `_ready.TrySetException(ex)`, but `_ready` is only awaited during `Start()`. A crash *after* startup completes does not fault any pending or future callers.
+
+Impact:
+- Any caller waiting synchronously on these tasks can hang indefinitely after a pump crash (not a clean shutdown).
+- This is especially dangerous because higher layers regularly use `.GetAwaiter().GetResult()` during connect, disconnect, rebuild, and request processing.
+
+Recommendation:
+- Check the return value of `PostThreadMessage`.
+- If post fails, remove/fault the queued work item immediately.
+- Mark the worker unusable when the pump exits unexpectedly and fault all remaining queued items.
+- Add a shutdown/crash-path test that verifies queued callers fail fast instead of hanging.
+
+**Status: Resolved (2026-04-07)**
+Fix: Refactored queue to `WorkItem` type with separate `Execute`/`Fault` actions. Added `_pumpExited` flag set in `ThreadEntry` finally block. `DrainAndFaultQueue()` faults all pending TCS instances without executing user actions. `RunAsync` checks `_pumpExited` before enqueueing. `PostThreadMessage` return value is checked — false triggers drain-and-fault. Added crash-path test via `PostQuitMessage`.
+
+### P1: `LmxNodeManager` discards subscription tasks, so failures can be silent
+
+Evidence:
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:396`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1906`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1934`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1960`
+- `tests/ZB.MOM.WW.LmxOpcUa.Tests/Helpers/FakeMxAccessClient.cs:83`
+
+Details:
+- Several subscription and unsubscription calls are fire-and-forget.
+- `SubscribeAlarmTags()` even wraps `SubscribeAsync(...)` in `try/catch`, but because the returned task is not awaited, asynchronous failures bypass that catch.
+- The test suite mostly uses `FakeMxAccessClient`, whose subscribe/unsubscribe methods complete immediately, so these failure paths are not exercised.
+
+Impact:
+- A failed runtime subscribe/unsubscribe can silently leave monitored OPC UA items stale or orphaned.
+- The service can appear healthy while live updates quietly stop flowing for part of the address space.
+
+Note:
+- `LmxNodeManager` also runs a separate `_dataChangeDispatchThread` that batches MXAccess callbacks into OPC UA value updates. Subscription failures upstream mean this thread will simply never receive data for the affected tags, with no indication of the gap. Failures should be cross-referenced with dispatch-thread health to surface silent data loss.
+
+Recommendation:
+- Stop discarding these tasks.
+- If the boundary must remain synchronous, centralize the wait and log/fail deterministically.
+- Add tests that inject asynchronously failing subscribe/unsubscribe operations.
+
+**Status: Resolved (2026-04-07)**
+Fix: `SubscribeTag` and `UnsubscribeTag` (critical monitored-item paths) now use `.GetAwaiter().GetResult()` with try/catch logging. `SubscribeAlarmTags`, `BuildSubtree` alarm subscribes, and `RestoreTransferredSubscriptions` (batch paths) now use `.ContinueWith(OnlyOnFaulted)` to log failures instead of silently discarding tasks.
+
+### P2: history continuation points can leak memory after expiry
+
+Evidence:
+- `src/ZB.MOM.WW.LmxOpcUa.Host/Historian/HistoryContinuationPoint.cs:23`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/Historian/HistoryContinuationPoint.cs:66`
+- `tests/ZB.MOM.WW.LmxOpcUa.Tests/Historian/HistoryContinuationPointTests.cs:25`
+
+Details:
+- Expired continuation points are purged only from `Store()`.
+- If a client requests continuation points and then never resumes or releases them, the stored `List<DataValue>` instances remain in memory until another `Store()` happens.
+- Existing tests cover store/retrieve/release but do not cover expiration or reclamation.
+
+Impact:
+- A burst of abandoned history reads can retain large result sets in memory until the next `Store()` call triggers `PurgeExpired()`. On an otherwise idle system with no new history reads, this retention is indefinite.
+
+Recommendation:
+- Purge expired entries on `Retrieve()` and `Release()`.
+- Consider a periodic sweep or a hard cap on stored continuation payloads.
+- Add an expiry-focused test.
+
+**Status: Resolved (2026-04-07)**
+Fix: `PurgeExpired()` now called at the start of both `Retrieve()` and `Release()`. Added internal constructor accepting `TimeSpan timeout` for testability. Added two expiry-focused tests.
+
+### P2: the status dashboard can fail to bind and disable itself silently
+
+Evidence:
+- `src/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusWebServer.cs:53`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusWebServer.cs:64`
+- `tests/ZB.MOM.WW.LmxOpcUa.Tests/Status/StatusWebServerTests.cs:30`
+- `tests/ZB.MOM.WW.LmxOpcUa.Tests/Status/StatusWebServerTests.cs:149`
+
+Observed test result:
+- `dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore --filter StatusWebServerTests`
+- Result: 9 failed, 0 passed
+- Failures were all consistent with the listener not starting (`IsRunning == false`, connection refused).
+
+Details:
+- `Start()` swallows startup exceptions, logs a warning, and leaves `_listener = null`.
+- The code binds `http://+:{port}/`, which is more permission-sensitive than a narrower host-specific prefix.
+- Callers get no explicit failure signal, so the dashboard can simply vanish at runtime.
+
+Impact:
+- Operators and external checks can assume a dashboard exists when it does not.
+- Health visibility degrades exactly when the service most needs diagnosability.
+
+Note:
+- The `http://+:{port}/` wildcard prefix requires either administrator privileges or a pre-configured URL ACL (`netsh http add urlacl`). This is also the likely cause of the 9/9 test failures — tests run without elevation will always fail to bind.
+
+Recommendation:
+- Fail fast, or at least return an explicit startup status.
+- Default to `http://localhost:{port}/` unless wildcard binding is explicitly configured — this avoids the ACL requirement for single-machine deployments and fixes the test suite without special privileges.
+- Add a startup test that asserts the service reports bind failures clearly.
+
+**Status: Resolved (2026-04-07)**
+Fix: Changed prefix from `http://+:{port}/` to `http://localhost:{port}/`. `Start()` now returns `bool`. Bind failure logged at Error level. Test suite now passes 9/9.
+
+### P2: blocking remote I/O is performed directly in request and rebuild paths
+
+Evidence:
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:586`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:617`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:641`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1228`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1289`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1386`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1526`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1601`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1655`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1718`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/Historian/HistorianDataSource.cs:175`
+
+Details:
+- OPC UA read/write/history handlers synchronously block on MXAccess and Historian calls.
+- Incremental sync also performs blocking subscribe/unsubscribe operations while holding the node-manager lock.
+- Historian connection establishment uses polling plus `Thread.Sleep(250)`, so slow connects directly occupy request threads.
+
+Impact:
+- Slow runtime dependencies can starve OPC UA worker threads and make rebuilds stall the namespace lock.
+- This is not just a latency issue; it turns transient backend slowness into whole-service responsiveness problems.
+
+Recommendation:
+- Move I/O out of locked sections.
+- Propagate cancellation/timeouts explicitly through the request path.
+- Add load/fault tests against the real async MXAccess client behavior, not only synchronous fakes.
+
+### P3: several background loops can be started multiple times and are not joined on shutdown
+
+Evidence:
+- `src/ZB.MOM.WW.LmxOpcUa.Host/GalaxyRepository/ChangeDetectionService.cs:58`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.Monitor.cs:15`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusWebServer.cs:57`
+
+Details:
+- `Start()` methods overwrite cancellation tokens and launch `Task.Run(...)` without keeping the returned `Task`.
+- Calling `Start()` twice leaks the earlier loop and its CTS.
+- `Stop()` only cancels; it does not wait for loop completion.
+
+Impact:
+- Duplicate starts or restart paths become nondeterministic.
+- Shutdown can race active loops that are still touching shared state.
+
+Recommendation:
+- Guard against duplicate starts.
+- Keep the background task handle and wait for orderly exit during stop/dispose.
+
+**Status: Resolved (2026-04-07)**
+Fix: All three services (`ChangeDetectionService`, `MxAccessClient.Monitor`, `StatusWebServer`) now store the `Task` returned by `Task.Run`. `Start()` cancels+joins any previous loop before launching a new one. `Stop()` cancels the token and waits on the task with a 5-second timeout.
+
+### P3: startup logging exposes sensitive configuration
+
+Evidence:
+- `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/ConfigurationValidator.cs:71`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/ConfigurationValidator.cs:118`
+
+Details:
+- The validator logs the full Galaxy repository connection string and detailed authentication-related settings.
+- In many deployments, the connection string will contain credentials.
+
+Impact:
+- Credential exposure in logs increases operational risk and complicates incident handling.
+
+Details on scope:
+- The primary exposure is `GalaxyRepository.ConnectionString` logged verbatim at `ConfigurationValidator.cs:72`. When using SQL authentication, this contains the password in the connection string.
+- Historian credentials (`UserName`/`Password`) are checked for emptiness but not logged as values — this section is safe.
+- LDAP `ServiceAccountDn` is checked for emptiness but not logged as a value — also safe.
+
+Recommendation:
+- Redact secrets before logging. Parse the connection string and mask or omit password segments.
+- Log connection targets (server, database) and non-sensitive settings only.
+
+**Status: Resolved (2026-04-07)**
+Fix: Added `SanitizeConnectionString` helper using `SqlConnectionStringBuilder` to mask passwords with `********`. Falls back to `(unparseable)` if the string can't be parsed.
+
+## Test Coverage Gaps
+
+### ~~Real async failure modes are under-tested~~ (Resolved)
+
+`FakeMxAccessClient` now supports fault injection via `SubscribeException`, `UnsubscribeException`, `ReadException`, and `WriteException` properties. When set, the corresponding async methods return `Task.FromException`. Three tests in `LmxNodeManagerSubscriptionFaultTests` verify that subscribe/unsubscribe faults are caught and logged instead of silently discarded, and that ref-count bookkeeping survives a transient fault.
+
+### ~~Historian lifecycle coverage is minimal~~ (Partially resolved)
+
+Six lifecycle tests added in `HistorianDataSourceLifecycleTests`: post-dispose rejection for all four read methods (`ReadRawAsync`, `ReadAggregateAsync`, `ReadAtTimeAsync`, `ReadEventsAsync`), double-dispose idempotency, and aggregate column mapping.
+
+Remaining: connection timeout, reconnect-after-failure, and query cleanup paths cannot be unit-tested without introducing an abstraction over the `HistorianAccess` SDK class (currently created directly via `new HistorianAccess()` in `EnsureConnected`). Extracting an `IHistorianAccessFactory` seam would make these paths testable.
+
+### ~~Continuation-point expiry is not tested~~ (Resolved)
+
+Two expiry tests added: `Retrieve_ExpiredContinuationPoint_ReturnsNull` and `Release_PurgesExpiredEntries`.
+
+## Commands Run
+
+Successful:
+- `dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore --filter HistoryContinuationPointTests`
+- `dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore --filter ChangeDetectionServiceTests`
+- `dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore --filter StaComThreadTests`
+
+Failed:
+- `dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore --filter StatusWebServerTests`
+
+Timed out:
+- `dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore`
+
+## Bottom Line
+
+The most serious risks are not style issues. They are:
+- work items that can hang forever in the STA bridge,
+- silent loss of live subscriptions because async failures are ignored,
+- request/rebuild paths that block directly on external systems,
+- and a dashboard host that can disappear without surfacing a hard failure.
+
+Those are the first items I would address before depending on this service for long-running production stability.