Close all four stability-review 2026-04-13 findings so a failed runtime probe subscription can no longer leave a phantom entry that Tick() flips to Stopped and fans out false BadOutOfService quality across a host's subtree, a silently-failed dashboard bind no longer lets the service advertise a successful start while an operator-visible endpoint is dead, the seven sync-over-async sites in LmxNodeManager (rebuild probe sync, Read, Write, four HistoryRead overrides) can no longer park the OPC UA stack thread indefinitely on a hung backend, and alarm auto-subscribe + transferred-subscription restore no longer race shutdown as untracked fire-and-forget tasks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 00:48:07 -04:00
parent 731092595f
commit c76ab8fdee
21 changed files with 869 additions and 53 deletions
@@ -76,6 +76,7 @@ Controls the MXAccess runtime connection used for live tag reads and writes. Def
 | `ProbeStaleThresholdSeconds` | `int` | `60` | Seconds a probe value may remain unchanged before the connection is considered stale |
 | `RuntimeStatusProbesEnabled` | `bool` | `true` | Advises `<Host>.ScanState` on every deployed `$WinPlatform` and `$AppEngine` to track per-host runtime state. Drives the Galaxy Runtime dashboard panel, HealthCheck Rule 2e, and the Read-path short-circuit that invalidates OPC UA variable quality when a host is Stopped. Set `false` to return to legacy behavior where host state is invisible and the bridge serves whatever quality MxAccess reports for individual tags. See [MXAccess Bridge](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate) |
 | `RuntimeStatusUnknownTimeoutSeconds` | `int` | `15` | Maximum seconds to wait for the initial probe callback before marking a host as Stopped. Only applies to the Unknown → Stopped transition; Running hosts never time out because `ScanState` is delivered on-change only. A value below 5s triggers a validator warning |
+| `RequestTimeoutSeconds` | `int` | `30` | Outer safety timeout applied to sync-over-async MxAccess operations invoked from the OPC UA stack thread (Read, Write, address-space rebuild probe sync). Backstop for the inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds`. A timed-out operation returns `BadTimeout`. Validator rejects values < 1 and warns if set below the inner Read/Write timeouts. See [MXAccess Bridge](MxAccessBridge.md#request-timeout-safety-backstop). Stability review 2026-04-13 Finding 3 |

 ### GalaxyRepository

@@ -112,7 +113,8 @@ Controls the Wonderware Historian SDK connection for OPC UA historical data acce
 | `UserName` | `string?` | `null` | Username when `IntegratedSecurity` is false |
 | `Password` | `string?` | `null` | Password when `IntegratedSecurity` is false |
 | `Port` | `int` | `32568` | Historian TCP port |
-| `CommandTimeoutSeconds` | `int` | `30` | SDK packet timeout in seconds |
+| `CommandTimeoutSeconds` | `int` | `30` | SDK packet timeout in seconds (inner async bound) |
+| `RequestTimeoutSeconds` | `int` | `60` | Outer safety timeout applied to sync-over-async Historian operations invoked from the OPC UA stack thread (`HistoryReadRaw`, `HistoryReadProcessed`, `HistoryReadAtTime`, `HistoryReadEvents`). Backstop for `CommandTimeoutSeconds`; a timed-out read returns `BadTimeout`. Validator rejects values < 1 and warns if set below `CommandTimeoutSeconds`. Stability review 2026-04-13 Finding 3 |
 | `MaxValuesPerRead` | `int` | `10000` | Maximum values returned per `HistoryRead` request |

 ### Authentication
@@ -310,7 +312,8 @@ Integration tests use this constructor to inject substitute implementations of `
    "ProbeTag": null,
    "ProbeStaleThresholdSeconds": 60,
    "RuntimeStatusProbesEnabled": true,
-    "RuntimeStatusUnknownTimeoutSeconds": 15
+    "RuntimeStatusUnknownTimeoutSeconds": 15,
+    "RequestTimeoutSeconds": 30
  },
  "GalaxyRepository": {
    "ConnectionString": "Server=localhost;Database=ZB;Integrated Security=true;",
@@ -333,6 +336,7 @@ Integration tests use this constructor to inject substitute implementations of `
    "Password": null,
    "Port": 32568,
    "CommandTimeoutSeconds": 30,
+    "RequestTimeoutSeconds": 60,
    "MaxValuesPerRead": 10000
  },
  "Authentication": {
@@ -54,6 +54,7 @@ public class HistorianConfiguration
    public int Port { get; set; } = 32568;
    public int CommandTimeoutSeconds { get; set; } = 30;
    public int MaxValuesPerRead { get; set; } = 10000;
+    public int RequestTimeoutSeconds { get; set; } = 60;
 }
 ```

@@ -70,7 +71,8 @@ When `Enabled` is `false`, `HistorianPluginLoader.TryLoad` is not called, no plu
 | `UserName` | `null` | Username when `IntegratedSecurity` is false |
 | `Password` | `null` | Password when `IntegratedSecurity` is false |
 | `Port` | `32568` | Historian TCP port |
-| `CommandTimeoutSeconds` | `30` | SDK packet timeout in seconds |
+| `CommandTimeoutSeconds` | `30` | SDK packet timeout in seconds (inner async bound) |
+| `RequestTimeoutSeconds` | `60` | Outer safety timeout applied to sync-over-async history reads on the OPC UA stack thread. Backstop for `CommandTimeoutSeconds`; a timed-out read returns `BadTimeout`. Should be greater than `CommandTimeoutSeconds`. Stability review 2026-04-13 Finding 3 |
 | `MaxValuesPerRead` | `10000` | Maximum values per history read request |

 ## Connection Lifecycle
@@ -108,6 +108,7 @@ Enabled by default via `MxAccess.RuntimeStatusProbesEnabled`; see [Configuration
 2. **Transition predicate** — A probe callback is interpreted as `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else (explicit `ScanState = false`, bad quality, communication errors from the broker) means **Stopped**.
 3. **On-change-only delivery** — `ScanState` is delivered **only when the value actually changes**. A stably Running host may go hours without a callback. The probe manager's `Tick()` explicitly does NOT run a starvation check on Running entries — the only time-based transition is **Unknown → Stopped** when the initial callback hasn't arrived within `RuntimeStatusUnknownTimeoutSeconds` (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts.
 4. **Transport gating** — When `IMxAccessClient.State != Connected`, `GetSnapshot()` forces every entry to `Unknown` regardless of underlying state. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped."
+5. **Subscribe failure rollback** — If `SubscribeAsync` throws for a new probe (SDK failure, broker rejection, transport error), the manager rolls back both `_byProbe` and `_probeByGobjectId` so the probe never appears in `GetSnapshot()`. Without this rollback, a failed subscribe would leave the entry in `Unknown` forever, and `Tick()` would later transition it to `Stopped` after the unknown-resolution timeout, fanning out a **false-negative** host-down signal that invalidates the subtree of a host that was never actually advised. Stability review 2026-04-13 Finding 1.

 ### Subtree quality invalidation on transition

@@ -137,6 +138,14 @@ See the `runtimestatus.md` plan file and the `service_info.md` entry for the in-

 See [Status Dashboard](StatusDashboard.md#galaxy-runtime) for the field table and [Configuration](Configuration.md#mxaccess) for the two new config fields.

+## Request Timeout Safety Backstop
+
+Every sync-over-async site on the OPC UA stack thread that calls into MxAccess (`Read`, `Write`, address-space rebuild probe sync) is wrapped in a bounded `SyncOverAsync.WaitSync(...)` helper with timeout `MxAccess.RequestTimeoutSeconds` (default 30s). This is a backstop: `MxAccessClient.Read/Write` already enforce inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds` bounds on the async path. The outer wrapper exists so a scheduler stall, slow reconnect, or any other non-returning async path cannot park the stack thread indefinitely.
+
+On timeout, the underlying task is **not** cancelled — it runs to completion on the thread pool and is abandoned. This is acceptable because MxAccess clients are shared singletons and the abandoned continuation does not capture request-scoped state. The OPC UA stack receives `StatusCodes.BadTimeout` on the affected operation.
+
+`ConfigurationValidator` enforces `RequestTimeoutSeconds >= 1` and warns when it is set below the inner Read/Write timeouts (operator misconfiguration). Stability review 2026-04-13 Finding 3.
+
 ## Why Marshal.ReleaseComObject Is Needed

 The .NET runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. `MxProxyAdapter.Unregister` calls `Marshal.ReleaseComObject(_lmxProxy)` in a `finally` block to immediately release the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance.
@@ -65,7 +65,7 @@ This is necessary because Windows services default their working directory to `S
 9. **Query Galaxy hierarchy** -- Fetches the object hierarchy and attribute definitions from the Galaxy repository database, recording object and attribute counts.
 10. **Start server and build address space** -- Starts the OPC UA server, retrieves the `LmxNodeManager`, and calls `BuildAddressSpace()` with the queried hierarchy and attributes. If the query or build fails, the server still starts with an empty address space.
 11. **Start change detection** -- Creates and starts `ChangeDetectionService`, which polls `galaxy.time_of_last_deploy` at the configured interval. When a change is detected, it triggers an address-space rebuild via the `OnGalaxyChanged` event.
-12. **Start status dashboard** -- Creates the `HealthCheckService` and `StatusReportService`, wires in all live components, and starts the `StatusWebServer` HTTP listener if the dashboard is enabled.
+12. **Start status dashboard** -- Creates the `HealthCheckService` and `StatusReportService`, wires in all live components, and starts the `StatusWebServer` HTTP listener if the dashboard is enabled. If `StatusWebServer.Start()` returns `false` (port already bound, insufficient permissions, etc.), the service logs a warning, disposes the unstarted instance, sets `OpcUaService.DashboardStartFailed = true`, and continues in degraded mode. Matches the warning-continue policy applied to MxAccess connect, Galaxy DB connect, and initial address space build. Stability review 2026-04-13 Finding 2.
 13. **Log startup complete** -- Logs "LmxOpcUa service started successfully" at `Information` level.

 ## Shutdown Sequence
@@ -243,6 +243,16 @@ The dashboard is configured through the `Dashboard` section in `appsettings.json

 Setting `Enabled` to `false` prevents the `StatusWebServer` from starting. The `StatusReportService` is still created so that other components can query health programmatically, but no HTTP listener is opened.

+### Dashboard start failures are non-fatal
+
+If the dashboard is enabled but the configured port is already bound (e.g., a previous instance did not clean up, another service is squatting on the port, or the user lacks URL-reservation rights), `StatusWebServer.Start()` logs the listener exception at Error level and returns `false`. `OpcUaService` then logs a Warning, disposes the unstarted instance, sets `DashboardStartFailed = true`, and continues in degraded mode — the OPC UA endpoint still starts. Operators can detect the failure by searching the service log for:
+
+```
+[WRN] Status dashboard failed to bind on port {Port}; service continues without dashboard
+```
+
+Stability review 2026-04-13 Finding 2.
+
 ## Component Wiring

 `StatusReportService` is initialized after all other service components are created. `OpcUaService.Start()` calls `SetComponents()` to supply the live references, including the historian configuration so the dashboard can label the plugin target and evaluate Rule 2b:
@@ -0,0 +1,104 @@
+# Stability Review - 2026-04-13
+
+## Scope
+
+Re-review of the updated `lmxopcua` codebase with emphasis on stability, shutdown behavior, async usage, latent deadlock patterns, and silent failure modes.
+
+Validation run for this review:
+
+```powershell
+dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore
+```
+
+Result: `471/471` tests passed in approximately `3m18s`.
+
+## Confirmed Findings
+
+### 1. Probe state is published before the subscription succeeds
+
+Severity: High
+
+File references:
+
+- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs:193`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs:201`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs:222`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs:225`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs:343`
+
+`SyncAsync` adds entries to `_byProbe` and `_probeByGobjectId` before `SubscribeAsync` completes. If the advise call fails, the catch block logs the failure but leaves the probe registered internally. `Tick()` later treats that entry as a real advised probe that never produced an initial callback and transitions it from `Unknown` to `Stopped`.
+
+That creates a false-negative health signal: a host can be marked stopped even though the real problem was "subscription never established". In this codebase that distinction matters because runtime-host state is later used to suppress or degrade published node quality.
+
+Recommendation: only commit the new probe entry after a successful subscribe, or roll the dictionaries back in the catch path. Add a regression test for subscribe failure in `GalaxyRuntimeProbeManagerTests`.
+
+### 2. Service startup still ignores dashboard bind failure
+
+Severity: Medium
+
+File references:
+
+- `src/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusWebServer.cs:50`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs:307`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs:308`
+
+`StatusWebServer.Start()` now correctly returns `bool`, but `OpcUaService.Start` still ignores that result. The service can therefore continue through startup and report success even when the dashboard failed to bind.
+
+This is not a process-crash bug, but it is still an operational stability issue because the service advertises a successful start while one of its enabled endpoints is unavailable.
+
+Recommendation: decide whether dashboard startup failure is fatal or degraded mode, then implement that policy explicitly. At minimum, surface the failure in service startup state instead of dropping the return value.
+
+### 3. Sync-over-async remains on critical request and rebuild paths
+
+Severity: Medium
+
+File references:
+
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:572`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1708`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1782`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:2022`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:2100`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:2154`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:2220`
+
+The updated code removed some blocking work from lock scopes, but several service-critical paths still call async MX access operations synchronously with `.GetAwaiter().GetResult()`. That pattern appears in address-space rebuild, direct read/write handling, and historian reads.
+
+I did not reproduce a deadlock in tests, but the pattern is still a stability risk because request threads now inherit backend latency directly and can stall hard if the underlying async path hangs, blocks on its own scheduler, or experiences slow reconnect behavior.
+
+Recommendation: keep the short synchronous boundary only where the external API forces it, and isolate backend calls behind bounded timeouts or dedicated worker threads. Rebuild-time probe synchronization is the highest-value place to reduce blocking first.
+
+### 4. Several background subscribe paths are still fire-and-forget
+
+Severity: Low
+
+File references:
+
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:858`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1362`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:2481`
+
+Alarm auto-subscribe and transferred-subscription restore still dispatch `SubscribeAsync(...)` and attach a fault-only continuation. That is better than dropping exceptions completely, but these operations are still not lifecycle-coordinated. A rebuild or shutdown can move on while subscription work is still in flight.
+
+The practical outcome is transient mismatch rather than memory corruption: expected subscriptions can arrive late, and shutdown/rebuild sequencing is harder to reason about under backend slowness.
+
+Recommendation: track these tasks when ordering matters, or centralize them behind a subscription queue with explicit cancellation and shutdown semantics.
+
+## Verified Improvements Since The Previous Review
+
+The following areas that were previously risky now look materially better in the current code:
+
+- `StaComThread` now checks `PostThreadMessage` failures and faults pending work instead of leaving callers parked indefinitely.
+- `HistoryContinuationPointManager` now purges expired continuation points on retrieve and release, not only on store.
+- `ChangeDetectionService`, MX monitor, and the status web server now retain background task handles and wait briefly during stop.
+- `StatusWebServer` no longer swallows startup failure silently; it returns a success flag and logs the failure.
+- Connection string validation now redacts credentials before logging.
+
+## Overall Assessment
+
+The updated code is in better shape than the previous pass. The most serious prior shutdown and leak hazards have been addressed, and the full automated test suite is currently green.
+
+The remaining stability work is concentrated in two areas:
+
+1. Correctness around failed runtime-probe subscription.
+2. Reducing synchronous waits and untracked background subscription work in the OPC UA node manager.