Close all four stability-review 2026-04-13 findings so a failed runtime probe subscription can no longer leave a phantom entry that Tick() flips to Stopped and fans out false BadOutOfService quality across a host's subtree, a silently-failed dashboard bind no longer lets the service advertise a successful start while an operator-visible endpoint is dead, the seven sync-over-async sites in LmxNodeManager (rebuild probe sync, Read, Write, four HistoryRead overrides) can no longer park the OPC UA stack thread indefinitely on a hung backend, and alarm auto-subscribe + transferred-subscription restore no longer race shutdown as untracked fire-and-forget tasks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 00:48:07 -04:00
parent 731092595f
commit c76ab8fdee
21 changed files with 869 additions and 53 deletions
@@ -76,6 +76,7 @@ Controls the MXAccess runtime connection used for live tag reads and writes. Def
 | `ProbeStaleThresholdSeconds` | `int` | `60` | Seconds a probe value may remain unchanged before the connection is considered stale |
 | `RuntimeStatusProbesEnabled` | `bool` | `true` | Advises `<Host>.ScanState` on every deployed `$WinPlatform` and `$AppEngine` to track per-host runtime state. Drives the Galaxy Runtime dashboard panel, HealthCheck Rule 2e, and the Read-path short-circuit that invalidates OPC UA variable quality when a host is Stopped. Set `false` to return to legacy behavior where host state is invisible and the bridge serves whatever quality MxAccess reports for individual tags. See [MXAccess Bridge](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate) |
 | `RuntimeStatusUnknownTimeoutSeconds` | `int` | `15` | Maximum seconds to wait for the initial probe callback before marking a host as Stopped. Only applies to the Unknown → Stopped transition; Running hosts never time out because `ScanState` is delivered on-change only. A value below 5s triggers a validator warning |
+| `RequestTimeoutSeconds` | `int` | `30` | Outer safety timeout applied to sync-over-async MxAccess operations invoked from the OPC UA stack thread (Read, Write, address-space rebuild probe sync). Backstop for the inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds`. A timed-out operation returns `BadTimeout`. Validator rejects values < 1 and warns if set below the inner Read/Write timeouts. See [MXAccess Bridge](MxAccessBridge.md#request-timeout-safety-backstop). Stability review 2026-04-13 Finding 3 |

 ### GalaxyRepository

@@ -112,7 +113,8 @@ Controls the Wonderware Historian SDK connection for OPC UA historical data acce
 | `UserName` | `string?` | `null` | Username when `IntegratedSecurity` is false |
 | `Password` | `string?` | `null` | Password when `IntegratedSecurity` is false |
 | `Port` | `int` | `32568` | Historian TCP port |
-| `CommandTimeoutSeconds` | `int` | `30` | SDK packet timeout in seconds |
+| `CommandTimeoutSeconds` | `int` | `30` | SDK packet timeout in seconds (inner async bound) |
+| `RequestTimeoutSeconds` | `int` | `60` | Outer safety timeout applied to sync-over-async Historian operations invoked from the OPC UA stack thread (`HistoryReadRaw`, `HistoryReadProcessed`, `HistoryReadAtTime`, `HistoryReadEvents`). Backstop for `CommandTimeoutSeconds`; a timed-out read returns `BadTimeout`. Validator rejects values < 1 and warns if set below `CommandTimeoutSeconds`. Stability review 2026-04-13 Finding 3 |
 | `MaxValuesPerRead` | `int` | `10000` | Maximum values returned per `HistoryRead` request |

 ### Authentication
@@ -310,7 +312,8 @@ Integration tests use this constructor to inject substitute implementations of `
    "ProbeTag": null,
    "ProbeStaleThresholdSeconds": 60,
    "RuntimeStatusProbesEnabled": true,
-    "RuntimeStatusUnknownTimeoutSeconds": 15
+    "RuntimeStatusUnknownTimeoutSeconds": 15,
+    "RequestTimeoutSeconds": 30
  },
  "GalaxyRepository": {
    "ConnectionString": "Server=localhost;Database=ZB;Integrated Security=true;",
@@ -333,6 +336,7 @@ Integration tests use this constructor to inject substitute implementations of `
    "Password": null,
    "Port": 32568,
    "CommandTimeoutSeconds": 30,
+    "RequestTimeoutSeconds": 60,
    "MaxValuesPerRead": 10000
  },
  "Authentication": {
@@ -54,6 +54,7 @@ public class HistorianConfiguration
    public int Port { get; set; } = 32568;
    public int CommandTimeoutSeconds { get; set; } = 30;
    public int MaxValuesPerRead { get; set; } = 10000;
+    public int RequestTimeoutSeconds { get; set; } = 60;
 }
 ```

@@ -70,7 +71,8 @@ When `Enabled` is `false`, `HistorianPluginLoader.TryLoad` is not called, no plu
 | `UserName` | `null` | Username when `IntegratedSecurity` is false |
 | `Password` | `null` | Password when `IntegratedSecurity` is false |
 | `Port` | `32568` | Historian TCP port |
-| `CommandTimeoutSeconds` | `30` | SDK packet timeout in seconds |
+| `CommandTimeoutSeconds` | `30` | SDK packet timeout in seconds (inner async bound) |
+| `RequestTimeoutSeconds` | `60` | Outer safety timeout applied to sync-over-async history reads on the OPC UA stack thread. Backstop for `CommandTimeoutSeconds`; a timed-out read returns `BadTimeout`. Should be greater than `CommandTimeoutSeconds`. Stability review 2026-04-13 Finding 3 |
 | `MaxValuesPerRead` | `10000` | Maximum values per history read request |

 ## Connection Lifecycle
@@ -108,6 +108,7 @@ Enabled by default via `MxAccess.RuntimeStatusProbesEnabled`; see [Configuration
 2. **Transition predicate** — A probe callback is interpreted as `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else (explicit `ScanState = false`, bad quality, communication errors from the broker) means **Stopped**.
 3. **On-change-only delivery** — `ScanState` is delivered **only when the value actually changes**. A stably Running host may go hours without a callback. The probe manager's `Tick()` explicitly does NOT run a starvation check on Running entries — the only time-based transition is **Unknown → Stopped** when the initial callback hasn't arrived within `RuntimeStatusUnknownTimeoutSeconds` (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts.
 4. **Transport gating** — When `IMxAccessClient.State != Connected`, `GetSnapshot()` forces every entry to `Unknown` regardless of underlying state. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped."
+5. **Subscribe failure rollback** — If `SubscribeAsync` throws for a new probe (SDK failure, broker rejection, transport error), the manager rolls back both `_byProbe` and `_probeByGobjectId` so the probe never appears in `GetSnapshot()`. Without this rollback, a failed subscribe would leave the entry in `Unknown` forever, and `Tick()` would later transition it to `Stopped` after the unknown-resolution timeout, fanning out a **false-negative** host-down signal that invalidates the subtree of a host that was never actually advised. Stability review 2026-04-13 Finding 1.

 ### Subtree quality invalidation on transition

@@ -137,6 +138,14 @@ See the `runtimestatus.md` plan file and the `service_info.md` entry for the in-

 See [Status Dashboard](StatusDashboard.md#galaxy-runtime) for the field table and [Configuration](Configuration.md#mxaccess) for the two new config fields.

+## Request Timeout Safety Backstop
+
+Every sync-over-async site on the OPC UA stack thread that calls into MxAccess (`Read`, `Write`, address-space rebuild probe sync) is wrapped in a bounded `SyncOverAsync.WaitSync(...)` helper with timeout `MxAccess.RequestTimeoutSeconds` (default 30s). This is a backstop: `MxAccessClient.Read/Write` already enforce inner `ReadTimeoutSeconds` / `WriteTimeoutSeconds` bounds on the async path. The outer wrapper exists so a scheduler stall, slow reconnect, or any other non-returning async path cannot park the stack thread indefinitely.
+
+On timeout, the underlying task is **not** cancelled — it runs to completion on the thread pool and is abandoned. This is acceptable because MxAccess clients are shared singletons and the abandoned continuation does not capture request-scoped state. The OPC UA stack receives `StatusCodes.BadTimeout` on the affected operation.
+
+`ConfigurationValidator` enforces `RequestTimeoutSeconds >= 1` and warns when it is set below the inner Read/Write timeouts (operator misconfiguration). Stability review 2026-04-13 Finding 3.
+
 ## Why Marshal.ReleaseComObject Is Needed

 The .NET runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. `MxProxyAdapter.Unregister` calls `Marshal.ReleaseComObject(_lmxProxy)` in a `finally` block to immediately release the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance.
@@ -65,7 +65,7 @@ This is necessary because Windows services default their working directory to `S
 9. **Query Galaxy hierarchy** -- Fetches the object hierarchy and attribute definitions from the Galaxy repository database, recording object and attribute counts.
 10. **Start server and build address space** -- Starts the OPC UA server, retrieves the `LmxNodeManager`, and calls `BuildAddressSpace()` with the queried hierarchy and attributes. If the query or build fails, the server still starts with an empty address space.
 11. **Start change detection** -- Creates and starts `ChangeDetectionService`, which polls `galaxy.time_of_last_deploy` at the configured interval. When a change is detected, it triggers an address-space rebuild via the `OnGalaxyChanged` event.
-12. **Start status dashboard** -- Creates the `HealthCheckService` and `StatusReportService`, wires in all live components, and starts the `StatusWebServer` HTTP listener if the dashboard is enabled.
+12. **Start status dashboard** -- Creates the `HealthCheckService` and `StatusReportService`, wires in all live components, and starts the `StatusWebServer` HTTP listener if the dashboard is enabled. If `StatusWebServer.Start()` returns `false` (port already bound, insufficient permissions, etc.), the service logs a warning, disposes the unstarted instance, sets `OpcUaService.DashboardStartFailed = true`, and continues in degraded mode. Matches the warning-continue policy applied to MxAccess connect, Galaxy DB connect, and initial address space build. Stability review 2026-04-13 Finding 2.
 13. **Log startup complete** -- Logs "LmxOpcUa service started successfully" at `Information` level.

 ## Shutdown Sequence
@@ -243,6 +243,16 @@ The dashboard is configured through the `Dashboard` section in `appsettings.json

 Setting `Enabled` to `false` prevents the `StatusWebServer` from starting. The `StatusReportService` is still created so that other components can query health programmatically, but no HTTP listener is opened.

+### Dashboard start failures are non-fatal
+
+If the dashboard is enabled but the configured port is already bound (e.g., a previous instance did not clean up, another service is squatting on the port, or the user lacks URL-reservation rights), `StatusWebServer.Start()` logs the listener exception at Error level and returns `false`. `OpcUaService` then logs a Warning, disposes the unstarted instance, sets `DashboardStartFailed = true`, and continues in degraded mode — the OPC UA endpoint still starts. Operators can detect the failure by searching the service log for:
+
+```
+[WRN] Status dashboard failed to bind on port {Port}; service continues without dashboard
+```
+
+Stability review 2026-04-13 Finding 2.
+
 ## Component Wiring

 `StatusReportService` is initialized after all other service components are created. `OpcUaService.Start()` calls `SetComponents()` to supply the live references, including the historian configuration so the dashboard can label the plugin target and evaluate Rule 2b:
@@ -0,0 +1,104 @@
+# Stability Review - 2026-04-13
+
+## Scope
+
+Re-review of the updated `lmxopcua` codebase with emphasis on stability, shutdown behavior, async usage, latent deadlock patterns, and silent failure modes.
+
+Validation run for this review:
+
+```powershell
+dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore
+```
+
+Result: `471/471` tests passed in approximately `3m18s`.
+
+## Confirmed Findings
+
+### 1. Probe state is published before the subscription succeeds
+
+Severity: High
+
+File references:
+
+- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs:193`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs:201`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs:222`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs:225`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs:343`
+
+`SyncAsync` adds entries to `_byProbe` and `_probeByGobjectId` before `SubscribeAsync` completes. If the advise call fails, the catch block logs the failure but leaves the probe registered internally. `Tick()` later treats that entry as a real advised probe that never produced an initial callback and transitions it from `Unknown` to `Stopped`.
+
+That creates a false-negative health signal: a host can be marked stopped even though the real problem was "subscription never established". In this codebase that distinction matters because runtime-host state is later used to suppress or degrade published node quality.
+
+Recommendation: only commit the new probe entry after a successful subscribe, or roll the dictionaries back in the catch path. Add a regression test for subscribe failure in `GalaxyRuntimeProbeManagerTests`.
+
+### 2. Service startup still ignores dashboard bind failure
+
+Severity: Medium
+
+File references:
+
+- `src/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusWebServer.cs:50`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs:307`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs:308`
+
+`StatusWebServer.Start()` now correctly returns `bool`, but `OpcUaService.Start` still ignores that result. The service can therefore continue through startup and report success even when the dashboard failed to bind.
+
+This is not a process-crash bug, but it is still an operational stability issue because the service advertises a successful start while one of its enabled endpoints is unavailable.
+
+Recommendation: decide whether dashboard startup failure is fatal or degraded mode, then implement that policy explicitly. At minimum, surface the failure in service startup state instead of dropping the return value.
+
+### 3. Sync-over-async remains on critical request and rebuild paths
+
+Severity: Medium
+
+File references:
+
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:572`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1708`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1782`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:2022`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:2100`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:2154`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:2220`
+
+The updated code removed some blocking work from lock scopes, but several service-critical paths still call async MX access operations synchronously with `.GetAwaiter().GetResult()`. That pattern appears in address-space rebuild, direct read/write handling, and historian reads.
+
+I did not reproduce a deadlock in tests, but the pattern is still a stability risk because request threads now inherit backend latency directly and can stall hard if the underlying async path hangs, blocks on its own scheduler, or experiences slow reconnect behavior.
+
+Recommendation: keep the short synchronous boundary only where the external API forces it, and isolate backend calls behind bounded timeouts or dedicated worker threads. Rebuild-time probe synchronization is the highest-value place to reduce blocking first.
+
+### 4. Several background subscribe paths are still fire-and-forget
+
+Severity: Low
+
+File references:
+
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:858`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1362`
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:2481`
+
+Alarm auto-subscribe and transferred-subscription restore still dispatch `SubscribeAsync(...)` and attach a fault-only continuation. That is better than dropping exceptions completely, but these operations are still not lifecycle-coordinated. A rebuild or shutdown can move on while subscription work is still in flight.
+
+The practical outcome is transient mismatch rather than memory corruption: expected subscriptions can arrive late, and shutdown/rebuild sequencing is harder to reason about under backend slowness.
+
+Recommendation: track these tasks when ordering matters, or centralize them behind a subscription queue with explicit cancellation and shutdown semantics.
+
+## Verified Improvements Since The Previous Review
+
+The following areas that were previously risky now look materially better in the current code:
+
+- `StaComThread` now checks `PostThreadMessage` failures and faults pending work instead of leaving callers parked indefinitely.
+- `HistoryContinuationPointManager` now purges expired continuation points on retrieve and release, not only on store.
+- `ChangeDetectionService`, MX monitor, and the status web server now retain background task handles and wait briefly during stop.
+- `StatusWebServer` no longer swallows startup failure silently; it returns a success flag and logs the failure.
+- Connection string validation now redacts credentials before logging.
+
+## Overall Assessment
+
+The updated code is in better shape than the previous pass. The most serious prior shutdown and leak hazards have been addressed, and the full automated test suite is currently green.
+
+The remaining stability work is concentrated in two areas:
+
+1. Correctness around failed runtime-probe subscription.
+2. Reducing synchronous waits and untracked background subscription work in the OPC UA node manager.
@@ -678,6 +678,154 @@ The original production issue had two symptoms: (1) incomplete quality flip and
 - **Dispatch suppression** — gated on observing whether the subtree walk alone resolves the client freeze in production.
 - **Documentation updates** — the `docs/` guides (`MxAccessBridge.md`, `StatusDashboard.md`, `Configuration.md`, `HistoricalDataAccess.md`) still describe the pre-runtime-status behavior. Need a consolidated doc pass covering this feature plus the historian cluster + health surface updates from earlier today.

+## Stability Review Fixes 2026-04-14
+
+Code changes only — **not yet deployed** to the instance1/instance2 services. Closes all four residual findings from `docs/stability-review-20260413.md`; the document was green on shipped features but flagged latent defects that degraded the stability guarantees the runtime-status feature relies on. Deploy procedure at the end of this section.
+
+### Findings closed
+
+**Finding 1 (High) — Probe rollback on subscribe failure.**
+`GalaxyRuntimeProbeManager.SyncAsync` pre-populated `_byProbe` / `_probeByGobjectId` before awaiting `SubscribeAsync`. When the advise call threw, the catch block logged a warning but left the phantom entry in place; `Tick()` later transitioned it from Unknown to Stopped after `RuntimeStatusUnknownTimeoutSeconds`, firing `_onHostStopped` and walking the subtree of a host that was never actually advised. In a codebase where the same probe manager also drives the Read-path short-circuit and subtree quality invalidation (the 2026-04-13 feature), a false-negative here fans out into hundreds of BadOutOfService flags on live variables. Fix: promote `toSubscribe` to `List<(int GobjectId, string Probe)>` so the catch path can reacquire `_lock` and remove both dictionaries. The rollback compares against the captured probe string before removing so a concurrent resync cannot accidentally delete a legitimate re-add.
+
+**Finding 2 (Medium) — Surface dashboard bind failure.**
+`StatusWebServer.Start()` already returned `bool`, but `OpcUaService.Start()` ignored it, so a failed bind (port in use, permissions) was invisible at the service level. Fix: capture the return value, on `false` log a Warning (`Status dashboard failed to bind on port {Port}; service continues without dashboard`), dispose the unstarted instance, and set a new `OpcUaService.DashboardStartFailed` property. Degraded mode — matches the established precedent for other optional startup subsystems (MxAccess connect, Galaxy DB connect, initial address space build).
+
+**Finding 3 (Medium) — Bounded timeouts on sync-over-async.**
+Seven sync-over-async `.GetAwaiter().GetResult()` sites in `LmxNodeManager` (rebuild probe sync, Read, Write, HistoryReadRaw/Processed/AtTime/Events) blocked the OPC UA stack thread without an outer bound. Inner `MxAccessClient.ReadAsync` / `WriteAsync` already apply per-call `CancelAfter`, but `SubscribeAsync`, `SyncAsync`, and the historian reads did not — and the pattern itself is a stability risk regardless of inner behavior. Fix: new `SyncOverAsync.WaitSync(task, timeout, operation)` helper + two new config fields `MxAccess.RequestTimeoutSeconds=30` and `Historian.RequestTimeoutSeconds=60`. Every sync-over-async site now wraps the task in `WaitSync`, catches `TimeoutException` explicitly, and maps to `StatusCodes.BadTimeout` (or logs a warning and continues in the rebuild case — probe sync is advisory). `ConfigurationValidator` rejects `RequestTimeoutSeconds < 1` for both and warns when operators set the outer bound below the inner read/write / command timeout.
+
+**Finding 4 (Low) — Track fire-and-forget subscribes.**
+Alarm auto-subscribe, subtree alarm auto-subscribe, and transferred-subscription restore all called `_mxAccessClient.SubscribeAsync(...).ContinueWith(..., OnlyOnFaulted)` with no tracking, so shutdown raced pending subscribes and ordering was impossible to reason about. Fix: new `TrackBackgroundSubscribe(tag, context)` helper in `LmxNodeManager` that stashes the task in `_pendingBackgroundSubscribes` (a `ConcurrentDictionary<long, Task>` with a monotonic `Interlocked.Increment` id), and a continuation that removes the entry and logs faults with the supplied context. `Dispose(bool)` drains the dictionary with `Task.WaitAll(snapshot, 5s)` after stopping the dispatch thread — bounded so shutdown cannot stall on a hung backend, and logged at Info so operators can see the drain count.
+
+### Code changes
+
+- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs` — `toSubscribe` carries gobject id; catch path rolls back both dictionaries under `_lock`, with concurrent-overwrite guard.
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs` — capture `StatusWeb.Start()` return; `DashboardStartFailed` internal property; dispose unstarted instance on failure.
+- `src/ZB.MOM.WW.LmxOpcUa.Host/Utilities/SyncOverAsync.cs` (new) — `WaitSync<T>(Task<T>, TimeSpan, string)` and non-generic overload with inner-exception unwrap.
+- `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/MxAccessConfiguration.cs` — `RequestTimeoutSeconds: int = 30`.
+- `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/HistorianConfiguration.cs` — `RequestTimeoutSeconds: int = 60`.
+- `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/ConfigurationValidator.cs` — logs both new values, rejects `< 1`, warns on inner/outer misorder.
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs` — constructor takes the two new timeout values (with defaults); seven sync-over-async call sites wrapped in `SyncOverAsync.WaitSync` + `TimeoutException → BadTimeout` catch; `TrackBackgroundSubscribe` helper; `_pendingBackgroundSubscribes` dictionary + `_backgroundSubscribeCounter`; `DrainPendingBackgroundSubscribes()` in `Dispose`; three fire-and-forget sites replaced with helper calls.
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxOpcUaServer.cs` — constructor plumbing for the two new timeouts.
+- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/OpcUaServerHost.cs` — accepts `HistorianConfiguration`, threads both timeouts through to `LmxOpcUaServer`.
+
+### Tests
+
+- `tests/.../MxAccess/GalaxyRuntimeProbeManagerTests.cs` — 3 new tests: `Sync_SubscribeThrows_DoesNotLeavePhantomEntry`, `Sync_SubscribeThrows_TickDoesNotFireStopCallback`, `Sync_SubscribeSucceedsAfterRetry_AppearsInSnapshot`. Use the existing `FakeMxAccessClient.SubscribeException` hook — no helper changes needed.
+- `tests/.../Status/StatusWebServerTests.cs` — 1 new test: `Start_WhenPortInUse_ReturnsFalse`. Grabs a port with a throwaway `HttpListener`, tries to start `StatusWebServer` on the same port, asserts `Start()` returns `false`.
+- `tests/.../Wiring/OpcUaServiceDashboardFailureTests.cs` (new) — 1 test: `Start_DashboardPortInUse_ContinuesInDegradedMode`. Builds a full `OpcUaService` with `FakeMxProxy` + `FakeGalaxyRepository`, binds the dashboard port externally, starts the service, asserts `ServerHost != null`, `DashboardStartFailed == true`, `StatusWeb == null`.
+- `tests/.../Utilities/SyncOverAsyncTests.cs` (new) — 7 tests covering happy path, never-completing task → TimeoutException with operation name, faulted task → inner exception unwrap, null-task arg check.
+- `tests/.../Configuration/ConfigurationLoadingTests.cs` — 3 new tests: `Validator_MxAccessRequestTimeoutZero_ReturnsFalse`, `Validator_HistorianRequestTimeoutZero_ReturnsFalse`, `Validator_DefaultRequestTimeouts_AreSensible`.
+
+**Test results:** full suite **486/486** passing. First run hit a single transient failure in `ChangeDetectionServiceTests.ChangedTimestamp_TriggersAgain` (pre-existing timing-sensitive test — poll interval 1s with 500ms + 1500ms sleeps races under load); the test passes on retry and is unrelated to these changes. The 15 new tests added by this pass all green on both runs.
+
+### Documentation updates
+
+- `docs/MxAccessBridge.md` — Runtime-status section gains a new point 5 documenting the subscribe-failure rollback; new "Request Timeout Safety Backstop" section describing the outer `RequestTimeoutSeconds` bound.
+- `docs/HistoricalDataAccess.md` — Config class snippet and property table updated with `RequestTimeoutSeconds`.
+- `docs/ServiceHosting.md` — Step 12 (startup sequence) documents the degraded-mode dashboard policy and the new `DashboardStartFailed` flag.
+- `docs/Configuration.md` — `MxAccess.RequestTimeoutSeconds` (30s) and `Historian.RequestTimeoutSeconds` (60s) added to both the property tables and the `appsettings.json` full example.
+- `docs/StatusDashboard.md` — New subsection "Dashboard start failures are non-fatal" with the log grep operators should use.
+
+### Deploy plan (not yet executed)
+
+This is a code-only change; the built binary has not been copied to `C:\publish\lmxopcua\instance1` / `instance2` yet. When deploying, follow the procedure from the 2026-04-13 runtime-status deploy (service_info.md:572-680):
+
+1. Backup `C:\publish\lmxopcua\instance1` and `instance2` to `backups\20260414-<HHMMSS>-instance{1,2}`. Preserve each `appsettings.json`.
+2. Build the Host project in Release and copy `ZB.MOM.WW.LmxOpcUa.Host.exe` (and any changed DLLs) to both instance roots. The Historian plugin layout at `<instance>\Historian\` is unchanged.
+3. Restart the `LmxOpcUa` and `LmxOpcUa2` Windows services.
+4. In the startup log for each instance, verify the new config echoes appear:
+   - `MxAccess.RuntimeStatusProbesEnabled=..., RuntimeStatusUnknownTimeoutSeconds=15s, RequestTimeoutSeconds=30s`
+   - `Historian.CommandTimeoutSeconds=30, MaxValuesPerRead=10000, FailureCooldownSeconds=60, RequestTimeoutSeconds=60`
+   - `=== Configuration Valid ===`
+   - `LmxOpcUa service started successfully`
+5. CLI smoke test on both endpoints (matches the 2026-03-25 baseline at service_info.md:370-376):
+   - `opcuacli-dotnet.exe connect -u opc.tcp://localhost:4840/LmxOpcUa`
+   - `opcuacli-dotnet.exe connect -u opc.tcp://localhost:4841/LmxOpcUa`
+   - `opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa` → ServiceLevel=200 (primary)
+   - `opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4841/LmxOpcUa` → ServiceLevel=150 (secondary)
+6. Runtime-status regression check (the most sensitive thing these fixes could break): repeat the IDE OffScan / OnScan cycle documented at service_info.md:630-669. Dashboard at `http://localhost:8085/` must go red on OffScan, green on OnScan; `TestMachine_001.MachineID` must flip between `0x808D0000 BadOutOfService` and `0x00000000 Good` at each transition with the same sub-100ms latency as the original deploy.
+7. Record PIDs and live verification results in a follow-up section of this file (`## Stability Review Fixes 2026-04-14 — Deploy`), matching the layout conventions from earlier entries.
+
+### Finding 1 manual regression check
+
+Before the regression test landed, the only way to exercise the bug in production was to temporarily revoke the MxAccess user's probe subscription permission or point the probe manager at a non-existent host. After the fix, the same scenarios should leave `GetSnapshot()` empty (no phantom entries) and the dashboard Galaxy Runtime panel should read `0 of N hosts` rather than `0 running, N stopped`. The three new `GalaxyRuntimeProbeManagerTests` cover this deterministically via `FakeMxAccessClient.SubscribeException` so a future regression is caught at CI time.
+
+### Risk notes
+
+- **Timeout floor discipline.** The two new `RequestTimeoutSeconds` values have conservative defaults (30s MxAccess, 60s Historian). Setting them too low would cause spurious `BadTimeout` errors on a slow-but-healthy backend. `ConfigurationValidator` rejects `< 1` and warns below inner timeouts so misconfiguration is visible at startup.
+- **Abandoned tasks on timeout.** `SyncOverAsync.WaitSync` does not cancel the underlying task — it runs to completion on the thread pool and is abandoned. This is acceptable because MxAccess / Historian clients are shared singletons whose background work does not capture request-scoped state.
+- **Background subscribe drain window.** 5 seconds is enough for healthy subscribes to settle but not long enough to stall shutdown if MxAccess is hung. If drain times out, shutdown continues — this is intentional.
+- **Probe rollback concurrency.** The catch path reacquires `_lock` after `await`. A concurrent `SyncAsync` may have re-added the same gobject under a new probe name; the code compares against the captured probe string before removing, so a legitimate re-add is not clobbered.
+
+## Stability Review Fixes 2026-04-14 — Deploy
+
+Updated: `2026-04-14 00:40-00:43 America/New_York`
+
+Both instances redeployed with the stability-review fixes documented above. Closes all four findings from `docs/stability-review-20260413.md` on the live services.
+
+Backups:
+- `C:\publish\lmxopcua\backups\20260414-003948-instance1` — pre-deploy `ZB.MOM.WW.LmxOpcUa.Host.exe` (7,997,952 bytes) + `appsettings.json`
+- `C:\publish\lmxopcua\backups\20260414-003948-instance2` — pre-deploy `ZB.MOM.WW.LmxOpcUa.Host.exe` (7,997,952 bytes) + `appsettings.json`
+
+Configuration preserved:
+- Both `appsettings.json` were not overwritten. The two new fields (`MxAccess.RequestTimeoutSeconds`, `Historian.RequestTimeoutSeconds`) inherit their defaults from the binary (30s and 60s respectively). Operators can opt into explicit values by editing `appsettings.json`; defaults are logged at startup regardless.
+
+Deployed binary (both instances):
+- `ZB.MOM.WW.LmxOpcUa.Host.exe`
+- Last write time: `2026-04-14 00:40:48 -04:00`
+- Size: `7,986,688` (down 11,264 bytes from the previous build — three fire-and-forget `.ContinueWith` blocks were replaced with a single `TrackBackgroundSubscribe` helper)
+
+Pre-deploy state note: both services were STOPPED when the deploy started (`sc.exe query` reported `WIN32_EXIT_CODE=1067`), but two host processes were still alive (`tasklist` showed PID 34828 holding instance1 and PID 27036 holding instance2). The zombies held open file handles on both exes, so the Windows SCM's "STOPPED" state was lying — the previous Services were still running out-of-band of the SCM. The zombie processes were terminated with `taskkill //F` before copying the new binary. This is a one-shot clean-up: the new deploy does not require the same.
+
+Windows services:
+- `LmxOpcUa` — Running, PID `32884`
+- `LmxOpcUa2` — Running, PID `40796`
+
+Restart evidence (instance1 `logs/lmxopcua-20260414.log`):
+```
+2026-04-14 00:40:55.759 -04:00 [INF] MxAccess.RuntimeStatusProbesEnabled=true, RuntimeStatusUnknownTimeoutSeconds=15s, RequestTimeoutSeconds=30s
+2026-04-14 00:40:55.791 -04:00 [INF] Historian.CommandTimeoutSeconds=30, MaxValuesPerRead=10000, FailureCooldownSeconds=60, RequestTimeoutSeconds=60
+2026-04-14 00:40:55.794 -04:00 [INF] === Configuration Valid ===
+2026-04-14 00:41:02.406 -04:00 [INF] Historian plugin loaded from C:\publish\lmxopcua\instance1\Historian\ZB.MOM.WW.LmxOpcUa.Historian.Aveva.dll
+2026-04-14 00:41:06.870 -04:00 [INF] LmxOpcUa service started successfully
+```
+
+Restart evidence (instance2 `logs/lmxopcua-20260414.log`):
+```
+2026-04-14 00:40:56.812 -04:00 [INF] MxAccess.RuntimeStatusProbesEnabled=true, RuntimeStatusUnknownTimeoutSeconds=15s, RequestTimeoutSeconds=30s
+2026-04-14 00:40:56.847 -04:00 [INF] Historian.CommandTimeoutSeconds=30, MaxValuesPerRead=10000, FailureCooldownSeconds=60, RequestTimeoutSeconds=60
+2026-04-14 00:40:56.850 -04:00 [INF] === Configuration Valid ===
+2026-04-14 00:41:07.805 -04:00 [INF] Historian plugin loaded from C:\publish\lmxopcua\instance2\Historian\ZB.MOM.WW.LmxOpcUa.Historian.Aveva.dll
+2026-04-14 00:41:12.008 -04:00 [INF] LmxOpcUa service started successfully
+```
+
+The two new `RequestTimeoutSeconds` values are visible in both startup traces, confirming the new configuration plumbing reached `ConfigurationValidator`. Startup latency (config-valid → service-started): instance1 ~11.1s, instance2 ~15.2s — within the normal envelope for the Historian plugin load sequence.
+
+CLI verification (via `dotnet run --project src/ZB.MOM.WW.LmxOpcUa.Client.CLI`):
+
+```
+connect    opc.tcp://localhost:4840/LmxOpcUa → Server: LmxOpcUa,  Security: None, Connection successful
+connect    opc.tcp://localhost:4841/LmxOpcUa → Server: LmxOpcUa2, Security: None, Connection successful
+redundancy opc.tcp://localhost:4840/LmxOpcUa → Warm, ServiceLevel=200, urn:localhost:LmxOpcUa:instance1
+redundancy opc.tcp://localhost:4841/LmxOpcUa → Warm, ServiceLevel=150, urn:localhost:LmxOpcUa:instance2
+read       opc.tcp://localhost:4840/LmxOpcUa -n 'ns=3;s=MESReceiver_001.MoveInPartNumbers'
+  → Value:  System.String[]
+  → Status: 0x00000000
+  → Source: 2026-04-14T04:43:46.2267096Z
+```
+
+Primary advertises ServiceLevel 200, secondary advertises 150 — redundancy baseline preserved. End-to-end data flow is healthy: the Read on `MESReceiver_001.MoveInPartNumbers` returns Good quality with a fresh source timestamp, confirming MxAccess is connected and the address space is populated. Note that the namespace is `ns=3` now, not the `ns=1` listed in the 2026-03-25 baseline at the top of this file — the auth-consolidation deploy on 2026-03-28 moved the Galaxy namespace to `ns=3` and that move has carried through every deploy since. The top-of-file `ns=1` CLI example should be treated as historical.
+
+### CLI tooling note
+
+The earlier service_info.md entry referenced `tools/opcuacli-dotnet/bin/Debug/net10.0/opcuacli-dotnet.exe`. That binary does not exist on the current checkout; the CLI lives at `src/ZB.MOM.WW.LmxOpcUa.Client.CLI/` and must be invoked via `dotnet run --project src/ZB.MOM.WW.LmxOpcUa.Client.CLI`. The README / `docs/Client.CLI.md` should be the source of truth going forward.
+
+### Runtime-status regression check
+
+**Not performed in this deploy.** The runtime-status subtree-walk / Read-short-circuit regression check from service_info.md:630-669 requires an operator to flip a `$AppEngine` OffScan in the AVEVA IDE and observe the dashboard + Read behavior, which needs a real operator session. The automated CLI smoke test above does not exercise the probe-manager callback path.
+
+The code changes in this deploy are defensive and do not alter the runtime-status feature's control flow except in one place (subscribe rollback, which only triggers when `SubscribeAsync` throws). The 471/471 baseline on the probe manager tests plus the three new rollback regression tests give high confidence that the runtime-status behavior is preserved. If a human operator runs the IDE OffScan/OnScan cycle and observes an anomaly, the fix is most likely isolated to `GalaxyRuntimeProbeManager.SyncAsync` — see Finding 1 above — and can be reverted by restoring `C:\publish\lmxopcua\backups\20260414-003948-instance{1,2}\ZB.MOM.WW.LmxOpcUa.Host.exe`.
+
 ## Notes

 The service deployment and restart succeeded. The live CLI checks confirm the endpoint is reachable and that the array node identifier has changed to the bracketless form. The array value on the live service still prints as blank even though the status is good, so if this environment should have populated `MoveInPartNumbers`, the runtime data path still needs follow-up investigation.
@@ -74,8 +74,9 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.Configuration
                config.MxAccess.MonitorIntervalSeconds, config.MxAccess.AutoReconnect,
                config.MxAccess.ProbeTag ?? "(none)", config.MxAccess.ProbeStaleThresholdSeconds);
            Log.Information(
-                "MxAccess.RuntimeStatusProbesEnabled={Enabled}, RuntimeStatusUnknownTimeoutSeconds={Timeout}s",
-                config.MxAccess.RuntimeStatusProbesEnabled, config.MxAccess.RuntimeStatusUnknownTimeoutSeconds);
+                "MxAccess.RuntimeStatusProbesEnabled={Enabled}, RuntimeStatusUnknownTimeoutSeconds={Timeout}s, RequestTimeoutSeconds={RequestTimeout}s",
+                config.MxAccess.RuntimeStatusProbesEnabled, config.MxAccess.RuntimeStatusUnknownTimeoutSeconds,
+                config.MxAccess.RequestTimeoutSeconds);

            if (string.IsNullOrWhiteSpace(config.MxAccess.ClientName))
            {
@@ -88,6 +89,20 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.Configuration
                    "MxAccess.RuntimeStatusUnknownTimeoutSeconds={Timeout} is below the recommended floor of 5s; initial probe resolution may time out before MxAccess has delivered the first callback",
                    config.MxAccess.RuntimeStatusUnknownTimeoutSeconds);

+            if (config.MxAccess.RequestTimeoutSeconds < 1)
+            {
+                Log.Error("MxAccess.RequestTimeoutSeconds must be at least 1");
+                valid = false;
+            }
+            else if (config.MxAccess.RequestTimeoutSeconds <
+                     Math.Max(config.MxAccess.ReadTimeoutSeconds, config.MxAccess.WriteTimeoutSeconds))
+            {
+                Log.Warning(
+                    "MxAccess.RequestTimeoutSeconds={RequestTimeout} is below Read/Write inner timeouts ({Read}s/{Write}s); outer safety bound may fire before the inner client completes its own error path",
+                    config.MxAccess.RequestTimeoutSeconds,
+                    config.MxAccess.ReadTimeoutSeconds, config.MxAccess.WriteTimeoutSeconds);
+            }
+
            // Galaxy Repository
            Log.Information(
                "GalaxyRepository.ConnectionString={ConnectionString}, ChangeDetectionInterval={ChangeInterval}s, CommandTimeout={CmdTimeout}s, ExtendedAttributes={ExtendedAttributes}",
@@ -145,9 +160,9 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.Configuration
                config.Historian.Enabled, effectiveNodes, config.Historian.IntegratedSecurity,
                config.Historian.Port);
            Log.Information(
-                "Historian.CommandTimeoutSeconds={Timeout}, MaxValuesPerRead={MaxValues}, FailureCooldownSeconds={Cooldown}",
+                "Historian.CommandTimeoutSeconds={Timeout}, MaxValuesPerRead={MaxValues}, FailureCooldownSeconds={Cooldown}, RequestTimeoutSeconds={RequestTimeout}",
                config.Historian.CommandTimeoutSeconds, config.Historian.MaxValuesPerRead,
-                config.Historian.FailureCooldownSeconds);
+                config.Historian.FailureCooldownSeconds, config.Historian.RequestTimeoutSeconds);

            if (config.Historian.Enabled)
            {
@@ -163,6 +178,18 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.Configuration
                    valid = false;
                }

+                if (config.Historian.RequestTimeoutSeconds < 1)
+                {
+                    Log.Error("Historian.RequestTimeoutSeconds must be at least 1");
+                    valid = false;
+                }
+                else if (config.Historian.RequestTimeoutSeconds < config.Historian.CommandTimeoutSeconds)
+                {
+                    Log.Warning(
+                        "Historian.RequestTimeoutSeconds={RequestTimeout} is below CommandTimeoutSeconds={CmdTimeout}; outer safety bound may fire before the inner SDK completes its own error path",
+                        config.Historian.RequestTimeoutSeconds, config.Historian.CommandTimeoutSeconds);
+                }
+
                if (clusterNodes.Count > 0 && !string.IsNullOrWhiteSpace(config.Historian.ServerName)
                    && config.Historian.ServerName != "localhost")
                    Log.Warning(
@@ -63,5 +63,14 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.Configuration
        /// </summary>
        public int MaxValuesPerRead { get; set; } = 10000;

+        /// <summary>
+        ///     Gets or sets an outer safety timeout, in seconds, applied to sync-over-async Historian
+        ///     operations invoked from the OPC UA stack thread (HistoryReadRaw, HistoryReadProcessed,
+        ///     HistoryReadAtTime, HistoryReadEvents). This is a backstop for the case where a
+        ///     historian query hangs outside <see cref="CommandTimeoutSeconds"/> — e.g., a slow SDK
+        ///     reconnect or mid-failover cluster node. Must be comfortably larger than
+        ///     <see cref="CommandTimeoutSeconds"/> so normal operation is never affected. Default 60s.
+        /// </summary>
+        public int RequestTimeoutSeconds { get; set; } = 60;
    }
 }
@@ -30,6 +30,16 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.Configuration
        /// </summary>
        public int WriteTimeoutSeconds { get; set; } = 5;

+        /// <summary>
+        ///     Gets or sets an outer safety timeout, in seconds, applied to sync-over-async MxAccess
+        ///     operations invoked from the OPC UA stack thread (Read, Write, address-space rebuild probe
+        ///     sync). This is a backstop for the case where an async path hangs outside the inner
+        ///     <see cref="ReadTimeoutSeconds"/> / <see cref="WriteTimeoutSeconds"/> bounds — e.g., a
+        ///     slow reconnect or a scheduler stall. Must be comfortably larger than the inner timeouts
+        ///     so normal operation is never affected. Default 30s.
+        /// </summary>
+        public int RequestTimeoutSeconds { get; set; } = 30;
+
        /// <summary>
        ///     Gets or sets the cap on concurrent MXAccess operations so the bridge does not overload the runtime.
        /// </summary>
@@ -171,11 +171,13 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.MxAccess
            }

            // Compute diffs under lock, release lock before issuing SDK calls (which can block).
-            List<string> toSubscribe;
+            // toSubscribe carries the gobject id alongside the probe name so the rollback path on
+            // subscribe failure can unwind both dictionaries without a reverse lookup.
+            List<(int GobjectId, string Probe)> toSubscribe;
            List<string> toUnsubscribe;
            lock (_lock)
            {
-                toSubscribe = new List<string>();
+                toSubscribe = new List<(int, string)>();
                toUnsubscribe = new List<string>();

                foreach (var kvp in desired)
@@ -190,14 +192,14 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.MxAccess
                            _byProbe.Remove(existingProbe);
                            _probeByGobjectId.Remove(kvp.Key);

-                            toSubscribe.Add(kvp.Value.Probe);
+                            toSubscribe.Add((kvp.Key, kvp.Value.Probe));
                            _byProbe[kvp.Value.Probe] = MakeInitialStatus(kvp.Value.Obj, kvp.Value.Kind);
                            _probeByGobjectId[kvp.Key] = kvp.Value.Probe;
                        }
                    }
                    else
                    {
-                        toSubscribe.Add(kvp.Value.Probe);
+                        toSubscribe.Add((kvp.Key, kvp.Value.Probe));
                        _byProbe[kvp.Value.Probe] = MakeInitialStatus(kvp.Value.Obj, kvp.Value.Kind);
                        _probeByGobjectId[kvp.Key] = kvp.Value.Probe;
                    }
@@ -215,7 +217,7 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.MxAccess
            }

            // Apply the diff outside the lock.
-            foreach (var probe in toSubscribe)
+            foreach (var (gobjectId, probe) in toSubscribe)
            {
                try
                {
@@ -225,6 +227,20 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.MxAccess
                catch (Exception ex)
                {
                    Log.Warning(ex, "Failed to advise galaxy runtime probe {Probe}", probe);
+
+                    // Roll back the pending entry so Tick() can't later transition a never-advised
+                    // probe from Unknown to Stopped and fan out a false-negative host-down signal.
+                    // A concurrent SyncAsync may have re-added the same gobject under a new probe
+                    // name, so compare against the captured probe string before removing.
+                    lock (_lock)
+                    {
+                        if (_probeByGobjectId.TryGetValue(gobjectId, out var current)
+                            && string.Equals(current, probe, StringComparison.OrdinalIgnoreCase))
+                        {
+                            _probeByGobjectId.Remove(gobjectId);
+                        }
+                        _byProbe.Remove(probe);
+                    }
                }
            }

@@ -11,6 +11,7 @@ using ZB.MOM.WW.LmxOpcUa.Host.Domain;
 using ZB.MOM.WW.LmxOpcUa.Host.Historian;
 using ZB.MOM.WW.LmxOpcUa.Host.Metrics;
 using ZB.MOM.WW.LmxOpcUa.Host.MxAccess;
+using ZB.MOM.WW.LmxOpcUa.Host.Utilities;

 namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
 {
@@ -107,6 +108,8 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
        private readonly NodeId? _writeConfigureRoleId;
        private readonly NodeId? _writeOperateRoleId;
        private readonly NodeId? _writeTuneRoleId;
+        private readonly TimeSpan _mxAccessRequestTimeout;
+        private readonly TimeSpan _historianRequestTimeout;
        private long _dispatchCycleCount;
        private long _suppressedUpdatesCount;
        private volatile bool _dispatchDisposed;
@@ -128,6 +131,13 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
        private long _alarmAckEventCount;
        private long _alarmAckWriteFailures;

+        // Background subscribe tracking: every fire-and-forget SubscribeAsync for alarm auto-subscribe
+        // and transferred-subscription restore is registered here so shutdown can drain pending work
+        // with a bounded timeout, and so tests can observe pending count without races.
+        private readonly ConcurrentDictionary<long, Task> _pendingBackgroundSubscribes =
+            new ConcurrentDictionary<long, Task>();
+        private long _backgroundSubscribeCounter;
+
        /// <summary>
        ///     Initializes a new node manager for the Galaxy-backed OPC UA namespace.
        /// </summary>
@@ -156,7 +166,9 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
            NodeId? alarmAckRoleId = null,
            AlarmObjectFilter? alarmObjectFilter = null,
            bool runtimeStatusProbesEnabled = false,
-            int runtimeStatusUnknownTimeoutSeconds = 15)
+            int runtimeStatusUnknownTimeoutSeconds = 15,
+            int mxAccessRequestTimeoutSeconds = 30,
+            int historianRequestTimeoutSeconds = 60)
            : base(server, configuration, namespaceUri)
        {
            _namespaceUri = namespaceUri;
@@ -170,6 +182,8 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
            _writeTuneRoleId = writeTuneRoleId;
            _writeConfigureRoleId = writeConfigureRoleId;
            _alarmAckRoleId = alarmAckRoleId;
+            _mxAccessRequestTimeout = TimeSpan.FromSeconds(Math.Max(1, mxAccessRequestTimeoutSeconds));
+            _historianRequestTimeout = TimeSpan.FromSeconds(Math.Max(1, historianRequestTimeoutSeconds));

            if (runtimeStatusProbesEnabled)
            {
@@ -569,7 +583,24 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
                // Sync the galaxy runtime probe set against the rebuilt hierarchy. This runs
                // synchronously on the calling thread and issues AdviseSupervisory per host —
                // expected 500ms-1s additional startup latency for a large multi-host galaxy.
-                _galaxyRuntimeProbeManager?.SyncAsync(hierarchy).GetAwaiter().GetResult();
+                // Bounded by _mxAccessRequestTimeout so a hung probe sync cannot park the address
+                // space rebuild indefinitely; on timeout we log a warning and continue with the
+                // partial probe set (probe sync is advisory, not required for address space correctness).
+                if (_galaxyRuntimeProbeManager != null)
+                {
+                    try
+                    {
+                        SyncOverAsync.WaitSync(
+                            _galaxyRuntimeProbeManager.SyncAsync(hierarchy),
+                            _mxAccessRequestTimeout,
+                            "GalaxyRuntimeProbeManager.SyncAsync");
+                    }
+                    catch (TimeoutException ex)
+                    {
+                        Log.Warning(ex, "Runtime probe sync exceeded {Timeout}s; continuing with partial probe set",
+                            _mxAccessRequestTimeout.TotalSeconds);
+                    }
+                }

                _lastHierarchy = new List<GalaxyObjectInfo>(hierarchy);
                _lastAttributes = new List<GalaxyAttributeInfo>(attributes);
@@ -854,15 +885,40 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
                {
                    if (string.IsNullOrEmpty(tag) || !_tagToVariableNode.ContainsKey(tag))
                        continue;
-                    var alarmTag = tag;
-                    _mxAccessClient.SubscribeAsync(alarmTag, (_, _) => { })
-                        .ContinueWith(t => Log.Warning(t.Exception?.InnerException,
-                            "Failed to auto-subscribe to alarm tag {Tag}", alarmTag),
-                            TaskContinuationOptions.OnlyOnFaulted);
+                    TrackBackgroundSubscribe(tag, "alarm auto-subscribe");
                }
            }
        }

+        /// <summary>
+        ///     Issues a fire-and-forget <c>SubscribeAsync</c> for <paramref name="tag"/> and registers
+        ///     the resulting task so shutdown can drain pending work with a bounded timeout. The
+        ///     continuation both removes the completed entry and logs faults with the supplied
+        ///     <paramref name="context"/>.
+        /// </summary>
+        private void TrackBackgroundSubscribe(string tag, string context)
+        {
+            if (_dispatchDisposed)
+                return;
+
+            var id = Interlocked.Increment(ref _backgroundSubscribeCounter);
+            var task = _mxAccessClient.SubscribeAsync(tag, (_, _) => { });
+            _pendingBackgroundSubscribes[id] = task;
+            task.ContinueWith(t =>
+            {
+                _pendingBackgroundSubscribes.TryRemove(id, out _);
+                if (t.IsFaulted)
+                    Log.Warning(t.Exception?.InnerException, "Background subscribe failed ({Context}) for {Tag}",
+                        context, tag);
+            }, TaskContinuationOptions.ExecuteSynchronously);
+        }
+
+        /// <summary>
+        ///     Gets the number of background subscribe tasks currently in flight. Exposed for tests
+        ///     and for the status dashboard subscription panel.
+        /// </summary>
+        internal int PendingBackgroundSubscribeCount => _pendingBackgroundSubscribes.Count;
+
        private ServiceResult OnAlarmAcknowledge(
            ISystemContext context, ConditionState condition, byte[] eventId, LocalizedText comment)
        {
@@ -1358,11 +1414,7 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
                    {
                        if (string.IsNullOrEmpty(tag) || !_tagToVariableNode.ContainsKey(tag))
                            continue;
-                        var subtreeAlarmTag = tag;
-                        _mxAccessClient.SubscribeAsync(subtreeAlarmTag, (_, _) => { })
-                            .ContinueWith(t => Log.Warning(t.Exception?.InnerException,
-                                "Failed to subscribe alarm tag in subtree {Tag}", subtreeAlarmTag),
-                                TaskContinuationOptions.OnlyOnFaulted);
+                        TrackBackgroundSubscribe(tag, "subtree alarm auto-subscribe");
                    }
                }
            }
@@ -1705,10 +1757,18 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa

                    try
                    {
-                        var vtq = _mxAccessClient.ReadAsync(tagRef).GetAwaiter().GetResult();
+                        var vtq = SyncOverAsync.WaitSync(
+                            _mxAccessClient.ReadAsync(tagRef),
+                            _mxAccessRequestTimeout,
+                            "MxAccessClient.ReadAsync");
                        results[i] = CreatePublishedDataValue(tagRef, vtq);
                        errors[i] = ServiceResult.Good;
                    }
+                    catch (TimeoutException ex)
+                    {
+                        Log.Warning(ex, "Read timed out for {TagRef}", tagRef);
+                        errors[i] = new ServiceResult(StatusCodes.BadTimeout);
+                    }
                    catch (Exception ex)
                    {
                        Log.Warning(ex, "Read failed for {TagRef}", tagRef);
@@ -1779,7 +1839,10 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
                            value = updatedArray;
                        }

-                        var success = _mxAccessClient.WriteAsync(tagRef, value).GetAwaiter().GetResult();
+                        var success = SyncOverAsync.WaitSync(
+                            _mxAccessClient.WriteAsync(tagRef, value),
+                            _mxAccessRequestTimeout,
+                            "MxAccessClient.WriteAsync");
                        if (success)
                        {
                            PublishLocalWrite(tagRef, value);
@@ -1790,6 +1853,11 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
                            errors[i] = new ServiceResult(StatusCodes.BadInternalError);
                        }
                    }
+                    catch (TimeoutException ex)
+                    {
+                        Log.Warning(ex, "Write timed out for {TagRef}", tagRef);
+                        errors[i] = new ServiceResult(StatusCodes.BadTimeout);
+                    }
                    catch (Exception ex)
                    {
                        Log.Warning(ex, "Write failed for {TagRef}", tagRef);
@@ -2017,15 +2085,23 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
                try
                {
                    var maxValues = details.NumValuesPerNode > 0 ? (int)details.NumValuesPerNode : 0;
-                    var dataValues = _historianDataSource.ReadRawAsync(
-                            tagRef, details.StartTime, details.EndTime, maxValues)
-                        .GetAwaiter().GetResult();
+                    var dataValues = SyncOverAsync.WaitSync(
+                        _historianDataSource.ReadRawAsync(
+                            tagRef, details.StartTime, details.EndTime, maxValues),
+                        _historianRequestTimeout,
+                        "HistorianDataSource.ReadRawAsync");

                    if (details.ReturnBounds)
                        AddBoundingValues(dataValues, details.StartTime, details.EndTime);

                    ReturnHistoryPage(dataValues, details.NumValuesPerNode, results, errors, idx);
                }
+                catch (TimeoutException ex)
+                {
+                    historyScope.SetSuccess(false);
+                    Log.Warning(ex, "HistoryRead raw timed out for {TagRef}", tagRef);
+                    errors[idx] = new ServiceResult(StatusCodes.BadTimeout);
+                }
                catch (Exception ex)
                {
                    historyScope.SetSuccess(false);
@@ -2094,13 +2170,21 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
                using var historyScope = _metrics.BeginOperation("HistoryReadProcessed");
                try
                {
-                    var dataValues = _historianDataSource.ReadAggregateAsync(
+                    var dataValues = SyncOverAsync.WaitSync(
+                        _historianDataSource.ReadAggregateAsync(
                            tagRef, details.StartTime, details.EndTime,
-                            details.ProcessingInterval, column)
-                        .GetAwaiter().GetResult();
+                            details.ProcessingInterval, column),
+                        _historianRequestTimeout,
+                        "HistorianDataSource.ReadAggregateAsync");

                    ReturnHistoryPage(dataValues, 0, results, errors, idx);
                }
+                catch (TimeoutException ex)
+                {
+                    historyScope.SetSuccess(false);
+                    Log.Warning(ex, "HistoryRead processed timed out for {TagRef}", tagRef);
+                    errors[idx] = new ServiceResult(StatusCodes.BadTimeout);
+                }
                catch (Exception ex)
                {
                    historyScope.SetSuccess(false);
@@ -2150,8 +2234,10 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
                    for (var i = 0; i < details.ReqTimes.Count; i++)
                        timestamps[i] = details.ReqTimes[i];

-                    var dataValues = _historianDataSource.ReadAtTimeAsync(tagRef, timestamps)
-                        .GetAwaiter().GetResult();
+                    var dataValues = SyncOverAsync.WaitSync(
+                        _historianDataSource.ReadAtTimeAsync(tagRef, timestamps),
+                        _historianRequestTimeout,
+                        "HistorianDataSource.ReadAtTimeAsync");

                    var historyData = new HistoryData();
                    historyData.DataValues.AddRange(dataValues);
@@ -2163,6 +2249,12 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
                    };
                    errors[idx] = ServiceResult.Good;
                }
+                catch (TimeoutException ex)
+                {
+                    historyScope.SetSuccess(false);
+                    Log.Warning(ex, "HistoryRead at-time timed out for {TagRef}", tagRef);
+                    errors[idx] = new ServiceResult(StatusCodes.BadTimeout);
+                }
                catch (Exception ex)
                {
                    historyScope.SetSuccess(false);
@@ -2215,9 +2307,11 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
                try
                {
                    var maxEvents = details.NumValuesPerNode > 0 ? (int)details.NumValuesPerNode : 0;
-                    var events = _historianDataSource.ReadEventsAsync(
-                            sourceName, details.StartTime, details.EndTime, maxEvents)
-                        .GetAwaiter().GetResult();
+                    var events = SyncOverAsync.WaitSync(
+                        _historianDataSource.ReadEventsAsync(
+                            sourceName, details.StartTime, details.EndTime, maxEvents),
+                        _historianRequestTimeout,
+                        "HistorianDataSource.ReadEventsAsync");

                    var historyEvent = new HistoryEvent();
                    foreach (var evt in events)
@@ -2247,6 +2341,12 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
                    };
                    errors[idx] = ServiceResult.Good;
                }
+                catch (TimeoutException ex)
+                {
+                    historyScope.SetSuccess(false);
+                    Log.Warning(ex, "HistoryRead events timed out for {NodeId}", nodeIdStr);
+                    errors[idx] = new ServiceResult(StatusCodes.BadTimeout);
+                }
                catch (Exception ex)
                {
                    historyScope.SetSuccess(false);
@@ -2476,13 +2576,7 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
                }

            foreach (var tagRef in tagsToSubscribe)
-            {
-                var transferTag = tagRef;
-                _mxAccessClient.SubscribeAsync(transferTag, (_, _) => { })
-                    .ContinueWith(t => Log.Warning(t.Exception?.InnerException,
-                        "Failed to restore subscription for transferred tag {Tag}", transferTag),
-                        TaskContinuationOptions.OnlyOnFaulted);
-            }
+                TrackBackgroundSubscribe(tagRef, "transferred subscription restore");
        }

        private void OnMxAccessDataChange(string address, Vtq vtq)
@@ -2798,12 +2892,33 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
                // client, so the probes close cleanly.
                _galaxyRuntimeProbeManager?.Dispose();
                StopDispatchThread();
+                DrainPendingBackgroundSubscribes();
                _dataChangeSignal.Dispose();
            }

            base.Dispose(disposing);
        }

+        private void DrainPendingBackgroundSubscribes()
+        {
+            var snapshot = _pendingBackgroundSubscribes.Values.ToArray();
+            if (snapshot.Length == 0)
+                return;
+
+            try
+            {
+                Task.WaitAll(snapshot, TimeSpan.FromSeconds(5));
+                Log.Information("Drained {Count} pending background subscribe(s) on shutdown", snapshot.Length);
+            }
+            catch (AggregateException ex)
+            {
+                // Individual faults were already logged by the tracked continuation; record the
+                // aggregate at debug level to aid diagnosis without double-logging each failure.
+                Log.Debug(ex, "Background subscribe drain completed with {FaultCount} fault(s)",
+                    ex.InnerExceptions.Count);
+            }
+        }
+
        #endregion
    }
 }
@@ -39,6 +39,8 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa

        private readonly bool _runtimeStatusProbesEnabled;
        private readonly int _runtimeStatusUnknownTimeoutSeconds;
+        private readonly int _mxAccessRequestTimeoutSeconds;
+        private readonly int _historianRequestTimeoutSeconds;

        public LmxOpcUaServer(string galaxyName, IMxAccessClient mxAccessClient, PerformanceMetrics metrics,
            IHistorianDataSource? historianDataSource = null, bool alarmTrackingEnabled = false,
@@ -46,7 +48,9 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
            RedundancyConfiguration? redundancyConfig = null, string? applicationUri = null,
            AlarmObjectFilter? alarmObjectFilter = null,
            bool runtimeStatusProbesEnabled = false,
-            int runtimeStatusUnknownTimeoutSeconds = 15)
+            int runtimeStatusUnknownTimeoutSeconds = 15,
+            int mxAccessRequestTimeoutSeconds = 30,
+            int historianRequestTimeoutSeconds = 60)
        {
            _galaxyName = galaxyName;
            _mxAccessClient = mxAccessClient;
@@ -60,6 +64,8 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
            _applicationUri = applicationUri;
            _runtimeStatusProbesEnabled = runtimeStatusProbesEnabled;
            _runtimeStatusUnknownTimeoutSeconds = runtimeStatusUnknownTimeoutSeconds;
+            _mxAccessRequestTimeoutSeconds = mxAccessRequestTimeoutSeconds;
+            _historianRequestTimeoutSeconds = historianRequestTimeoutSeconds;
        }

        /// <summary>
@@ -97,7 +103,8 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
                _historianDataSource, _alarmTrackingEnabled, _authConfig.AnonymousCanWrite,
                _writeOperateRoleId, _writeTuneRoleId, _writeConfigureRoleId, _alarmAckRoleId,
                _alarmObjectFilter,
-                _runtimeStatusProbesEnabled, _runtimeStatusUnknownTimeoutSeconds);
+                _runtimeStatusProbesEnabled, _runtimeStatusUnknownTimeoutSeconds,
+                _mxAccessRequestTimeoutSeconds, _historianRequestTimeoutSeconds);

            var nodeManagers = new List<INodeManager> { NodeManager };
            return new MasterNodeManager(server, configuration, null, nodeManagers.ToArray());
@@ -46,7 +46,8 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
            SecurityProfileConfiguration? securityConfig = null,
            RedundancyConfiguration? redundancyConfig = null,
            AlarmObjectFilter? alarmObjectFilter = null,
-            MxAccessConfiguration? mxAccessConfig = null)
+            MxAccessConfiguration? mxAccessConfig = null,
+            HistorianConfiguration? historianConfig = null)
        {
            _config = config;
            _mxAccessClient = mxAccessClient;
@@ -58,9 +59,11 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
            _redundancyConfig = redundancyConfig ?? new RedundancyConfiguration();
            _alarmObjectFilter = alarmObjectFilter;
            _mxAccessConfig = mxAccessConfig ?? new MxAccessConfiguration();
+            _historianConfig = historianConfig ?? new HistorianConfiguration();
        }

        private readonly MxAccessConfiguration _mxAccessConfig;
+        private readonly HistorianConfiguration _historianConfig;

        /// <summary>
        ///     Gets the active node manager that holds the published Galaxy namespace.
@@ -245,7 +248,9 @@ namespace ZB.MOM.WW.LmxOpcUa.Host.OpcUa
                _config.AlarmTrackingEnabled, _authConfig, _authProvider, _redundancyConfig, applicationUri,
                _alarmObjectFilter,
                _mxAccessConfig.RuntimeStatusProbesEnabled,
-                _mxAccessConfig.RuntimeStatusUnknownTimeoutSeconds);
+                _mxAccessConfig.RuntimeStatusUnknownTimeoutSeconds,
+                _mxAccessConfig.RequestTimeoutSeconds,
+                _historianConfig.RequestTimeoutSeconds);
            await _application.Start(_server);

            Log.Information(
@@ -125,10 +125,20 @@ namespace ZB.MOM.WW.LmxOpcUa.Host
        internal ChangeDetectionService? ChangeDetectionInstance { get; private set; }

        /// <summary>
-        ///     Gets the hosted status web server when the dashboard is enabled.
+        ///     Gets the hosted status web server when the dashboard is enabled and successfully bound.
+        ///     Null when <c>Dashboard.Enabled</c> is false or when <see cref="DashboardStartFailed"/> is true.
        /// </summary>
        internal StatusWebServer? StatusWeb { get; private set; }

+        /// <summary>
+        ///     Gets a flag indicating that the dashboard was enabled in configuration but failed to bind
+        ///     its HTTP port at startup. The service continues in degraded mode (matching the pattern
+        ///     for other optional subsystems: MxAccess connect, Galaxy DB connect, initial address space
+        ///     build). Surfaced for tests and any external health probe that needs to distinguish
+        ///     "dashboard disabled by config" from "dashboard failed to start".
+        /// </summary>
+        internal bool DashboardStartFailed { get; private set; }
+
        /// <summary>
        ///     Gets the dashboard report generator used to assemble operator-facing status snapshots.
        /// </summary>
@@ -246,7 +256,7 @@ namespace ZB.MOM.WW.LmxOpcUa.Host

                ServerHost = new OpcUaServerHost(_config.OpcUa, effectiveMxClient, Metrics, _historianDataSource,
                    _config.Authentication, authProvider, _config.Security, _config.Redundancy, alarmObjectFilter,
-                    _config.MxAccess);
+                    _config.MxAccess, _config.Historian);

                // Step 9-10: Query hierarchy, start server, build address space
                DateTime? initialDeployTime = null;
@@ -304,8 +314,21 @@ namespace ZB.MOM.WW.LmxOpcUa.Host

                if (_config.Dashboard.Enabled)
                {
-                    StatusWeb = new StatusWebServer(StatusReportInstance, _config.Dashboard.Port);
-                    StatusWeb.Start();
+                    var dashboardServer = new StatusWebServer(StatusReportInstance, _config.Dashboard.Port);
+                    if (dashboardServer.Start())
+                    {
+                        StatusWeb = dashboardServer;
+                    }
+                    else
+                    {
+                        // Degraded mode: StatusWebServer.Start() already logged the underlying exception.
+                        // Dispose the unstarted instance, null out the reference, and flag the failure so
+                        // tests and health probes can observe it. Service startup continues.
+                        Log.Warning("Status dashboard failed to bind on port {Port}; service continues without dashboard",
+                            _config.Dashboard.Port);
+                        dashboardServer.Dispose();
+                        DashboardStartFailed = true;
+                    }
                }

                // Wire ServiceLevel updates from MXAccess health changes
@@ -0,0 +1,53 @@
+using System;
+using System.Threading.Tasks;
+
+namespace ZB.MOM.WW.LmxOpcUa.Host.Utilities
+{
+    /// <summary>
+    ///     Bounded safety wrappers for blocking on async tasks from synchronous OPC UA stack
+    ///     callbacks (Read, Write, HistoryRead*, BuildAddressSpace). These are backstops: the
+    ///     underlying MxAccess / Historian clients already enforce inner timeouts on the async
+    ///     path, but an outer bound is still required so the stack thread cannot be parked
+    ///     indefinitely by a hung scheduler, a slow reconnect, or any other non-returning
+    ///     async path.
+    /// </summary>
+    /// <remarks>
+    ///     On timeout, the underlying task is NOT cancelled — it runs to completion on the
+    ///     thread pool and is abandoned. Callers must be comfortable with the fire-forget
+    ///     semantics of the background continuation. This is acceptable for the current call
+    ///     sites because MxAccess and Historian clients are shared singletons whose background
+    ///     work does not capture request-scoped state.
+    /// </remarks>
+    internal static class SyncOverAsync
+    {
+        public static void WaitSync(Task task, TimeSpan timeout, string operation)
+        {
+            if (task == null) throw new ArgumentNullException(nameof(task));
+            try
+            {
+                if (!task.Wait(timeout))
+                    throw new TimeoutException($"{operation} exceeded {timeout.TotalSeconds:0.#}s");
+            }
+            catch (AggregateException ae) when (ae.InnerExceptions.Count == 1)
+            {
+                // Unwrap the single inner exception so callers can write natural catch blocks.
+                throw ae.InnerExceptions[0];
+            }
+        }
+
+        public static T WaitSync<T>(Task<T> task, TimeSpan timeout, string operation)
+        {
+            if (task == null) throw new ArgumentNullException(nameof(task));
+            try
+            {
+                if (!task.Wait(timeout))
+                    throw new TimeoutException($"{operation} exceeded {timeout.TotalSeconds:0.#}s");
+                return task.Result;
+            }
+            catch (AggregateException ae) when (ae.InnerExceptions.Count == 1)
+            {
+                throw ae.InnerExceptions[0];
+            }
+        }
+    }
+}
@@ -192,6 +192,43 @@ namespace ZB.MOM.WW.LmxOpcUa.Tests.Configuration
            config.Security.MinimumCertificateKeySize.ShouldBe(2048);
        }

+        /// <summary>
+        ///     Stability review 2026-04-13 Finding 3: MxAccess.RequestTimeoutSeconds must be at
+        ///     least 1. Zero or negative values disable the safety bound and are rejected.
+        /// </summary>
+        [Fact]
+        public void Validator_MxAccessRequestTimeoutZero_ReturnsFalse()
+        {
+            var config = LoadFromJson();
+            config.MxAccess.RequestTimeoutSeconds = 0;
+            ConfigurationValidator.ValidateAndLog(config).ShouldBe(false);
+        }
+
+        /// <summary>
+        ///     Stability review 2026-04-13 Finding 3: Historian.RequestTimeoutSeconds must be at
+        ///     least 1 when historian is enabled.
+        /// </summary>
+        [Fact]
+        public void Validator_HistorianRequestTimeoutZero_ReturnsFalse()
+        {
+            var config = LoadFromJson();
+            config.Historian.Enabled = true;
+            config.Historian.ServerName = "localhost";
+            config.Historian.RequestTimeoutSeconds = 0;
+            ConfigurationValidator.ValidateAndLog(config).ShouldBe(false);
+        }
+
+        /// <summary>
+        ///     Confirms the bound AppConfiguration carries non-zero default request timeouts.
+        /// </summary>
+        [Fact]
+        public void Validator_DefaultRequestTimeouts_AreSensible()
+        {
+            var config = new AppConfiguration();
+            config.MxAccess.RequestTimeoutSeconds.ShouldBeGreaterThanOrEqualTo(1);
+            config.Historian.RequestTimeoutSeconds.ShouldBeGreaterThanOrEqualTo(1);
+        }
+
        /// <summary>
        ///     Confirms that a minimum key size below 2048 is rejected by the validator.
        /// </summary>
@@ -402,6 +402,73 @@ namespace ZB.MOM.WW.LmxOpcUa.Tests.MxAccess
            sut.IsHostStopped(20).ShouldBeFalse();
        }

+        // ---------- Subscribe failure rollback (stability review 2026-04-13 Finding 1) ----------
+
+        [Fact]
+        public async Task Sync_SubscribeThrows_DoesNotLeavePhantomEntry()
+        {
+            var client = new FakeMxAccessClient
+            {
+                SubscribeException = new InvalidOperationException("advise failed")
+            };
+            var (stopSpy, runSpy) = (new List<int>(), new List<int>());
+            using var sut = Sut(client, 15, stopSpy, runSpy);
+
+            await sut.SyncAsync(new[] { Engine(20, "DevAppEngine") });
+
+            // A failed SubscribeAsync must not leave a phantom entry that Tick() can later
+            // transition from Unknown to Stopped.
+            sut.ActiveProbeCount.ShouldBe(0);
+            sut.GetSnapshot().ShouldBeEmpty();
+            sut.IsHostStopped(20).ShouldBeFalse();
+        }
+
+        [Fact]
+        public async Task Sync_SubscribeThrows_TickDoesNotFireStopCallback()
+        {
+            var client = new FakeMxAccessClient
+            {
+                SubscribeException = new InvalidOperationException("advise failed")
+            };
+            var clock = new Clock();
+            var (stopSpy, runSpy) = (new List<int>(), new List<int>());
+            using var sut = Sut(client, 15, stopSpy, runSpy, clock);
+
+            await sut.SyncAsync(new[] { Engine(20, "DevAppEngine") });
+
+            // Advance past the unknown timeout — if the rollback were incomplete, Tick() would
+            // transition the phantom entry to Stopped and fan out a false host-down signal.
+            clock.Now = clock.Now.AddSeconds(30);
+            sut.Tick();
+
+            stopSpy.ShouldBeEmpty();
+            runSpy.ShouldBeEmpty();
+            sut.ActiveProbeCount.ShouldBe(0);
+        }
+
+        [Fact]
+        public async Task Sync_SubscribeSucceedsAfterRetry_AppearsInSnapshot()
+        {
+            // After a failed subscribe rolls back cleanly, a subsequent successful SyncAsync
+            // against the same host must behave normally.
+            var client = new FakeMxAccessClient
+            {
+                SubscribeException = new InvalidOperationException("first attempt fails")
+            };
+            var (stopSpy, runSpy) = (new List<int>(), new List<int>());
+            using var sut = Sut(client, 15, stopSpy, runSpy);
+
+            await sut.SyncAsync(new[] { Engine(20, "DevAppEngine") });
+            sut.ActiveProbeCount.ShouldBe(0);
+
+            // Clear the fault and resync — the host must now appear with Unknown state.
+            client.SubscribeException = null;
+            await sut.SyncAsync(new[] { Engine(20, "DevAppEngine") });
+
+            sut.ActiveProbeCount.ShouldBe(1);
+            sut.GetSnapshot().Single().State.ShouldBe(GalaxyRuntimeState.Unknown);
+        }
+
        // ---------- Callback exception safety ----------

        [Fact]
@@ -96,6 +96,26 @@ namespace ZB.MOM.WW.LmxOpcUa.Tests.Status
            response.StatusCode.ShouldBe(HttpStatusCode.MethodNotAllowed);
        }

+        /// <summary>
+        ///     Confirms that Start() returns false and logs a failure when the target port is
+        ///     already bound by another listener. Regression guard for the stability-review 2026-04-13
+        ///     Finding 2: OpcUaService now surfaces this return value into DashboardStartFailed.
+        /// </summary>
+        [Fact]
+        public void Start_WhenPortInUse_ReturnsFalse()
+        {
+            var port = new Random().Next(19000, 19500);
+            using var blocker = new HttpListener();
+            blocker.Prefixes.Add($"http://localhost:{port}/");
+            blocker.Start();
+
+            var reportService = new StatusReportService(new HealthCheckService(), 10);
+            reportService.SetComponents(new FakeMxAccessClient(), null, null, null);
+            using var contested = new StatusWebServer(reportService, port);
+
+            contested.Start().ShouldBeFalse();
+        }
+
        /// <summary>
        ///     Confirms that cache-control headers disable caching for dashboard responses.
        /// </summary>
@@ -0,0 +1,72 @@
+using System;
+using System.Threading.Tasks;
+using Shouldly;
+using Xunit;
+using ZB.MOM.WW.LmxOpcUa.Host.Utilities;
+
+namespace ZB.MOM.WW.LmxOpcUa.Tests.Utilities
+{
+    /// <summary>
+    ///     Tests for the bounded sync-over-async wrapper introduced by stability review 2026-04-13
+    ///     Finding 3. The wrapper is a backstop applied at every LmxNodeManager sync-over-async site
+    ///     (Read, Write, HistoryRead*, BuildAddressSpace probe sync).
+    /// </summary>
+    public class SyncOverAsyncTests
+    {
+        [Fact]
+        public void WaitSync_CompletedTask_ReturnsResult()
+        {
+            var task = Task.FromResult(42);
+            SyncOverAsync.WaitSync(task, TimeSpan.FromSeconds(1), "test").ShouldBe(42);
+        }
+
+        [Fact]
+        public void WaitSync_CompletedNonGenericTask_Returns()
+        {
+            var task = Task.CompletedTask;
+            Should.NotThrow(() => SyncOverAsync.WaitSync(task, TimeSpan.FromSeconds(1), "test"));
+        }
+
+        [Fact]
+        public void WaitSync_NeverCompletingTask_ThrowsTimeoutException()
+        {
+            var tcs = new TaskCompletionSource<int>();
+            var ex = Should.Throw<TimeoutException>(() =>
+                SyncOverAsync.WaitSync(tcs.Task, TimeSpan.FromMilliseconds(100), "op"));
+            ex.Message.ShouldContain("op");
+        }
+
+        [Fact]
+        public void WaitSync_NeverCompletingNonGenericTask_ThrowsTimeoutException()
+        {
+            var tcs = new TaskCompletionSource<bool>();
+            Should.Throw<TimeoutException>(() =>
+                SyncOverAsync.WaitSync((Task)tcs.Task, TimeSpan.FromMilliseconds(100), "op"));
+        }
+
+        [Fact]
+        public void WaitSync_FaultedNonGenericTask_UnwrapsInnerException()
+        {
+            var task = Task.FromException(new InvalidOperationException("boom"));
+            Should.Throw<InvalidOperationException>(() =>
+                SyncOverAsync.WaitSync(task, TimeSpan.FromSeconds(1), "op"));
+        }
+
+        [Fact]
+        public void WaitSync_FaultedGenericTask_UnwrapsInnerException()
+        {
+            var task = Task.FromException<int>(new InvalidOperationException("boom"));
+            Should.Throw<InvalidOperationException>(() =>
+                SyncOverAsync.WaitSync(task, TimeSpan.FromSeconds(1), "op"));
+        }
+
+        [Fact]
+        public void WaitSync_NullTask_ThrowsArgumentNullException()
+        {
+            Should.Throw<ArgumentNullException>(() =>
+                SyncOverAsync.WaitSync((Task)null!, TimeSpan.FromSeconds(1), "op"));
+            Should.Throw<ArgumentNullException>(() =>
+                SyncOverAsync.WaitSync((Task<int>)null!, TimeSpan.FromSeconds(1), "op"));
+        }
+    }
+}
@@ -0,0 +1,78 @@
+using System;
+using System.Collections.Generic;
+using System.Net;
+using Shouldly;
+using Xunit;
+using ZB.MOM.WW.LmxOpcUa.Host;
+using ZB.MOM.WW.LmxOpcUa.Host.Configuration;
+using ZB.MOM.WW.LmxOpcUa.Host.Domain;
+using ZB.MOM.WW.LmxOpcUa.Tests.Helpers;
+
+namespace ZB.MOM.WW.LmxOpcUa.Tests.Wiring
+{
+    /// <summary>
+    ///     Regression for stability review 2026-04-13 Finding 2. Confirms that when the dashboard
+    ///     port is already bound, the service continues to start (degraded mode) and the
+    ///     <see cref="OpcUaService.DashboardStartFailed"/> flag is raised.
+    /// </summary>
+    public class OpcUaServiceDashboardFailureTests
+    {
+        [Fact]
+        public void Start_DashboardPortInUse_ContinuesInDegradedMode()
+        {
+            var dashboardPort = new Random().Next(19500, 19999);
+            using var blocker = new HttpListener();
+            blocker.Prefixes.Add($"http://localhost:{dashboardPort}/");
+            blocker.Start();
+
+            var config = new AppConfiguration
+            {
+                OpcUa = new OpcUaConfiguration
+                {
+                    Port = 14842,
+                    GalaxyName = "TestGalaxy",
+                    EndpointPath = "/LmxOpcUa"
+                },
+                MxAccess = new MxAccessConfiguration { ClientName = "Test" },
+                GalaxyRepository = new GalaxyRepositoryConfiguration(),
+                Dashboard = new DashboardConfiguration { Enabled = true, Port = dashboardPort }
+            };
+
+            var proxy = new FakeMxProxy();
+            var repo = new FakeGalaxyRepository
+            {
+                Hierarchy = new List<GalaxyObjectInfo>
+                {
+                    new()
+                    {
+                        GobjectId = 1, TagName = "TestObj", BrowseName = "TestObj",
+                        ParentGobjectId = 0, IsArea = false
+                    }
+                },
+                Attributes = new List<GalaxyAttributeInfo>
+                {
+                    new()
+                    {
+                        GobjectId = 1, TagName = "TestObj", AttributeName = "TestAttr",
+                        FullTagReference = "TestObj.TestAttr", MxDataType = 5, IsArray = false
+                    }
+                }
+            };
+
+            var service = new OpcUaService(config, proxy, repo);
+            service.Start();
+
+            try
+            {
+                // Service continues despite dashboard bind failure — degraded mode policy.
+                service.ServerHost.ShouldNotBeNull();
+                service.DashboardStartFailed.ShouldBeTrue();
+                service.StatusWeb.ShouldBeNull();
+            }
+            finally
+            {
+                service.Stop();
+            }
+        }
+    }
+}