# Stability Review - 2026-04-13

## Scope

Re-review of the updated `lmxopcua` codebase with emphasis on stability, shutdown behavior, async usage, latent deadlock patterns, and silent failure modes.

Validation run for this review:

```powershell
dotnet test tests\ZB.MOM.WW.LmxOpcUa.Tests\ZB.MOM.WW.LmxOpcUa.Tests.csproj --no-restore
```

Result: `471/471` tests passed in approximately `3m18s`.

## Confirmed Findings

### 1. Probe state is published before the subscription succeeds

Severity: High

File references:

- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs:193`
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs:201`
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs:222`
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs:225`
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs:343`

`SyncAsync` adds entries to `_byProbe` and `_probeByGobjectId` before `SubscribeAsync` completes. If the advise call fails, the catch block logs the failure but leaves the probe registered internally. `Tick()` later treats that entry as a real advised probe that never produced an initial callback and transitions it from `Unknown` to `Stopped`.

That creates a false-negative health signal: a host can be marked stopped even though the real problem was "subscription never established". In this codebase that distinction matters because runtime-host state is later used to suppress or degrade published node quality.

Recommendation: only commit the new probe entry after a successful subscribe, or roll the dictionaries back in the catch path. Add a regression test for subscribe failure in `GalaxyRuntimeProbeManagerTests`.

### 2. Service startup still ignores dashboard bind failure

Severity: Medium

File references:

- `src/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusWebServer.cs:50`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs:307`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs:308`

`StatusWebServer.Start()` now correctly returns `bool`, but `OpcUaService.Start` still ignores that result. The service can therefore continue through startup and report success even when the dashboard failed to bind.

This is not a process-crash bug, but it is still an operational stability issue because the service advertises a successful start while one of its enabled endpoints is unavailable.

Recommendation: decide whether dashboard startup failure is fatal or degraded mode, then implement that policy explicitly. At minimum, surface the failure in service startup state instead of dropping the return value.

### 3. Sync-over-async remains on critical request and rebuild paths

Severity: Medium

File references:

- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:572`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1708`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1782`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:2022`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:2100`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:2154`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:2220`

The updated code removed some blocking work from lock scopes, but several service-critical paths still call async MX access operations synchronously with `.GetAwaiter().GetResult()`. That pattern appears in address-space rebuild, direct read/write handling, and historian reads.

I did not reproduce a deadlock in tests, but the pattern is still a stability risk because request threads now inherit backend latency directly and can stall hard if the underlying async path hangs, blocks on its own scheduler, or experiences slow reconnect behavior.

Recommendation: keep the short synchronous boundary only where the external API forces it, and isolate backend calls behind bounded timeouts or dedicated worker threads. Rebuild-time probe synchronization is the highest-value place to reduce blocking first.

### 4. Several background subscribe paths are still fire-and-forget

Severity: Low

File references:

- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:858`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:1362`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs:2481`

Alarm auto-subscribe and transferred-subscription restore still dispatch `SubscribeAsync(...)` and attach a fault-only continuation. That is better than dropping exceptions completely, but these operations are still not lifecycle-coordinated. A rebuild or shutdown can move on while subscription work is still in flight.

The practical outcome is transient mismatch rather than memory corruption: expected subscriptions can arrive late, and shutdown/rebuild sequencing is harder to reason about under backend slowness.

Recommendation: track these tasks when ordering matters, or centralize them behind a subscription queue with explicit cancellation and shutdown semantics.

## Verified Improvements Since The Previous Review

The following areas that were previously risky now look materially better in the current code:

- `StaComThread` now checks `PostThreadMessage` failures and faults pending work instead of leaving callers parked indefinitely.
- `HistoryContinuationPointManager` now purges expired continuation points on retrieve and release, not only on store.
- `ChangeDetectionService`, MX monitor, and the status web server now retain background task handles and wait briefly during stop.
- `StatusWebServer` no longer swallows startup failure silently; it returns a success flag and logs the failure.
- Connection string validation now redacts credentials before logging.

## Overall Assessment

The updated code is in better shape than the previous pass. The most serious prior shutdown and leak hazards have been addressed, and the full automated test suite is currently green.

The remaining stability work is concentrated in two areas:

1. Correctness around failed runtime-probe subscription.
2. Reducing synchronous waits and untracked background subscription work in the OPC UA node manager.