Close all four stability-review 2026-04-13 findings so a failed runtime probe subscription can no longer leave a phantom entry that Tick() flips to Stopped and fans out false BadOutOfService quality across a host's subtree, a silently-failed dashboard bind no longer lets the service advertise a successful start while an operator-visible endpoint is dead, the seven sync-over-async sites in LmxNodeManager (rebuild probe sync, Read, Write, four HistoryRead overrides) can no longer park the OPC UA stack thread indefinitely on a hung backend, and alarm auto-subscribe + transferred-subscription restore no longer race shutdown as untracked fire-and-forget tasks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-04-14 00:48:07 -04:00
parent 731092595f
commit c76ab8fdee
21 changed files with 869 additions and 53 deletions

View File

@@ -678,6 +678,154 @@ The original production issue had two symptoms: (1) incomplete quality flip and
- **Dispatch suppression** — gated on observing whether the subtree walk alone resolves the client freeze in production.
- **Documentation updates** — the `docs/` guides (`MxAccessBridge.md`, `StatusDashboard.md`, `Configuration.md`, `HistoricalDataAccess.md`) still describe the pre-runtime-status behavior. Need a consolidated doc pass covering this feature plus the historian cluster + health surface updates from earlier today.
## Stability Review Fixes 2026-04-14
Code changes only — **not yet deployed** to the instance1/instance2 services. Closes all four residual findings from `docs/stability-review-20260413.md`; the document was green on shipped features but flagged latent defects that degraded the stability guarantees the runtime-status feature relies on. Deploy procedure at the end of this section.
### Findings closed
**Finding 1 (High) — Probe rollback on subscribe failure.**
`GalaxyRuntimeProbeManager.SyncAsync` pre-populated `_byProbe` / `_probeByGobjectId` before awaiting `SubscribeAsync`. When the advise call threw, the catch block logged a warning but left the phantom entry in place; `Tick()` later transitioned it from Unknown to Stopped after `RuntimeStatusUnknownTimeoutSeconds`, firing `_onHostStopped` and walking the subtree of a host that was never actually advised. In a codebase where the same probe manager also drives the Read-path short-circuit and subtree quality invalidation (the 2026-04-13 feature), a false-negative here fans out into hundreds of BadOutOfService flags on live variables. Fix: promote `toSubscribe` to `List<(int GobjectId, string Probe)>` so the catch path can reacquire `_lock` and remove both dictionaries. The rollback compares against the captured probe string before removing so a concurrent resync cannot accidentally delete a legitimate re-add.
**Finding 2 (Medium) — Surface dashboard bind failure.**
`StatusWebServer.Start()` already returned `bool`, but `OpcUaService.Start()` ignored it, so a failed bind (port in use, permissions) was invisible at the service level. Fix: capture the return value, on `false` log a Warning (`Status dashboard failed to bind on port {Port}; service continues without dashboard`), dispose the unstarted instance, and set a new `OpcUaService.DashboardStartFailed` property. Degraded mode — matches the established precedent for other optional startup subsystems (MxAccess connect, Galaxy DB connect, initial address space build).
**Finding 3 (Medium) — Bounded timeouts on sync-over-async.**
Seven sync-over-async `.GetAwaiter().GetResult()` sites in `LmxNodeManager` (rebuild probe sync, Read, Write, HistoryReadRaw/Processed/AtTime/Events) blocked the OPC UA stack thread without an outer bound. Inner `MxAccessClient.ReadAsync` / `WriteAsync` already apply per-call `CancelAfter`, but `SubscribeAsync`, `SyncAsync`, and the historian reads did not — and the pattern itself is a stability risk regardless of inner behavior. Fix: new `SyncOverAsync.WaitSync(task, timeout, operation)` helper + two new config fields `MxAccess.RequestTimeoutSeconds=30` and `Historian.RequestTimeoutSeconds=60`. Every sync-over-async site now wraps the task in `WaitSync`, catches `TimeoutException` explicitly, and maps to `StatusCodes.BadTimeout` (or logs a warning and continues in the rebuild case — probe sync is advisory). `ConfigurationValidator` rejects `RequestTimeoutSeconds < 1` for both and warns when operators set the outer bound below the inner read/write / command timeout.
**Finding 4 (Low) — Track fire-and-forget subscribes.**
Alarm auto-subscribe, subtree alarm auto-subscribe, and transferred-subscription restore all called `_mxAccessClient.SubscribeAsync(...).ContinueWith(..., OnlyOnFaulted)` with no tracking, so shutdown raced pending subscribes and ordering was impossible to reason about. Fix: new `TrackBackgroundSubscribe(tag, context)` helper in `LmxNodeManager` that stashes the task in `_pendingBackgroundSubscribes` (a `ConcurrentDictionary<long, Task>` with a monotonic `Interlocked.Increment` id), and a continuation that removes the entry and logs faults with the supplied context. `Dispose(bool)` drains the dictionary with `Task.WaitAll(snapshot, 5s)` after stopping the dispatch thread — bounded so shutdown cannot stall on a hung backend, and logged at Info so operators can see the drain count.
### Code changes
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs``toSubscribe` carries gobject id; catch path rolls back both dictionaries under `_lock`, with concurrent-overwrite guard.
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUaService.cs` — capture `StatusWeb.Start()` return; `DashboardStartFailed` internal property; dispose unstarted instance on failure.
- `src/ZB.MOM.WW.LmxOpcUa.Host/Utilities/SyncOverAsync.cs` (new) — `WaitSync<T>(Task<T>, TimeSpan, string)` and non-generic overload with inner-exception unwrap.
- `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/MxAccessConfiguration.cs``RequestTimeoutSeconds: int = 30`.
- `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/HistorianConfiguration.cs``RequestTimeoutSeconds: int = 60`.
- `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/ConfigurationValidator.cs` — logs both new values, rejects `< 1`, warns on inner/outer misorder.
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs` — constructor takes the two new timeout values (with defaults); seven sync-over-async call sites wrapped in `SyncOverAsync.WaitSync` + `TimeoutException → BadTimeout` catch; `TrackBackgroundSubscribe` helper; `_pendingBackgroundSubscribes` dictionary + `_backgroundSubscribeCounter`; `DrainPendingBackgroundSubscribes()` in `Dispose`; three fire-and-forget sites replaced with helper calls.
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxOpcUaServer.cs` — constructor plumbing for the two new timeouts.
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/OpcUaServerHost.cs` — accepts `HistorianConfiguration`, threads both timeouts through to `LmxOpcUaServer`.
### Tests
- `tests/.../MxAccess/GalaxyRuntimeProbeManagerTests.cs` — 3 new tests: `Sync_SubscribeThrows_DoesNotLeavePhantomEntry`, `Sync_SubscribeThrows_TickDoesNotFireStopCallback`, `Sync_SubscribeSucceedsAfterRetry_AppearsInSnapshot`. Use the existing `FakeMxAccessClient.SubscribeException` hook — no helper changes needed.
- `tests/.../Status/StatusWebServerTests.cs` — 1 new test: `Start_WhenPortInUse_ReturnsFalse`. Grabs a port with a throwaway `HttpListener`, tries to start `StatusWebServer` on the same port, asserts `Start()` returns `false`.
- `tests/.../Wiring/OpcUaServiceDashboardFailureTests.cs` (new) — 1 test: `Start_DashboardPortInUse_ContinuesInDegradedMode`. Builds a full `OpcUaService` with `FakeMxProxy` + `FakeGalaxyRepository`, binds the dashboard port externally, starts the service, asserts `ServerHost != null`, `DashboardStartFailed == true`, `StatusWeb == null`.
- `tests/.../Utilities/SyncOverAsyncTests.cs` (new) — 7 tests covering happy path, never-completing task → TimeoutException with operation name, faulted task → inner exception unwrap, null-task arg check.
- `tests/.../Configuration/ConfigurationLoadingTests.cs` — 3 new tests: `Validator_MxAccessRequestTimeoutZero_ReturnsFalse`, `Validator_HistorianRequestTimeoutZero_ReturnsFalse`, `Validator_DefaultRequestTimeouts_AreSensible`.
**Test results:** full suite **486/486** passing. First run hit a single transient failure in `ChangeDetectionServiceTests.ChangedTimestamp_TriggersAgain` (pre-existing timing-sensitive test — poll interval 1s with 500ms + 1500ms sleeps races under load); the test passes on retry and is unrelated to these changes. The 15 new tests added by this pass all green on both runs.
### Documentation updates
- `docs/MxAccessBridge.md` — Runtime-status section gains a new point 5 documenting the subscribe-failure rollback; new "Request Timeout Safety Backstop" section describing the outer `RequestTimeoutSeconds` bound.
- `docs/HistoricalDataAccess.md` — Config class snippet and property table updated with `RequestTimeoutSeconds`.
- `docs/ServiceHosting.md` — Step 12 (startup sequence) documents the degraded-mode dashboard policy and the new `DashboardStartFailed` flag.
- `docs/Configuration.md``MxAccess.RequestTimeoutSeconds` (30s) and `Historian.RequestTimeoutSeconds` (60s) added to both the property tables and the `appsettings.json` full example.
- `docs/StatusDashboard.md` — New subsection "Dashboard start failures are non-fatal" with the log grep operators should use.
### Deploy plan (not yet executed)
This is a code-only change; the built binary has not been copied to `C:\publish\lmxopcua\instance1` / `instance2` yet. When deploying, follow the procedure from the 2026-04-13 runtime-status deploy (service_info.md:572-680):
1. Backup `C:\publish\lmxopcua\instance1` and `instance2` to `backups\20260414-<HHMMSS>-instance{1,2}`. Preserve each `appsettings.json`.
2. Build the Host project in Release and copy `ZB.MOM.WW.LmxOpcUa.Host.exe` (and any changed DLLs) to both instance roots. The Historian plugin layout at `<instance>\Historian\` is unchanged.
3. Restart the `LmxOpcUa` and `LmxOpcUa2` Windows services.
4. In the startup log for each instance, verify the new config echoes appear:
- `MxAccess.RuntimeStatusProbesEnabled=..., RuntimeStatusUnknownTimeoutSeconds=15s, RequestTimeoutSeconds=30s`
- `Historian.CommandTimeoutSeconds=30, MaxValuesPerRead=10000, FailureCooldownSeconds=60, RequestTimeoutSeconds=60`
- `=== Configuration Valid ===`
- `LmxOpcUa service started successfully`
5. CLI smoke test on both endpoints (matches the 2026-03-25 baseline at service_info.md:370-376):
- `opcuacli-dotnet.exe connect -u opc.tcp://localhost:4840/LmxOpcUa`
- `opcuacli-dotnet.exe connect -u opc.tcp://localhost:4841/LmxOpcUa`
- `opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4840/LmxOpcUa` → ServiceLevel=200 (primary)
- `opcuacli-dotnet.exe redundancy -u opc.tcp://localhost:4841/LmxOpcUa` → ServiceLevel=150 (secondary)
6. Runtime-status regression check (the most sensitive thing these fixes could break): repeat the IDE OffScan / OnScan cycle documented at service_info.md:630-669. Dashboard at `http://localhost:8085/` must go red on OffScan, green on OnScan; `TestMachine_001.MachineID` must flip between `0x808D0000 BadOutOfService` and `0x00000000 Good` at each transition with the same sub-100ms latency as the original deploy.
7. Record PIDs and live verification results in a follow-up section of this file (`## Stability Review Fixes 2026-04-14 — Deploy`), matching the layout conventions from earlier entries.
### Finding 1 manual regression check
Before the regression test landed, the only way to exercise the bug in production was to temporarily revoke the MxAccess user's probe subscription permission or point the probe manager at a non-existent host. After the fix, the same scenarios should leave `GetSnapshot()` empty (no phantom entries) and the dashboard Galaxy Runtime panel should read `0 of N hosts` rather than `0 running, N stopped`. The three new `GalaxyRuntimeProbeManagerTests` cover this deterministically via `FakeMxAccessClient.SubscribeException` so a future regression is caught at CI time.
### Risk notes
- **Timeout floor discipline.** The two new `RequestTimeoutSeconds` values have conservative defaults (30s MxAccess, 60s Historian). Setting them too low would cause spurious `BadTimeout` errors on a slow-but-healthy backend. `ConfigurationValidator` rejects `< 1` and warns below inner timeouts so misconfiguration is visible at startup.
- **Abandoned tasks on timeout.** `SyncOverAsync.WaitSync` does not cancel the underlying task — it runs to completion on the thread pool and is abandoned. This is acceptable because MxAccess / Historian clients are shared singletons whose background work does not capture request-scoped state.
- **Background subscribe drain window.** 5 seconds is enough for healthy subscribes to settle but not long enough to stall shutdown if MxAccess is hung. If drain times out, shutdown continues — this is intentional.
- **Probe rollback concurrency.** The catch path reacquires `_lock` after `await`. A concurrent `SyncAsync` may have re-added the same gobject under a new probe name; the code compares against the captured probe string before removing, so a legitimate re-add is not clobbered.
## Stability Review Fixes 2026-04-14 — Deploy
Updated: `2026-04-14 00:40-00:43 America/New_York`
Both instances redeployed with the stability-review fixes documented above. Closes all four findings from `docs/stability-review-20260413.md` on the live services.
Backups:
- `C:\publish\lmxopcua\backups\20260414-003948-instance1` — pre-deploy `ZB.MOM.WW.LmxOpcUa.Host.exe` (7,997,952 bytes) + `appsettings.json`
- `C:\publish\lmxopcua\backups\20260414-003948-instance2` — pre-deploy `ZB.MOM.WW.LmxOpcUa.Host.exe` (7,997,952 bytes) + `appsettings.json`
Configuration preserved:
- Both `appsettings.json` were not overwritten. The two new fields (`MxAccess.RequestTimeoutSeconds`, `Historian.RequestTimeoutSeconds`) inherit their defaults from the binary (30s and 60s respectively). Operators can opt into explicit values by editing `appsettings.json`; defaults are logged at startup regardless.
Deployed binary (both instances):
- `ZB.MOM.WW.LmxOpcUa.Host.exe`
- Last write time: `2026-04-14 00:40:48 -04:00`
- Size: `7,986,688` (down 11,264 bytes from the previous build — three fire-and-forget `.ContinueWith` blocks were replaced with a single `TrackBackgroundSubscribe` helper)
Pre-deploy state note: both services were STOPPED when the deploy started (`sc.exe query` reported `WIN32_EXIT_CODE=1067`), but two host processes were still alive (`tasklist` showed PID 34828 holding instance1 and PID 27036 holding instance2). The zombies held open file handles on both exes, so the Windows SCM's "STOPPED" state was lying — the previous Services were still running out-of-band of the SCM. The zombie processes were terminated with `taskkill //F` before copying the new binary. This is a one-shot clean-up: the new deploy does not require the same.
Windows services:
- `LmxOpcUa` — Running, PID `32884`
- `LmxOpcUa2` — Running, PID `40796`
Restart evidence (instance1 `logs/lmxopcua-20260414.log`):
```
2026-04-14 00:40:55.759 -04:00 [INF] MxAccess.RuntimeStatusProbesEnabled=true, RuntimeStatusUnknownTimeoutSeconds=15s, RequestTimeoutSeconds=30s
2026-04-14 00:40:55.791 -04:00 [INF] Historian.CommandTimeoutSeconds=30, MaxValuesPerRead=10000, FailureCooldownSeconds=60, RequestTimeoutSeconds=60
2026-04-14 00:40:55.794 -04:00 [INF] === Configuration Valid ===
2026-04-14 00:41:02.406 -04:00 [INF] Historian plugin loaded from C:\publish\lmxopcua\instance1\Historian\ZB.MOM.WW.LmxOpcUa.Historian.Aveva.dll
2026-04-14 00:41:06.870 -04:00 [INF] LmxOpcUa service started successfully
```
Restart evidence (instance2 `logs/lmxopcua-20260414.log`):
```
2026-04-14 00:40:56.812 -04:00 [INF] MxAccess.RuntimeStatusProbesEnabled=true, RuntimeStatusUnknownTimeoutSeconds=15s, RequestTimeoutSeconds=30s
2026-04-14 00:40:56.847 -04:00 [INF] Historian.CommandTimeoutSeconds=30, MaxValuesPerRead=10000, FailureCooldownSeconds=60, RequestTimeoutSeconds=60
2026-04-14 00:40:56.850 -04:00 [INF] === Configuration Valid ===
2026-04-14 00:41:07.805 -04:00 [INF] Historian plugin loaded from C:\publish\lmxopcua\instance2\Historian\ZB.MOM.WW.LmxOpcUa.Historian.Aveva.dll
2026-04-14 00:41:12.008 -04:00 [INF] LmxOpcUa service started successfully
```
The two new `RequestTimeoutSeconds` values are visible in both startup traces, confirming the new configuration plumbing reached `ConfigurationValidator`. Startup latency (config-valid → service-started): instance1 ~11.1s, instance2 ~15.2s — within the normal envelope for the Historian plugin load sequence.
CLI verification (via `dotnet run --project src/ZB.MOM.WW.LmxOpcUa.Client.CLI`):
```
connect opc.tcp://localhost:4840/LmxOpcUa → Server: LmxOpcUa, Security: None, Connection successful
connect opc.tcp://localhost:4841/LmxOpcUa → Server: LmxOpcUa2, Security: None, Connection successful
redundancy opc.tcp://localhost:4840/LmxOpcUa → Warm, ServiceLevel=200, urn:localhost:LmxOpcUa:instance1
redundancy opc.tcp://localhost:4841/LmxOpcUa → Warm, ServiceLevel=150, urn:localhost:LmxOpcUa:instance2
read opc.tcp://localhost:4840/LmxOpcUa -n 'ns=3;s=MESReceiver_001.MoveInPartNumbers'
→ Value: System.String[]
→ Status: 0x00000000
→ Source: 2026-04-14T04:43:46.2267096Z
```
Primary advertises ServiceLevel 200, secondary advertises 150 — redundancy baseline preserved. End-to-end data flow is healthy: the Read on `MESReceiver_001.MoveInPartNumbers` returns Good quality with a fresh source timestamp, confirming MxAccess is connected and the address space is populated. Note that the namespace is `ns=3` now, not the `ns=1` listed in the 2026-03-25 baseline at the top of this file — the auth-consolidation deploy on 2026-03-28 moved the Galaxy namespace to `ns=3` and that move has carried through every deploy since. The top-of-file `ns=1` CLI example should be treated as historical.
### CLI tooling note
The earlier service_info.md entry referenced `tools/opcuacli-dotnet/bin/Debug/net10.0/opcuacli-dotnet.exe`. That binary does not exist on the current checkout; the CLI lives at `src/ZB.MOM.WW.LmxOpcUa.Client.CLI/` and must be invoked via `dotnet run --project src/ZB.MOM.WW.LmxOpcUa.Client.CLI`. The README / `docs/Client.CLI.md` should be the source of truth going forward.
### Runtime-status regression check
**Not performed in this deploy.** The runtime-status subtree-walk / Read-short-circuit regression check from service_info.md:630-669 requires an operator to flip a `$AppEngine` OffScan in the AVEVA IDE and observe the dashboard + Read behavior, which needs a real operator session. The automated CLI smoke test above does not exercise the probe-manager callback path.
The code changes in this deploy are defensive and do not alter the runtime-status feature's control flow except in one place (subscribe rollback, which only triggers when `SubscribeAsync` throws). The 471/471 baseline on the probe manager tests plus the three new rollback regression tests give high confidence that the runtime-status behavior is preserved. If a human operator runs the IDE OffScan/OnScan cycle and observes an anomaly, the fix is most likely isolated to `GalaxyRuntimeProbeManager.SyncAsync` — see Finding 1 above — and can be reverted by restoring `C:\publish\lmxopcua\backups\20260414-003948-instance{1,2}\ZB.MOM.WW.LmxOpcUa.Host.exe`.
## Notes
The service deployment and restart succeeded. The live CLI checks confirm the endpoint is reachable and that the array node identifier has changed to the bracketless form. The array value on the live service still prints as blank even though the status is good, so if this environment should have populated `MoveInPartNumbers`, the runtime data path still needs follow-up investigation.