From f2ea751e2b9b6d1be40f715b8e80d7b8b4cd09b9 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Mon, 13 Apr 2026 16:24:18 -0400 Subject: [PATCH] Document the Galaxy runtime status deploy so operators can reconstruct the stop/start verification sequence, the two bugs found in-flight, and the phase-2 client-freeze decision gate without having to dig through the plan file or chat transcript Co-Authored-By: Claude Opus 4.6 (1M context) --- service_info.md | 115 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 115 insertions(+) diff --git a/service_info.md b/service_info.md index 77a33d1..24ddb11 100644 --- a/service_info.md +++ b/service_info.md @@ -563,6 +563,121 @@ Operator configuration shape: } ``` +## Galaxy Runtime Status Probes + Subtree Quality Invalidation + +Updated: `2026-04-13 15:28-16:19 America/New_York` + +Both instances updated with per-host Galaxy runtime status tracking ($WinPlatform + $AppEngine), proactive subtree quality invalidation when a host transitions to Stopped, and an OPC UA Read short-circuit so operators can no longer read stale-Good cached values from a dead runtime host. + +This ships the feature described in the `runtimestatus.md` plan file. Addresses the production issue reported earlier: "when an AppEngine is set to scan off, LMX updates are received for every tag, causing OPC UA client freeze and sometimes not all OPC UA tags are set to bad quality." + +Backups: +- `C:\publish\lmxopcua\backups\20260413-152824-instance1` +- `C:\publish\lmxopcua\backups\20260413-152824-instance2` + +Deployed binary (both instances): +- `ZB.MOM.WW.LmxOpcUa.Host.exe` — commit `98ed6bd` +- Two incremental deploys during verification: 15:28 (initial), 15:52 (Read-handler patch), 16:06 (dispatch-thread deadlock fix) + +Windows services: +- `LmxOpcUa` — Running, PID `29528` +- `LmxOpcUa2` — Running, PID `30684` + +### Code changes — what shipped + +**New config** — `MxAccessConfiguration`: +- `RuntimeStatusProbesEnabled: bool` (default `true`) — enables `.ScanState` probing for every deployed `$WinPlatform` and `$AppEngine`. +- `RuntimeStatusUnknownTimeoutSeconds: int` (default `15`) — only applies to the Unknown → Stopped transition; running hosts never time out because `ScanState` is delivered on-change only. + +**New hierarchy columns** — `hierarchy.sql` and `GalaxyObjectInfo`: +- `CategoryId: int` — populated from `template_definition.category_id` (1 = $WinPlatform, 3 = $AppEngine). +- `HostedByGobjectId: int` — populated from `gobject.hosted_by_gobject_id` (the actual column name on this Galaxy schema; the plan document's guess of `host_gobject_id` was wrong). Walked up to find each variable's nearest Platform/Engine ancestor. + +**New domain types** — `Host/Domain/`: +- `GalaxyRuntimeState` enum (`Unknown` / `Running` / `Stopped`). +- `GalaxyRuntimeStatus` DTO with callback/state-change timestamps, `LastScanState`, `LastError`, cumulative counters. + +**New probe manager** — `Host/MxAccess/GalaxyRuntimeProbeManager.cs`: +- Pure manager, no SDK leakage. `AdviseSupervisory`s `.ScanState` for every runtime host on `SyncAsync`. +- State predicate: `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else is Stopped. +- `GetSnapshot()` forces every entry to `Unknown` when the MxAccess transport is disconnected — prevents misleading "every host stopped" display when the actual problem is the transport. +- `Tick()` only advances Unknown → Stopped on the configured timeout; Running hosts never time out (on-change delivery semantic). +- `IsHostStopped(gobjectId)` — used by the Read-path short-circuit; uses underlying state directly (not the snapshot force-unknown rewrite) so a transport outage doesn't double-flag reads. +- `Dispose()` unadvises every active probe before MxAccess teardown. + +**New hosted-variables map** — `LmxNodeManager`: +- `_hostedVariables: Dictionary>` — host gobject_id → list of every descendant variable, populated during `BuildAddressSpace` by walking each variable's `HostedByGobjectId` chain up to the nearest Platform/Engine. A variable hosted by an Engine inside a Platform appears in BOTH lists. +- `_hostIdsByTagRef: Dictionary>` — reverse index used by the Read short-circuit, populated alongside `_hostedVariables`. +- Public `MarkHostVariablesBadQuality(int gobjectId)` — walks `_hostedVariables[gobjectId]`, sets `StatusCode = BadOutOfService` on each, calls `ClearChangeMasks(ctx, false)` to push through the OPC UA publisher. +- Public `ClearHostVariablesBadQuality(int gobjectId)` — inverse, resets to `Good` on recovery. + +**OPC UA Read short-circuit** — `LmxNodeManager.Read`: +- Before the normal `_mxAccessClient.ReadAsync(tagRef)` round-trip, check `IsTagUnderStoppedHost(tagRef)`. If true, return a `DataValue { StatusCode = BadOutOfService, Value = cachedVar?.Value }` directly. Covers both direct Read requests AND OPC UA monitored-item sampling, which both flow through this override. + +**Deadlock fix — `_pendingHostStateChanges` queue**: +- First draft invoked `MarkHostVariablesBadQuality` synchronously from the probe callback. MxAccess delivers `OnDataChange` on the STA thread; the callback took the node manager `Lock`. Meanwhile any worker thread inside `Read` could hold `Lock` and wait on a pending `ReadAsync` that needed the STA thread — **classic STA deadlock** (first real deploy hung in ~30s). +- Fix: probe transitions are enqueued on `ConcurrentQueue<(int GobjectId, bool Stopped)>` and the dispatch thread drains the queue inside its existing 100ms `WaitOne` loop. The dispatch thread takes `Lock` naturally without STA involvement, so no cycle. Live verified with the IDE OffScan/OnScan cycle after the fix. + +**Dashboard** — `Host/Status/`: +- New `RuntimeStatusInfo` DTO + "Galaxy Runtime" panel between Galaxy Info and Historian. Shows total/running/stopped/unknown counts plus a per-host table with Name / Kind / State / Since / Last Error columns. Panel color: green (all Running), yellow (some Unknown, none Stopped), red (any Stopped), gray (MxAccess disconnected forces every row to Unknown). +- Subscriptions panel gets a new `Probes: N (bridge-owned runtime status)` line when non-zero. +- `HealthCheckService` Rule 2e: `Degraded` when any host is Stopped, ordered after Rule 1 (MxAccess transport) to avoid double-messaging when the transport is the root cause. + +### Tests +- **24** new `GalaxyRuntimeProbeManagerTests`: state transitions (Unknown/Running/Stopped/recovery), unknown-resolution timeout, transport gating, sync diff, dispose, callback exception safety, `IsHostStopped` for Read-path short-circuit (Unknown/Running/Stopped/recovery/unknown-id/transport-disconnected-contract). +- Full Host suite: **471/471** tests passing. No regressions. + +### Live end-to-end verification (today, against real IDE OffScan action) + +**Baseline** (before OffScan, dashboard at 15:44:00): +``` +Galaxy Runtime: green, 2 of 2 hosts running +DevAppEngine $AppEngine Running 2026-04-13T19:29:12.9475357Z +DevPlatform $WinPlatform Running 2026-04-13T19:29:12.9345208Z +TestMachine_001.MachineID → Status 0x00000000 (Good), value "admin_test" +``` + +**After operator Set OffScan on DevAppEngine in IDE** (log at 15:44:25): +``` +15:44:25.554 Galaxy runtime DevAppEngine.ScanState transitioned Running → Stopped (ScanState = false (OffScan)) +15:44:25.557 Marked 3971 variable(s) BadOutOfService for stopped host gobject_id=1043 +``` +Dashboard: red panel, `1 of 2 hosts running (1 stopped, 0 unknown)`. Health: `Degraded — Galaxy runtime has 1 of 2 host(s) stopped: DevAppEngine`. Critical: 3ms from probe callback to subtree walk complete. + +**Read during stop — found bug #1** (Read handler bypassed cached state): +- Initial deploy: `TestMachine_001.MachineID` still read `0x00000000` Good with a post-stop source time from MxAccess. Revealed that `LmxNodeManager.Read` calls `_mxAccessClient.ReadAsync()` directly and never consults the in-memory `BaseDataVariableState.StatusCode` we set during the walk. +- Fix: `IsTagUnderStoppedHost` short-circuit in Read override. After patch: `[808D0000] BadOutOfService` on all three test tags. + +**Read during stop — found bug #2** (deadlock): +- After shipping the Read patch, the service hung on the next OffScan. HTTP listener accepted connections but never responded, and service shutdown stuck at STOP_PENDING for 15+ seconds until manually killed. +- Diagnosis: the probe callback fires `HandleProbeUpdate` → `MarkHostVariablesBadQuality` → acquires `Lock` on the STA thread. Meanwhile the dispatch thread can sit inside `Read` holding `Lock` and waiting for an STA-routed `ReadAsync`. Circular wait. +- Fix: enqueue probe transitions onto `ConcurrentQueue` and drain on the dispatch thread where `Lock` acquisition is safe. Second deploy resolved the hang. + +**A/B verification** (instance1 patched, instance2 not yet): +| Instance | `TestMachine_001.MachineID` | +|---|---| +| `LmxOpcUa` (patched) | `0x808D0000` BadOutOfService ✅ | +| `LmxOpcUa2` (old) | `0x00000000` Good, stale ❌ | + +Clean A/B confirmed the Read patch is required; instance2 subsequently updated to match. + +**Recovery** (operator Set OnScan on DevAppEngine, log at 16:10:05): +``` +16:10:05.129 Galaxy runtime DevAppEngine.ScanState transitioned → Running +16:10:05.130 Cleared bad-quality override on 3971 variable(s) for recovered host gobject_id=1043 +``` +Dashboard: back to green, `DevAppEngine` Running with new `Since = 20:10:05.129Z`. All three test tags back to `0x00000000` Good with fresh source timestamps. 1ms from probe callback to subtree clear. + +### Client freeze observation — phase 2 decision gate + +The original production issue had two symptoms: (1) incomplete quality flip and (2) OPC UA client freeze. The subtree walk + Read short-circuit fixes (1) definitively. For (2), there's still a pending dispatch-queue flood of per-tag MxAccess callbacks that MxAccess fans out when a host stops — the bridge doesn't currently drop them. We **deliberately did not** ship dispatch suppression in this pass, on the grounds that the subtree walk may coalesce notifications sufficiently at the SDK publisher level to resolve the freeze on its own. The verification against the live Galaxy with no OPC UA clients subscribed doesn't tell us one way or the other — the next subscribed-client test against a real stop will be the deciding measurement. If the client still freezes after the walk, phase 2 adds pre-dispatch filtering for tags under Stopped hosts. + +### What's deferred + +- **Synthetic OPC UA child nodes** (`$RuntimeState`, `$LastCallbackTime`, etc.) under each host object. Dashboard + health surface give operators visibility today; the OPC UA synthetic nodes are a follow-up. +- **Dispatch suppression** — gated on observing whether the subtree walk alone resolves the client freeze in production. +- **Documentation updates** — the `docs/` guides (`MxAccessBridge.md`, `StatusDashboard.md`, `Configuration.md`, `HistoricalDataAccess.md`) still describe the pre-runtime-status behavior. Need a consolidated doc pass covering this feature plus the historian cluster + health surface updates from earlier today. + ## Notes The service deployment and restart succeeded. The live CLI checks confirm the endpoint is reachable and that the array node identifier has changed to the bracketless form. The array value on the live service still prints as blank even though the status is good, so if this environment should have populated `MoveInPartNumbers`, the runtime data path still needs follow-up investigation.