Document the Galaxy runtime status feature across the architecture guides so operators and future maintainers can find probe machinery, config fields, dashboard panel, and HealthCheck Rule 2e without having to dig through runtimestatus.md or service_info.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-04-13 16:36:35 -04:00
parent f2ea751e2b
commit 0003984c1a
4 changed files with 88 additions and 4 deletions

View File

@@ -74,6 +74,8 @@ Controls the MXAccess runtime connection used for live tag reads and writes. Def
| `AutoReconnect` | `bool` | `true` | Automatically re-establish dropped MXAccess sessions |
| `ProbeTag` | `string?` | `null` | Optional tag used to verify the runtime returns fresh data |
| `ProbeStaleThresholdSeconds` | `int` | `60` | Seconds a probe value may remain unchanged before the connection is considered stale |
| `RuntimeStatusProbesEnabled` | `bool` | `true` | Advises `<Host>.ScanState` on every deployed `$WinPlatform` and `$AppEngine` to track per-host runtime state. Drives the Galaxy Runtime dashboard panel, HealthCheck Rule 2e, and the Read-path short-circuit that invalidates OPC UA variable quality when a host is Stopped. Set `false` to return to legacy behavior where host state is invisible and the bridge serves whatever quality MxAccess reports for individual tags. See [MXAccess Bridge](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate) |
| `RuntimeStatusUnknownTimeoutSeconds` | `int` | `15` | Maximum seconds to wait for the initial probe callback before marking a host as Stopped. Only applies to the Unknown → Stopped transition; Running hosts never time out because `ScanState` is delivered on-change only. A value below 5s triggers a validator warning |
### GalaxyRepository
@@ -255,6 +257,7 @@ Three boolean properties act as feature flags that control optional subsystems:
- `Historian.ServerName` (or `Historian.ServerNames`) must not be empty when `Historian.Enabled = true`
- `Historian.FailureCooldownSeconds` must be zero or positive
- `Historian.ServerName` is set alongside a non-empty `Historian.ServerNames` emits a warning (single ServerName is ignored)
- `MxAccess.RuntimeStatusUnknownTimeoutSeconds` below 5s emits a warning (below the reasonable floor for MxAccess initial-resolution latency)
- `OpcUa.ApplicationUri` must be set when `Redundancy.Enabled = true`
- `Redundancy.ServiceLevelBase` must be between 1 and 255
- `Redundancy.ServerUris` should contain at least 2 entries when enabled
@@ -305,7 +308,9 @@ Integration tests use this constructor to inject substitute implementations of `
"MonitorIntervalSeconds": 5,
"AutoReconnect": true,
"ProbeTag": null,
"ProbeStaleThresholdSeconds": 60
"ProbeStaleThresholdSeconds": 60,
"RuntimeStatusProbesEnabled": true,
"RuntimeStatusUnknownTimeoutSeconds": 15
},
"GalaxyRepository": {
"ConnectionString": "Server=localhost;Database=ZB;Integrated Security=true;",

View File

@@ -30,6 +30,8 @@ Returns deployed Galaxy objects with their parent relationships, browse names, a
- Filters to `is_template = 0` (instances only, not templates)
- Filters to `deployed_package_id <> 0` (deployed objects only)
- Returns a `template_chain` column built by a recursive CTE that walks `gobject.derived_from_gobject_id` from each instance through its immediate template and ancestor templates (depth guard `< 10`). Template names are ordered by depth and joined with `|` via `STUFF(... FOR XML PATH(''))`. Example: `TestMachine_001` returns `$TestMachine|$gMachine|$gUserDefined|$UserDefined`. The C# repository reader splits the column on `|`, trims, and populates `GalaxyObjectInfo.TemplateChain`, which is consumed by `AlarmObjectFilter` for template-based alarm filtering. See [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter).
- Returns `template_definition.category_id` as a `category_id` column, populated into `GalaxyObjectInfo.CategoryId`. The runtime status probe manager filters this down to `CategoryId == 1` (`$WinPlatform`) and `CategoryId == 3` (`$AppEngine`) to decide which objects get a `<Host>.ScanState` probe advised. Also used by `LmxNodeManager.BuildHostedVariablesMap` to identify Platform/Engine ancestors during the hosted-variables walk.
- Returns `gobject.hosted_by_gobject_id` as a `hosted_by_gobject_id` column, populated into `GalaxyObjectInfo.HostedByGobjectId`. This is the **runtime host** of the object (e.g., which `$AppEngine` actually runs it), **not** the browse-containment parent (`contained_by_gobject_id`). The two are often different — an object can live in one Area in the browse tree but be hosted by an Engine on a different Platform for runtime execution. The node manager walks this chain during `BuildHostedVariablesMap` to find the nearest `$WinPlatform` or `$AppEngine` ancestor so subtree quality invalidation on a Stopped host reaches exactly the variables that were actually executing there. Note: the Galaxy schema column is named `hosted_by_gobject_id` (not `host_gobject_id` as some documentation sources guess). See [MXAccess Bridge — Per-Host Runtime Status Probes](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate).
### Attributes query (standard)

View File

@@ -94,6 +94,49 @@ A configurable probe tag (e.g., a frequently updating Galaxy attribute) serves a
The monitor loop compares `DateTime.UtcNow - _lastProbeValueTime` against `ProbeStaleThresholdSeconds`. If the probe value has not updated within the threshold, the connection is assumed stale and a reconnect is forced. This catches scenarios where the COM connection is technically alive but the runtime has stopped delivering data.
## Per-Host Runtime Status Probes (`<Host>.ScanState`)
Separate from the connection-level probe above, the bridge advises `<HostName>.ScanState` on every deployed `$WinPlatform` and `$AppEngine` in the Galaxy. These probes track per-host runtime state so the dashboard can report "this specific Platform / AppEngine is off scan" and the bridge can proactively invalidate every OPC UA variable hosted by the stopped object — preventing MxAccess from serving stale Good-quality cached values to clients who read those tags while the host is down.
Enabled by default via `MxAccess.RuntimeStatusProbesEnabled`; see [Configuration](Configuration.md#mxaccess) for the two config fields.
### How it works
`GalaxyRuntimeProbeManager` is owned by `LmxNodeManager` and operates on a simple three-state machine per host (Unknown / Running / Stopped):
1. **Discovery** — After `BuildAddressSpace` completes, the manager filters the hierarchy to rows where `CategoryId == 1` (`$WinPlatform`) or `CategoryId == 3` (`$AppEngine`) and issues `AdviseSupervisory` for `<TagName>.ScanState` on each one. Probes are bridge-owned, not ref-counted against client subscriptions, and persist across address-space rebuilds via a `Sync` diff.
2. **Transition predicate** — A probe callback is interpreted as `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else (explicit `ScanState = false`, bad quality, communication errors from the broker) means **Stopped**.
3. **On-change-only delivery**`ScanState` is delivered **only when the value actually changes**. A stably Running host may go hours without a callback. The probe manager's `Tick()` explicitly does NOT run a starvation check on Running entries — the only time-based transition is **Unknown → Stopped** when the initial callback hasn't arrived within `RuntimeStatusUnknownTimeoutSeconds` (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts.
4. **Transport gating** — When `IMxAccessClient.State != Connected`, `GetSnapshot()` forces every entry to `Unknown` regardless of underlying state. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped."
### Subtree quality invalidation on transition
When a host transitions **Running → Stopped**, the probe manager invokes a callback that walks `_hostedVariables[gobjectId]` — the set of every OPC UA variable transitively hosted by that Galaxy object — and sets each variable's `StatusCode` to `BadOutOfService`. The reverse happens on **Stopped → Running**: `ClearHostVariablesBadQuality` resets each to `Good` and lets subsequent on-change MxAccess updates repopulate the values.
The hosted-variables map is built once per `BuildAddressSpace` by walking each object's `HostedByGobjectId` chain up to the nearest Platform or Engine ancestor. A variable hosted by an Engine inside a Platform ends up in **both** the Engine's list and the Platform's list, so stopping the Platform transitively invalidates every descendant Engine's variables.
### Read-path short-circuit (`IsTagUnderStoppedHost`)
`LmxNodeManager.Read` override is called by the OPC UA SDK for both direct Read requests and monitored-item sampling. It previously called `_mxAccessClient.ReadAsync(tagRef)` unconditionally and returned whatever VTQ the runtime reported. That created a gap: MxAccess happily serves the last cached value as Good on a tag whose hosting Engine has gone off scan.
The Read override now checks `IsTagUnderStoppedHost(tagRef)` (a reverse-index lookup `_hostIdsByTagRef[tagRef]``GalaxyRuntimeProbeManager.IsHostStopped(hostId)`) before the MxAccess round-trip. When the owning host is Stopped, the handler returns a synthesized `DataValue { Value = cachedVar.Value, StatusCode = BadOutOfService }` directly without touching MxAccess. This guarantees clients see a uniform `BadOutOfService` on every descendant tag of a stopped host, regardless of whether they're reading or subscribing.
### Deferred dispatch: the STA deadlock
**Critical**: probe transition callbacks must **not** run synchronously on the STA thread that delivered the `OnDataChange`. `MarkHostVariablesBadQuality` takes the `LmxNodeManager.Lock`, which may be held by a worker thread currently inside `Read` waiting on an `_mxAccessClient.ReadAsync()` round-trip that is itself waiting for the STA thread. Classic circular wait — the first real deploy of this feature hung inside 30 seconds from exactly this pattern.
The fix is a deferred-dispatch queue: probe callbacks enqueue the transition onto `ConcurrentQueue<(int GobjectId, bool Stopped)>` and set the existing dispatch signal. The dispatch thread drains the queue inside its existing 100ms `WaitOne` loop — **outside** any locks held by the STA path — and then calls `MarkHostVariablesBadQuality` / `ClearHostVariablesBadQuality` under its own natural `Lock` acquisition. No circular wait, no STA dispatch involvement.
See the `runtimestatus.md` plan file and the `service_info.md` entry for the in-flight debugging that led to this pattern.
### Dashboard + health surface
- Dashboard **Galaxy Runtime** panel between Galaxy Info and Historian shows per-host state with Name / Kind / State / Since / Last Error columns. Panel color is green (all Running), yellow (any Unknown, none Stopped), red (any Stopped), gray (MxAccess transport disconnected).
- Subscriptions panel gains a `Probes: N (bridge-owned runtime status)` line when at least one probe is active, so operators can distinguish bridge-owned probe count from client-driven subscriptions.
- `HealthCheckService.CheckHealth` Rule 2e rolls overall health to `Degraded` when any host is Stopped, ordered after the MxAccess-transport check (Rule 1) so a transport outage stays `Unhealthy` without double-messaging.
See [Status Dashboard](StatusDashboard.md#galaxy-runtime) for the field table and [Configuration](Configuration.md#mxaccess) for the two new config fields.
## Why Marshal.ReleaseComObject Is Needed
The .NET runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. `MxProxyAdapter.Unregister` calls `Marshal.ReleaseComObject(_lmxProxy)` in a `finally` block to immediately release the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance.
@@ -108,4 +151,7 @@ The .NET runtime's garbage collector releases COM references non-deterministical
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.EventHandlers.cs` -- OnDataChange and OnWriteComplete handlers
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.Monitor.cs` -- Background health monitor
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxProxyAdapter.cs` -- COM object wrapper
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs` -- Per-host `ScanState` probes, state machine, `IsHostStopped` lookup
- `src/ZB.MOM.WW.LmxOpcUa.Host/Domain/GalaxyRuntimeStatus.cs` -- Per-host DTO
- `src/ZB.MOM.WW.LmxOpcUa.Host/Domain/GalaxyRuntimeState.cs` -- `Unknown` / `Running` / `Stopped` enum
- `src/ZB.MOM.WW.LmxOpcUa.Host/Domain/IMxAccessClient.cs` -- Client interface

View File

@@ -21,7 +21,7 @@ Any other path returns `404 Not Found`.
## Health Check Logic
`HealthCheckService.CheckHealth` evaluates bridge health using the following rules applied in order. The first rule that matches wins; rules 2b, 2c, and 2d only fire when the corresponding integration is enabled and a non-null snapshot is passed:
`HealthCheckService.CheckHealth` evaluates bridge health using the following rules applied in order. The first rule that matches wins; rules 2b, 2c, 2d, and 2e only fire when the corresponding integration is enabled and a non-null snapshot is passed:
1. **Rule 1 -- Unhealthy**: MXAccess connection state is not `Connected`. Returns a red banner with the current state.
2. **Rule 2b -- Degraded**: `Historian.Enabled=true` but the plugin load outcome is not `Loaded`. Returns a yellow banner citing the plugin status (`NotFound`, `LoadFailed`) and the error message if one is available.
@@ -29,7 +29,8 @@ Any other path returns `404 Not Found`.
- Regular operations (`Read`, `Write`, `Subscribe`, `AlarmAcknowledge`): >100 invocations and <50% success rate.
- Historian operations (`HistoryReadRaw`, `HistoryReadProcessed`, `HistoryReadAtTime`, `HistoryReadEvents`): >10 invocations and <50% success rate. The lower threshold surfaces a stuck historian quickly, since history reads are rare relative to live reads.
4. **Rule 2d -- Degraded (latched)**: `AlarmTrackingEnabled=true` and any alarm acknowledge MXAccess write has failed since startup. Latched on purpose -- an ack write failure is a durable MXAccess write problem that should stay visible until the operator restarts.
5. **Rule 3 -- Healthy**: All checks pass. Returns a green banner with "All systems operational."
5. **Rule 2e -- Degraded**: `RuntimeStatus.StoppedCount > 0` -- at least one Galaxy runtime host (`$WinPlatform` / `$AppEngine`) is currently reported Stopped by the runtime probe manager. The rule names the stopped hosts in the message. Ordered after Rule 1 so an MxAccess transport outage stays `Unhealthy` via Rule 1 and this rule never double-messages; the probe manager also forces every entry to `Unknown` when the transport is disconnected, so the `StoppedCount` is always 0 in that case.
6. **Rule 3 -- Healthy**: All checks pass. Returns a green banner with "All systems operational."
The `/api/health` endpoint returns `200` for both Healthy and Degraded states, and `503` only for Unhealthy. This allows load balancers or monitoring tools to distinguish between a service that is running but degraded and one that has lost its runtime connection.
@@ -57,7 +58,8 @@ The `/api/health` endpoint returns `200` for both Healthy and Degraded states, a
| Field | Type | Description |
|-------|------|-------------|
| `ActiveCount` | `int` | Number of active MXAccess tag subscriptions |
| `ActiveCount` | `int` | Number of active MXAccess tag subscriptions (includes bridge-owned runtime status probes — see `ProbeCount`) |
| `ProbeCount` | `int` | Subset of `ActiveCount` attributable to bridge-owned runtime status probes (`<Host>.ScanState` per deployed `$WinPlatform` / `$AppEngine`). Rendered as a separate `Probes: N (bridge-owned runtime status)` line on the dashboard so operators can distinguish probe overhead from client-driven subscription load |
### Galaxy
@@ -79,6 +81,35 @@ The `/api/health` endpoint returns `200` for both Healthy and Degraded states, a
| `PendingItems` | `int` | Items waiting in the dispatch queue |
| `TotalEvents` | `long` | Total MXAccess data change events since startup |
### Galaxy Runtime
Populated from the `GalaxyRuntimeProbeManager` that advises `<Host>.ScanState` on every deployed `$WinPlatform` and `$AppEngine`. See [MXAccess Bridge](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate) for the probe machinery, state machine, and the subtree quality invalidation that fires on transitions. Disabled when `MxAccess.RuntimeStatusProbesEnabled = false`; the panel is suppressed entirely from the HTML when `Total == 0`.
| Field | Type | Description |
|-------|------|-------------|
| `Total` | `int` | Number of runtime hosts tracked (Platforms + AppEngines) |
| `RunningCount` | `int` | Hosts whose last probe callback reported `ScanState = true` with Good quality |
| `StoppedCount` | `int` | Hosts whose last probe callback reported `ScanState != true` or a failed item status, or whose initial probe timed out in Unknown state |
| `UnknownCount` | `int` | Hosts still awaiting initial probe resolution, or rewritten to Unknown when the MxAccess transport is Disconnected |
| `Hosts` | `List<GalaxyRuntimeStatus>` | Per-host detail rows, sorted alphabetically by `ObjectName` |
Each `GalaxyRuntimeStatus` entry:
| Field | Type | Description |
|-------|------|-------------|
| `ObjectName` | `string` | Galaxy `tag_name` of the host (e.g., `DevPlatform`, `DevAppEngine`) |
| `GobjectId` | `int` | Galaxy `gobject_id` of the host |
| `Kind` | `string` | `$WinPlatform` or `$AppEngine` |
| `State` | `enum` | `Unknown`, `Running`, or `Stopped` |
| `LastStateCallbackTime` | `DateTime?` | UTC time of the most recent probe callback, whether good or bad |
| `LastStateChangeTime` | `DateTime?` | UTC time of the most recent Running↔Stopped transition; backs the dashboard "Since" column |
| `LastScanState` | `bool?` | Last `ScanState` value received; `null` before the first callback |
| `LastError` | `string?` | Detail message from the most recent failure callback (e.g., `"ScanState = false (OffScan)"`); cleared on successful recovery |
| `GoodUpdateCount` | `long` | Cumulative count of `ScanState = true` callbacks |
| `FailureCount` | `long` | Cumulative count of `ScanState != true` callbacks or failed item statuses |
The HTML panel renders a per-host table with Name / Kind / State / Since / Last Error columns. Panel color reflects aggregate state: green when every host is `Running`, yellow when any host is `Unknown` with zero `Stopped`, red when any host is `Stopped`, gray when the MxAccess transport is disconnected (the Connection panel is the primary signal in that case and every row is force-rewritten to `Unknown`).
### Operations
A dictionary of `MetricsStatistics` keyed by operation name. Each entry contains: