diff --git a/docs/Configuration.md b/docs/Configuration.md index 3fa8a83..c5f79ae 100644 --- a/docs/Configuration.md +++ b/docs/Configuration.md @@ -74,6 +74,8 @@ Controls the MXAccess runtime connection used for live tag reads and writes. Def | `AutoReconnect` | `bool` | `true` | Automatically re-establish dropped MXAccess sessions | | `ProbeTag` | `string?` | `null` | Optional tag used to verify the runtime returns fresh data | | `ProbeStaleThresholdSeconds` | `int` | `60` | Seconds a probe value may remain unchanged before the connection is considered stale | +| `RuntimeStatusProbesEnabled` | `bool` | `true` | Advises `.ScanState` on every deployed `$WinPlatform` and `$AppEngine` to track per-host runtime state. Drives the Galaxy Runtime dashboard panel, HealthCheck Rule 2e, and the Read-path short-circuit that invalidates OPC UA variable quality when a host is Stopped. Set `false` to return to legacy behavior where host state is invisible and the bridge serves whatever quality MxAccess reports for individual tags. See [MXAccess Bridge](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate) | +| `RuntimeStatusUnknownTimeoutSeconds` | `int` | `15` | Maximum seconds to wait for the initial probe callback before marking a host as Stopped. Only applies to the Unknown → Stopped transition; Running hosts never time out because `ScanState` is delivered on-change only. A value below 5s triggers a validator warning | ### GalaxyRepository @@ -255,6 +257,7 @@ Three boolean properties act as feature flags that control optional subsystems: - `Historian.ServerName` (or `Historian.ServerNames`) must not be empty when `Historian.Enabled = true` - `Historian.FailureCooldownSeconds` must be zero or positive - `Historian.ServerName` is set alongside a non-empty `Historian.ServerNames` emits a warning (single ServerName is ignored) +- `MxAccess.RuntimeStatusUnknownTimeoutSeconds` below 5s emits a warning (below the reasonable floor for MxAccess initial-resolution latency) - `OpcUa.ApplicationUri` must be set when `Redundancy.Enabled = true` - `Redundancy.ServiceLevelBase` must be between 1 and 255 - `Redundancy.ServerUris` should contain at least 2 entries when enabled @@ -305,7 +308,9 @@ Integration tests use this constructor to inject substitute implementations of ` "MonitorIntervalSeconds": 5, "AutoReconnect": true, "ProbeTag": null, - "ProbeStaleThresholdSeconds": 60 + "ProbeStaleThresholdSeconds": 60, + "RuntimeStatusProbesEnabled": true, + "RuntimeStatusUnknownTimeoutSeconds": 15 }, "GalaxyRepository": { "ConnectionString": "Server=localhost;Database=ZB;Integrated Security=true;", diff --git a/docs/GalaxyRepository.md b/docs/GalaxyRepository.md index 89bbf84..5bd3fc7 100644 --- a/docs/GalaxyRepository.md +++ b/docs/GalaxyRepository.md @@ -30,6 +30,8 @@ Returns deployed Galaxy objects with their parent relationships, browse names, a - Filters to `is_template = 0` (instances only, not templates) - Filters to `deployed_package_id <> 0` (deployed objects only) - Returns a `template_chain` column built by a recursive CTE that walks `gobject.derived_from_gobject_id` from each instance through its immediate template and ancestor templates (depth guard `< 10`). Template names are ordered by depth and joined with `|` via `STUFF(... FOR XML PATH(''))`. Example: `TestMachine_001` returns `$TestMachine|$gMachine|$gUserDefined|$UserDefined`. The C# repository reader splits the column on `|`, trims, and populates `GalaxyObjectInfo.TemplateChain`, which is consumed by `AlarmObjectFilter` for template-based alarm filtering. See [Alarm Tracking](AlarmTracking.md#template-based-alarm-object-filter). +- Returns `template_definition.category_id` as a `category_id` column, populated into `GalaxyObjectInfo.CategoryId`. The runtime status probe manager filters this down to `CategoryId == 1` (`$WinPlatform`) and `CategoryId == 3` (`$AppEngine`) to decide which objects get a `.ScanState` probe advised. Also used by `LmxNodeManager.BuildHostedVariablesMap` to identify Platform/Engine ancestors during the hosted-variables walk. +- Returns `gobject.hosted_by_gobject_id` as a `hosted_by_gobject_id` column, populated into `GalaxyObjectInfo.HostedByGobjectId`. This is the **runtime host** of the object (e.g., which `$AppEngine` actually runs it), **not** the browse-containment parent (`contained_by_gobject_id`). The two are often different — an object can live in one Area in the browse tree but be hosted by an Engine on a different Platform for runtime execution. The node manager walks this chain during `BuildHostedVariablesMap` to find the nearest `$WinPlatform` or `$AppEngine` ancestor so subtree quality invalidation on a Stopped host reaches exactly the variables that were actually executing there. Note: the Galaxy schema column is named `hosted_by_gobject_id` (not `host_gobject_id` as some documentation sources guess). See [MXAccess Bridge — Per-Host Runtime Status Probes](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate). ### Attributes query (standard) diff --git a/docs/MxAccessBridge.md b/docs/MxAccessBridge.md index 704a01a..8f5750f 100644 --- a/docs/MxAccessBridge.md +++ b/docs/MxAccessBridge.md @@ -94,6 +94,49 @@ A configurable probe tag (e.g., a frequently updating Galaxy attribute) serves a The monitor loop compares `DateTime.UtcNow - _lastProbeValueTime` against `ProbeStaleThresholdSeconds`. If the probe value has not updated within the threshold, the connection is assumed stale and a reconnect is forced. This catches scenarios where the COM connection is technically alive but the runtime has stopped delivering data. +## Per-Host Runtime Status Probes (`.ScanState`) + +Separate from the connection-level probe above, the bridge advises `.ScanState` on every deployed `$WinPlatform` and `$AppEngine` in the Galaxy. These probes track per-host runtime state so the dashboard can report "this specific Platform / AppEngine is off scan" and the bridge can proactively invalidate every OPC UA variable hosted by the stopped object — preventing MxAccess from serving stale Good-quality cached values to clients who read those tags while the host is down. + +Enabled by default via `MxAccess.RuntimeStatusProbesEnabled`; see [Configuration](Configuration.md#mxaccess) for the two config fields. + +### How it works + +`GalaxyRuntimeProbeManager` is owned by `LmxNodeManager` and operates on a simple three-state machine per host (Unknown / Running / Stopped): + +1. **Discovery** — After `BuildAddressSpace` completes, the manager filters the hierarchy to rows where `CategoryId == 1` (`$WinPlatform`) or `CategoryId == 3` (`$AppEngine`) and issues `AdviseSupervisory` for `.ScanState` on each one. Probes are bridge-owned, not ref-counted against client subscriptions, and persist across address-space rebuilds via a `Sync` diff. +2. **Transition predicate** — A probe callback is interpreted as `isRunning = vtq.Quality.IsGood() && vtq.Value is bool b && b`. Everything else (explicit `ScanState = false`, bad quality, communication errors from the broker) means **Stopped**. +3. **On-change-only delivery** — `ScanState` is delivered **only when the value actually changes**. A stably Running host may go hours without a callback. The probe manager's `Tick()` explicitly does NOT run a starvation check on Running entries — the only time-based transition is **Unknown → Stopped** when the initial callback hasn't arrived within `RuntimeStatusUnknownTimeoutSeconds` (default 15s). This protects against a probe that fails to resolve at all without incorrectly flipping healthy long-running hosts. +4. **Transport gating** — When `IMxAccessClient.State != Connected`, `GetSnapshot()` forces every entry to `Unknown` regardless of underlying state. The dashboard shows the Connection panel as the primary signal in that case rather than misleading operators with "every host stopped." + +### Subtree quality invalidation on transition + +When a host transitions **Running → Stopped**, the probe manager invokes a callback that walks `_hostedVariables[gobjectId]` — the set of every OPC UA variable transitively hosted by that Galaxy object — and sets each variable's `StatusCode` to `BadOutOfService`. The reverse happens on **Stopped → Running**: `ClearHostVariablesBadQuality` resets each to `Good` and lets subsequent on-change MxAccess updates repopulate the values. + +The hosted-variables map is built once per `BuildAddressSpace` by walking each object's `HostedByGobjectId` chain up to the nearest Platform or Engine ancestor. A variable hosted by an Engine inside a Platform ends up in **both** the Engine's list and the Platform's list, so stopping the Platform transitively invalidates every descendant Engine's variables. + +### Read-path short-circuit (`IsTagUnderStoppedHost`) + +`LmxNodeManager.Read` override is called by the OPC UA SDK for both direct Read requests and monitored-item sampling. It previously called `_mxAccessClient.ReadAsync(tagRef)` unconditionally and returned whatever VTQ the runtime reported. That created a gap: MxAccess happily serves the last cached value as Good on a tag whose hosting Engine has gone off scan. + +The Read override now checks `IsTagUnderStoppedHost(tagRef)` (a reverse-index lookup `_hostIdsByTagRef[tagRef]` → `GalaxyRuntimeProbeManager.IsHostStopped(hostId)`) before the MxAccess round-trip. When the owning host is Stopped, the handler returns a synthesized `DataValue { Value = cachedVar.Value, StatusCode = BadOutOfService }` directly without touching MxAccess. This guarantees clients see a uniform `BadOutOfService` on every descendant tag of a stopped host, regardless of whether they're reading or subscribing. + +### Deferred dispatch: the STA deadlock + +**Critical**: probe transition callbacks must **not** run synchronously on the STA thread that delivered the `OnDataChange`. `MarkHostVariablesBadQuality` takes the `LmxNodeManager.Lock`, which may be held by a worker thread currently inside `Read` waiting on an `_mxAccessClient.ReadAsync()` round-trip that is itself waiting for the STA thread. Classic circular wait — the first real deploy of this feature hung inside 30 seconds from exactly this pattern. + +The fix is a deferred-dispatch queue: probe callbacks enqueue the transition onto `ConcurrentQueue<(int GobjectId, bool Stopped)>` and set the existing dispatch signal. The dispatch thread drains the queue inside its existing 100ms `WaitOne` loop — **outside** any locks held by the STA path — and then calls `MarkHostVariablesBadQuality` / `ClearHostVariablesBadQuality` under its own natural `Lock` acquisition. No circular wait, no STA dispatch involvement. + +See the `runtimestatus.md` plan file and the `service_info.md` entry for the in-flight debugging that led to this pattern. + +### Dashboard + health surface + +- Dashboard **Galaxy Runtime** panel between Galaxy Info and Historian shows per-host state with Name / Kind / State / Since / Last Error columns. Panel color is green (all Running), yellow (any Unknown, none Stopped), red (any Stopped), gray (MxAccess transport disconnected). +- Subscriptions panel gains a `Probes: N (bridge-owned runtime status)` line when at least one probe is active, so operators can distinguish bridge-owned probe count from client-driven subscriptions. +- `HealthCheckService.CheckHealth` Rule 2e rolls overall health to `Degraded` when any host is Stopped, ordered after the MxAccess-transport check (Rule 1) so a transport outage stays `Unhealthy` without double-messaging. + +See [Status Dashboard](StatusDashboard.md#galaxy-runtime) for the field table and [Configuration](Configuration.md#mxaccess) for the two new config fields. + ## Why Marshal.ReleaseComObject Is Needed The .NET runtime's garbage collector releases COM references non-deterministically. For MXAccess, delayed release can leave stale COM connections open, preventing clean re-registration. `MxProxyAdapter.Unregister` calls `Marshal.ReleaseComObject(_lmxProxy)` in a `finally` block to immediately release the COM reference count to zero. This ensures the underlying COM server is freed before a reconnect attempt creates a new instance. @@ -108,4 +151,7 @@ The .NET runtime's garbage collector releases COM references non-deterministical - `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.EventHandlers.cs` -- OnDataChange and OnWriteComplete handlers - `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxAccessClient.Monitor.cs` -- Background health monitor - `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/MxProxyAdapter.cs` -- COM object wrapper +- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs` -- Per-host `ScanState` probes, state machine, `IsHostStopped` lookup +- `src/ZB.MOM.WW.LmxOpcUa.Host/Domain/GalaxyRuntimeStatus.cs` -- Per-host DTO +- `src/ZB.MOM.WW.LmxOpcUa.Host/Domain/GalaxyRuntimeState.cs` -- `Unknown` / `Running` / `Stopped` enum - `src/ZB.MOM.WW.LmxOpcUa.Host/Domain/IMxAccessClient.cs` -- Client interface diff --git a/docs/StatusDashboard.md b/docs/StatusDashboard.md index 7815242..e81e21e 100644 --- a/docs/StatusDashboard.md +++ b/docs/StatusDashboard.md @@ -21,7 +21,7 @@ Any other path returns `404 Not Found`. ## Health Check Logic -`HealthCheckService.CheckHealth` evaluates bridge health using the following rules applied in order. The first rule that matches wins; rules 2b, 2c, and 2d only fire when the corresponding integration is enabled and a non-null snapshot is passed: +`HealthCheckService.CheckHealth` evaluates bridge health using the following rules applied in order. The first rule that matches wins; rules 2b, 2c, 2d, and 2e only fire when the corresponding integration is enabled and a non-null snapshot is passed: 1. **Rule 1 -- Unhealthy**: MXAccess connection state is not `Connected`. Returns a red banner with the current state. 2. **Rule 2b -- Degraded**: `Historian.Enabled=true` but the plugin load outcome is not `Loaded`. Returns a yellow banner citing the plugin status (`NotFound`, `LoadFailed`) and the error message if one is available. @@ -29,7 +29,8 @@ Any other path returns `404 Not Found`. - Regular operations (`Read`, `Write`, `Subscribe`, `AlarmAcknowledge`): >100 invocations and <50% success rate. - Historian operations (`HistoryReadRaw`, `HistoryReadProcessed`, `HistoryReadAtTime`, `HistoryReadEvents`): >10 invocations and <50% success rate. The lower threshold surfaces a stuck historian quickly, since history reads are rare relative to live reads. 4. **Rule 2d -- Degraded (latched)**: `AlarmTrackingEnabled=true` and any alarm acknowledge MXAccess write has failed since startup. Latched on purpose -- an ack write failure is a durable MXAccess write problem that should stay visible until the operator restarts. -5. **Rule 3 -- Healthy**: All checks pass. Returns a green banner with "All systems operational." +5. **Rule 2e -- Degraded**: `RuntimeStatus.StoppedCount > 0` -- at least one Galaxy runtime host (`$WinPlatform` / `$AppEngine`) is currently reported Stopped by the runtime probe manager. The rule names the stopped hosts in the message. Ordered after Rule 1 so an MxAccess transport outage stays `Unhealthy` via Rule 1 and this rule never double-messages; the probe manager also forces every entry to `Unknown` when the transport is disconnected, so the `StoppedCount` is always 0 in that case. +6. **Rule 3 -- Healthy**: All checks pass. Returns a green banner with "All systems operational." The `/api/health` endpoint returns `200` for both Healthy and Degraded states, and `503` only for Unhealthy. This allows load balancers or monitoring tools to distinguish between a service that is running but degraded and one that has lost its runtime connection. @@ -57,7 +58,8 @@ The `/api/health` endpoint returns `200` for both Healthy and Degraded states, a | Field | Type | Description | |-------|------|-------------| -| `ActiveCount` | `int` | Number of active MXAccess tag subscriptions | +| `ActiveCount` | `int` | Number of active MXAccess tag subscriptions (includes bridge-owned runtime status probes — see `ProbeCount`) | +| `ProbeCount` | `int` | Subset of `ActiveCount` attributable to bridge-owned runtime status probes (`.ScanState` per deployed `$WinPlatform` / `$AppEngine`). Rendered as a separate `Probes: N (bridge-owned runtime status)` line on the dashboard so operators can distinguish probe overhead from client-driven subscription load | ### Galaxy @@ -79,6 +81,35 @@ The `/api/health` endpoint returns `200` for both Healthy and Degraded states, a | `PendingItems` | `int` | Items waiting in the dispatch queue | | `TotalEvents` | `long` | Total MXAccess data change events since startup | +### Galaxy Runtime + +Populated from the `GalaxyRuntimeProbeManager` that advises `.ScanState` on every deployed `$WinPlatform` and `$AppEngine`. See [MXAccess Bridge](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate) for the probe machinery, state machine, and the subtree quality invalidation that fires on transitions. Disabled when `MxAccess.RuntimeStatusProbesEnabled = false`; the panel is suppressed entirely from the HTML when `Total == 0`. + +| Field | Type | Description | +|-------|------|-------------| +| `Total` | `int` | Number of runtime hosts tracked (Platforms + AppEngines) | +| `RunningCount` | `int` | Hosts whose last probe callback reported `ScanState = true` with Good quality | +| `StoppedCount` | `int` | Hosts whose last probe callback reported `ScanState != true` or a failed item status, or whose initial probe timed out in Unknown state | +| `UnknownCount` | `int` | Hosts still awaiting initial probe resolution, or rewritten to Unknown when the MxAccess transport is Disconnected | +| `Hosts` | `List` | Per-host detail rows, sorted alphabetically by `ObjectName` | + +Each `GalaxyRuntimeStatus` entry: + +| Field | Type | Description | +|-------|------|-------------| +| `ObjectName` | `string` | Galaxy `tag_name` of the host (e.g., `DevPlatform`, `DevAppEngine`) | +| `GobjectId` | `int` | Galaxy `gobject_id` of the host | +| `Kind` | `string` | `$WinPlatform` or `$AppEngine` | +| `State` | `enum` | `Unknown`, `Running`, or `Stopped` | +| `LastStateCallbackTime` | `DateTime?` | UTC time of the most recent probe callback, whether good or bad | +| `LastStateChangeTime` | `DateTime?` | UTC time of the most recent Running↔Stopped transition; backs the dashboard "Since" column | +| `LastScanState` | `bool?` | Last `ScanState` value received; `null` before the first callback | +| `LastError` | `string?` | Detail message from the most recent failure callback (e.g., `"ScanState = false (OffScan)"`); cleared on successful recovery | +| `GoodUpdateCount` | `long` | Cumulative count of `ScanState = true` callbacks | +| `FailureCount` | `long` | Cumulative count of `ScanState != true` callbacks or failed item statuses | + +The HTML panel renders a per-host table with Name / Kind / State / Since / Last Error columns. Panel color reflects aggregate state: green when every host is `Running`, yellow when any host is `Unknown` with zero `Stopped`, red when any host is `Stopped`, gray when the MxAccess transport is disconnected (the Connection panel is the primary signal in that case and every row is force-rewritten to `Unknown`). + ### Operations A dictionary of `MetricsStatistics` keyed by operation name. Each entry contains: