Document the Galaxy runtime status feature across the architecture guides so operators and future maintainers can find probe machinery, config fields, dashboard panel, and HealthCheck Rule 2e without having to dig through runtimestatus.md or service_info.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-04-13 16:36:35 -04:00
parent f2ea751e2b
commit 0003984c1a
4 changed files with 88 additions and 4 deletions

View File

@@ -21,7 +21,7 @@ Any other path returns `404 Not Found`.
## Health Check Logic
`HealthCheckService.CheckHealth` evaluates bridge health using the following rules applied in order. The first rule that matches wins; rules 2b, 2c, and 2d only fire when the corresponding integration is enabled and a non-null snapshot is passed:
`HealthCheckService.CheckHealth` evaluates bridge health using the following rules applied in order. The first rule that matches wins; rules 2b, 2c, 2d, and 2e only fire when the corresponding integration is enabled and a non-null snapshot is passed:
1. **Rule 1 -- Unhealthy**: MXAccess connection state is not `Connected`. Returns a red banner with the current state.
2. **Rule 2b -- Degraded**: `Historian.Enabled=true` but the plugin load outcome is not `Loaded`. Returns a yellow banner citing the plugin status (`NotFound`, `LoadFailed`) and the error message if one is available.
@@ -29,7 +29,8 @@ Any other path returns `404 Not Found`.
- Regular operations (`Read`, `Write`, `Subscribe`, `AlarmAcknowledge`): >100 invocations and <50% success rate.
- Historian operations (`HistoryReadRaw`, `HistoryReadProcessed`, `HistoryReadAtTime`, `HistoryReadEvents`): >10 invocations and <50% success rate. The lower threshold surfaces a stuck historian quickly, since history reads are rare relative to live reads.
4. **Rule 2d -- Degraded (latched)**: `AlarmTrackingEnabled=true` and any alarm acknowledge MXAccess write has failed since startup. Latched on purpose -- an ack write failure is a durable MXAccess write problem that should stay visible until the operator restarts.
5. **Rule 3 -- Healthy**: All checks pass. Returns a green banner with "All systems operational."
5. **Rule 2e -- Degraded**: `RuntimeStatus.StoppedCount > 0` -- at least one Galaxy runtime host (`$WinPlatform` / `$AppEngine`) is currently reported Stopped by the runtime probe manager. The rule names the stopped hosts in the message. Ordered after Rule 1 so an MxAccess transport outage stays `Unhealthy` via Rule 1 and this rule never double-messages; the probe manager also forces every entry to `Unknown` when the transport is disconnected, so the `StoppedCount` is always 0 in that case.
6. **Rule 3 -- Healthy**: All checks pass. Returns a green banner with "All systems operational."
The `/api/health` endpoint returns `200` for both Healthy and Degraded states, and `503` only for Unhealthy. This allows load balancers or monitoring tools to distinguish between a service that is running but degraded and one that has lost its runtime connection.
@@ -57,7 +58,8 @@ The `/api/health` endpoint returns `200` for both Healthy and Degraded states, a
| Field | Type | Description |
|-------|------|-------------|
| `ActiveCount` | `int` | Number of active MXAccess tag subscriptions |
| `ActiveCount` | `int` | Number of active MXAccess tag subscriptions (includes bridge-owned runtime status probes — see `ProbeCount`) |
| `ProbeCount` | `int` | Subset of `ActiveCount` attributable to bridge-owned runtime status probes (`<Host>.ScanState` per deployed `$WinPlatform` / `$AppEngine`). Rendered as a separate `Probes: N (bridge-owned runtime status)` line on the dashboard so operators can distinguish probe overhead from client-driven subscription load |
### Galaxy
@@ -79,6 +81,35 @@ The `/api/health` endpoint returns `200` for both Healthy and Degraded states, a
| `PendingItems` | `int` | Items waiting in the dispatch queue |
| `TotalEvents` | `long` | Total MXAccess data change events since startup |
### Galaxy Runtime
Populated from the `GalaxyRuntimeProbeManager` that advises `<Host>.ScanState` on every deployed `$WinPlatform` and `$AppEngine`. See [MXAccess Bridge](MxAccessBridge.md#per-host-runtime-status-probes-hostscanstate) for the probe machinery, state machine, and the subtree quality invalidation that fires on transitions. Disabled when `MxAccess.RuntimeStatusProbesEnabled = false`; the panel is suppressed entirely from the HTML when `Total == 0`.
| Field | Type | Description |
|-------|------|-------------|
| `Total` | `int` | Number of runtime hosts tracked (Platforms + AppEngines) |
| `RunningCount` | `int` | Hosts whose last probe callback reported `ScanState = true` with Good quality |
| `StoppedCount` | `int` | Hosts whose last probe callback reported `ScanState != true` or a failed item status, or whose initial probe timed out in Unknown state |
| `UnknownCount` | `int` | Hosts still awaiting initial probe resolution, or rewritten to Unknown when the MxAccess transport is Disconnected |
| `Hosts` | `List<GalaxyRuntimeStatus>` | Per-host detail rows, sorted alphabetically by `ObjectName` |
Each `GalaxyRuntimeStatus` entry:
| Field | Type | Description |
|-------|------|-------------|
| `ObjectName` | `string` | Galaxy `tag_name` of the host (e.g., `DevPlatform`, `DevAppEngine`) |
| `GobjectId` | `int` | Galaxy `gobject_id` of the host |
| `Kind` | `string` | `$WinPlatform` or `$AppEngine` |
| `State` | `enum` | `Unknown`, `Running`, or `Stopped` |
| `LastStateCallbackTime` | `DateTime?` | UTC time of the most recent probe callback, whether good or bad |
| `LastStateChangeTime` | `DateTime?` | UTC time of the most recent Running↔Stopped transition; backs the dashboard "Since" column |
| `LastScanState` | `bool?` | Last `ScanState` value received; `null` before the first callback |
| `LastError` | `string?` | Detail message from the most recent failure callback (e.g., `"ScanState = false (OffScan)"`); cleared on successful recovery |
| `GoodUpdateCount` | `long` | Cumulative count of `ScanState = true` callbacks |
| `FailureCount` | `long` | Cumulative count of `ScanState != true` callbacks or failed item statuses |
The HTML panel renders a per-host table with Name / Kind / State / Since / Last Error columns. Panel color reflects aggregate state: green when every host is `Running`, yellow when any host is `Unknown` with zero `Stopped`, red when any host is `Stopped`, gray when the MxAccess transport is disconnected (the Connection panel is the primary signal in that case and every row is force-rewritten to `Unknown`).
### Operations
A dictionary of `MetricsStatistics` keyed by operation name. Each entry contains: