Files
lmxopcua/runtimestatus.md

554 lines
49 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Plan: Galaxy Runtime Status (Platform + AppEngine Stopped/Started Detection)
## Context
Today the bridge has no operator-visible signal for "is Galaxy Platform X or AppEngine Y stopped or running?". The dashboard shows:
- **MXAccess state** — one bit of truth about whether the bridge can talk to the local MxAccess runtime at all.
- **Data change dispatch rate** — aggregate throughput across every advised attribute.
Neither catches the case an operator actually cares about: a single Platform or AppEngine in a multi-host Galaxy has stopped (operator stopped it from the IDE, the node crashed, network cut, process died, someone toggled OffScan for maintenance). The bridge keeps serving cached values, downstream OPC UA clients see stale reads, and nobody notices until somebody specifically goes looking at the affected equipment.
Galaxy exposes `<ObjectName>.ScanState` as a boolean system attribute on every deployed `$WinPlatform` **and** `$AppEngine`. `true` means the object is on scan and executing; anything else means not running. AppEngine state is independently observable through MxAccess (even a stopped Engine's parent Platform can still route the query) so a single probe mechanism covers both host types.
The goal is to advise `<ObjectName>.ScanState` for every deployed `$WinPlatform` and `$AppEngine`, surface per-host runtime state on the dashboard, drive a `Degraded` health check rule when any is down, and publish the state into the OPC UA address space so external clients can subscribe alongside the value data they already consume.
## Design
### Probe tag: `<ObjectName>.ScanState`
`ScanState` is a boolean system attribute on every deployed `$WinPlatform` and `$AppEngine`. The classification rule:
```
isRunning = status.Success && vtq.Value is bool b && b
```
Everything else → **Stopped**. The `ItemStatus` fields (`category`, `detail`) are still captured into `LastError` for operator diagnostics, but they don't branch the state machine.
#### On-change delivery semantic
MxAccess `AdviseSupervisory` delivers the current value at subscription time and then fires `OnDataChange` **only when the value changes**. `ScanState` is discrete — for a healthy host, the initial advise callback reports `true` and nothing follows until the state actually changes. There is no periodic heartbeat on the subscription.
Implications:
- **No starvation-based Running → Stopped transition.** A Running host will legitimately go minutes or hours without an update. The stale-threshold check for the Running state is dropped entirely.
- **Error callbacks drive the Running → Stopped transition.** MxAccess delivers a data-change callback with `ItemStatus[0].success == false` and `detail == 2 (MX_E_PlatformCommunicationError)` when a host becomes unreachable. We trust this signal — it's the broker's job to surface it, and in practice it fires quickly.
- **Stale threshold only applies to the Unknown state.** If a probe is advised but never receives a first callback (initial resolution failure, host never deployed, MxAccess routing broken), the Unknown → Stopped transition fires after `UnknownResolutionTimeoutSeconds`. This catches "the probe never came online" without tripping on healthy stable hosts.
Subscription mechanics:
- `AdviseSupervisory` on `<ObjectName>.ScanState`. Supervisory variant avoids user-login requirements for bridge-owned probes — matches the pattern the node manager already uses for its own subscriptions.
- Probes are bridge-owned, not ref-counted against client subscriptions. They live for the lifetime of the address space between rebuilds.
- On rebuild, the probe set is diffed against the new host list and the minimum number of `AdviseSupervisory`/`Unadvise` calls are issued (see `Sync` in the probe manager).
### Host discovery
Galaxy Repository already has the data — we just need to surface it to the runtime layer.
`hierarchy.sql` currently selects every deployed object where `template_definition.category_id IN (1, 3, 4, 10, 11, 13, 17, 24, 26)`. Category `1 = $WinPlatform` and `3 = $AppEngine` are already in the set. Add `template_definition.category_id` as a new column on the query so the repository loader can tag each `GalaxyObjectInfo` with its Galaxy category, and the probe manager can filter for categories 1 and 3.
**Schema change:** add `CategoryId: int` to `GalaxyObjectInfo`, populated from `hierarchy.sql`. Small schema change, keeps the probe enumeration aligned with whatever the rest of the address space sees at each rebuild.
### Runtime host state machine
```
┌─ Unknown ─┐ (initial state; advise issued, no callback yet)
│ │
│ │ ScanState == true
│ ▼
│ Running ◄───────────────────┐
│ │ │
│ │ │ ScanState == true
│ │ ScanState != true │ (recovery callback)
│ │ (false / error / │
│ │ bad status) │
│ ▼ │
│ Stopped ──────────────────────┘
└─► Stopped (Unknown → Stopped after UnknownResolutionTimeoutSeconds
if no initial callback ever arrives)
```
Three states:
- **Unknown** — probe advised but no callback yet. Initial state after bridge startup or a rebuild until the first `OnDataChange` for that host. If this state persists longer than `UnknownResolutionTimeoutSeconds` (default 15s), the manager's periodic check flips it to Stopped — captures the "probe never resolved" case.
- **Running** — last probe callback delivered `ScanState = true` with `ItemStatus[0].success == true`. Stays in this state until a callback changes it. No starvation-based timeout.
- **Stopped** — any of:
1. Last probe callback had `ScanState != true` (explicit off-scan).
2. Last probe callback had `ItemStatus[0].success == false` (unreachable host).
3. Unknown state timed out (initial resolution never completed).
4. Initial `AdviseSupervisory` reported `ResolutionStatus` of `invalidReference` or `noGalaxyRepository`.
### MxAccess transport down → force Unknown
When the local MxAccess client is not connected (`IMxAccessClient.State != ConnectionState.Connected`), every probe's transport is effectively offline regardless of the underlying host state. The probe manager **forces every entry to Unknown** in its snapshot output while MxAccess is disconnected. Rationale:
- Telling the operator that all hosts are `Stopped` is misleading — the actual problem is the local transport, which the existing Connection panel already surfaces prominently.
- Unknown is the right semantic: we don't know the host state because we can't see them right now.
- When MxAccess reconnects, the broker re-delivers probe subscriptions and the state machine resumes normally.
Implementation: `GetSnapshot()` checks `_client.State` and rewrites `State = Unknown` (leaving the underlying `_stateByProbe` map intact for when the transport comes back). `HealthCheckService` already rolls to Unhealthy via the MxAccess-not-connected rule before the runtime status rule fires, so this doesn't create a confusing health-rollup story.
### New types
All in `src/ZB.MOM.WW.LmxOpcUa.Host/Domain/`:
```csharp
public enum GalaxyRuntimeState { Unknown, Running, Stopped }
public sealed class GalaxyRuntimeStatus
{
public string ObjectName { get; set; } = ""; // gobject.tag_name
public int GobjectId { get; set; }
public string Kind { get; set; } = ""; // "$WinPlatform" or "$AppEngine"
public GalaxyRuntimeState State { get; set; }
public DateTime? LastStateCallbackTime { get; set; } // UTC of most recent probe callback
public DateTime? LastStateChangeTime { get; set; } // UTC of last Running↔Stopped transition
public bool? LastScanState { get; set; } // last ScanState value; null before first update
public string? LastError { get; set; } // MxStatus.detail description when !success
public long GoodUpdateCount { get; set; } // callbacks where ScanState == true
public long FailureCount { get; set; } // callbacks where ScanState != true or !success
}
```
Why two timestamps (`LastStateCallbackTime` vs `LastStateChangeTime`): on-change-only delivery means they'll match for most entries, but a callback that arrives with a different error detail while the host is already Stopped updates the callback time and `LastError` without touching `LastStateChangeTime`. The dashboard's "Since" column (see Dashboard panel) uses `LastStateChangeTime` so operators see "Stopped since 08:17:02Z" regardless of how many intervening error callbacks have refined the diagnostic detail.
Naming note: "Galaxy runtime" is the generic term covering both `$WinPlatform` and `$AppEngine` — the dashboard and config use this neutral phrasing so the feature doesn't look like it only covers Platforms.
### Probe manager
New class `MxAccess/GalaxyRuntimeProbeManager.cs`, owned by `LmxNodeManager`:
```csharp
internal sealed class GalaxyRuntimeProbeManager : IDisposable
{
public GalaxyRuntimeProbeManager(
IMxAccessClient client,
int unknownResolutionTimeoutSeconds,
Action<int> onHostStopped, // invoked with GobjectId on Running → Stopped
Action<int> onHostRunning); // invoked with GobjectId on Stopped → Running
// Called after address-space build / rebuild. Adds probes for new hosts,
// removes them for hosts no longer in the hierarchy. Idempotent.
// Caller supplies the full hierarchy; the manager filters for category_id
// 1 ($WinPlatform) and 3 ($AppEngine).
// Blocks on sequential AddItem/AdviseSupervisory SDK calls — see wiring notes.
public void Sync(IReadOnlyList<GalaxyObjectInfo> hierarchy);
// Invoked by LmxNodeManager's OnTagValueChanged callback when the address
// matches a probe tag reference. Returns true when the event was consumed
// by a probe so the data-change dispatch queue can skip it.
public bool HandleProbeUpdate(string tagRef, Vtq vtq, MxStatusProxy status);
// Called from the MxAccess connection monitor callback (MonitorIntervalSeconds
// cadence) to advance time-based transitions:
// 1. Unknown → Stopped when UnknownResolutionTimeoutSeconds has elapsed.
// 2. Nothing for Running — no starvation check (on-change-only semantics).
public void Tick();
// Snapshot respects MxAccess transport state — returns all Unknown when
// the transport is disconnected, regardless of the underlying per-host state.
public IReadOnlyList<GalaxyRuntimeStatus> GetSnapshot();
public int ActiveProbeCount { get; }
// Unadvise + RemoveItem on every active probe. Called from LmxNodeManager.Dispose
// before the MxAccess client teardown. Idempotent — safe to call multiple times.
public void Dispose();
}
```
The two `Action<int>` callbacks are how the probe manager triggers the subtree quality invalidation documented below — the owning `LmxNodeManager` passes references to its own `MarkHostVariablesBadQuality` and `ClearHostVariablesBadQuality` methods at construction time. The probe manager calls them synchronously on state transitions, from whichever thread delivered the probe callback (the MxAccess dispatch thread). The node manager methods acquire their own lock internally — the probe manager does not hold its own lock across the callback invocation to avoid inverted-lock-order deadlocks.
Internals:
- `Dictionary<string, GalaxyRuntimeStatus>` keyed by probe tag reference (`<ObjectName>.ScanState`).
- Reverse `Dictionary<int, string>` from `GobjectId` to probe tag for `Sync` to diff against a fresh hierarchy.
- One lock guarding both maps. Operations are microsecond-scale.
- `Sync` filters `hierarchy` for `CategoryId == 1 || CategoryId == 3`, then compares the filtered set against the active probe set:
- Added hosts → `client.AddItem` + `AdviseSupervisory`; insert `GalaxyRuntimeStatus { State = Unknown }`.
- Removed hosts → `Unadvise` + `RemoveItem`; drop entry.
- Unchanged hosts → leave in place, preserving their state machine across the rebuild.
- `HandleProbeUpdate` is the per-callback entry point. It evaluates the `isRunning` predicate, updates `LastUpdateTime`, transitions state, logs at `Information` level on state changes only (not every tick), and stores the `ItemStatus` detail into `LastError` on failure.
- `Tick` runs at the existing dispatch thread cadence. For each Unknown entry, checks `LastUpdateTime == null && (now - _createdAt[id]) > unknownResolutionTimeoutSeconds` and flips to Stopped if so. Healthy Running entries are not touched.
- `GetSnapshot` short-circuits to "all Unknown" when `_client.State != ConnectionState.Connected`.
### LmxNodeManager wiring
`LmxNodeManager` constructs a `GalaxyRuntimeProbeManager` when `MxAccessConfiguration.RuntimeStatusProbesEnabled` is true. In `BuildAddressSpace` and the subtree rebuild path, after the existing loops complete, call `_probeManager.Sync(hierarchy)`. `Sync` blocks while it issues `AddItem` + `AdviseSupervisory` sequentially for each new host — for a galaxy with ~50 runtime hosts this adds roughly 500ms1s to the address-space build on top of the existing several-second build time. Kept synchronous deliberately: the simpler correctness model is worth the startup hit, and `ActiveProbeCount` is guaranteed to be accurate the moment the build completes.
Route the existing `OnTagValueChanged` callback through `_probeManager.HandleProbeUpdate` first — if it returns `true`, the event was consumed by a bridge-owned probe and the dispatch queue skips the normal variable-update path.
**Tick() cadence — piggyback on the MxAccess connection monitor.** The dispatch thread wakes on `_dataChangeSignal`, which only fires when tag values change. In the degenerate case where no probe ever resolves (MxAccess routing broken, bad probe tag, etc.), the dispatch loop never wakes and the Unknown → Stopped timeout would never fire. To avoid adding a new thread or timer, hook `_probeManager.Tick()` into the callback path that the existing `MxAccess.MonitorIntervalSeconds` watcher already runs — the same cadence that drives the connection-level probe-tag staleness check. A single call site covers both.
If the monitor is not accessible from `LmxNodeManager` during implementation (it lives at a different layer in the MxAccess client), fall back to Option A from the design discussion: change the dispatch loop's `WaitOne()` call to a timed `WaitOne(500ms)` so it wakes periodically regardless of data changes. Single-line change, but requires verifying no assumptions in the existing loop break from the periodic wake-ups.
### Service shutdown — explicit probe cleanup
The probe manager's `Sync` handles Unadvise on diff removal when a host leaves the hierarchy. Service shutdown is a separate path that needs explicit handling: when `LmxNodeManager` is disposed, the active probe subscriptions must be torn down before the MxAccess client is closed — otherwise we rely on the client's broader shutdown to cover supervisory subscriptions, which depends on disposal ordering and may or may not clean up cleanly.
`GalaxyRuntimeProbeManager` implements `IDisposable`. `Dispose()` walks the active probe map, calls `Unadvise` + `RemoveItem` on each entry, and clears the maps. Idempotent — calling it twice is a no-op. `LmxNodeManager.Dispose` calls `_probeManager?.Dispose()` **before** the existing teardown steps that touch the MxAccess client.
### Subtree quality invalidation on Stopped transition
**Operational context for this section** — observed behavior from production: when an AppEngine or Platform goes OffScan, MxAccess fans out per-tag `OnDataChange` callbacks for every advised tag hosted by that runtime object, each carrying bad quality. Two symptoms result:
1. **OPC UA client freeze** — the dispatch handler processes the flood in one cycle, pushes thousands of OPC UA value-change notifications to subscribed clients in one `Publish` response, and the client visibly stalls handling the volume.
2. **Incomplete quality flip** — some OPC UA variables retain their last good value with Good quality even after the host is down, either because the dispatch queue drops updates, or because some tags aren't in the subscribed set at the moment of the flood, or because of an edge case in the quality mapper. Operationally: clients read plausible-looking stale data from a dead host.
The probe-driven Stopped transition is the authoritative, on-time signal we control. On that transition, the bridge proactively walks every OPC UA variable node hosted by the Stopped host and sets its `StatusCode` to `BadOutOfService`. This is independent of whether MxAccess also delivers per-tag bad-quality updates — the two signals are belt-and-suspenders for correctness. Even if the dispatch queue drops half the per-tag updates, the subtree walk guarantees the end state is uniformly Bad for every variable under the dead host.
On the recovery `Stopped → Running` transition, the bridge walks the same set and clears the override — sets `StatusCode` back to `Good` so the cached values are visible again. Subsequent real MxAccess updates arrive on-change and overwrite value + status as normal. Trade-off: for a host that's been down a long time, some tags may show Good quality on a stale cached value for a short window after recovery, until MxAccess delivers the next on-change update for that tag. This matches existing bridge behavior for any slow-changing attribute and is preferable to leaving variables stuck at BadOutOfService indefinitely waiting for an update that may never come.
**What's included in the "subtree"** — the set of variables whose owning Galaxy object is hosted (transitively) by the Stopped host. For AppEngines, this is every variable whose object's `host_gobject_id` chain reaches the Engine. For Platforms, it's every variable on every Engine hosted by the Platform, plus every object hosted directly on the Platform. This is **not** browse-tree containment — an object can live in one Area (browse parent) but be hosted by an Engine on a different Platform (runtime parent), and the host relationship is what determines the fate of its live data.
Implementation plan for the host-to-variables mapping:
1. Extend `hierarchy.sql` to return `gobject.host_gobject_id` as a new column if it exists. Verify during implementation — if the column is not present on this Galaxy schema version, fall back to `contained_by_gobject_id` as an approximation (less precise for edge cases where browse containment differs from runtime hosting, but sufficient for typical Galaxy topologies).
2. Extend `GalaxyObjectInfo` with `HostGobjectId: int`.
3. During `BuildAddressSpace`, as each variable is created, compute its owning host by walking `HostGobjectId` up the chain until hitting a `$WinPlatform` or `$AppEngine` (or reaching the root). Append the variable to a `Dictionary<int, List<BaseDataVariableState>>` keyed by the host's `GobjectId`.
4. On `BuildSubtree` (incremental rebuild), the same logic runs for newly added variables. Variables that leave the hierarchy are removed from the map. The map lives next to `_nodeMap` on `LmxNodeManager`.
New public methods on `LmxNodeManager`:
```csharp
// Called by probe manager on Running → Stopped. Walks every variable hosted by
// gobjectId and sets its StatusCode to BadOutOfService. Safe to call multiple times.
// Does nothing when gobjectId has no hosted variables.
public void MarkHostVariablesBadQuality(int gobjectId);
// Called by probe manager on Stopped → Running. Walks every variable hosted by
// gobjectId and resets StatusCode to Good. Values are left at whatever the last
// MxAccess-delivered value was; subsequent on-change updates will refresh them.
public void ClearHostVariablesBadQuality(int gobjectId);
```
Both methods acquire the standard node manager `Lock`, iterate the hosted list, set `StatusCode` + call `ClearChangeMasks(ctx, false)` per variable, and release the lock. The OPC UA subscription publisher picks up the change masks on its next tick and pushes notifications to subscribed clients — so operators see a single uniform quality flip per variable rather than two (one from our walk, one from the MxAccess per-tag delivery).
### Dispatch suppression — deferred pending observation
The subtree invalidation above addresses the **data-correctness** symptom (some variables not flipping to bad quality). The **client freeze** symptom is a separate problem: even if the quality state is correct, the bridge is still processing a thundering herd of per-tag bad-quality MxAccess callbacks through the dispatch queue, which in turn push thousands of OPC UA value-change notifications to subscribed clients.
A stronger fix would be **dispatch suppression**: once the probe manager transitions a host to Stopped, filter out incoming MxAccess per-tag updates for any tag owned by that host before they hit the dispatch queue. The subtree walk has already captured the state; the redundant per-tag updates are pure noise.
**This is deliberately NOT part of phase 1.** Reasons:
- The subtree walk may make the freeze disappear entirely. If the dispatch queue processes the flood but the notifications it pushes are now duplicates of change masks the walk already set, the SDK may coalesce them into a single publish cycle and the client sees one notification batch rather than thousands. We want to observe whether this is the case before building suppression.
- If the freeze persists after subtree invalidation ships, we have a real measurement of the residual problem to inform the suppression design (which hosts, which tags, how much batching, whether to also coalesce at the OPC UA publisher level).
- The suppression path has a subtle failure mode: if the probe is briefly wrong (race where the probe says Stopped but the host actually recovered), we'd drop legitimate updates for a few seconds until the probe catches up. For an on-change-only probe this is bounded, but the plan should justify the trade-off against real observed data.
Phase 2 decision gate: after shipping phase 1 and observing the post-subtree-walk behavior against a real AppEngine stop, decide whether dispatch suppression is still needed and design it against the real measurement.
### OPC UA address space exposure
Per-host status should be readable by OPC UA clients, not just the dashboard. Add child variable nodes under each `$WinPlatform` / `$AppEngine` object node in the address space. **All bridge-synthetic nodes use a `$` prefix** so they can never collide with user-defined attributes on extended templates:
- `<Object>.$RuntimeState` (`String`) — `Unknown` / `Running` / `Stopped`.
- `<Object>.$LastCallbackTime` (`DateTime`) — most recent probe callback regardless of transition.
- `<Object>.$LastScanState` (`Boolean`) — last `ScanState` value received; null before first update.
- `<Object>.$LastStateChangeTime` (`DateTime`) — most recent Running↔Stopped transition, backs the dashboard "Since" column.
- `<Object>.$FailureCount` (`Int64`)
- `<Object>.$LastError` (`String`) — last non-success MxStatus detail, empty string when null.
These read from the probe manager's snapshot (bridge-synthetic, no MxAccess round-trip) and are updated via `ChangeBits.Value` signalling when the state transitions. Read-only.
Note: the underlying `<ObjectName>.ScanState` Galaxy attribute will already appear in the address space via the normal hierarchy-build path, so downstream clients will see both the raw attribute (`ns=3;s=DevPlatform.ScanState`) and the synthesized state rollup (`ns=3;s=DevPlatform.$RuntimeState`). Intentional — the raw attribute is the ground truth, the rollup adds state-change timestamps and the Unknown/Running/Stopped trichotomy.
Namespace placement: under the existing host object node in the Galaxy namespace (`ns=3`), browseable at `DevPlatform/$RuntimeState` etc. No new namespace needed.
### Dashboard
#### Runtime Status panel
New `RuntimeStatusInfo` class on `StatusData`:
```csharp
public class RuntimeStatusInfo
{
public int Total { get; set; }
public int RunningCount { get; set; }
public int StoppedCount { get; set; }
public int UnknownCount { get; set; }
public List<GalaxyRuntimeStatus> Hosts { get; set; } = new();
}
```
Populated in `StatusReportService` via a new `LmxNodeManager.RuntimeStatuses` accessor. Renders between the Galaxy Info panel and the Historian panel.
Panel color:
- **Green** — all hosts Running.
- **Yellow** — at least one Unknown, zero Stopped.
- **Red** — at least one Stopped.
- **Gray** — MxAccess disconnected (all hosts Unknown; the Connection panel is the primary signal).
HTML layout:
```
┌ Galaxy Runtime ───────────────────────────────────────────────────────┐
│ 5 of 6 hosts running (3 platforms, 3 engines) │
│ ┌─────────────────┬──────────────┬─────────┬──────────────────────┐ │
│ │ Name │ Kind │ State │ Since │ │
│ ├─────────────────┼──────────────┼─────────┼──────────────────────┤ │
│ │ DevPlatform │ $WinPlatform │ Running │ 2026-04-13T08:15:02Z │ │
│ │ DevAppEngine │ $AppEngine │ Running │ 2026-04-13T08:15:04Z │ │
│ │ PlatformA │ $WinPlatform │ Running │ 2026-04-13T08:15:03Z │ │
│ │ EngineA_1 │ $AppEngine │ Running │ 2026-04-13T08:15:05Z │ │
│ │ EngineA_2 │ $AppEngine │ Stopped │ 2026-04-13T14:28:03Z │ │
│ │ PlatformB │ $WinPlatform │ Running │ 2026-04-13T08:15:04Z │ │
│ └─────────────────┴──────────────┴─────────┴──────────────────────┘ │
└────────────────────────────────────────────────────────────────────────┘
```
The "Since" column backs on `LastStateChangeTime` and its meaning depends on the row's current state: "Running since X" reads as "has been on scan since X", "Stopped since X" reads as "has been off scan since X". For Unknown rows, display "Advised since X" instead (the probe was registered at X but has not yet received its first callback).
#### Subscriptions panel — break out bridge probe count
The existing Subscriptions panel shows `Active: N` — the total advised-item count from `IMxAccessClient.ActiveSubscriptionCount`. After this ships, that number will include the bridge-owned runtime probes (one per Platform + one per AppEngine), which would look like a silent jump to operators watching for capacity planning purposes.
Fix: expose a new `ActiveProbeSubscriptionCount` property on `LmxNodeManager` (wired from `GalaxyRuntimeProbeManager.ActiveProbeCount`) and render as a second line on the Subscriptions panel:
```
┌ Subscriptions ──────────────────────────────┐
│ Active: 1247 │
│ Probes: 6 (bridge-owned runtime status) │
└──────────────────────────────────────────────┘
```
The `Active` total continues to include probes (no subtraction) so the count still matches whatever MxAccess actually holds — the breakout line tells operators which slice is bridge-internal.
### HealthCheckService rule
New rule in `HealthCheckService.CheckHealth`:
```
Rule 2e: Any Galaxy runtime host in Stopped state → Degraded
- Yellow panel
- Message: "N of M hosts stopped: Host1, Host2"
```
Rationale: the bridge is still able to talk to the local MxAccess runtime and serve cached values for the hosts that are up, so this is `Degraded` rather than `Unhealthy`. A stopped host is recoverable — the operator fixes it and the probe automatically transitions back to `Running`.
Rule ordering matters: this rule checks after the MxAccess-connected check (Rule 1), so when MxAccess is disconnected the service is Unhealthy on Rule 1 and the runtime-host rule never runs — avoids the confusing "MxAccess down AND Galaxy runtime degraded" double message.
### Configuration
New fields on `MxAccessConfiguration` (not a new config class — this is a runtime concern of the MxAccess bridge):
```csharp
public class MxAccessConfiguration
{
// ...existing fields...
/// <summary>
/// Enables per-host runtime status probing via AdviseSupervisory on
/// <c>&lt;ObjectName&gt;.ScanState</c> for every deployed $WinPlatform
/// and $AppEngine. Default enabled when a deployed ArchestrA Platform
/// is present. Set false for bridges that don't need multi-host
/// visibility and want to minimize subscription count.
/// </summary>
public bool RuntimeStatusProbesEnabled { get; set; } = true;
/// <summary>
/// Maximum seconds to wait for the initial probe callback before marking
/// an Unknown host as Stopped. Only applies to the Unknown → Stopped
/// transition; Running hosts do not time out (ScanState is delivered
/// on-change only, so a stable healthy host may go indefinitely without
/// a callback). Default 15s.
/// </summary>
public int RuntimeStatusUnknownTimeoutSeconds { get; set; } = 15;
}
```
No new top-level config section. Validator emits a warning if the timeout is shorter than 5 seconds (below the reasonable floor for MxAccess initial-resolution latency).
## Critical Files
### Modified
- `src/ZB.MOM.WW.LmxOpcUa.Host/Domain/GalaxyObjectInfo.cs` — add `CategoryId: int` and `HostGobjectId: int`
- `src/ZB.MOM.WW.LmxOpcUa.Host/GalaxyRepository/GalaxyRepositoryService.cs` — include `template_definition.category_id` and `gobject.host_gobject_id` in `HierarchySql` and the reader (falling back to `contained_by_gobject_id` if host column is unavailable)
- `gr/queries/hierarchy.sql` — same column additions (documentation query)
- `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/MxAccessConfiguration.cs` — add `RuntimeStatusProbesEnabled` + `RuntimeStatusUnknownTimeoutSeconds`
- `src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs` — construct probe manager, wire `OnTagValueChanged` and the MxAccess monitor callback, build `_hostedVariables: Dictionary<int, List<BaseDataVariableState>>` during address-space construction, expose `RuntimeStatuses` / `ActiveProbeSubscriptionCount` / `MarkHostVariablesBadQuality` / `ClearHostVariablesBadQuality`
- `src/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusData.cs` — add `RuntimeStatusInfo`; add `ProbeSubscriptionCount` field on `SubscriptionInfo`
- `src/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusReportService.cs` — populate from node manager, render Runtime Status panel + Probes line
- `src/ZB.MOM.WW.LmxOpcUa.Host/Status/HealthCheckService.cs` — new Rule 2e (after Rule 1 to avoid double-messaging when MxAccess is down)
- `src/ZB.MOM.WW.LmxOpcUa.Host/appsettings.json` — new MxAccess fields with defaults
- `src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/ConfigurationValidator.cs` — timeout floor warning
- `docs/MxAccessBridge.md` — document the probe pattern and on-change semantics
- `docs/StatusDashboard.md` — add `RuntimeStatusInfo` field table and Probes line
- `docs/Configuration.md` — add the two new MxAccess fields
### New
- `src/ZB.MOM.WW.LmxOpcUa.Host/Domain/GalaxyRuntimeStatus.cs`
- `src/ZB.MOM.WW.LmxOpcUa.Host/Domain/GalaxyRuntimeState.cs`
- `src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs`
- `tests/ZB.MOM.WW.LmxOpcUa.Tests/MxAccess/GalaxyRuntimeProbeManagerTests.cs`
## Execution order
1. **DTO + enum**`GalaxyRuntimeState`, `GalaxyRuntimeStatus`.
2. **Hierarchy schema** — add `CategoryId` to `GalaxyObjectInfo`, extend `HierarchySql` to select `td.category_id` as a new column, update `GalaxyRepositoryService` reader.
3. **Config** — add the two new `MxAccessConfiguration` fields and validator rule.
4. **Probe manager class + unit tests (TDD)** — write `GalaxyRuntimeProbeManagerTests.cs` first. Fake `IMxAccessClient` with scripted `OnTagValueChanged` invocations, configurable `State`, and a fake clock. Exercise the full matrix in the test plan below.
5. **Ship tests green before touching node manager.**
6. **Host-to-variables mapping in node manager** — add `_hostedVariables: Dictionary<int, List<BaseDataVariableState>>` populated during `BuildAddressSpace`. For each variable node, walk its owning object's `HostGobjectId` chain up to the nearest `$WinPlatform` or `$AppEngine` and append to that host's list. On rebuild (`BuildSubtree`), incrementally maintain the map. Expose `MarkHostVariablesBadQuality(int gobjectId)` and `ClearHostVariablesBadQuality(int gobjectId)` public methods that take the node manager `Lock`, iterate the hosted list, set/clear `StatusCode`, and call `ClearChangeMasks(ctx, false)` per variable.
7. **Node manager wiring** — construct `GalaxyRuntimeProbeManager`, pass `MarkHostVariablesBadQuality` / `ClearHostVariablesBadQuality` as its `onHostStopped` / `onHostRunning` callbacks, call `Sync` after `BuildAddressSpace` / rebuild, route `OnTagValueChanged` through `HandleProbeUpdate`, hook `Tick()` into the MxAccess connection-monitor callback path (fall back to timed `WaitOne(500ms)` on the dispatch loop if the monitor isn't reachable from the node manager). Add `RuntimeStatuses` and `ActiveProbeSubscriptionCount` accessors. Call `_probeManager?.Dispose()` from `LmxNodeManager.Dispose` **before** the existing MxAccess client teardown steps.
8. **OPC UA synthetic nodes** — under each `$WinPlatform` and `$AppEngine` node in BuildAddressSpace, add the six `$`-prefixed variables backed by lambdas that read from the probe manager snapshot.
9. **Dashboard**`RuntimeStatusInfo` on `StatusData`, `BuildRuntimeStatusInfo` in `StatusReportService`, render Runtime Status panel, add Probes line to Subscriptions panel. Status tests asserting both.
10. **Health check** — new Rule 2e with test: Degraded when any host is stopped, message names the stopped hosts.
11. **Integration tests**`LmxNodeManagerBuildTests` additions with a fake repository containing mixed `$WinPlatform` and `$AppEngine` hierarchy entries; verify `Sync` is called, synthetic nodes are created on both host types, `_hostedVariables` map is populated, and `MarkHostVariablesBadQuality` / `ClearHostVariablesBadQuality` flip status codes on the correct subset.
12. **Docs**`MxAccessBridge.md`, `StatusDashboard.md`, `Configuration.md`.
13. **Deploy** — backup, deploy both instances, verify via dashboard.
14. **Live verification** — see Verification section below.
## Test plan
### `GalaxyRuntimeProbeManagerTests.cs` — unit tests with fake client + fake clock
**State transitions**
- Fresh manager → empty snapshot.
- `Sync` with one Platform + one Engine → snapshot contains two entries in `Unknown`, `Kind` set correctly.
- First `ScanState = true` update → Unknown → Running, `LastUpdateTime` and `LastScanState = true` set, `GoodUpdateCount == 1`.
- Second `ScanState = true` update → still Running, counter increments.
- `ScanState = false` update → Running → Stopped, `LastScanState = false`, `FailureCount == 1`.
- `ItemStatus[0].success = false, detail = 2` update → Running → Stopped, `LastError` contains `MX_E_PlatformCommunicationError`.
- Null value delivered → Running → Stopped defensively, `LastError` explains null-value rejection.
- Recovery `ScanState = true` after Stopped → Stopped → Running, `LastStateChangeTime` updated, `LastError` cleared.
- Platform and AppEngine transitions behave identically (parameterized test).
**Unknown resolution timeout**
- No callback + clock advances past timeout → Unknown → Stopped.
- Good update just before timeout → Unknown → Running (no subsequent Stopped).
- Good update after timeout already flipped Unknown → Stopped → Stopped → Running (recovery path still works).
- `Tick` on a Running entry with no recent update → still Running (no starvation check — this is the critical on-change-semantic guarantee).
**MxAccess transport gating**
- Client `State = Disconnected``GetSnapshot` returns all entries with `State = Unknown` regardless of underlying state.
- Client flips Connected → Disconnected → underlying state preserved internally; snapshot reports Unknown.
- Client flips Disconnected → Connected → snapshot reflects underlying state again.
- Incoming `HandleProbeUpdate` while client is Disconnected → still updates the underlying state machine (so the snapshot is correct when transport comes back).
**Sync diff behavior**
- Sync with new Platform → Advise called once, counter = 1.
- Sync with new Engine → Advise called once, counter = 1.
- Sync twice with same hosts → Advise called once total (idempotent on unchanged entries).
- Sync then Sync with a Platform removed → Unadvise called, snapshot loses entry.
- Sync with different host set → Advise for new, Unadvise for old, unchanged preserved.
- Sync filters out non-runtime categories (areas, user objects) — hierarchy with 10 mixed categories and 2 runtime hosts produces exactly 2 probes.
**Event routing**
- `HandleProbeUpdate(probeAddr, ...)` → returns `true`, updates state.
- `HandleProbeUpdate(nonProbeAddr, ...)` → returns `false`, no state change.
- Concurrent `Sync` + `HandleProbeUpdate` under lock → no corruption (thread-safety smoke test).
- Callback arriving after Sync removed the entry → `HandleProbeUpdate` returns false (entry not found), no crash.
**Counters**
- `ActiveProbeCount == 2` after Sync with 1 Platform + 1 Engine.
- `ActiveProbeCount` decrements when a host is removed via Sync.
- `ActiveProbeCount == 0` on a fresh manager with no Sync called yet.
**Dispose**
- Dispose on a fresh manager → no-op, no Unadvise calls on the fake client.
- Dispose after Sync with 3 hosts → 3 Unadvise + 3 RemoveItem calls on the fake client.
- Dispose twice → second call is idempotent, no extra Unadvise calls.
- HandleProbeUpdate after Dispose → returns false defensively (no crash, no state change).
- Sync after Dispose → no-op or throws ObjectDisposedException (pick one; test documents whichever is chosen).
**Subtree invalidation callbacks**
- Construct probe manager with spy callbacks tracking `(gobjectId, kind)` tuples for each call.
- Running → Stopped transition → `onHostStopped` invoked exactly once with the correct GobjectId, `onHostRunning` never called.
- Stopped → Running transition → `onHostRunning` invoked exactly once with the correct GobjectId, `onHostStopped` never called.
- Unknown → Running (initial callback) → no invocation of either callback (only Running↔Stopped transitions trigger them, not fresh Unknown→Running).
- Unknown → Stopped (via timeout) → `onHostStopped` invoked once.
- Multiple consecutive callbacks with `ScanState=true` while already Running → no extra `onHostRunning` invocations.
- Multiple consecutive error callbacks while already Stopped → no extra `onHostStopped` invocations.
- Callback throws exception → probe manager logs a warning, updates its internal state regardless, does not propagate.
### `LmxNodeManagerBuildTests` additions
- Build address space with a `$WinPlatform` in the fake hierarchy → probe manager receives a `Sync` call with one entry.
- Build address space with a mix (1 Platform + 2 AppEngines + 5 user objects) → probe manager Sync receives exactly 3 runtime hosts.
- Build + rebuild with different host set → probe manager's `Sync` called twice with correct diff.
- Address space contains synthetic `$RuntimeState` variable under each host object node.
- `ActiveProbeSubscriptionCount` reflects probe count after build.
### Host-to-variables mapping + subtree invalidation tests
- Build address space with 1 `$AppEngine` hosting 2 user objects with 3 attributes each → `_hostedVariables[engineId]` contains 6 variable nodes.
- Build address space with 1 `$WinPlatform` hosting 2 `$AppEngine`s, each hosting 3 user objects with 2 attributes each → `_hostedVariables[platformId]` contains the 2 Engine nodes + 12 attribute variables; `_hostedVariables[engineId]` contains its 6 attribute variables. (Platform and Engine entries both exist; a single variable can appear in both lists.)
- Rebuild with a different set → the map is rebuilt from scratch; old entries are released.
- `MarkHostVariablesBadQuality(engineId)` → every variable in `_hostedVariables[engineId]` has `StatusCode = BadOutOfService` after the call; variables hosted by other engines are unchanged.
- `ClearHostVariablesBadQuality(engineId)` → every variable in that host's list has `StatusCode = Good` after the call.
- `MarkHostVariablesBadQuality` on a GobjectId with no entry in the map → no-op, no crash.
- `MarkHostVariablesBadQuality` followed by a fresh MxAccess update on one of the variables → the update's Value + Status overwrites the forced Bad (confirms no "override layer" confusion; the simple StatusCode set is naturally overwritten by the normal dispatch path).
- `MarkHostVariablesBadQuality` acquires the node manager `Lock` (verify no deadlock when called from a thread that also needs the lock).
### End-to-end subtree invalidation integration test
- Fake repository with 1 Engine hosting 10 attributes. All on advise. All have some recent value with Good status.
- Simulate probe callback delivering `ScanState = false` for the Engine → probe manager flips to Stopped, invokes `onHostStopped`, which in turn walks the 10 variables and flips them to BadOutOfService.
- Assert all 10 variables now report StatusCode = BadOutOfService after a `client.Read` round-trip.
- Simulate probe callback delivering `ScanState = true` again → probe manager flips to Running, `onHostRunning` clears the override, all 10 variables now report StatusCode = Good.
### `StatusReportServiceTests` additions
- HTML contains `<h2>Galaxy Runtime</h2>` when at least one runtime host is present.
- HTML rendering distinguishes `$WinPlatform` and `$AppEngine` rows in the Kind column.
- JSON exposes `RuntimeStatus.Total`, `RuntimeStatus.RunningCount`, `RuntimeStatus.StoppedCount`, `RuntimeStatus.Hosts[]`.
- Subscriptions panel HTML contains a `Probes:` line when `ProbeSubscriptionCount > 0`.
- No Runtime Status panel when the fake repository has zero runtime hosts.
- When fake MxAccess client is `Disconnected`, all host rows render `Unknown` regardless of state passed in.
### `HealthCheckServiceTests` additions
- All hosts running → Healthy.
- One host stopped → Degraded, message mentions the stopped host name.
- All hosts stopped → Degraded (not Unhealthy — cached values still served).
- MxAccess disconnected + one host stopped → Unhealthy via Rule 1 (runtime status rule doesn't fire).
## Verification
1. `dotnet build` clean on both Host and plugin.
2. `dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests --filter "FullyQualifiedName~GalaxyRuntimeProbe|FullyQualifiedName~Status|FullyQualifiedName~HealthCheck"` → all pass.
3. Deploy to instance1 (default `RuntimeStatusProbesEnabled: true`, `RuntimeStatusUnknownTimeoutSeconds: 15`). Dashboard shows `Galaxy Runtime: 2 of 2 hosts running` (DevPlatform + DevAppEngine) immediately after startup, all green. Subscriptions panel shows `Probes: 2`.
4. Stop `DevAppEngine` from the IDE (SMC `SetToOffScan` or the engine's stop action, leaving its parent Platform running). Verify:
- Dashboard panel turns red within ~1s of the action.
- DevAppEngine row shows `Stopped` with the last good timestamp.
- DevPlatform row remains `Running` — confirms the engines are independently observable.
- Overall Health rolls up to `Degraded`.
- CLI `read ns=3;s=DevAppEngine.$RuntimeState` returns `"Stopped"`.
- Log has an `Information` line "Galaxy runtime DevAppEngine ($AppEngine) transitioned Running → Stopped".
- **Subtree invalidation**: CLI `read ns=3;s=TestMachine_001.MachineID` and any other tag under an object hosted by DevAppEngine returns status code `BadOutOfService` (or whatever specific code the Mark method uses). Every descendant tag, not just a sample — sweep-test via a browse + read across the whole address space. Operators also observe this on the dashboard Alarms / subscribed-variable reads if they're watching any particular value.
- **Client-freeze observation**: subscribe an OPC UA client to a handful of variables under DevAppEngine before step 4, then trigger the stop. Note whether the client handles the resulting notification batch cleanly (ideal) or visibly stalls (residual problem that dispatch suppression would need to address in phase 2). Document the observed behavior in the phase-2 decision gate for dispatch suppression.
5. Start `DevAppEngine` again (`SetToOnScan`). Verify:
- Dashboard flips back to green within ~1s.
- CLI read of `$RuntimeState` returns `"Running"`.
- Log has a "Galaxy runtime DevAppEngine ($AppEngine) transitioned Stopped → Running" line.
- **Subtree recovery**: descendant tags previously showing `BadOutOfService` now show `Good` status. Values may initially be stale (whatever was cached at stop time) until fresh on-change MxAccess updates arrive; this matches the design trade-off documented in the Subtree Quality Invalidation section.
6. Stop `DevPlatform` entirely (full platform stop). Verify:
- Both DevPlatform and DevAppEngine flip to `Stopped` (the Platform takes the Engine down with it).
- Log records both transitions.
- CLI reads of `$RuntimeState` for both hosts return `"Stopped"`.
- The underlying raw `ScanState` attribute reads may return BadCommunicationError — operator sees the distinction between the cached rollup and the live raw attribute.
7. Simulate MxAccess transport loss — e.g., stop the ArchestrA runtime on the local node or kill the probe connection. Verify:
- Every host row in the Runtime Status panel renders `Unknown` (not Stopped) while the Connection panel reports `Disconnected`.
- Overall Health is `Unhealthy` via Rule 1, NOT `Degraded` via Rule 2e (the rules should not double-message).
- After MxAccess reconnects, the runtime rows revert to their actual underlying states.
8. Deploy to instance2 with same config. Both instances should show consistent state since they observe the same local ArchestrA runtime.
9. Smoke-test: disable probes via `RuntimeStatusProbesEnabled: false`, restart, verify Runtime Status panel absent from HTML, `Probes:` line absent from Subscriptions panel, no probe subscriptions advised (log and `ActiveSubscriptionCount` delta) — backward compatibility path for deployments that don't want the feature.
10. **Unresolvable-probe-tag behavior verification** — temporarily add a bogus tag to the probe set to discover how MxAccess surfaces resolution failures. The simplest way is to force the probe manager to advise a made-up `NoSuchPlatform_999.ScanState` reference during a test boot, then observe:
- Does MxAccess deliver a data-change callback with `ItemStatus[0].success = false` and a resolution-failure detail? If yes, the host row transitions Unknown → Stopped within ~1s via the error-callback path, and `LastError` carries the detail. Tighten the plan's language to say "MxAccess surfaces resolution failures as error callbacks" and optionally tighten `RuntimeStatusUnknownTimeoutSeconds` downward.
- Or does MxAccess silently drop the advise with no callback at all? If yes, the bogus host stays Unknown until `RuntimeStatusUnknownTimeoutSeconds` elapses, then flips to Stopped via the Unknown-timeout backstop. Tighten the plan's language to say "MxAccess does not surface resolution failures; the Unknown-timeout is the only detection path" and leave the default timeout as-is.
- Document the observed behavior in `docs/MxAccessBridge.md` alongside the probe pattern section so operators know which detection path their deployment relies on.
- Remove the bogus tag and restart before handing over.
## Open questions (phase 2/3 scope — not blocking phase 1)
1. **Dispatch suppression for Stopped hosts** (phase 2 decision gate) — once phase 1 ships with subtree invalidation, observe whether the client-freeze symptom persists. If it does, design dispatch suppression: filter MxAccess per-tag updates before they hit the dispatch queue when the owning host is Stopped. Requires a `tagRef → owning-host GobjectId` map (which `_hostedVariables` already implies, inverted). Trade-off is dropping legitimate updates during brief probe/reality mismatch windows. Decide after real measurement.
2. **Should the probe manager expose transition events?** Synthetic OPC UA event notifier on each host object that fires when `$RuntimeState` transitions. Phase 2 stretch — operators get per-host polling via the dashboard panel today; events would let clients subscribe without polling.
3. **Multi-node Galaxies** — Platform on a remote node shows up in the hierarchy but probes fire through the local MxAccess runtime's node. The probe semantics should still work because MxAccess routes inter-Platform queries transparently, but worth confirming during step 4 if the environment has a multi-node Galaxy.
4. **Is `ScanState` writable?** Some Galaxy system attributes are writable via MxAccess (SetScan method on the object) which would let an operator start/stop a host through the OPC UA bridge. Phase 3 possibility — would require a gating security classification since it's a runtime control action, not a data write.