Files

Joseph Doherty 9d49cdcc58 Track Galaxy Platform and AppEngine runtime state via ScanState probes and proactively invalidate descendant variable quality on Stopped transitions so operators can detect a stopped runtime host before downstream clients read stale data and so the bridge delivers a uniform bad-quality signal instead of relying on MxAccess per-tag fan-out

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-13 15:40:44 -04:00

49 KiB

Raw Permalink Blame History

Plan: Galaxy Runtime Status (Platform + AppEngine Stopped/Started Detection)

Context

Today the bridge has no operator-visible signal for "is Galaxy Platform X or AppEngine Y stopped or running?". The dashboard shows:

MXAccess state — one bit of truth about whether the bridge can talk to the local MxAccess runtime at all.
Data change dispatch rate — aggregate throughput across every advised attribute.

Neither catches the case an operator actually cares about: a single Platform or AppEngine in a multi-host Galaxy has stopped (operator stopped it from the IDE, the node crashed, network cut, process died, someone toggled OffScan for maintenance). The bridge keeps serving cached values, downstream OPC UA clients see stale reads, and nobody notices until somebody specifically goes looking at the affected equipment.

Galaxy exposes <ObjectName>.ScanState as a boolean system attribute on every deployed $WinPlatform and $AppEngine. true means the object is on scan and executing; anything else means not running. AppEngine state is independently observable through MxAccess (even a stopped Engine's parent Platform can still route the query) so a single probe mechanism covers both host types.

The goal is to advise <ObjectName>.ScanState for every deployed $WinPlatform and $AppEngine, surface per-host runtime state on the dashboard, drive a Degraded health check rule when any is down, and publish the state into the OPC UA address space so external clients can subscribe alongside the value data they already consume.

Design

Probe tag: `<ObjectName>.ScanState`

ScanState is a boolean system attribute on every deployed $WinPlatform and $AppEngine. The classification rule:

isRunning = status.Success && vtq.Value is bool b && b

Everything else → Stopped. The ItemStatus fields (category, detail) are still captured into LastError for operator diagnostics, but they don't branch the state machine.

On-change delivery semantic

MxAccess AdviseSupervisory delivers the current value at subscription time and then fires OnDataChange only when the value changes. ScanState is discrete — for a healthy host, the initial advise callback reports true and nothing follows until the state actually changes. There is no periodic heartbeat on the subscription.

Implications:

No starvation-based Running → Stopped transition. A Running host will legitimately go minutes or hours without an update. The stale-threshold check for the Running state is dropped entirely.
Error callbacks drive the Running → Stopped transition. MxAccess delivers a data-change callback with ItemStatus[0].success == false and detail == 2 (MX_E_PlatformCommunicationError) when a host becomes unreachable. We trust this signal — it's the broker's job to surface it, and in practice it fires quickly.
Stale threshold only applies to the Unknown state. If a probe is advised but never receives a first callback (initial resolution failure, host never deployed, MxAccess routing broken), the Unknown → Stopped transition fires after UnknownResolutionTimeoutSeconds. This catches "the probe never came online" without tripping on healthy stable hosts.

Subscription mechanics:

AdviseSupervisory on <ObjectName>.ScanState. Supervisory variant avoids user-login requirements for bridge-owned probes — matches the pattern the node manager already uses for its own subscriptions.
Probes are bridge-owned, not ref-counted against client subscriptions. They live for the lifetime of the address space between rebuilds.
On rebuild, the probe set is diffed against the new host list and the minimum number of AdviseSupervisory/Unadvise calls are issued (see Sync in the probe manager).

Host discovery

Galaxy Repository already has the data — we just need to surface it to the runtime layer.

hierarchy.sql currently selects every deployed object where template_definition.category_id IN (1, 3, 4, 10, 11, 13, 17, 24, 26). Category 1 = $WinPlatform and 3 = $AppEngine are already in the set. Add template_definition.category_id as a new column on the query so the repository loader can tag each GalaxyObjectInfo with its Galaxy category, and the probe manager can filter for categories 1 and 3.

Schema change: add CategoryId: int to GalaxyObjectInfo, populated from hierarchy.sql. Small schema change, keeps the probe enumeration aligned with whatever the rest of the address space sees at each rebuild.

Runtime host state machine

┌─ Unknown ─┐            (initial state; advise issued, no callback yet)
│     │
│     │ ScanState == true
│     ▼
│   Running ◄───────────────────┐
│     │                         │
│     │                         │ ScanState == true
│     │ ScanState != true       │ (recovery callback)
│     │   (false / error /      │
│     │    bad status)          │
│     ▼                         │
│   Stopped ──────────────────────┘
│
└─► Stopped (Unknown → Stopped after UnknownResolutionTimeoutSeconds
              if no initial callback ever arrives)

Three states:

Unknown — probe advised but no callback yet. Initial state after bridge startup or a rebuild until the first OnDataChange for that host. If this state persists longer than UnknownResolutionTimeoutSeconds (default 15s), the manager's periodic check flips it to Stopped — captures the "probe never resolved" case.
Running — last probe callback delivered ScanState = true with ItemStatus[0].success == true. Stays in this state until a callback changes it. No starvation-based timeout.
Stopped — any of:
1. Last probe callback had ScanState != true (explicit off-scan).
2. Last probe callback had ItemStatus[0].success == false (unreachable host).
3. Unknown state timed out (initial resolution never completed).
4. Initial AdviseSupervisory reported ResolutionStatus of invalidReference or noGalaxyRepository.

MxAccess transport down → force Unknown

When the local MxAccess client is not connected (IMxAccessClient.State != ConnectionState.Connected), every probe's transport is effectively offline regardless of the underlying host state. The probe manager forces every entry to Unknown in its snapshot output while MxAccess is disconnected. Rationale:

Telling the operator that all hosts are Stopped is misleading — the actual problem is the local transport, which the existing Connection panel already surfaces prominently.
Unknown is the right semantic: we don't know the host state because we can't see them right now.
When MxAccess reconnects, the broker re-delivers probe subscriptions and the state machine resumes normally.

Implementation: GetSnapshot() checks _client.State and rewrites State = Unknown (leaving the underlying _stateByProbe map intact for when the transport comes back). HealthCheckService already rolls to Unhealthy via the MxAccess-not-connected rule before the runtime status rule fires, so this doesn't create a confusing health-rollup story.

New types

All in src/ZB.MOM.WW.LmxOpcUa.Host/Domain/:

public enum GalaxyRuntimeState { Unknown, Running, Stopped }

public sealed class GalaxyRuntimeStatus
{
    public string ObjectName { get; set; } = "";           // gobject.tag_name
    public int GobjectId { get; set; }
    public string Kind { get; set; } = "";                 // "$WinPlatform" or "$AppEngine"
    public GalaxyRuntimeState State { get; set; }
    public DateTime? LastStateCallbackTime { get; set; }   // UTC of most recent probe callback
    public DateTime? LastStateChangeTime { get; set; }     // UTC of last Running↔Stopped transition
    public bool? LastScanState { get; set; }               // last ScanState value; null before first update
    public string? LastError { get; set; }                 // MxStatus.detail description when !success
    public long GoodUpdateCount { get; set; }              // callbacks where ScanState == true
    public long FailureCount { get; set; }                 // callbacks where ScanState != true or !success
}

Why two timestamps (LastStateCallbackTime vs LastStateChangeTime): on-change-only delivery means they'll match for most entries, but a callback that arrives with a different error detail while the host is already Stopped updates the callback time and LastError without touching LastStateChangeTime. The dashboard's "Since" column (see Dashboard panel) uses LastStateChangeTime so operators see "Stopped since 08:17:02Z" regardless of how many intervening error callbacks have refined the diagnostic detail.

Naming note: "Galaxy runtime" is the generic term covering both $WinPlatform and $AppEngine — the dashboard and config use this neutral phrasing so the feature doesn't look like it only covers Platforms.

Probe manager

New class MxAccess/GalaxyRuntimeProbeManager.cs, owned by LmxNodeManager:

internal sealed class GalaxyRuntimeProbeManager : IDisposable
{
    public GalaxyRuntimeProbeManager(
        IMxAccessClient client,
        int unknownResolutionTimeoutSeconds,
        Action<int> onHostStopped,   // invoked with GobjectId on Running → Stopped
        Action<int> onHostRunning);  // invoked with GobjectId on Stopped → Running

    // Called after address-space build / rebuild. Adds probes for new hosts,
    // removes them for hosts no longer in the hierarchy. Idempotent.
    // Caller supplies the full hierarchy; the manager filters for category_id
    // 1 ($WinPlatform) and 3 ($AppEngine).
    // Blocks on sequential AddItem/AdviseSupervisory SDK calls — see wiring notes.
    public void Sync(IReadOnlyList<GalaxyObjectInfo> hierarchy);

    // Invoked by LmxNodeManager's OnTagValueChanged callback when the address
    // matches a probe tag reference. Returns true when the event was consumed
    // by a probe so the data-change dispatch queue can skip it.
    public bool HandleProbeUpdate(string tagRef, Vtq vtq, MxStatusProxy status);

    // Called from the MxAccess connection monitor callback (MonitorIntervalSeconds
    // cadence) to advance time-based transitions:
    //   1. Unknown → Stopped when UnknownResolutionTimeoutSeconds has elapsed.
    //   2. Nothing for Running — no starvation check (on-change-only semantics).
    public void Tick();

    // Snapshot respects MxAccess transport state — returns all Unknown when
    // the transport is disconnected, regardless of the underlying per-host state.
    public IReadOnlyList<GalaxyRuntimeStatus> GetSnapshot();

    public int ActiveProbeCount { get; }

    // Unadvise + RemoveItem on every active probe. Called from LmxNodeManager.Dispose
    // before the MxAccess client teardown. Idempotent — safe to call multiple times.
    public void Dispose();
}

The two Action<int> callbacks are how the probe manager triggers the subtree quality invalidation documented below — the owning LmxNodeManager passes references to its own MarkHostVariablesBadQuality and ClearHostVariablesBadQuality methods at construction time. The probe manager calls them synchronously on state transitions, from whichever thread delivered the probe callback (the MxAccess dispatch thread). The node manager methods acquire their own lock internally — the probe manager does not hold its own lock across the callback invocation to avoid inverted-lock-order deadlocks.

Internals:

Dictionary<string, GalaxyRuntimeStatus> keyed by probe tag reference (<ObjectName>.ScanState).
Reverse Dictionary<int, string> from GobjectId to probe tag for Sync to diff against a fresh hierarchy.
One lock guarding both maps. Operations are microsecond-scale.
Sync filters hierarchy for CategoryId == 1 || CategoryId == 3, then compares the filtered set against the active probe set:
- Added hosts → client.AddItem + AdviseSupervisory; insert GalaxyRuntimeStatus { State = Unknown }.
- Removed hosts → Unadvise + RemoveItem; drop entry.
- Unchanged hosts → leave in place, preserving their state machine across the rebuild.
HandleProbeUpdate is the per-callback entry point. It evaluates the isRunning predicate, updates LastUpdateTime, transitions state, logs at Information level on state changes only (not every tick), and stores the ItemStatus detail into LastError on failure.
Tick runs at the existing dispatch thread cadence. For each Unknown entry, checks LastUpdateTime == null && (now - _createdAt[id]) > unknownResolutionTimeoutSeconds and flips to Stopped if so. Healthy Running entries are not touched.
GetSnapshot short-circuits to "all Unknown" when _client.State != ConnectionState.Connected.

LmxNodeManager wiring

LmxNodeManager constructs a GalaxyRuntimeProbeManager when MxAccessConfiguration.RuntimeStatusProbesEnabled is true. In BuildAddressSpace and the subtree rebuild path, after the existing loops complete, call _probeManager.Sync(hierarchy). Sync blocks while it issues AddItem + AdviseSupervisory sequentially for each new host — for a galaxy with ~50 runtime hosts this adds roughly 500ms–1s to the address-space build on top of the existing several-second build time. Kept synchronous deliberately: the simpler correctness model is worth the startup hit, and ActiveProbeCount is guaranteed to be accurate the moment the build completes.

Route the existing OnTagValueChanged callback through _probeManager.HandleProbeUpdate first — if it returns true, the event was consumed by a bridge-owned probe and the dispatch queue skips the normal variable-update path.

Tick() cadence — piggyback on the MxAccess connection monitor. The dispatch thread wakes on _dataChangeSignal, which only fires when tag values change. In the degenerate case where no probe ever resolves (MxAccess routing broken, bad probe tag, etc.), the dispatch loop never wakes and the Unknown → Stopped timeout would never fire. To avoid adding a new thread or timer, hook _probeManager.Tick() into the callback path that the existing MxAccess.MonitorIntervalSeconds watcher already runs — the same cadence that drives the connection-level probe-tag staleness check. A single call site covers both.

If the monitor is not accessible from LmxNodeManager during implementation (it lives at a different layer in the MxAccess client), fall back to Option A from the design discussion: change the dispatch loop's WaitOne() call to a timed WaitOne(500ms) so it wakes periodically regardless of data changes. Single-line change, but requires verifying no assumptions in the existing loop break from the periodic wake-ups.

Service shutdown — explicit probe cleanup

The probe manager's Sync handles Unadvise on diff removal when a host leaves the hierarchy. Service shutdown is a separate path that needs explicit handling: when LmxNodeManager is disposed, the active probe subscriptions must be torn down before the MxAccess client is closed — otherwise we rely on the client's broader shutdown to cover supervisory subscriptions, which depends on disposal ordering and may or may not clean up cleanly.

GalaxyRuntimeProbeManager implements IDisposable. Dispose() walks the active probe map, calls Unadvise + RemoveItem on each entry, and clears the maps. Idempotent — calling it twice is a no-op. LmxNodeManager.Dispose calls _probeManager?.Dispose() before the existing teardown steps that touch the MxAccess client.

Subtree quality invalidation on Stopped transition

Operational context for this section — observed behavior from production: when an AppEngine or Platform goes OffScan, MxAccess fans out per-tag OnDataChange callbacks for every advised tag hosted by that runtime object, each carrying bad quality. Two symptoms result:

OPC UA client freeze — the dispatch handler processes the flood in one cycle, pushes thousands of OPC UA value-change notifications to subscribed clients in one Publish response, and the client visibly stalls handling the volume.
Incomplete quality flip — some OPC UA variables retain their last good value with Good quality even after the host is down, either because the dispatch queue drops updates, or because some tags aren't in the subscribed set at the moment of the flood, or because of an edge case in the quality mapper. Operationally: clients read plausible-looking stale data from a dead host.

The probe-driven Stopped transition is the authoritative, on-time signal we control. On that transition, the bridge proactively walks every OPC UA variable node hosted by the Stopped host and sets its StatusCode to BadOutOfService. This is independent of whether MxAccess also delivers per-tag bad-quality updates — the two signals are belt-and-suspenders for correctness. Even if the dispatch queue drops half the per-tag updates, the subtree walk guarantees the end state is uniformly Bad for every variable under the dead host.

On the recovery Stopped → Running transition, the bridge walks the same set and clears the override — sets StatusCode back to Good so the cached values are visible again. Subsequent real MxAccess updates arrive on-change and overwrite value + status as normal. Trade-off: for a host that's been down a long time, some tags may show Good quality on a stale cached value for a short window after recovery, until MxAccess delivers the next on-change update for that tag. This matches existing bridge behavior for any slow-changing attribute and is preferable to leaving variables stuck at BadOutOfService indefinitely waiting for an update that may never come.

What's included in the "subtree" — the set of variables whose owning Galaxy object is hosted (transitively) by the Stopped host. For AppEngines, this is every variable whose object's host_gobject_id chain reaches the Engine. For Platforms, it's every variable on every Engine hosted by the Platform, plus every object hosted directly on the Platform. This is not browse-tree containment — an object can live in one Area (browse parent) but be hosted by an Engine on a different Platform (runtime parent), and the host relationship is what determines the fate of its live data.

Implementation plan for the host-to-variables mapping:

Extend hierarchy.sql to return gobject.host_gobject_id as a new column if it exists. Verify during implementation — if the column is not present on this Galaxy schema version, fall back to contained_by_gobject_id as an approximation (less precise for edge cases where browse containment differs from runtime hosting, but sufficient for typical Galaxy topologies).
Extend GalaxyObjectInfo with HostGobjectId: int.
During BuildAddressSpace, as each variable is created, compute its owning host by walking HostGobjectId up the chain until hitting a $WinPlatform or $AppEngine (or reaching the root). Append the variable to a Dictionary<int, List<BaseDataVariableState>> keyed by the host's GobjectId.
On BuildSubtree (incremental rebuild), the same logic runs for newly added variables. Variables that leave the hierarchy are removed from the map. The map lives next to _nodeMap on LmxNodeManager.

New public methods on LmxNodeManager:

// Called by probe manager on Running → Stopped. Walks every variable hosted by
// gobjectId and sets its StatusCode to BadOutOfService. Safe to call multiple times.
// Does nothing when gobjectId has no hosted variables.
public void MarkHostVariablesBadQuality(int gobjectId);

// Called by probe manager on Stopped → Running. Walks every variable hosted by
// gobjectId and resets StatusCode to Good. Values are left at whatever the last
// MxAccess-delivered value was; subsequent on-change updates will refresh them.
public void ClearHostVariablesBadQuality(int gobjectId);

Both methods acquire the standard node manager Lock, iterate the hosted list, set StatusCode + call ClearChangeMasks(ctx, false) per variable, and release the lock. The OPC UA subscription publisher picks up the change masks on its next tick and pushes notifications to subscribed clients — so operators see a single uniform quality flip per variable rather than two (one from our walk, one from the MxAccess per-tag delivery).

Dispatch suppression — deferred pending observation

The subtree invalidation above addresses the data-correctness symptom (some variables not flipping to bad quality). The client freeze symptom is a separate problem: even if the quality state is correct, the bridge is still processing a thundering herd of per-tag bad-quality MxAccess callbacks through the dispatch queue, which in turn push thousands of OPC UA value-change notifications to subscribed clients.

A stronger fix would be dispatch suppression: once the probe manager transitions a host to Stopped, filter out incoming MxAccess per-tag updates for any tag owned by that host before they hit the dispatch queue. The subtree walk has already captured the state; the redundant per-tag updates are pure noise.

This is deliberately NOT part of phase 1. Reasons:

The subtree walk may make the freeze disappear entirely. If the dispatch queue processes the flood but the notifications it pushes are now duplicates of change masks the walk already set, the SDK may coalesce them into a single publish cycle and the client sees one notification batch rather than thousands. We want to observe whether this is the case before building suppression.
If the freeze persists after subtree invalidation ships, we have a real measurement of the residual problem to inform the suppression design (which hosts, which tags, how much batching, whether to also coalesce at the OPC UA publisher level).
The suppression path has a subtle failure mode: if the probe is briefly wrong (race where the probe says Stopped but the host actually recovered), we'd drop legitimate updates for a few seconds until the probe catches up. For an on-change-only probe this is bounded, but the plan should justify the trade-off against real observed data.

Phase 2 decision gate: after shipping phase 1 and observing the post-subtree-walk behavior against a real AppEngine stop, decide whether dispatch suppression is still needed and design it against the real measurement.

OPC UA address space exposure

Per-host status should be readable by OPC UA clients, not just the dashboard. Add child variable nodes under each $WinPlatform / $AppEngine object node in the address space. All bridge-synthetic nodes use a $ prefix so they can never collide with user-defined attributes on extended templates:

<Object>.$RuntimeState (String) — Unknown / Running / Stopped.
<Object>.$LastCallbackTime (DateTime) — most recent probe callback regardless of transition.
<Object>.$LastScanState (Boolean) — last ScanState value received; null before first update.
<Object>.$LastStateChangeTime (DateTime) — most recent Running↔Stopped transition, backs the dashboard "Since" column.
<Object>.$FailureCount (Int64)
<Object>.$LastError (String) — last non-success MxStatus detail, empty string when null.

These read from the probe manager's snapshot (bridge-synthetic, no MxAccess round-trip) and are updated via ChangeBits.Value signalling when the state transitions. Read-only.

Note: the underlying <ObjectName>.ScanState Galaxy attribute will already appear in the address space via the normal hierarchy-build path, so downstream clients will see both the raw attribute (ns=3;s=DevPlatform.ScanState) and the synthesized state rollup (ns=3;s=DevPlatform.$RuntimeState). Intentional — the raw attribute is the ground truth, the rollup adds state-change timestamps and the Unknown/Running/Stopped trichotomy.

Namespace placement: under the existing host object node in the Galaxy namespace (ns=3), browseable at DevPlatform/$RuntimeState etc. No new namespace needed.

Dashboard

Runtime Status panel

New RuntimeStatusInfo class on StatusData:

public class RuntimeStatusInfo
{
    public int Total { get; set; }
    public int RunningCount { get; set; }
    public int StoppedCount { get; set; }
    public int UnknownCount { get; set; }
    public List<GalaxyRuntimeStatus> Hosts { get; set; } = new();
}

Populated in StatusReportService via a new LmxNodeManager.RuntimeStatuses accessor. Renders between the Galaxy Info panel and the Historian panel.

Panel color:

Green — all hosts Running.
Yellow — at least one Unknown, zero Stopped.
Red — at least one Stopped.
Gray — MxAccess disconnected (all hosts Unknown; the Connection panel is the primary signal).

HTML layout:

┌ Galaxy Runtime ───────────────────────────────────────────────────────┐
│  5 of 6 hosts running (3 platforms, 3 engines)                         │
│  ┌─────────────────┬──────────────┬─────────┬──────────────────────┐   │
│  │ Name            │ Kind         │ State   │ Since                │   │
│  ├─────────────────┼──────────────┼─────────┼──────────────────────┤   │
│  │ DevPlatform     │ $WinPlatform │ Running │ 2026-04-13T08:15:02Z │   │
│  │ DevAppEngine    │ $AppEngine   │ Running │ 2026-04-13T08:15:04Z │   │
│  │ PlatformA       │ $WinPlatform │ Running │ 2026-04-13T08:15:03Z │   │
│  │ EngineA_1       │ $AppEngine   │ Running │ 2026-04-13T08:15:05Z │   │
│  │ EngineA_2       │ $AppEngine   │ Stopped │ 2026-04-13T14:28:03Z │   │
│  │ PlatformB       │ $WinPlatform │ Running │ 2026-04-13T08:15:04Z │   │
│  └─────────────────┴──────────────┴─────────┴──────────────────────┘   │
└────────────────────────────────────────────────────────────────────────┘

The "Since" column backs on LastStateChangeTime and its meaning depends on the row's current state: "Running since X" reads as "has been on scan since X", "Stopped since X" reads as "has been off scan since X". For Unknown rows, display "Advised since X" instead (the probe was registered at X but has not yet received its first callback).

Subscriptions panel — break out bridge probe count

The existing Subscriptions panel shows Active: N — the total advised-item count from IMxAccessClient.ActiveSubscriptionCount. After this ships, that number will include the bridge-owned runtime probes (one per Platform + one per AppEngine), which would look like a silent jump to operators watching for capacity planning purposes.

Fix: expose a new ActiveProbeSubscriptionCount property on LmxNodeManager (wired from GalaxyRuntimeProbeManager.ActiveProbeCount) and render as a second line on the Subscriptions panel:

┌ Subscriptions ──────────────────────────────┐
│  Active: 1247                                │
│  Probes: 6 (bridge-owned runtime status)     │
└──────────────────────────────────────────────┘

The Active total continues to include probes (no subtraction) so the count still matches whatever MxAccess actually holds — the breakout line tells operators which slice is bridge-internal.

HealthCheckService rule

New rule in HealthCheckService.CheckHealth:

Rule 2e: Any Galaxy runtime host in Stopped state → Degraded
  - Yellow panel
  - Message: "N of M hosts stopped: Host1, Host2"

Rationale: the bridge is still able to talk to the local MxAccess runtime and serve cached values for the hosts that are up, so this is Degraded rather than Unhealthy. A stopped host is recoverable — the operator fixes it and the probe automatically transitions back to Running.

Rule ordering matters: this rule checks after the MxAccess-connected check (Rule 1), so when MxAccess is disconnected the service is Unhealthy on Rule 1 and the runtime-host rule never runs — avoids the confusing "MxAccess down AND Galaxy runtime degraded" double message.

Configuration

New fields on MxAccessConfiguration (not a new config class — this is a runtime concern of the MxAccess bridge):

public class MxAccessConfiguration
{
    // ...existing fields...

    /// <summary>
    ///     Enables per-host runtime status probing via AdviseSupervisory on
    ///     <c>&lt;ObjectName&gt;.ScanState</c> for every deployed $WinPlatform
    ///     and $AppEngine. Default enabled when a deployed ArchestrA Platform
    ///     is present. Set false for bridges that don't need multi-host
    ///     visibility and want to minimize subscription count.
    /// </summary>
    public bool RuntimeStatusProbesEnabled { get; set; } = true;

    /// <summary>
    ///     Maximum seconds to wait for the initial probe callback before marking
    ///     an Unknown host as Stopped. Only applies to the Unknown → Stopped
    ///     transition; Running hosts do not time out (ScanState is delivered
    ///     on-change only, so a stable healthy host may go indefinitely without
    ///     a callback). Default 15s.
    /// </summary>
    public int RuntimeStatusUnknownTimeoutSeconds { get; set; } = 15;
}

No new top-level config section. Validator emits a warning if the timeout is shorter than 5 seconds (below the reasonable floor for MxAccess initial-resolution latency).

Critical Files

Modified

src/ZB.MOM.WW.LmxOpcUa.Host/Domain/GalaxyObjectInfo.cs — add CategoryId: int and HostGobjectId: int
src/ZB.MOM.WW.LmxOpcUa.Host/GalaxyRepository/GalaxyRepositoryService.cs — include template_definition.category_id and gobject.host_gobject_id in HierarchySql and the reader (falling back to contained_by_gobject_id if host column is unavailable)
gr/queries/hierarchy.sql — same column additions (documentation query)
src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/MxAccessConfiguration.cs — add RuntimeStatusProbesEnabled + RuntimeStatusUnknownTimeoutSeconds
src/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs — construct probe manager, wire OnTagValueChanged and the MxAccess monitor callback, build _hostedVariables: Dictionary<int, List<BaseDataVariableState>> during address-space construction, expose RuntimeStatuses / ActiveProbeSubscriptionCount / MarkHostVariablesBadQuality / ClearHostVariablesBadQuality
src/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusData.cs — add RuntimeStatusInfo; add ProbeSubscriptionCount field on SubscriptionInfo
src/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusReportService.cs — populate from node manager, render Runtime Status panel + Probes line
src/ZB.MOM.WW.LmxOpcUa.Host/Status/HealthCheckService.cs — new Rule 2e (after Rule 1 to avoid double-messaging when MxAccess is down)
src/ZB.MOM.WW.LmxOpcUa.Host/appsettings.json — new MxAccess fields with defaults
src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/ConfigurationValidator.cs — timeout floor warning
docs/MxAccessBridge.md — document the probe pattern and on-change semantics
docs/StatusDashboard.md — add RuntimeStatusInfo field table and Probes line
docs/Configuration.md — add the two new MxAccess fields

New

src/ZB.MOM.WW.LmxOpcUa.Host/Domain/GalaxyRuntimeStatus.cs
src/ZB.MOM.WW.LmxOpcUa.Host/Domain/GalaxyRuntimeState.cs
src/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cs
tests/ZB.MOM.WW.LmxOpcUa.Tests/MxAccess/GalaxyRuntimeProbeManagerTests.cs

Execution order

DTO + enum — GalaxyRuntimeState, GalaxyRuntimeStatus.
Hierarchy schema — add CategoryId to GalaxyObjectInfo, extend HierarchySql to select td.category_id as a new column, update GalaxyRepositoryService reader.
Config — add the two new MxAccessConfiguration fields and validator rule.
Probe manager class + unit tests (TDD) — write GalaxyRuntimeProbeManagerTests.cs first. Fake IMxAccessClient with scripted OnTagValueChanged invocations, configurable State, and a fake clock. Exercise the full matrix in the test plan below.
Ship tests green before touching node manager.
Host-to-variables mapping in node manager — add _hostedVariables: Dictionary<int, List<BaseDataVariableState>> populated during BuildAddressSpace. For each variable node, walk its owning object's HostGobjectId chain up to the nearest $WinPlatform or $AppEngine and append to that host's list. On rebuild (BuildSubtree), incrementally maintain the map. Expose MarkHostVariablesBadQuality(int gobjectId) and ClearHostVariablesBadQuality(int gobjectId) public methods that take the node manager Lock, iterate the hosted list, set/clear StatusCode, and call ClearChangeMasks(ctx, false) per variable.
Node manager wiring — construct GalaxyRuntimeProbeManager, pass MarkHostVariablesBadQuality / ClearHostVariablesBadQuality as its onHostStopped / onHostRunning callbacks, call Sync after BuildAddressSpace / rebuild, route OnTagValueChanged through HandleProbeUpdate, hook Tick() into the MxAccess connection-monitor callback path (fall back to timed WaitOne(500ms) on the dispatch loop if the monitor isn't reachable from the node manager). Add RuntimeStatuses and ActiveProbeSubscriptionCount accessors. Call _probeManager?.Dispose() from LmxNodeManager.Dispose before the existing MxAccess client teardown steps.
OPC UA synthetic nodes — under each $WinPlatform and $AppEngine node in BuildAddressSpace, add the six $-prefixed variables backed by lambdas that read from the probe manager snapshot.
Dashboard — RuntimeStatusInfo on StatusData, BuildRuntimeStatusInfo in StatusReportService, render Runtime Status panel, add Probes line to Subscriptions panel. Status tests asserting both.
Health check — new Rule 2e with test: Degraded when any host is stopped, message names the stopped hosts.
Integration tests — LmxNodeManagerBuildTests additions with a fake repository containing mixed $WinPlatform and $AppEngine hierarchy entries; verify Sync is called, synthetic nodes are created on both host types, _hostedVariables map is populated, and MarkHostVariablesBadQuality / ClearHostVariablesBadQuality flip status codes on the correct subset.
Docs — MxAccessBridge.md, StatusDashboard.md, Configuration.md.
Deploy — backup, deploy both instances, verify via dashboard.
Live verification — see Verification section below.

Test plan

`GalaxyRuntimeProbeManagerTests.cs` — unit tests with fake client + fake clock

State transitions

Fresh manager → empty snapshot.
Sync with one Platform + one Engine → snapshot contains two entries in Unknown, Kind set correctly.
First ScanState = true update → Unknown → Running, LastUpdateTime and LastScanState = true set, GoodUpdateCount == 1.
Second ScanState = true update → still Running, counter increments.
ScanState = false update → Running → Stopped, LastScanState = false, FailureCount == 1.
ItemStatus[0].success = false, detail = 2 update → Running → Stopped, LastError contains MX_E_PlatformCommunicationError.
Null value delivered → Running → Stopped defensively, LastError explains null-value rejection.
Recovery ScanState = true after Stopped → Stopped → Running, LastStateChangeTime updated, LastError cleared.
Platform and AppEngine transitions behave identically (parameterized test).

Unknown resolution timeout

No callback + clock advances past timeout → Unknown → Stopped.
Good update just before timeout → Unknown → Running (no subsequent Stopped).
Good update after timeout already flipped Unknown → Stopped → Stopped → Running (recovery path still works).
Tick on a Running entry with no recent update → still Running (no starvation check — this is the critical on-change-semantic guarantee).

MxAccess transport gating

Client State = Disconnected → GetSnapshot returns all entries with State = Unknown regardless of underlying state.
Client flips Connected → Disconnected → underlying state preserved internally; snapshot reports Unknown.
Client flips Disconnected → Connected → snapshot reflects underlying state again.
Incoming HandleProbeUpdate while client is Disconnected → still updates the underlying state machine (so the snapshot is correct when transport comes back).

Sync diff behavior

Sync with new Platform → Advise called once, counter = 1.
Sync with new Engine → Advise called once, counter = 1.
Sync twice with same hosts → Advise called once total (idempotent on unchanged entries).
Sync then Sync with a Platform removed → Unadvise called, snapshot loses entry.
Sync with different host set → Advise for new, Unadvise for old, unchanged preserved.
Sync filters out non-runtime categories (areas, user objects) — hierarchy with 10 mixed categories and 2 runtime hosts produces exactly 2 probes.

Event routing

HandleProbeUpdate(probeAddr, ...) → returns true, updates state.
HandleProbeUpdate(nonProbeAddr, ...) → returns false, no state change.
Concurrent Sync + HandleProbeUpdate under lock → no corruption (thread-safety smoke test).
Callback arriving after Sync removed the entry → HandleProbeUpdate returns false (entry not found), no crash.

Counters

ActiveProbeCount == 2 after Sync with 1 Platform + 1 Engine.
ActiveProbeCount decrements when a host is removed via Sync.
ActiveProbeCount == 0 on a fresh manager with no Sync called yet.

Dispose

Dispose on a fresh manager → no-op, no Unadvise calls on the fake client.
Dispose after Sync with 3 hosts → 3 Unadvise + 3 RemoveItem calls on the fake client.
Dispose twice → second call is idempotent, no extra Unadvise calls.
HandleProbeUpdate after Dispose → returns false defensively (no crash, no state change).
Sync after Dispose → no-op or throws ObjectDisposedException (pick one; test documents whichever is chosen).

Subtree invalidation callbacks

Construct probe manager with spy callbacks tracking (gobjectId, kind) tuples for each call.
Running → Stopped transition → onHostStopped invoked exactly once with the correct GobjectId, onHostRunning never called.
Stopped → Running transition → onHostRunning invoked exactly once with the correct GobjectId, onHostStopped never called.
Unknown → Running (initial callback) → no invocation of either callback (only Running↔Stopped transitions trigger them, not fresh Unknown→Running).
Unknown → Stopped (via timeout) → onHostStopped invoked once.
Multiple consecutive callbacks with ScanState=true while already Running → no extra onHostRunning invocations.
Multiple consecutive error callbacks while already Stopped → no extra onHostStopped invocations.
Callback throws exception → probe manager logs a warning, updates its internal state regardless, does not propagate.

`LmxNodeManagerBuildTests` additions

Build address space with a $WinPlatform in the fake hierarchy → probe manager receives a Sync call with one entry.
Build address space with a mix (1 Platform + 2 AppEngines + 5 user objects) → probe manager Sync receives exactly 3 runtime hosts.
Build + rebuild with different host set → probe manager's Sync called twice with correct diff.
Address space contains synthetic $RuntimeState variable under each host object node.
ActiveProbeSubscriptionCount reflects probe count after build.

Host-to-variables mapping + subtree invalidation tests

Build address space with 1 $AppEngine hosting 2 user objects with 3 attributes each → _hostedVariables[engineId] contains 6 variable nodes.
Build address space with 1 $WinPlatform hosting 2 $AppEngines, each hosting 3 user objects with 2 attributes each → _hostedVariables[platformId] contains the 2 Engine nodes + 12 attribute variables; _hostedVariables[engineId] contains its 6 attribute variables. (Platform and Engine entries both exist; a single variable can appear in both lists.)
Rebuild with a different set → the map is rebuilt from scratch; old entries are released.
MarkHostVariablesBadQuality(engineId) → every variable in _hostedVariables[engineId] has StatusCode = BadOutOfService after the call; variables hosted by other engines are unchanged.
ClearHostVariablesBadQuality(engineId) → every variable in that host's list has StatusCode = Good after the call.
MarkHostVariablesBadQuality on a GobjectId with no entry in the map → no-op, no crash.
MarkHostVariablesBadQuality followed by a fresh MxAccess update on one of the variables → the update's Value + Status overwrites the forced Bad (confirms no "override layer" confusion; the simple StatusCode set is naturally overwritten by the normal dispatch path).
MarkHostVariablesBadQuality acquires the node manager Lock (verify no deadlock when called from a thread that also needs the lock).

End-to-end subtree invalidation integration test

Fake repository with 1 Engine hosting 10 attributes. All on advise. All have some recent value with Good status.
Simulate probe callback delivering ScanState = false for the Engine → probe manager flips to Stopped, invokes onHostStopped, which in turn walks the 10 variables and flips them to BadOutOfService.
Assert all 10 variables now report StatusCode = BadOutOfService after a client.Read round-trip.
Simulate probe callback delivering ScanState = true again → probe manager flips to Running, onHostRunning clears the override, all 10 variables now report StatusCode = Good.

`StatusReportServiceTests` additions

HTML contains <h2>Galaxy Runtime</h2> when at least one runtime host is present.
HTML rendering distinguishes $WinPlatform and $AppEngine rows in the Kind column.
JSON exposes RuntimeStatus.Total, RuntimeStatus.RunningCount, RuntimeStatus.StoppedCount, RuntimeStatus.Hosts[].
Subscriptions panel HTML contains a Probes: line when ProbeSubscriptionCount > 0.
No Runtime Status panel when the fake repository has zero runtime hosts.
When fake MxAccess client is Disconnected, all host rows render Unknown regardless of state passed in.

`HealthCheckServiceTests` additions

All hosts running → Healthy.
One host stopped → Degraded, message mentions the stopped host name.
All hosts stopped → Degraded (not Unhealthy — cached values still served).
MxAccess disconnected + one host stopped → Unhealthy via Rule 1 (runtime status rule doesn't fire).

Verification

dotnet build clean on both Host and plugin.
dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests --filter "FullyQualifiedName~GalaxyRuntimeProbe|FullyQualifiedName~Status|FullyQualifiedName~HealthCheck" → all pass.
Deploy to instance1 (default RuntimeStatusProbesEnabled: true, RuntimeStatusUnknownTimeoutSeconds: 15). Dashboard shows Galaxy Runtime: 2 of 2 hosts running (DevPlatform + DevAppEngine) immediately after startup, all green. Subscriptions panel shows Probes: 2.
Stop DevAppEngine from the IDE (SMC SetToOffScan or the engine's stop action, leaving its parent Platform running). Verify:
- Dashboard panel turns red within ~1s of the action.
- DevAppEngine row shows Stopped with the last good timestamp.
- DevPlatform row remains Running — confirms the engines are independently observable.
- Overall Health rolls up to Degraded.
- CLI read ns=3;s=DevAppEngine.$RuntimeState returns "Stopped".
- Log has an Information line "Galaxy runtime DevAppEngine ($AppEngine) transitioned Running → Stopped".
- Subtree invalidation: CLI read ns=3;s=TestMachine_001.MachineID and any other tag under an object hosted by DevAppEngine returns status code BadOutOfService (or whatever specific code the Mark method uses). Every descendant tag, not just a sample — sweep-test via a browse + read across the whole address space. Operators also observe this on the dashboard Alarms / subscribed-variable reads if they're watching any particular value.
- Client-freeze observation: subscribe an OPC UA client to a handful of variables under DevAppEngine before step 4, then trigger the stop. Note whether the client handles the resulting notification batch cleanly (ideal) or visibly stalls (residual problem that dispatch suppression would need to address in phase 2). Document the observed behavior in the phase-2 decision gate for dispatch suppression.
Start DevAppEngine again (SetToOnScan). Verify:
- Dashboard flips back to green within ~1s.
- CLI read of $RuntimeState returns "Running".
- Log has a "Galaxy runtime DevAppEngine ($AppEngine) transitioned Stopped → Running" line.
- Subtree recovery: descendant tags previously showing BadOutOfService now show Good status. Values may initially be stale (whatever was cached at stop time) until fresh on-change MxAccess updates arrive; this matches the design trade-off documented in the Subtree Quality Invalidation section.
Stop DevPlatform entirely (full platform stop). Verify:
- Both DevPlatform and DevAppEngine flip to Stopped (the Platform takes the Engine down with it).
- Log records both transitions.
- CLI reads of $RuntimeState for both hosts return "Stopped".
- The underlying raw ScanState attribute reads may return BadCommunicationError — operator sees the distinction between the cached rollup and the live raw attribute.
Simulate MxAccess transport loss — e.g., stop the ArchestrA runtime on the local node or kill the probe connection. Verify:
- Every host row in the Runtime Status panel renders Unknown (not Stopped) while the Connection panel reports Disconnected.
- Overall Health is Unhealthy via Rule 1, NOT Degraded via Rule 2e (the rules should not double-message).
- After MxAccess reconnects, the runtime rows revert to their actual underlying states.
Deploy to instance2 with same config. Both instances should show consistent state since they observe the same local ArchestrA runtime.
Smoke-test: disable probes via RuntimeStatusProbesEnabled: false, restart, verify Runtime Status panel absent from HTML, Probes: line absent from Subscriptions panel, no probe subscriptions advised (log and ActiveSubscriptionCount delta) — backward compatibility path for deployments that don't want the feature.
Unresolvable-probe-tag behavior verification — temporarily add a bogus tag to the probe set to discover how MxAccess surfaces resolution failures. The simplest way is to force the probe manager to advise a made-up NoSuchPlatform_999.ScanState reference during a test boot, then observe:
- Does MxAccess deliver a data-change callback with ItemStatus[0].success = false and a resolution-failure detail? If yes, the host row transitions Unknown → Stopped within ~1s via the error-callback path, and LastError carries the detail. Tighten the plan's language to say "MxAccess surfaces resolution failures as error callbacks" and optionally tighten RuntimeStatusUnknownTimeoutSeconds downward.
- Or does MxAccess silently drop the advise with no callback at all? If yes, the bogus host stays Unknown until RuntimeStatusUnknownTimeoutSeconds elapses, then flips to Stopped via the Unknown-timeout backstop. Tighten the plan's language to say "MxAccess does not surface resolution failures; the Unknown-timeout is the only detection path" and leave the default timeout as-is.
- Document the observed behavior in docs/MxAccessBridge.md alongside the probe pattern section so operators know which detection path their deployment relies on.
- Remove the bogus tag and restart before handing over.

Open questions (phase 2/3 scope — not blocking phase 1)

Dispatch suppression for Stopped hosts (phase 2 decision gate) — once phase 1 ships with subtree invalidation, observe whether the client-freeze symptom persists. If it does, design dispatch suppression: filter MxAccess per-tag updates before they hit the dispatch queue when the owning host is Stopped. Requires a tagRef → owning-host GobjectId map (which _hostedVariables already implies, inverted). Trade-off is dropping legitimate updates during brief probe/reality mismatch windows. Decide after real measurement.
Should the probe manager expose transition events? Synthetic OPC UA event notifier on each host object that fires when $RuntimeState transitions. Phase 2 stretch — operators get per-host polling via the dashboard panel today; events would let clients subscribe without polling.
Multi-node Galaxies — Platform on a remote node shows up in the hierarchy but probes fire through the local MxAccess runtime's node. The probe semantics should still work because MxAccess routes inter-Platform queries transparently, but worth confirming during step 4 if the environment has a multi-node Galaxy.
Is ScanState writable? Some Galaxy system attributes are writable via MxAccess (SetScan method on the object) which would let an operator start/stop a host through the OPC UA bridge. Phase 3 possibility — would require a gating security classification since it's a runtime control action, not a data write.

49 KiB Raw Permalink Blame History Unescape Escape