49 KiB
Plan: Galaxy Runtime Status (Platform + AppEngine Stopped/Started Detection)
Context
Today the bridge has no operator-visible signal for "is Galaxy Platform X or AppEngine Y stopped or running?". The dashboard shows:
- MXAccess state — one bit of truth about whether the bridge can talk to the local MxAccess runtime at all.
- Data change dispatch rate — aggregate throughput across every advised attribute.
Neither catches the case an operator actually cares about: a single Platform or AppEngine in a multi-host Galaxy has stopped (operator stopped it from the IDE, the node crashed, network cut, process died, someone toggled OffScan for maintenance). The bridge keeps serving cached values, downstream OPC UA clients see stale reads, and nobody notices until somebody specifically goes looking at the affected equipment.
Galaxy exposes <ObjectName>.ScanState as a boolean system attribute on every deployed $WinPlatform and $AppEngine. true means the object is on scan and executing; anything else means not running. AppEngine state is independently observable through MxAccess (even a stopped Engine's parent Platform can still route the query) so a single probe mechanism covers both host types.
The goal is to advise <ObjectName>.ScanState for every deployed $WinPlatform and $AppEngine, surface per-host runtime state on the dashboard, drive a Degraded health check rule when any is down, and publish the state into the OPC UA address space so external clients can subscribe alongside the value data they already consume.
Design
Probe tag: <ObjectName>.ScanState
ScanState is a boolean system attribute on every deployed $WinPlatform and $AppEngine. The classification rule:
isRunning = status.Success && vtq.Value is bool b && b
Everything else → Stopped. The ItemStatus fields (category, detail) are still captured into LastError for operator diagnostics, but they don't branch the state machine.
On-change delivery semantic
MxAccess AdviseSupervisory delivers the current value at subscription time and then fires OnDataChange only when the value changes. ScanState is discrete — for a healthy host, the initial advise callback reports true and nothing follows until the state actually changes. There is no periodic heartbeat on the subscription.
Implications:
- No starvation-based Running → Stopped transition. A Running host will legitimately go minutes or hours without an update. The stale-threshold check for the Running state is dropped entirely.
- Error callbacks drive the Running → Stopped transition. MxAccess delivers a data-change callback with
ItemStatus[0].success == falseanddetail == 2 (MX_E_PlatformCommunicationError)when a host becomes unreachable. We trust this signal — it's the broker's job to surface it, and in practice it fires quickly. - Stale threshold only applies to the Unknown state. If a probe is advised but never receives a first callback (initial resolution failure, host never deployed, MxAccess routing broken), the Unknown → Stopped transition fires after
UnknownResolutionTimeoutSeconds. This catches "the probe never came online" without tripping on healthy stable hosts.
Subscription mechanics:
AdviseSupervisoryon<ObjectName>.ScanState. Supervisory variant avoids user-login requirements for bridge-owned probes — matches the pattern the node manager already uses for its own subscriptions.- Probes are bridge-owned, not ref-counted against client subscriptions. They live for the lifetime of the address space between rebuilds.
- On rebuild, the probe set is diffed against the new host list and the minimum number of
AdviseSupervisory/Unadvisecalls are issued (seeSyncin the probe manager).
Host discovery
Galaxy Repository already has the data — we just need to surface it to the runtime layer.
hierarchy.sql currently selects every deployed object where template_definition.category_id IN (1, 3, 4, 10, 11, 13, 17, 24, 26). Category 1 = $WinPlatform and 3 = $AppEngine are already in the set. Add template_definition.category_id as a new column on the query so the repository loader can tag each GalaxyObjectInfo with its Galaxy category, and the probe manager can filter for categories 1 and 3.
Schema change: add CategoryId: int to GalaxyObjectInfo, populated from hierarchy.sql. Small schema change, keeps the probe enumeration aligned with whatever the rest of the address space sees at each rebuild.
Runtime host state machine
┌─ Unknown ─┐ (initial state; advise issued, no callback yet)
│ │
│ │ ScanState == true
│ ▼
│ Running ◄───────────────────┐
│ │ │
│ │ │ ScanState == true
│ │ ScanState != true │ (recovery callback)
│ │ (false / error / │
│ │ bad status) │
│ ▼ │
│ Stopped ──────────────────────┘
│
└─► Stopped (Unknown → Stopped after UnknownResolutionTimeoutSeconds
if no initial callback ever arrives)
Three states:
- Unknown — probe advised but no callback yet. Initial state after bridge startup or a rebuild until the first
OnDataChangefor that host. If this state persists longer thanUnknownResolutionTimeoutSeconds(default 15s), the manager's periodic check flips it to Stopped — captures the "probe never resolved" case. - Running — last probe callback delivered
ScanState = truewithItemStatus[0].success == true. Stays in this state until a callback changes it. No starvation-based timeout. - Stopped — any of:
- Last probe callback had
ScanState != true(explicit off-scan). - Last probe callback had
ItemStatus[0].success == false(unreachable host). - Unknown state timed out (initial resolution never completed).
- Initial
AdviseSupervisoryreportedResolutionStatusofinvalidReferenceornoGalaxyRepository.
- Last probe callback had
MxAccess transport down → force Unknown
When the local MxAccess client is not connected (IMxAccessClient.State != ConnectionState.Connected), every probe's transport is effectively offline regardless of the underlying host state. The probe manager forces every entry to Unknown in its snapshot output while MxAccess is disconnected. Rationale:
- Telling the operator that all hosts are
Stoppedis misleading — the actual problem is the local transport, which the existing Connection panel already surfaces prominently. - Unknown is the right semantic: we don't know the host state because we can't see them right now.
- When MxAccess reconnects, the broker re-delivers probe subscriptions and the state machine resumes normally.
Implementation: GetSnapshot() checks _client.State and rewrites State = Unknown (leaving the underlying _stateByProbe map intact for when the transport comes back). HealthCheckService already rolls to Unhealthy via the MxAccess-not-connected rule before the runtime status rule fires, so this doesn't create a confusing health-rollup story.
New types
All in src/ZB.MOM.WW.LmxOpcUa.Host/Domain/:
public enum GalaxyRuntimeState { Unknown, Running, Stopped }
public sealed class GalaxyRuntimeStatus
{
public string ObjectName { get; set; } = ""; // gobject.tag_name
public int GobjectId { get; set; }
public string Kind { get; set; } = ""; // "$WinPlatform" or "$AppEngine"
public GalaxyRuntimeState State { get; set; }
public DateTime? LastStateCallbackTime { get; set; } // UTC of most recent probe callback
public DateTime? LastStateChangeTime { get; set; } // UTC of last Running↔Stopped transition
public bool? LastScanState { get; set; } // last ScanState value; null before first update
public string? LastError { get; set; } // MxStatus.detail description when !success
public long GoodUpdateCount { get; set; } // callbacks where ScanState == true
public long FailureCount { get; set; } // callbacks where ScanState != true or !success
}
Why two timestamps (LastStateCallbackTime vs LastStateChangeTime): on-change-only delivery means they'll match for most entries, but a callback that arrives with a different error detail while the host is already Stopped updates the callback time and LastError without touching LastStateChangeTime. The dashboard's "Since" column (see Dashboard panel) uses LastStateChangeTime so operators see "Stopped since 08:17:02Z" regardless of how many intervening error callbacks have refined the diagnostic detail.
Naming note: "Galaxy runtime" is the generic term covering both $WinPlatform and $AppEngine — the dashboard and config use this neutral phrasing so the feature doesn't look like it only covers Platforms.
Probe manager
New class MxAccess/GalaxyRuntimeProbeManager.cs, owned by LmxNodeManager:
internal sealed class GalaxyRuntimeProbeManager : IDisposable
{
public GalaxyRuntimeProbeManager(
IMxAccessClient client,
int unknownResolutionTimeoutSeconds,
Action<int> onHostStopped, // invoked with GobjectId on Running → Stopped
Action<int> onHostRunning); // invoked with GobjectId on Stopped → Running
// Called after address-space build / rebuild. Adds probes for new hosts,
// removes them for hosts no longer in the hierarchy. Idempotent.
// Caller supplies the full hierarchy; the manager filters for category_id
// 1 ($WinPlatform) and 3 ($AppEngine).
// Blocks on sequential AddItem/AdviseSupervisory SDK calls — see wiring notes.
public void Sync(IReadOnlyList<GalaxyObjectInfo> hierarchy);
// Invoked by LmxNodeManager's OnTagValueChanged callback when the address
// matches a probe tag reference. Returns true when the event was consumed
// by a probe so the data-change dispatch queue can skip it.
public bool HandleProbeUpdate(string tagRef, Vtq vtq, MxStatusProxy status);
// Called from the MxAccess connection monitor callback (MonitorIntervalSeconds
// cadence) to advance time-based transitions:
// 1. Unknown → Stopped when UnknownResolutionTimeoutSeconds has elapsed.
// 2. Nothing for Running — no starvation check (on-change-only semantics).
public void Tick();
// Snapshot respects MxAccess transport state — returns all Unknown when
// the transport is disconnected, regardless of the underlying per-host state.
public IReadOnlyList<GalaxyRuntimeStatus> GetSnapshot();
public int ActiveProbeCount { get; }
// Unadvise + RemoveItem on every active probe. Called from LmxNodeManager.Dispose
// before the MxAccess client teardown. Idempotent — safe to call multiple times.
public void Dispose();
}
The two Action<int> callbacks are how the probe manager triggers the subtree quality invalidation documented below — the owning LmxNodeManager passes references to its own MarkHostVariablesBadQuality and ClearHostVariablesBadQuality methods at construction time. The probe manager calls them synchronously on state transitions, from whichever thread delivered the probe callback (the MxAccess dispatch thread). The node manager methods acquire their own lock internally — the probe manager does not hold its own lock across the callback invocation to avoid inverted-lock-order deadlocks.
Internals:
Dictionary<string, GalaxyRuntimeStatus>keyed by probe tag reference (<ObjectName>.ScanState).- Reverse
Dictionary<int, string>fromGobjectIdto probe tag forSyncto diff against a fresh hierarchy. - One lock guarding both maps. Operations are microsecond-scale.
SyncfiltershierarchyforCategoryId == 1 || CategoryId == 3, then compares the filtered set against the active probe set:- Added hosts →
client.AddItem+AdviseSupervisory; insertGalaxyRuntimeStatus { State = Unknown }. - Removed hosts →
Unadvise+RemoveItem; drop entry. - Unchanged hosts → leave in place, preserving their state machine across the rebuild.
- Added hosts →
HandleProbeUpdateis the per-callback entry point. It evaluates theisRunningpredicate, updatesLastUpdateTime, transitions state, logs atInformationlevel on state changes only (not every tick), and stores theItemStatusdetail intoLastErroron failure.Tickruns at the existing dispatch thread cadence. For each Unknown entry, checksLastUpdateTime == null && (now - _createdAt[id]) > unknownResolutionTimeoutSecondsand flips to Stopped if so. Healthy Running entries are not touched.GetSnapshotshort-circuits to "all Unknown" when_client.State != ConnectionState.Connected.
LmxNodeManager wiring
LmxNodeManager constructs a GalaxyRuntimeProbeManager when MxAccessConfiguration.RuntimeStatusProbesEnabled is true. In BuildAddressSpace and the subtree rebuild path, after the existing loops complete, call _probeManager.Sync(hierarchy). Sync blocks while it issues AddItem + AdviseSupervisory sequentially for each new host — for a galaxy with ~50 runtime hosts this adds roughly 500ms–1s to the address-space build on top of the existing several-second build time. Kept synchronous deliberately: the simpler correctness model is worth the startup hit, and ActiveProbeCount is guaranteed to be accurate the moment the build completes.
Route the existing OnTagValueChanged callback through _probeManager.HandleProbeUpdate first — if it returns true, the event was consumed by a bridge-owned probe and the dispatch queue skips the normal variable-update path.
Tick() cadence — piggyback on the MxAccess connection monitor. The dispatch thread wakes on _dataChangeSignal, which only fires when tag values change. In the degenerate case where no probe ever resolves (MxAccess routing broken, bad probe tag, etc.), the dispatch loop never wakes and the Unknown → Stopped timeout would never fire. To avoid adding a new thread or timer, hook _probeManager.Tick() into the callback path that the existing MxAccess.MonitorIntervalSeconds watcher already runs — the same cadence that drives the connection-level probe-tag staleness check. A single call site covers both.
If the monitor is not accessible from LmxNodeManager during implementation (it lives at a different layer in the MxAccess client), fall back to Option A from the design discussion: change the dispatch loop's WaitOne() call to a timed WaitOne(500ms) so it wakes periodically regardless of data changes. Single-line change, but requires verifying no assumptions in the existing loop break from the periodic wake-ups.
Service shutdown — explicit probe cleanup
The probe manager's Sync handles Unadvise on diff removal when a host leaves the hierarchy. Service shutdown is a separate path that needs explicit handling: when LmxNodeManager is disposed, the active probe subscriptions must be torn down before the MxAccess client is closed — otherwise we rely on the client's broader shutdown to cover supervisory subscriptions, which depends on disposal ordering and may or may not clean up cleanly.
GalaxyRuntimeProbeManager implements IDisposable. Dispose() walks the active probe map, calls Unadvise + RemoveItem on each entry, and clears the maps. Idempotent — calling it twice is a no-op. LmxNodeManager.Dispose calls _probeManager?.Dispose() before the existing teardown steps that touch the MxAccess client.
Subtree quality invalidation on Stopped transition
Operational context for this section — observed behavior from production: when an AppEngine or Platform goes OffScan, MxAccess fans out per-tag OnDataChange callbacks for every advised tag hosted by that runtime object, each carrying bad quality. Two symptoms result:
- OPC UA client freeze — the dispatch handler processes the flood in one cycle, pushes thousands of OPC UA value-change notifications to subscribed clients in one
Publishresponse, and the client visibly stalls handling the volume. - Incomplete quality flip — some OPC UA variables retain their last good value with Good quality even after the host is down, either because the dispatch queue drops updates, or because some tags aren't in the subscribed set at the moment of the flood, or because of an edge case in the quality mapper. Operationally: clients read plausible-looking stale data from a dead host.
The probe-driven Stopped transition is the authoritative, on-time signal we control. On that transition, the bridge proactively walks every OPC UA variable node hosted by the Stopped host and sets its StatusCode to BadOutOfService. This is independent of whether MxAccess also delivers per-tag bad-quality updates — the two signals are belt-and-suspenders for correctness. Even if the dispatch queue drops half the per-tag updates, the subtree walk guarantees the end state is uniformly Bad for every variable under the dead host.
On the recovery Stopped → Running transition, the bridge walks the same set and clears the override — sets StatusCode back to Good so the cached values are visible again. Subsequent real MxAccess updates arrive on-change and overwrite value + status as normal. Trade-off: for a host that's been down a long time, some tags may show Good quality on a stale cached value for a short window after recovery, until MxAccess delivers the next on-change update for that tag. This matches existing bridge behavior for any slow-changing attribute and is preferable to leaving variables stuck at BadOutOfService indefinitely waiting for an update that may never come.
What's included in the "subtree" — the set of variables whose owning Galaxy object is hosted (transitively) by the Stopped host. For AppEngines, this is every variable whose object's host_gobject_id chain reaches the Engine. For Platforms, it's every variable on every Engine hosted by the Platform, plus every object hosted directly on the Platform. This is not browse-tree containment — an object can live in one Area (browse parent) but be hosted by an Engine on a different Platform (runtime parent), and the host relationship is what determines the fate of its live data.
Implementation plan for the host-to-variables mapping:
- Extend
hierarchy.sqlto returngobject.host_gobject_idas a new column if it exists. Verify during implementation — if the column is not present on this Galaxy schema version, fall back tocontained_by_gobject_idas an approximation (less precise for edge cases where browse containment differs from runtime hosting, but sufficient for typical Galaxy topologies). - Extend
GalaxyObjectInfowithHostGobjectId: int. - During
BuildAddressSpace, as each variable is created, compute its owning host by walkingHostGobjectIdup the chain until hitting a$WinPlatformor$AppEngine(or reaching the root). Append the variable to aDictionary<int, List<BaseDataVariableState>>keyed by the host'sGobjectId. - On
BuildSubtree(incremental rebuild), the same logic runs for newly added variables. Variables that leave the hierarchy are removed from the map. The map lives next to_nodeMaponLmxNodeManager.
New public methods on LmxNodeManager:
// Called by probe manager on Running → Stopped. Walks every variable hosted by
// gobjectId and sets its StatusCode to BadOutOfService. Safe to call multiple times.
// Does nothing when gobjectId has no hosted variables.
public void MarkHostVariablesBadQuality(int gobjectId);
// Called by probe manager on Stopped → Running. Walks every variable hosted by
// gobjectId and resets StatusCode to Good. Values are left at whatever the last
// MxAccess-delivered value was; subsequent on-change updates will refresh them.
public void ClearHostVariablesBadQuality(int gobjectId);
Both methods acquire the standard node manager Lock, iterate the hosted list, set StatusCode + call ClearChangeMasks(ctx, false) per variable, and release the lock. The OPC UA subscription publisher picks up the change masks on its next tick and pushes notifications to subscribed clients — so operators see a single uniform quality flip per variable rather than two (one from our walk, one from the MxAccess per-tag delivery).
Dispatch suppression — deferred pending observation
The subtree invalidation above addresses the data-correctness symptom (some variables not flipping to bad quality). The client freeze symptom is a separate problem: even if the quality state is correct, the bridge is still processing a thundering herd of per-tag bad-quality MxAccess callbacks through the dispatch queue, which in turn push thousands of OPC UA value-change notifications to subscribed clients.
A stronger fix would be dispatch suppression: once the probe manager transitions a host to Stopped, filter out incoming MxAccess per-tag updates for any tag owned by that host before they hit the dispatch queue. The subtree walk has already captured the state; the redundant per-tag updates are pure noise.
This is deliberately NOT part of phase 1. Reasons:
- The subtree walk may make the freeze disappear entirely. If the dispatch queue processes the flood but the notifications it pushes are now duplicates of change masks the walk already set, the SDK may coalesce them into a single publish cycle and the client sees one notification batch rather than thousands. We want to observe whether this is the case before building suppression.
- If the freeze persists after subtree invalidation ships, we have a real measurement of the residual problem to inform the suppression design (which hosts, which tags, how much batching, whether to also coalesce at the OPC UA publisher level).
- The suppression path has a subtle failure mode: if the probe is briefly wrong (race where the probe says Stopped but the host actually recovered), we'd drop legitimate updates for a few seconds until the probe catches up. For an on-change-only probe this is bounded, but the plan should justify the trade-off against real observed data.
Phase 2 decision gate: after shipping phase 1 and observing the post-subtree-walk behavior against a real AppEngine stop, decide whether dispatch suppression is still needed and design it against the real measurement.
OPC UA address space exposure
Per-host status should be readable by OPC UA clients, not just the dashboard. Add child variable nodes under each $WinPlatform / $AppEngine object node in the address space. All bridge-synthetic nodes use a $ prefix so they can never collide with user-defined attributes on extended templates:
<Object>.$RuntimeState(String) —Unknown/Running/Stopped.<Object>.$LastCallbackTime(DateTime) — most recent probe callback regardless of transition.<Object>.$LastScanState(Boolean) — lastScanStatevalue received; null before first update.<Object>.$LastStateChangeTime(DateTime) — most recent Running↔Stopped transition, backs the dashboard "Since" column.<Object>.$FailureCount(Int64)<Object>.$LastError(String) — last non-success MxStatus detail, empty string when null.
These read from the probe manager's snapshot (bridge-synthetic, no MxAccess round-trip) and are updated via ChangeBits.Value signalling when the state transitions. Read-only.
Note: the underlying <ObjectName>.ScanState Galaxy attribute will already appear in the address space via the normal hierarchy-build path, so downstream clients will see both the raw attribute (ns=3;s=DevPlatform.ScanState) and the synthesized state rollup (ns=3;s=DevPlatform.$RuntimeState). Intentional — the raw attribute is the ground truth, the rollup adds state-change timestamps and the Unknown/Running/Stopped trichotomy.
Namespace placement: under the existing host object node in the Galaxy namespace (ns=3), browseable at DevPlatform/$RuntimeState etc. No new namespace needed.
Dashboard
Runtime Status panel
New RuntimeStatusInfo class on StatusData:
public class RuntimeStatusInfo
{
public int Total { get; set; }
public int RunningCount { get; set; }
public int StoppedCount { get; set; }
public int UnknownCount { get; set; }
public List<GalaxyRuntimeStatus> Hosts { get; set; } = new();
}
Populated in StatusReportService via a new LmxNodeManager.RuntimeStatuses accessor. Renders between the Galaxy Info panel and the Historian panel.
Panel color:
- Green — all hosts Running.
- Yellow — at least one Unknown, zero Stopped.
- Red — at least one Stopped.
- Gray — MxAccess disconnected (all hosts Unknown; the Connection panel is the primary signal).
HTML layout:
┌ Galaxy Runtime ───────────────────────────────────────────────────────┐
│ 5 of 6 hosts running (3 platforms, 3 engines) │
│ ┌─────────────────┬──────────────┬─────────┬──────────────────────┐ │
│ │ Name │ Kind │ State │ Since │ │
│ ├─────────────────┼──────────────┼─────────┼──────────────────────┤ │
│ │ DevPlatform │ $WinPlatform │ Running │ 2026-04-13T08:15:02Z │ │
│ │ DevAppEngine │ $AppEngine │ Running │ 2026-04-13T08:15:04Z │ │
│ │ PlatformA │ $WinPlatform │ Running │ 2026-04-13T08:15:03Z │ │
│ │ EngineA_1 │ $AppEngine │ Running │ 2026-04-13T08:15:05Z │ │
│ │ EngineA_2 │ $AppEngine │ Stopped │ 2026-04-13T14:28:03Z │ │
│ │ PlatformB │ $WinPlatform │ Running │ 2026-04-13T08:15:04Z │ │
│ └─────────────────┴──────────────┴─────────┴──────────────────────┘ │
└────────────────────────────────────────────────────────────────────────┘
The "Since" column backs on LastStateChangeTime and its meaning depends on the row's current state: "Running since X" reads as "has been on scan since X", "Stopped since X" reads as "has been off scan since X". For Unknown rows, display "Advised since X" instead (the probe was registered at X but has not yet received its first callback).
Subscriptions panel — break out bridge probe count
The existing Subscriptions panel shows Active: N — the total advised-item count from IMxAccessClient.ActiveSubscriptionCount. After this ships, that number will include the bridge-owned runtime probes (one per Platform + one per AppEngine), which would look like a silent jump to operators watching for capacity planning purposes.
Fix: expose a new ActiveProbeSubscriptionCount property on LmxNodeManager (wired from GalaxyRuntimeProbeManager.ActiveProbeCount) and render as a second line on the Subscriptions panel:
┌ Subscriptions ──────────────────────────────┐
│ Active: 1247 │
│ Probes: 6 (bridge-owned runtime status) │
└──────────────────────────────────────────────┘
The Active total continues to include probes (no subtraction) so the count still matches whatever MxAccess actually holds — the breakout line tells operators which slice is bridge-internal.
HealthCheckService rule
New rule in HealthCheckService.CheckHealth:
Rule 2e: Any Galaxy runtime host in Stopped state → Degraded
- Yellow panel
- Message: "N of M hosts stopped: Host1, Host2"
Rationale: the bridge is still able to talk to the local MxAccess runtime and serve cached values for the hosts that are up, so this is Degraded rather than Unhealthy. A stopped host is recoverable — the operator fixes it and the probe automatically transitions back to Running.
Rule ordering matters: this rule checks after the MxAccess-connected check (Rule 1), so when MxAccess is disconnected the service is Unhealthy on Rule 1 and the runtime-host rule never runs — avoids the confusing "MxAccess down AND Galaxy runtime degraded" double message.
Configuration
New fields on MxAccessConfiguration (not a new config class — this is a runtime concern of the MxAccess bridge):
public class MxAccessConfiguration
{
// ...existing fields...
/// <summary>
/// Enables per-host runtime status probing via AdviseSupervisory on
/// <c><ObjectName>.ScanState</c> for every deployed $WinPlatform
/// and $AppEngine. Default enabled when a deployed ArchestrA Platform
/// is present. Set false for bridges that don't need multi-host
/// visibility and want to minimize subscription count.
/// </summary>
public bool RuntimeStatusProbesEnabled { get; set; } = true;
/// <summary>
/// Maximum seconds to wait for the initial probe callback before marking
/// an Unknown host as Stopped. Only applies to the Unknown → Stopped
/// transition; Running hosts do not time out (ScanState is delivered
/// on-change only, so a stable healthy host may go indefinitely without
/// a callback). Default 15s.
/// </summary>
public int RuntimeStatusUnknownTimeoutSeconds { get; set; } = 15;
}
No new top-level config section. Validator emits a warning if the timeout is shorter than 5 seconds (below the reasonable floor for MxAccess initial-resolution latency).
Critical Files
Modified
src/ZB.MOM.WW.LmxOpcUa.Host/Domain/GalaxyObjectInfo.cs— addCategoryId: intandHostGobjectId: intsrc/ZB.MOM.WW.LmxOpcUa.Host/GalaxyRepository/GalaxyRepositoryService.cs— includetemplate_definition.category_idandgobject.host_gobject_idinHierarchySqland the reader (falling back tocontained_by_gobject_idif host column is unavailable)gr/queries/hierarchy.sql— same column additions (documentation query)src/ZB.MOM.WW.LmxOpcUa.Host/Configuration/MxAccessConfiguration.cs— addRuntimeStatusProbesEnabled+RuntimeStatusUnknownTimeoutSecondssrc/ZB.MOM.WW.LmxOpcUa.Host/OpcUa/LmxNodeManager.cs— construct probe manager, wireOnTagValueChangedand the MxAccess monitor callback, build_hostedVariables: Dictionary<int, List<BaseDataVariableState>>during address-space construction, exposeRuntimeStatuses/ActiveProbeSubscriptionCount/MarkHostVariablesBadQuality/ClearHostVariablesBadQualitysrc/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusData.cs— addRuntimeStatusInfo; addProbeSubscriptionCountfield onSubscriptionInfosrc/ZB.MOM.WW.LmxOpcUa.Host/Status/StatusReportService.cs— populate from node manager, render Runtime Status panel + Probes linesrc/ZB.MOM.WW.LmxOpcUa.Host/Status/HealthCheckService.cs— new Rule 2e (after Rule 1 to avoid double-messaging when MxAccess is down)src/ZB.MOM.WW.LmxOpcUa.Host/appsettings.json— new MxAccess fields with defaultssrc/ZB.MOM.WW.LmxOpcUa.Host/Configuration/ConfigurationValidator.cs— timeout floor warningdocs/MxAccessBridge.md— document the probe pattern and on-change semanticsdocs/StatusDashboard.md— addRuntimeStatusInfofield table and Probes linedocs/Configuration.md— add the two new MxAccess fields
New
src/ZB.MOM.WW.LmxOpcUa.Host/Domain/GalaxyRuntimeStatus.cssrc/ZB.MOM.WW.LmxOpcUa.Host/Domain/GalaxyRuntimeState.cssrc/ZB.MOM.WW.LmxOpcUa.Host/MxAccess/GalaxyRuntimeProbeManager.cstests/ZB.MOM.WW.LmxOpcUa.Tests/MxAccess/GalaxyRuntimeProbeManagerTests.cs
Execution order
- DTO + enum —
GalaxyRuntimeState,GalaxyRuntimeStatus. - Hierarchy schema — add
CategoryIdtoGalaxyObjectInfo, extendHierarchySqlto selecttd.category_idas a new column, updateGalaxyRepositoryServicereader. - Config — add the two new
MxAccessConfigurationfields and validator rule. - Probe manager class + unit tests (TDD) — write
GalaxyRuntimeProbeManagerTests.csfirst. FakeIMxAccessClientwith scriptedOnTagValueChangedinvocations, configurableState, and a fake clock. Exercise the full matrix in the test plan below. - Ship tests green before touching node manager.
- Host-to-variables mapping in node manager — add
_hostedVariables: Dictionary<int, List<BaseDataVariableState>>populated duringBuildAddressSpace. For each variable node, walk its owning object'sHostGobjectIdchain up to the nearest$WinPlatformor$AppEngineand append to that host's list. On rebuild (BuildSubtree), incrementally maintain the map. ExposeMarkHostVariablesBadQuality(int gobjectId)andClearHostVariablesBadQuality(int gobjectId)public methods that take the node managerLock, iterate the hosted list, set/clearStatusCode, and callClearChangeMasks(ctx, false)per variable. - Node manager wiring — construct
GalaxyRuntimeProbeManager, passMarkHostVariablesBadQuality/ClearHostVariablesBadQualityas itsonHostStopped/onHostRunningcallbacks, callSyncafterBuildAddressSpace/ rebuild, routeOnTagValueChangedthroughHandleProbeUpdate, hookTick()into the MxAccess connection-monitor callback path (fall back to timedWaitOne(500ms)on the dispatch loop if the monitor isn't reachable from the node manager). AddRuntimeStatusesandActiveProbeSubscriptionCountaccessors. Call_probeManager?.Dispose()fromLmxNodeManager.Disposebefore the existing MxAccess client teardown steps. - OPC UA synthetic nodes — under each
$WinPlatformand$AppEnginenode in BuildAddressSpace, add the six$-prefixed variables backed by lambdas that read from the probe manager snapshot. - Dashboard —
RuntimeStatusInfoonStatusData,BuildRuntimeStatusInfoinStatusReportService, render Runtime Status panel, add Probes line to Subscriptions panel. Status tests asserting both. - Health check — new Rule 2e with test: Degraded when any host is stopped, message names the stopped hosts.
- Integration tests —
LmxNodeManagerBuildTestsadditions with a fake repository containing mixed$WinPlatformand$AppEnginehierarchy entries; verifySyncis called, synthetic nodes are created on both host types,_hostedVariablesmap is populated, andMarkHostVariablesBadQuality/ClearHostVariablesBadQualityflip status codes on the correct subset. - Docs —
MxAccessBridge.md,StatusDashboard.md,Configuration.md. - Deploy — backup, deploy both instances, verify via dashboard.
- Live verification — see Verification section below.
Test plan
GalaxyRuntimeProbeManagerTests.cs — unit tests with fake client + fake clock
State transitions
- Fresh manager → empty snapshot.
Syncwith one Platform + one Engine → snapshot contains two entries inUnknown,Kindset correctly.- First
ScanState = trueupdate → Unknown → Running,LastUpdateTimeandLastScanState = trueset,GoodUpdateCount == 1. - Second
ScanState = trueupdate → still Running, counter increments. ScanState = falseupdate → Running → Stopped,LastScanState = false,FailureCount == 1.ItemStatus[0].success = false, detail = 2update → Running → Stopped,LastErrorcontainsMX_E_PlatformCommunicationError.- Null value delivered → Running → Stopped defensively,
LastErrorexplains null-value rejection. - Recovery
ScanState = trueafter Stopped → Stopped → Running,LastStateChangeTimeupdated,LastErrorcleared. - Platform and AppEngine transitions behave identically (parameterized test).
Unknown resolution timeout
- No callback + clock advances past timeout → Unknown → Stopped.
- Good update just before timeout → Unknown → Running (no subsequent Stopped).
- Good update after timeout already flipped Unknown → Stopped → Stopped → Running (recovery path still works).
Tickon a Running entry with no recent update → still Running (no starvation check — this is the critical on-change-semantic guarantee).
MxAccess transport gating
- Client
State = Disconnected→GetSnapshotreturns all entries withState = Unknownregardless of underlying state. - Client flips Connected → Disconnected → underlying state preserved internally; snapshot reports Unknown.
- Client flips Disconnected → Connected → snapshot reflects underlying state again.
- Incoming
HandleProbeUpdatewhile client is Disconnected → still updates the underlying state machine (so the snapshot is correct when transport comes back).
Sync diff behavior
- Sync with new Platform → Advise called once, counter = 1.
- Sync with new Engine → Advise called once, counter = 1.
- Sync twice with same hosts → Advise called once total (idempotent on unchanged entries).
- Sync then Sync with a Platform removed → Unadvise called, snapshot loses entry.
- Sync with different host set → Advise for new, Unadvise for old, unchanged preserved.
- Sync filters out non-runtime categories (areas, user objects) — hierarchy with 10 mixed categories and 2 runtime hosts produces exactly 2 probes.
Event routing
HandleProbeUpdate(probeAddr, ...)→ returnstrue, updates state.HandleProbeUpdate(nonProbeAddr, ...)→ returnsfalse, no state change.- Concurrent
Sync+HandleProbeUpdateunder lock → no corruption (thread-safety smoke test). - Callback arriving after Sync removed the entry →
HandleProbeUpdatereturns false (entry not found), no crash.
Counters
ActiveProbeCount == 2after Sync with 1 Platform + 1 Engine.ActiveProbeCountdecrements when a host is removed via Sync.ActiveProbeCount == 0on a fresh manager with no Sync called yet.
Dispose
- Dispose on a fresh manager → no-op, no Unadvise calls on the fake client.
- Dispose after Sync with 3 hosts → 3 Unadvise + 3 RemoveItem calls on the fake client.
- Dispose twice → second call is idempotent, no extra Unadvise calls.
- HandleProbeUpdate after Dispose → returns false defensively (no crash, no state change).
- Sync after Dispose → no-op or throws ObjectDisposedException (pick one; test documents whichever is chosen).
Subtree invalidation callbacks
- Construct probe manager with spy callbacks tracking
(gobjectId, kind)tuples for each call. - Running → Stopped transition →
onHostStoppedinvoked exactly once with the correct GobjectId,onHostRunningnever called. - Stopped → Running transition →
onHostRunninginvoked exactly once with the correct GobjectId,onHostStoppednever called. - Unknown → Running (initial callback) → no invocation of either callback (only Running↔Stopped transitions trigger them, not fresh Unknown→Running).
- Unknown → Stopped (via timeout) →
onHostStoppedinvoked once. - Multiple consecutive callbacks with
ScanState=truewhile already Running → no extraonHostRunninginvocations. - Multiple consecutive error callbacks while already Stopped → no extra
onHostStoppedinvocations. - Callback throws exception → probe manager logs a warning, updates its internal state regardless, does not propagate.
LmxNodeManagerBuildTests additions
- Build address space with a
$WinPlatformin the fake hierarchy → probe manager receives aSynccall with one entry. - Build address space with a mix (1 Platform + 2 AppEngines + 5 user objects) → probe manager Sync receives exactly 3 runtime hosts.
- Build + rebuild with different host set → probe manager's
Synccalled twice with correct diff. - Address space contains synthetic
$RuntimeStatevariable under each host object node. ActiveProbeSubscriptionCountreflects probe count after build.
Host-to-variables mapping + subtree invalidation tests
- Build address space with 1
$AppEnginehosting 2 user objects with 3 attributes each →_hostedVariables[engineId]contains 6 variable nodes. - Build address space with 1
$WinPlatformhosting 2$AppEngines, each hosting 3 user objects with 2 attributes each →_hostedVariables[platformId]contains the 2 Engine nodes + 12 attribute variables;_hostedVariables[engineId]contains its 6 attribute variables. (Platform and Engine entries both exist; a single variable can appear in both lists.) - Rebuild with a different set → the map is rebuilt from scratch; old entries are released.
MarkHostVariablesBadQuality(engineId)→ every variable in_hostedVariables[engineId]hasStatusCode = BadOutOfServiceafter the call; variables hosted by other engines are unchanged.ClearHostVariablesBadQuality(engineId)→ every variable in that host's list hasStatusCode = Goodafter the call.MarkHostVariablesBadQualityon a GobjectId with no entry in the map → no-op, no crash.MarkHostVariablesBadQualityfollowed by a fresh MxAccess update on one of the variables → the update's Value + Status overwrites the forced Bad (confirms no "override layer" confusion; the simple StatusCode set is naturally overwritten by the normal dispatch path).MarkHostVariablesBadQualityacquires the node managerLock(verify no deadlock when called from a thread that also needs the lock).
End-to-end subtree invalidation integration test
- Fake repository with 1 Engine hosting 10 attributes. All on advise. All have some recent value with Good status.
- Simulate probe callback delivering
ScanState = falsefor the Engine → probe manager flips to Stopped, invokesonHostStopped, which in turn walks the 10 variables and flips them to BadOutOfService. - Assert all 10 variables now report StatusCode = BadOutOfService after a
client.Readround-trip. - Simulate probe callback delivering
ScanState = trueagain → probe manager flips to Running,onHostRunningclears the override, all 10 variables now report StatusCode = Good.
StatusReportServiceTests additions
- HTML contains
<h2>Galaxy Runtime</h2>when at least one runtime host is present. - HTML rendering distinguishes
$WinPlatformand$AppEnginerows in the Kind column. - JSON exposes
RuntimeStatus.Total,RuntimeStatus.RunningCount,RuntimeStatus.StoppedCount,RuntimeStatus.Hosts[]. - Subscriptions panel HTML contains a
Probes:line whenProbeSubscriptionCount > 0. - No Runtime Status panel when the fake repository has zero runtime hosts.
- When fake MxAccess client is
Disconnected, all host rows renderUnknownregardless of state passed in.
HealthCheckServiceTests additions
- All hosts running → Healthy.
- One host stopped → Degraded, message mentions the stopped host name.
- All hosts stopped → Degraded (not Unhealthy — cached values still served).
- MxAccess disconnected + one host stopped → Unhealthy via Rule 1 (runtime status rule doesn't fire).
Verification
-
dotnet buildclean on both Host and plugin. -
dotnet test tests/ZB.MOM.WW.LmxOpcUa.Tests --filter "FullyQualifiedName~GalaxyRuntimeProbe|FullyQualifiedName~Status|FullyQualifiedName~HealthCheck"→ all pass. -
Deploy to instance1 (default
RuntimeStatusProbesEnabled: true,RuntimeStatusUnknownTimeoutSeconds: 15). Dashboard showsGalaxy Runtime: 2 of 2 hosts running(DevPlatform + DevAppEngine) immediately after startup, all green. Subscriptions panel showsProbes: 2. -
Stop
DevAppEnginefrom the IDE (SMCSetToOffScanor the engine's stop action, leaving its parent Platform running). Verify:- Dashboard panel turns red within ~1s of the action.
- DevAppEngine row shows
Stoppedwith the last good timestamp. - DevPlatform row remains
Running— confirms the engines are independently observable. - Overall Health rolls up to
Degraded. - CLI
read ns=3;s=DevAppEngine.$RuntimeStatereturns"Stopped". - Log has an
Informationline "Galaxy runtime DevAppEngine ($AppEngine) transitioned Running → Stopped". - Subtree invalidation: CLI
read ns=3;s=TestMachine_001.MachineIDand any other tag under an object hosted by DevAppEngine returns status codeBadOutOfService(or whatever specific code the Mark method uses). Every descendant tag, not just a sample — sweep-test via a browse + read across the whole address space. Operators also observe this on the dashboard Alarms / subscribed-variable reads if they're watching any particular value. - Client-freeze observation: subscribe an OPC UA client to a handful of variables under DevAppEngine before step 4, then trigger the stop. Note whether the client handles the resulting notification batch cleanly (ideal) or visibly stalls (residual problem that dispatch suppression would need to address in phase 2). Document the observed behavior in the phase-2 decision gate for dispatch suppression.
-
Start
DevAppEngineagain (SetToOnScan). Verify:- Dashboard flips back to green within ~1s.
- CLI read of
$RuntimeStatereturns"Running". - Log has a "Galaxy runtime DevAppEngine ($AppEngine) transitioned Stopped → Running" line.
- Subtree recovery: descendant tags previously showing
BadOutOfServicenow showGoodstatus. Values may initially be stale (whatever was cached at stop time) until fresh on-change MxAccess updates arrive; this matches the design trade-off documented in the Subtree Quality Invalidation section.
-
Stop
DevPlatformentirely (full platform stop). Verify:- Both DevPlatform and DevAppEngine flip to
Stopped(the Platform takes the Engine down with it). - Log records both transitions.
- CLI reads of
$RuntimeStatefor both hosts return"Stopped". - The underlying raw
ScanStateattribute reads may return BadCommunicationError — operator sees the distinction between the cached rollup and the live raw attribute.
- Both DevPlatform and DevAppEngine flip to
-
Simulate MxAccess transport loss — e.g., stop the ArchestrA runtime on the local node or kill the probe connection. Verify:
- Every host row in the Runtime Status panel renders
Unknown(not Stopped) while the Connection panel reportsDisconnected. - Overall Health is
Unhealthyvia Rule 1, NOTDegradedvia Rule 2e (the rules should not double-message). - After MxAccess reconnects, the runtime rows revert to their actual underlying states.
- Every host row in the Runtime Status panel renders
-
Deploy to instance2 with same config. Both instances should show consistent state since they observe the same local ArchestrA runtime.
-
Smoke-test: disable probes via
RuntimeStatusProbesEnabled: false, restart, verify Runtime Status panel absent from HTML,Probes:line absent from Subscriptions panel, no probe subscriptions advised (log andActiveSubscriptionCountdelta) — backward compatibility path for deployments that don't want the feature. -
Unresolvable-probe-tag behavior verification — temporarily add a bogus tag to the probe set to discover how MxAccess surfaces resolution failures. The simplest way is to force the probe manager to advise a made-up
NoSuchPlatform_999.ScanStatereference during a test boot, then observe:- Does MxAccess deliver a data-change callback with
ItemStatus[0].success = falseand a resolution-failure detail? If yes, the host row transitions Unknown → Stopped within ~1s via the error-callback path, andLastErrorcarries the detail. Tighten the plan's language to say "MxAccess surfaces resolution failures as error callbacks" and optionally tightenRuntimeStatusUnknownTimeoutSecondsdownward. - Or does MxAccess silently drop the advise with no callback at all? If yes, the bogus host stays Unknown until
RuntimeStatusUnknownTimeoutSecondselapses, then flips to Stopped via the Unknown-timeout backstop. Tighten the plan's language to say "MxAccess does not surface resolution failures; the Unknown-timeout is the only detection path" and leave the default timeout as-is. - Document the observed behavior in
docs/MxAccessBridge.mdalongside the probe pattern section so operators know which detection path their deployment relies on. - Remove the bogus tag and restart before handing over.
- Does MxAccess deliver a data-change callback with
Open questions (phase 2/3 scope — not blocking phase 1)
-
Dispatch suppression for Stopped hosts (phase 2 decision gate) — once phase 1 ships with subtree invalidation, observe whether the client-freeze symptom persists. If it does, design dispatch suppression: filter MxAccess per-tag updates before they hit the dispatch queue when the owning host is Stopped. Requires a
tagRef → owning-host GobjectIdmap (which_hostedVariablesalready implies, inverted). Trade-off is dropping legitimate updates during brief probe/reality mismatch windows. Decide after real measurement. -
Should the probe manager expose transition events? Synthetic OPC UA event notifier on each host object that fires when
$RuntimeStatetransitions. Phase 2 stretch — operators get per-host polling via the dashboard panel today; events would let clients subscribe without polling. -
Multi-node Galaxies — Platform on a remote node shows up in the hierarchy but probes fire through the local MxAccess runtime's node. The probe semantics should still work because MxAccess routes inter-Platform queries transparently, but worth confirming during step 4 if the environment has a multi-node Galaxy.
-
Is
ScanStatewritable? Some Galaxy system attributes are writable via MxAccess (SetScan method on the object) which would let an operator start/stop a host through the OPC UA bridge. Phase 3 possibility — would require a gating security classification since it's a runtime control action, not a data write.