Phase 2 PR 13 — port GalaxyRuntimeProbeManager state machine + wire per-platform ScanState probing. PR 8 gave operators the gateway-level transport signal (MxAccessClient.ConnectionStateChanged → OnHostStatusChanged tagged with the Wonderware client identity) — enough to detect when the entire MXAccess COM proxy dies, but silent when a specific or goes down while the gateway stays alive. This PR closes that gap: a pure-logic GalaxyRuntimeProbeManager (ported from v1's 472-LOC GalaxyRuntimeProbeManager.cs, distilled to ~240 LOC of state machine without the OPC UA node-manager entanglement) lives in Backend/Stability/, advises <TagName>.ScanState per WinPlatform + AppEngine gobject discovered during DiscoverAsync, runs the Unknown → Running → Stopped state machine with v1's documented semantics preserved verbatim, and raises a StateChanged event on transitions operators should react to. MxAccessGalaxyBackend instantiates the probe manager in the constructor with SubscribeAsync/UnsubscribeAsync delegate pointers into MxAccessClient, hooks StateChanged to forward each transition through the same OnHostStatusChanged IPC event the gateway signal uses (HostName = platform/engine TagName, RuntimeStatus = 'Running'|'Stopped'|'Unknown', LastObservedUtcUnixMs from the state-change timestamp), so Admin UI gets per-host signals flowing through the existing PR 8 wire with no additional IPC plumbing. State machine rules ported from v1 runtimestatus.md: (a) ScanState is on-change-only — a stably-Running host may go hours without a callback, so Running → Stopped is driven only by explicit ScanState=false, never by starvation; (b) Unknown → Running is a startup transition and does NOT fire StateChanged (would paint every host as 'just recovered' at startup, which is noise and can clear Bad quality set by a concurrently-stopping sibling); (c) Stopped → Running fires StateChanged for the real recovery case; (d) Running → Stopped fires StateChanged; (e) Unknown → Stopped fires StateChanged because that's the first-known-bad signal operators need when a host is down at our startup time. MxAccessGalaxyBackend.DiscoverAsync calls _probeManager.SyncAsync with the runtime-host subset of the hierarchy (CategoryId == 1 WinPlatform or 3 AppEngine) as a best-effort step after building the Discover response — probe failures are swallowed so Discover still returns the hierarchy even if a per-host advise fails; the gateway signal covers the critical rung. SyncAsync is idempotent (second call with the same set is a no-op) and handles the diff on re-Discover for tag rename / host add / host remove. Subscribe failure rolls back the host's state entry under the lock so a later probe callback for a never-advised tag can't transition a phantom state from Unknown to Stopped and fan out a false host-down signal (the same protection v1's GalaxyRuntimeProbeManager had at line 237-243 of v1 with a captured-probe-string comparison under the lock). MxAccessGalaxyBackend.Dispose unsubscribes the StateChanged handler before disposing the probe manager to prevent dangling invocation-list references across reconnects, same discipline as PR 8's ConnectionStateChanged and PR 6's SubscriptionReplayFailed. Tests (12 new GalaxyRuntimeProbeManagerTests): Sync_subscribes_to_ScanState_per_host verifies tag.ScanState subscriptions are advised per Platform/Engine; Sync_is_idempotent_on_repeat_call_with_same_set verifies no duplicate subscribes; Sync_unadvises_removed_hosts verifies the diff unadvises gone hosts; Subscribe_failure_rolls_back_host_entry_so_later_transitions_do_not_fire_stale_events covers the rollback-on-subscribe-fail guard; Unknown_to_Running_does_not_fire_StateChanged preserves the startup-noise rule; Running_to_Stopped_fires_StateChanged_with_both_states asserts OldState and NewState are both captured in the transition record; Stopped_to_Running_fires_StateChanged_for_recovery verifies the recovery case; Unknown_to_Stopped_fires_StateChanged_for_first_known_bad_signal preserves the first-bad rule; Repeated_Good_Running_callbacks_do_not_fire_duplicate_events verifies the state-tracking de-dup; Unknown_callback_for_non_tracked_probe_is_dropped asserts a foreign callback is silently ignored; Snapshot_reports_current_state_for_every_tracked_host covers the dashboard query hook; IsRuntimeHost_recognizes_WinPlatform_and_AppEngine_category_ids asserts the CategoryId filter. Galaxy.Host.Tests Unit suite 75 pass / 0 fail (12 new probe + 63 pre-existing). Galaxy.Host builds clean (0 errors / 0 warnings). Branches off v2.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,273 @@
|
||||
using System;
|
||||
using System.Collections.Generic;
|
||||
using System.Linq;
|
||||
using System.Threading.Tasks;
|
||||
using ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.Backend.MxAccess;
|
||||
|
||||
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host.Backend.Stability;
|
||||
|
||||
/// <summary>
|
||||
/// Per-platform + per-AppEngine runtime probe. Subscribes to <c><TagName>.ScanState</c>
|
||||
/// for each $WinPlatform and $AppEngine gobject, tracks Unknown → Running → Stopped
|
||||
/// transitions, and fires <see cref="StateChanged"/> so <see cref="Backend.MxAccessGalaxyBackend"/>
|
||||
/// can forward per-host events through the existing IPC <c>OnHostStatusChanged</c> event.
|
||||
/// Pure-logic state machine with an injected clock so it's deterministically testable —
|
||||
/// port of v1 <c>GalaxyRuntimeProbeManager</c> without the OPC UA node-manager coupling.
|
||||
/// </summary>
|
||||
/// <remarks>
|
||||
/// State machine rules (documented in v1's <c>runtimestatus.md</c> and preserved here):
|
||||
/// <list type="bullet">
|
||||
/// <item><c>ScanState</c> is on-change-only — a stably-Running host may go hours without a
|
||||
/// callback. Running → Stopped is driven by an explicit <c>ScanState=false</c> callback,
|
||||
/// never by starvation.</item>
|
||||
/// <item>Unknown → Running is a startup transition and does NOT fire StateChanged (would
|
||||
/// paint every host as "just recovered" at startup, which is noise).</item>
|
||||
/// <item>Stopped → Running and Running → Stopped fire StateChanged. Unknown → Stopped
|
||||
/// fires StateChanged because that's a first-known-bad signal operators need.</item>
|
||||
/// <item>All public methods are thread-safe. Callbacks fire outside the internal lock to
|
||||
/// avoid lock inversion with caller-owned state.</item>
|
||||
/// </list>
|
||||
/// </remarks>
|
||||
public sealed class GalaxyRuntimeProbeManager : IDisposable
|
||||
{
|
||||
public const int CategoryWinPlatform = 1;
|
||||
public const int CategoryAppEngine = 3;
|
||||
public const string ProbeAttribute = ".ScanState";
|
||||
|
||||
private readonly Func<DateTime> _clock;
|
||||
private readonly Func<string, Action<string, Vtq>, Task> _subscribe;
|
||||
private readonly Func<string, Task> _unsubscribe;
|
||||
private readonly object _lock = new();
|
||||
|
||||
// probe tag → per-host state
|
||||
private readonly Dictionary<string, HostProbeState> _byProbe = new(StringComparer.OrdinalIgnoreCase);
|
||||
// tag name → probe tag (for reverse lookup on the desired-set diff)
|
||||
private readonly Dictionary<string, string> _probeByTagName = new(StringComparer.OrdinalIgnoreCase);
|
||||
private bool _disposed;
|
||||
|
||||
/// <summary>
|
||||
/// Fires on every state transition that operators should react to. See class remarks
|
||||
/// for the rules on which transitions fire.
|
||||
/// </summary>
|
||||
public event EventHandler<HostStateTransition>? StateChanged;
|
||||
|
||||
public GalaxyRuntimeProbeManager(
|
||||
Func<string, Action<string, Vtq>, Task> subscribe,
|
||||
Func<string, Task> unsubscribe)
|
||||
: this(subscribe, unsubscribe, () => DateTime.UtcNow) { }
|
||||
|
||||
internal GalaxyRuntimeProbeManager(
|
||||
Func<string, Action<string, Vtq>, Task> subscribe,
|
||||
Func<string, Task> unsubscribe,
|
||||
Func<DateTime> clock)
|
||||
{
|
||||
_subscribe = subscribe ?? throw new ArgumentNullException(nameof(subscribe));
|
||||
_unsubscribe = unsubscribe ?? throw new ArgumentNullException(nameof(unsubscribe));
|
||||
_clock = clock ?? throw new ArgumentNullException(nameof(clock));
|
||||
}
|
||||
|
||||
/// <summary>Number of probes currently advised. Test/dashboard hook.</summary>
|
||||
public int ActiveProbeCount
|
||||
{
|
||||
get { lock (_lock) return _byProbe.Count; }
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Snapshot every currently-tracked host's state. One entry per probe.
|
||||
/// </summary>
|
||||
public IReadOnlyList<HostProbeSnapshot> SnapshotStates()
|
||||
{
|
||||
lock (_lock)
|
||||
{
|
||||
return _byProbe.Select(kv => new HostProbeSnapshot(
|
||||
TagName: kv.Value.TagName,
|
||||
State: kv.Value.State,
|
||||
LastChangedUtc: kv.Value.LastStateChangeUtc)).ToList();
|
||||
}
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Query the current runtime state for <paramref name="tagName"/>. Returns
|
||||
/// <see cref="HostRuntimeState.Unknown"/> when the host is not tracked.
|
||||
/// </summary>
|
||||
public HostRuntimeState GetState(string tagName)
|
||||
{
|
||||
lock (_lock)
|
||||
{
|
||||
if (_probeByTagName.TryGetValue(tagName, out var probe)
|
||||
&& _byProbe.TryGetValue(probe, out var state))
|
||||
return state.State;
|
||||
return HostRuntimeState.Unknown;
|
||||
}
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Diff the desired host set (filtered $WinPlatform / $AppEngine from the latest Discover)
|
||||
/// against the currently-tracked set and advise / unadvise as needed. Idempotent:
|
||||
/// calling twice with the same set does nothing.
|
||||
/// </summary>
|
||||
public async Task SyncAsync(IEnumerable<HostProbeTarget> desiredHosts)
|
||||
{
|
||||
if (_disposed) return;
|
||||
|
||||
var desired = desiredHosts
|
||||
.Where(h => !string.IsNullOrWhiteSpace(h.TagName))
|
||||
.ToDictionary(h => h.TagName, StringComparer.OrdinalIgnoreCase);
|
||||
|
||||
List<string> toAdvise;
|
||||
List<string> toUnadvise;
|
||||
lock (_lock)
|
||||
{
|
||||
toAdvise = desired.Keys
|
||||
.Where(tag => !_probeByTagName.ContainsKey(tag))
|
||||
.ToList();
|
||||
toUnadvise = _probeByTagName.Keys
|
||||
.Where(tag => !desired.ContainsKey(tag))
|
||||
.Select(tag => _probeByTagName[tag])
|
||||
.ToList();
|
||||
|
||||
foreach (var tag in toAdvise)
|
||||
{
|
||||
var probe = tag + ProbeAttribute;
|
||||
_probeByTagName[tag] = probe;
|
||||
_byProbe[probe] = new HostProbeState
|
||||
{
|
||||
TagName = tag,
|
||||
State = HostRuntimeState.Unknown,
|
||||
LastStateChangeUtc = _clock(),
|
||||
};
|
||||
}
|
||||
|
||||
foreach (var probe in toUnadvise)
|
||||
{
|
||||
_byProbe.Remove(probe);
|
||||
}
|
||||
|
||||
foreach (var removedTag in _probeByTagName.Keys.Where(t => !desired.ContainsKey(t)).ToList())
|
||||
{
|
||||
_probeByTagName.Remove(removedTag);
|
||||
}
|
||||
}
|
||||
|
||||
foreach (var tag in toAdvise)
|
||||
{
|
||||
var probe = tag + ProbeAttribute;
|
||||
try
|
||||
{
|
||||
await _subscribe(probe, OnProbeCallback);
|
||||
}
|
||||
catch
|
||||
{
|
||||
// Rollback on subscribe failure so a later Tick can't transition a never-advised
|
||||
// probe into a false Stopped state. Callers can re-Sync later to retry.
|
||||
lock (_lock)
|
||||
{
|
||||
_byProbe.Remove(probe);
|
||||
_probeByTagName.Remove(tag);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
foreach (var probe in toUnadvise)
|
||||
{
|
||||
try { await _unsubscribe(probe); } catch { /* best-effort cleanup */ }
|
||||
}
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Public entry point for tests and internal callbacks. Production flow: MxAccessClient's
|
||||
/// SubscribeAsync delivers VTQ updates through the callback wired in <see cref="SyncAsync"/>,
|
||||
/// which calls this method under the lock to update state and fires
|
||||
/// <see cref="StateChanged"/> outside the lock for any transition that matters.
|
||||
/// </summary>
|
||||
public void OnProbeCallback(string probeTag, Vtq vtq)
|
||||
{
|
||||
if (_disposed) return;
|
||||
|
||||
HostStateTransition? transition = null;
|
||||
lock (_lock)
|
||||
{
|
||||
if (!_byProbe.TryGetValue(probeTag, out var state)) return;
|
||||
|
||||
var isRunning = vtq.Quality >= 192 && vtq.Value is bool b && b;
|
||||
var now = _clock();
|
||||
var previous = state.State;
|
||||
state.LastCallbackUtc = now;
|
||||
|
||||
if (isRunning)
|
||||
{
|
||||
state.GoodUpdateCount++;
|
||||
if (previous != HostRuntimeState.Running)
|
||||
{
|
||||
state.State = HostRuntimeState.Running;
|
||||
state.LastStateChangeUtc = now;
|
||||
if (previous == HostRuntimeState.Stopped)
|
||||
{
|
||||
transition = new HostStateTransition(state.TagName, previous, HostRuntimeState.Running, now);
|
||||
}
|
||||
}
|
||||
}
|
||||
else
|
||||
{
|
||||
state.FailureCount++;
|
||||
if (previous != HostRuntimeState.Stopped)
|
||||
{
|
||||
state.State = HostRuntimeState.Stopped;
|
||||
state.LastStateChangeUtc = now;
|
||||
transition = new HostStateTransition(state.TagName, previous, HostRuntimeState.Stopped, now);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if (transition is { } t)
|
||||
{
|
||||
StateChanged?.Invoke(this, t);
|
||||
}
|
||||
}
|
||||
|
||||
public void Dispose()
|
||||
{
|
||||
if (_disposed) return;
|
||||
_disposed = true;
|
||||
lock (_lock)
|
||||
{
|
||||
_byProbe.Clear();
|
||||
_probeByTagName.Clear();
|
||||
}
|
||||
}
|
||||
|
||||
private sealed class HostProbeState
|
||||
{
|
||||
public string TagName { get; set; } = "";
|
||||
public HostRuntimeState State { get; set; }
|
||||
public DateTime LastStateChangeUtc { get; set; }
|
||||
public DateTime? LastCallbackUtc { get; set; }
|
||||
public long GoodUpdateCount { get; set; }
|
||||
public long FailureCount { get; set; }
|
||||
}
|
||||
}
|
||||
|
||||
public enum HostRuntimeState
|
||||
{
|
||||
Unknown,
|
||||
Running,
|
||||
Stopped,
|
||||
}
|
||||
|
||||
public sealed record HostStateTransition(
|
||||
string TagName,
|
||||
HostRuntimeState OldState,
|
||||
HostRuntimeState NewState,
|
||||
DateTime AtUtc);
|
||||
|
||||
public sealed record HostProbeSnapshot(
|
||||
string TagName,
|
||||
HostRuntimeState State,
|
||||
DateTime LastChangedUtc);
|
||||
|
||||
public readonly record struct HostProbeTarget(string TagName, int CategoryId)
|
||||
{
|
||||
public bool IsRuntimeHost =>
|
||||
CategoryId == GalaxyRuntimeProbeManager.CategoryWinPlatform
|
||||
|| CategoryId == GalaxyRuntimeProbeManager.CategoryAppEngine;
|
||||
}
|
||||
Reference in New Issue
Block a user