Phase 2 PR 8 — gateway-level host-status push from MxAccessGalaxyBackend #7

Merged
dohertj2 merged 1 commits from phase-2-pr8-alarms-hoststatus into v2 2026-04-18 06:59:06 -04:00
Owner

Summary

Wire gateway-level host-status push from MxAccessGalaxyBackend. PR 4 built the IPC infrastructure for OnHostStatusChanged (MessageKind.RuntimeStatusChange + ConnectionSink) but no backend raised the event. This PR closes the top-level signal:

  • MxAccessClient.ConnectionStateChanged (already fires on Register / Unregister / reconnect) now drives OnHostStatusChanged with HostName=ClientName, RuntimeStatus="Running"/"Stopped", LastObservedUtcUnixMs=now.
  • The Admin UI's existing GalaxyProxyDriver.OnHostConnectivityUpdate already parses the string status into HostState enum and fires HostStatusChangedEventArgs downstream — so this single backend-side wire-up produces an end-to-end signal with no further Proxy changes.

What's in scope

  • MxAccessClient.ClientName property — exposes the previously-private _clientName so the backend can tag its pushes with a stable gateway identity.
  • MxAccessGalaxyBackend constructor subscribes _onConnectionStateChanged before returning; Dispose unsubscribes to prevent dangling handler on the MxAccessClient.
  • #pragma warning disable CS0067 narrowed to cover only OnAlarmEvent (alarm wire-up is PR 9). OnHostStatusChanged is now raised.

What's deferred

  • Per-platform + per-AppEngine ScanState probes. v1's GalaxyRuntimeProbeManager is 472 LOC of state-machine (Unknown → Running → Stopped with on-change-only delivery quirk) and deserves its own PR. The gateway signal this PR adds covers the most operationally-important rung — when the Galaxy COM proxy itself dies — which is what wakes operators up at 3am.

Tests

  • ConnectionStateChanged_raises_OnHostStatusChanged_with_gateway_name — fires Connect → Disconnect, asserts 2 notifications with HostName="GatewayClient", status "Running" then "Stopped", timestamp > 0.
  • Dispose_unsubscribes_so_post_dispose_state_changes_do_not_fire_events — asserts the backend detaches its handler on Dispose, so a subsequent disconnect does not bump a registered event subscriber.

Test plan

  • dotnet build src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/ — 0 errors, 0 warnings
  • dotnet test .../Host.Tests/ --filter "Category=Unit" — 26/26 pass (2 new + 24 pre-existing)
  • Reviewer: on a machine with ArchestrA runtime, start the Host, kill the aaBootstrap service, and observe the Admin UI transition the gateway's status from RunningStopped within the MonitorInterval window.

Merge order

Branches off phase-2-pr5-historian because PR 5 already added the IHistorianDataSource? historian constructor parameter and the Dispose implementation now coordinates two unsubscribes. The lib/ArchestrA.MxAccess.dll test reference duplicates the one PR 6 adds — git will collapse to one when either PR lands first.


🤖 Generated with Claude Code

## Summary Wire gateway-level host-status push from `MxAccessGalaxyBackend`. PR 4 built the IPC infrastructure for `OnHostStatusChanged` (MessageKind.RuntimeStatusChange + ConnectionSink) but no backend raised the event. This PR closes the top-level signal: - `MxAccessClient.ConnectionStateChanged` (already fires on Register / Unregister / reconnect) now drives `OnHostStatusChanged` with `HostName=ClientName`, `RuntimeStatus="Running"/"Stopped"`, `LastObservedUtcUnixMs=now`. - The Admin UI's existing `GalaxyProxyDriver.OnHostConnectivityUpdate` already parses the string status into `HostState` enum and fires `HostStatusChangedEventArgs` downstream — so this single backend-side wire-up produces an end-to-end signal with no further Proxy changes. ## What's in scope - `MxAccessClient.ClientName` property — exposes the previously-private `_clientName` so the backend can tag its pushes with a stable gateway identity. - `MxAccessGalaxyBackend` constructor subscribes `_onConnectionStateChanged` before returning; `Dispose` unsubscribes to prevent dangling handler on the MxAccessClient. - `#pragma warning disable CS0067` narrowed to cover only `OnAlarmEvent` (alarm wire-up is PR 9). `OnHostStatusChanged` is now raised. ## What's deferred - Per-platform + per-AppEngine `ScanState` probes. v1's `GalaxyRuntimeProbeManager` is 472 LOC of state-machine (Unknown → Running → Stopped with on-change-only delivery quirk) and deserves its own PR. The gateway signal this PR adds covers the most operationally-important rung — when the Galaxy COM proxy itself dies — which is what wakes operators up at 3am. ## Tests - `ConnectionStateChanged_raises_OnHostStatusChanged_with_gateway_name` — fires Connect → Disconnect, asserts 2 notifications with `HostName="GatewayClient"`, status `"Running"` then `"Stopped"`, timestamp > 0. - `Dispose_unsubscribes_so_post_dispose_state_changes_do_not_fire_events` — asserts the backend detaches its handler on Dispose, so a subsequent disconnect does not bump a registered event subscriber. ## Test plan - [x] `dotnet build src/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Host/` — 0 errors, 0 warnings - [x] `dotnet test .../Host.Tests/ --filter "Category=Unit"` — 26/26 pass (2 new + 24 pre-existing) - [ ] Reviewer: on a machine with ArchestrA runtime, start the Host, kill the aaBootstrap service, and observe the Admin UI transition the gateway's status from `Running` → `Stopped` within the MonitorInterval window. ## Merge order Branches off `phase-2-pr5-historian` because PR 5 already added the `IHistorianDataSource? historian` constructor parameter and the Dispose implementation now coordinates two unsubscribes. The `lib/ArchestrA.MxAccess.dll` test reference duplicates the one PR 6 adds — git will collapse to one when either PR lands first. --- 🤖 Generated with [Claude Code](https://claude.com/claude-code)
dohertj2 changed target branch from phase-2-pr5-historian to v2 2026-04-18 06:58:44 -04:00
dohertj2 added 1 commit 2026-04-18 06:58:44 -04:00
Phase 2 PR 8 — wire gateway-level host-status push from MxAccessGalaxyBackend. PR 4 built the IPC infrastructure for OnHostStatusChanged (MessageKind.RuntimeStatusChange frame + ConnectionSink forwarding through FrameWriter) but no backend actually raised the event; the #pragma warning disable CS0067 around MxAccessGalaxyBackend.OnHostStatusChanged declared the event for interface symmetry while acknowledging the wire-up was Phase 2 follow-up. This PR closes the gateway-level signal: MxAccessClient.ConnectionStateChanged (already raised on false→true Register and true→false Unregister transitions, including the reconnect path in MonitorLoopAsync) now drives OnHostStatusChanged with a synthetic HostConnectivityStatus tagged HostName=MxAccessClient.ClientName, RuntimeStatus="Running" on reconnect + "Stopped" on disconnect, LastObservedUtcUnixMs set to the transition moment. The Admin UI's existing IHostConnectivityProbe subscriber on GalaxyProxyDriver (HostStatusChangedEventArgs) already handles the full translation — OnHostConnectivityUpdate parses "Running"/"Stopped"/"Faulted" into the Core.Abstractions HostState enum and fires OnHostStatusChanged downstream, so this single backend-side event wire-up produces an end-to-end signal with no further Proxy changes required. Per-platform and per-AppEngine ScanState probing (the 472 LOC GalaxyRuntimeProbeManager state machine in v1 that advises <Host>.ScanState on every deployed $WinPlatform + $AppEngine gobject, tracks Unknown → Running → Stopped transitions, handles the on-change-only delivery quirk of ScanState, and surfaces IsHostStopped(gobjectId) for the node manager's Read path to short-circuit on-demand reads against known-stopped runtimes) remains deferred to a follow-up PR — the gateway-level signal gives operators the top-level transport-health rung of the status ladder, which is what matters when the Galaxy COM proxy itself goes down (vs a specific platform going down). MxAccessClient.ClientName property exposes the previously-private _clientName field so the backend can tag its pushes with a stable gateway identity — operators configure this via OTOPCUA_GALAXY_CLIENT_NAME env var (default "OtOpcUa-Galaxy.Host" per Program.cs). MxAccessGalaxyBackend constructor subscribes the new _onConnectionStateChanged field before returning + Dispose unsubscribes it via _mx.ConnectionStateChanged -= _onConnectionStateChanged to prevent the backend's own dispose from leaving a dangling handler on the MxAccessClient (same shape as MxAccessClient.SubscriptionReplayFailed PR 6 dispose discipline). #pragma warning disable CS0067 removed from around OnHostStatusChanged since the event is now raised; the directive is narrowed to cover only OnAlarmEvent which stays unraised pending the alarm subsystem port (PR 9 candidate). Tests — HostStatusPushTests (new, 2 cases): ConnectionStateChanged_raises_OnHostStatusChanged_with_gateway_name fires mx.ConnectAsync → mx.DisconnectAsync and asserts two notifications in order with HostName="GatewayClient" (the clientName passed to MxAccessClient ctor), RuntimeStatus="Running" then "Stopped", LastObservedUtcUnixMs > 0; Dispose_unsubscribes_so_post_dispose_state_changes_do_not_fire_events asserts that after backend.Dispose() a subsequent mx.DisconnectAsync does not bump the count on a registered OnHostStatusChanged handler — guards against the subscription-leak regression where a lingering backend instance would accumulate cross-reconnect notifications for a dead writer. Host.Tests csproj gains a Reference to lib/ArchestrA.MxAccess.dll (identical to the reference PR 6 adds — conflict-free cherry-pick/merge since both PRs stage the same <Reference> node; git will collapse to one when either lands first). Full Galaxy.Host.Tests Unit suite: 26 pass / 0 fail (2 new host-status + 9 PR5 historian + 15 pre-existing PostMortemMmf/RecyclePolicy/StaPump/MemoryWatchdog/EndToEndIpc/Handshake). Galaxy.Host builds clean (0 errors, 0 warnings). Branch base — PR 8 is on phase-2-pr5-historian rather than phase-2-pr4-findings because the constructor path on MxAccessGalaxyBackend gained a new historian parameter in PR 5 and the Dispose implementation needs to coordinate the two unsubscribes; targeting the earlier base would leave a trivial conflict on Dispose. 30ece6e22c
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dohertj2 merged commit 90ce0af375 into v2 2026-04-18 06:59:06 -04:00
dohertj2 referenced this issue from a commit 2026-04-18 13:14:03 -04:00
Phase 3 PR 27 — Fleet status dashboard page. New /fleet route shows per-node apply state (ClusterNodeGenerationState joined with ClusterNode for the ClusterId) in a sortable table with summary cards for Total / Applied / Stale / Failed node counts. Stale detection: LastSeenAt older than 30s triggers a table-warning row class + yellow count card. Failed rows get table-danger + red card. Badge classes per LastAppliedStatus: Applied=bg-success, Failed=bg-danger, Applying=bg-info, unknown=bg-secondary. Timestamps rendered as relative-age strings ('42s ago', '15m ago', '3h ago', then absolute date for >24h). Error column is truncated to 320px with the full message in a tooltip so the table stays readable on wide fleets. Initial data load on OnInitializedAsync; auto-refresh every 5s via a Timer that calls InvokeAsync(RefreshAsync) — matches the FleetStatusPoller's 5s cadence so the dashboard sees the most recent state without polling ahead of the broadcaster. A Refresh button also kicks a manual reload; _refreshing gate prevents double-runs when the timer fires during an in-flight query. IServiceScopeFactory (matches FleetStatusPoller's pattern) creates a fresh DI scope per refresh so the per-page DbContext can't race the timer with the render thread; no new DI registrations needed. Live SignalR hub push is deliberately deferred to a follow-up PR — the existing FleetStatusHub + NodeStateChangedMessage already works for external JavaScript clients; wiring an in-process Blazor Server consumer adds HubConnectionBuilder plumbing that's worth its own focused change. Sidebar link added to MainLayout between Overview and Clusters. Full Admin.Tests Unit suite 14 pass / 0 fail — unchanged, no tests regressed. Full Admin build clean (0 errors, 0 warnings). Closes the 'no per-driver dashboard' gap from lmx-followups item #7 at the fleet level; per-host (platform/engine/Modbus PLC) granularity still needs a dedicated page that consumes IHostConnectivityProbe.GetHostStatuses from the Server process — that's the live-SignalR follow-up.
dohertj2 referenced this issue from a commit 2026-04-18 15:53:31 -04:00
Phase 3 PR 34 — Host-status publisher (Server) + /hosts drill-down page (Admin). Closes LMX follow-up #7 by wiring together the data layer from PR 33. Server.HostStatusPublisher is a BackgroundService that walks every driver registered in DriverHost every 10 seconds, skips drivers that don't implement IHostConnectivityProbe, calls GetHostStatuses() on each probe-capable driver, and upserts one DriverHostStatus row per (NodeId, DriverInstanceId, HostName) into the central config DB. Upsert path: SingleOrDefaultAsync on the composite PK; if no row exists, Add a new one; if a row exists, LastSeenUtc advances unconditionally (heartbeat) and State + StateChangedUtc update only on transitions so Admin UI can distinguish 'still reporting, still Running' from 'freshly transitioned to Running'. MapState translates Core.Abstractions.HostState to Configuration.Enums.DriverHostState (intentional duplicate enum — Configuration project stays free of driver-runtime deps per PR 33's choice). If a driver's GetHostStatuses throws, log warning and skip that driver this tick — never take down the Server on a publisher failure. If the DB is unreachable, log warning + retry next heartbeat (no buffering — next tick's current-state snapshot is more useful than replaying stale transitions after a long outage). 2-second startup delay so NodeBootstrap's RegisterAsync calls land before the first publish tick, then tick runs immediately so a freshly-started Server surfaces its host topology in the Admin UI without waiting a full interval.
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: dohertj2/lmxopcua#7