Task #219 follow-up — close AlarmConditionState child-NodeId + Part 9 event-propagation gaps #198

Merged
dohertj2 merged 1 commits from task-219-followup-alarm-wiring into v2 2026-04-21 00:24:43 -04:00
Owner

Summary

PR #197 surfaced two integration-level wiring gaps in DriverNodeManager's MarkAsAlarmCondition path. This PR fixes both and upgrades the alarm integration test to assert the full dispatch path end-to-end — no more scope-outs.

Fix 1 — addressable child nodes

AlarmConditionState inherits ~50 typed children (Severity / Message / ActiveState / AckedState / EnabledState / …). The stack was leaving them with Foundation-namespace NodeIds (type-declaration defaults) or shared ns=0 counter allocations, so client Read on a child returned BadNodeIdUnknown. Pass assignNodeIds=true to alarm.Create, then walk the condition subtree and rewrite each descendant's NodeId symbolically as {condition-full-ref}.{symbolic-path} in the node manager's namespace. Stable, unique, and collision-free across multiple alarm instances in the same driver.

Fix 2 — event propagation to Server.EventNotifier

OPC UA Part 9 event propagation relies on the alarm condition being reachable from Objects/Server via HasNotifier. Call CustomNodeManager2.AddRootNotifier(alarm) after registering the condition so subscriptions placed on Server-object EventNotifier receive the ReportEvent calls ConditionSink emits per-transition.

Test upgrades in AlarmSubscribeIntegrationTests

  • Driver_alarm_transition_updates_server_side_AlarmConditionState_node — now asserts Severity == 700, message text, and ActiveState.Id == true through the OPC UA client (previously scoped out as BadNodeIdUnknown).
  • New: Driver_alarm_event_flows_to_client_subscription_on_Server_EventNotifier subscribes an OPC UA event monitor on ObjectIds.Server, fires a driver transition, and waits for the AlarmConditionType event to be delivered — asserting Message + Severity fields. Previously scoped out as 'Part 9 event propagation out of reach.'

Test plan

  • dotnet test tests/ZB.MOM.WW.OtOpcUa.Server.Tests → 239 passed, 0 failed (+1 new event-subscription test).
  • dotnet test tests/ZB.MOM.WW.OtOpcUa.Core.Tests → 195 passed, 0 failed.

🤖 Generated with Claude Code

## Summary PR #197 surfaced two integration-level wiring gaps in `DriverNodeManager`'s `MarkAsAlarmCondition` path. This PR fixes both and upgrades the alarm integration test to assert the full dispatch path end-to-end — no more scope-outs. ### Fix 1 — addressable child nodes `AlarmConditionState` inherits ~50 typed children (`Severity` / `Message` / `ActiveState` / `AckedState` / `EnabledState` / …). The stack was leaving them with Foundation-namespace NodeIds (type-declaration defaults) or shared `ns=0` counter allocations, so client `Read` on a child returned `BadNodeIdUnknown`. Pass `assignNodeIds=true` to `alarm.Create`, then walk the condition subtree and rewrite each descendant's NodeId symbolically as `{condition-full-ref}.{symbolic-path}` in the node manager's namespace. Stable, unique, and collision-free across multiple alarm instances in the same driver. ### Fix 2 — event propagation to `Server.EventNotifier` OPC UA Part 9 event propagation relies on the alarm condition being reachable from `Objects/Server` via `HasNotifier`. Call `CustomNodeManager2.AddRootNotifier(alarm)` after registering the condition so subscriptions placed on Server-object `EventNotifier` receive the `ReportEvent` calls `ConditionSink` emits per-transition. ### Test upgrades in `AlarmSubscribeIntegrationTests` - `Driver_alarm_transition_updates_server_side_AlarmConditionState_node` — now asserts `Severity == 700`, message text, and `ActiveState.Id == true` through the OPC UA client (previously scoped out as `BadNodeIdUnknown`). - **New**: `Driver_alarm_event_flows_to_client_subscription_on_Server_EventNotifier` subscribes an OPC UA event monitor on `ObjectIds.Server`, fires a driver transition, and waits for the `AlarmConditionType` event to be delivered — asserting `Message` + `Severity` fields. Previously scoped out as 'Part 9 event propagation out of reach.' ## Test plan - [x] `dotnet test tests/ZB.MOM.WW.OtOpcUa.Server.Tests` → 239 passed, 0 failed (+1 new event-subscription test). - [x] `dotnet test tests/ZB.MOM.WW.OtOpcUa.Core.Tests` → 195 passed, 0 failed. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
dohertj2 added 1 commit 2026-04-21 00:24:13 -04:00
PR #197 surfaced two integration-level wiring gaps in DriverNodeManager's
MarkAsAlarmCondition path; this commit fixes both and upgrades the integration
test to assert them end-to-end.

Fix 1 — addressable child nodes: AlarmConditionState inherits ~50 typed children
(Severity / Message / ActiveState / AckedState / EnabledState / …). The stack
was leaving them with Foundation-namespace NodeIds (type-declaration defaults) or
shared ns=0 counter allocations, so client Read on a child returned
BadNodeIdUnknown. Pass assignNodeIds=true to alarm.Create, then walk the condition
subtree and rewrite each descendant's NodeId symbolically as
  {condition-full-ref}.{symbolic-path}
in the node manager's namespace. Stable, unique, and collision-free across
multiple alarm instances in the same driver.

Fix 2 — event propagation to Server.EventNotifier: OPC UA Part 9 event
propagation relies on the alarm condition being reachable from Objects/Server
via HasNotifier. Call CustomNodeManager2.AddRootNotifier(alarm) after registering
the condition so subscriptions placed on Server-object EventNotifier receive the
ReportEvent calls ConditionSink emits per-transition.

Test upgrades in AlarmSubscribeIntegrationTests:
  - Driver_alarm_transition_updates_server_side_AlarmConditionState_node — now
    asserts Severity == 700, Message text, and ActiveState.Id == true through
    the OPC UA client (previously scoped out as BadNodeIdUnknown).
  - New: Driver_alarm_event_flows_to_client_subscription_on_Server_EventNotifier
    subscribes an OPC UA event monitor on ObjectIds.Server, fires a driver
    transition, and waits for the AlarmConditionType event to be delivered,
    asserting Message + Severity fields. Previously scoped out as "Part 9 event
    propagation out of reach."

Regression checks: 239 server tests pass (+1 new event-subscription test),
195 Core tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dohertj2 merged commit 6863cc4652 into v2 2026-04-21 00:24:43 -04:00
dohertj2 referenced this issue from a commit 2026-04-30 08:21:26 -04:00
Admin RedundancyTab — per-cluster read-only topology view. Closes the UI slice of task #149 (Phase 6.3 Stream E — Admin UI RedundancyTab + OpenTelemetry metrics + SignalR); the OpenTelemetry metrics + RoleChanged SignalR push are split into new follow-up task #198 because each is a structural add that deserves its own test matrix + NuGet-dep decision rather than riding this UI PR. New /clusters/{ClusterId} Redundancy tab slotted between ACLs and Audit in the existing ClusterDetail tab bar. Shows each ClusterNode row in the cluster with columns Node / Role (Primary green, Secondary blue, Standalone primary-blue badge) / Host / OPC UA port / ServiceLevel base / ApplicationUri (text-break so the long urn: doesn't blow out the table) / Enabled badge / Last seen (relative age via the same FormatAge helper as Hosts.razor, with a yellow "Stale" chip once LastSeenAt crosses the 30s threshold shared with HostStatusService.StaleThreshold — a missed heartbeat plus clock-skew buffer). Four summary cards above the table — total Nodes, Primary count, Secondary count, Stale count. Two guard-rail alerts: (a) red "No Primary or Standalone" when the cluster has no authoritative write target (all rows are Secondaries — read-only until one is promoted by the server-side RedundancyCoordinator apply-lease flow); (b) red "Split-brain" when >1 Primary exists — apply-lease enforcement at the coordinator level should have made this impossible, so the alert implies a hand-edited DB row + an investigation. New ClusterNodeService with ListByClusterAsync (ordered by ServiceLevelBase descending so Primary rows with higher base float to the top) + a static IsStale predicate matching HostStatusService's 30s convention. DI-registered alongside the existing scoped services in Program.cs. Writes (role swap, enable/disable) are deliberately absent from the service — they go through the RedundancyCoordinator apply-lease flow on the server side + direct DB mutation from Admin would race with it. New ClusterNodeServiceTests covering IsStale across null/recent/old LastSeenAt + ListByClusterAsync ordering + cluster filter. 4/4 new tests passing; full Admin.Tests suite 76/76 (was 72 before this PR, +4). Admin project builds 0 errors. Task #198 captures the deferred work: (1) OpenTelemetry Meter for primary/secondary/stale counts + role_transition counter with from/to/node tags + OTLP exporter config; (2) RoleChanged SignalR push — extend FleetStatusPoller to detect RedundancyRole changes on ClusterNode rows + emit a RoleChanged hub message so the RedundancyTab refreshes instantly instead of on-page-load polling.
dohertj2 referenced this issue from a commit 2026-04-30 08:21:26 -04:00
OpenTelemetry redundancy metrics + RoleChanged SignalR push. Closes instrumentation + live-push slices of task #198; the exporter wiring (OTLP vs Prometheus package decision) is split to new task #201 because the collector/scrape-endpoint choice is a fleet-ops decision that deserves its own PR rather than hardcoded here. New RedundancyMetrics class (Singleton-registered in DI) owning a System.Diagnostics.Metrics.Meter("ZB.MOM.WW.OtOpcUa.Redundancy", "1.0.0"). Three ObservableGauge instruments — otopcua.redundancy.primary_count / secondary_count / stale_count — all tagged by cluster.id, populated by SetClusterCounts(clusterId, primary, secondary, stale) which the poller calls at the tail of every tick; ObservableGauge callbacks snapshot the last value set under a lock so the reader (OTel collector, dotnet-counters) sees consistent tuples. One Counter — otopcua.redundancy.role_transition — tagged cluster.id, node.id, from_role, to_role; ideal for tracking "how often does Cluster-X failover" + "which node transitions most" aggregate queries. In-box Metrics API means zero NuGet dep here — the exporter PR adds OpenTelemetry.Extensions.Hosting + OpenTelemetry.Exporter.OpenTelemetryProtocol or OpenTelemetry.Exporter.Prometheus.AspNetCore to actually ship the data somewhere. FleetStatusPoller extended with role-change detection. Its PollOnceAsync now pulls ClusterNode rows alongside the existing ClusterNodeGenerationState scan, and a new PollRolesAsync walks every node comparing RedundancyRole to the _lastRole cache. On change: records the transition to RedundancyMetrics + emits a RoleChanged SignalR message to both FleetStatusHub.GroupName(cluster) + FleetStatusHub.FleetGroup so cluster-scoped + fleet-wide subscribers both see it. First observation per node is a bootstrap (cache fill) + NOT a transition — avoids spurious churn on service startup or pod restart. UpdateClusterGauges groups nodes by cluster + sets the three gauge values, using ClusterNodeService.StaleThreshold (shared 30s convention) for staleness so the /hosts page + the gauge agree. RoleChangedMessage record lives alongside NodeStateChangedMessage in FleetStatusPoller.cs. RedundancyTab.razor subscribes to the fleet-status hub on first parameters-set, filters RoleChanged events to the current cluster, reloads the node list + paints a blue info banner ("Role changed on node-a: Primary → Secondary at HH:mm:ss UTC") so operators see the transition without needing to poll-refresh the page. IAsyncDisposable closes the connection on tab swap-away. Two new RedundancyMetricsTests covering RecordRoleTransition tag emission (cluster.id + node.id + from_role + to_role all flow through the MeterListener callback) + ObservableGauge snapshot for two clusters (assert primary_count=1 for c1, stale_count=1 for c2). Existing FleetStatusPollerTests ctor-line updated to pass a RedundancyMetrics instance; all tests still pass. Full Admin.Tests suite 87/87 passing (was 85, +2). Admin project builds 0 errors. Task #201 captures the exporter-wiring follow-up — OpenTelemetry.Extensions.Hosting + OTLP vs Prometheus + /metrics endpoint decision, driven by fleet-ops infra direction.
Sign in to join this conversation.