Files
mxaccessgw/docs/plans/2026-06-13-alarm-subtag-fallback-design.md
T
Joseph Doherty 37aadf72b3 docs(alarms): clarify resolver cancellation contract; mark design doc superseded
C6b: IAlarmWatchListResolver.ResolveAsync doc now notes that while discovery being
unavailable never throws, a triggered cancellation token still propagates.
C7: annotate the original design doc where it drifted from the shipped code — metric
names / unimplemented watch-list gauges, and the proto-type location (gateway proto, not
worker proto).
2026-06-14 02:33:14 -04:00

17 KiB

Alarm Subtag-Monitoring Fallback — Design

Date: 2026-06-13 Status: Superseded by implementation (merged to main). This is the original brainstorming design; a few details below were refined during implementation — see the inline Superseded notes. The shipped behaviour is documented in docs/AlarmClientDiscovery.md, the client READMEs, and the contracts. Branch: feat/alarm-subtag-fallback

Problem

The gateway's central alarm feed (GatewayAlarmMonitor → worker WnWrapAlarmConsumer) depends on the AVEVA wnwrap COM consumer (WNWRAPCONSUMERLib.wwAlarmConsumerClass), which polls GetXmlCurrentAlarms2 on the worker STA. That provider can fail at the COM boundary (the older aaAlarmManagedClient crashed on FILETIME marshaling; wnwrap can still return failure HRESULTs or throw COMException). When it does, the gateway loses all alarm visibility.

This design adds a second alarm source — direct monitoring of each alarm attribute's subtags (.active, .acked, …) via the existing MXAccess AddItem/Advise pipeline — and fails over to it automatically when the wnwrap provider breaks, then fails back automatically when it recovers. The subtag source can also be forced on by config.

Decisions (locked during brainstorming)

Decision Choice
Failover model Auto-failover + auto-failback (both directions, runtime)
Watch-list source Galaxy Repository SQL discovery + config override
Acknowledge in subtag mode Write the operator comment to the alarm's ack-comment subtag (the write performs the ack)
Failure signal N consecutive wnwrap COM failures (Subscribe / GetXmlCurrentAlarms2 throws or returns a failure HRESULT)
Degraded-state visibility Both — explicit field in the gRPC contract and dashboard + metrics
Synthesis location Worker-side (Approach A) — keeps the parity rule "the gateway forwards only events the worker emits; it never synthesizes events"

Core principle

Subtag monitoring is, by definition, a non-parity, lower-fidelity alarm source: it synthesizes alarm transitions from raw data changes, has no native alarm GUID, no native original-raise timestamp, and a narrower field set. Per CLAUDE.md, synthesizing events is allowed only as an explicit opt-in non-parity mode. This design satisfies that by (a) doing the synthesis inside the worker (so the gateway still only forwards worker-emitted events) and (b) marking every degraded event and the whole feed as degraded so no client mistakes it for the authoritative alarmmgr feed.

Architecture

                          GATEWAY (.NET 10, x64)
  ┌─────────────────────────────────────────────────────────────────┐
  │ GatewayAlarmMonitor (BackgroundService)                           │
  │  • resolves watch-list: Galaxy Repository SQL + config override   │
  │  • arms the worker with the watch-list at subscribe time          │
  │  • consumes AlarmProviderModeChanged → reflects mode into feed,   │
  │    /hubs/alarms dashboard hub, and metrics                        │
  │  • forces a cache reconcile (QueryActiveAlarms) on every switch   │
  └───────────────────────────────┬───────────────────────────────────┘
                                   │ IPC (WorkerEnvelope frames)
                                   │  · SubscribeAlarms{ watch_list, failover cfg }
                                   │  · AlarmProviderModeChanged{ mode, reason, hresult }
                                   │  · OnAlarmTransitionEvent (degraded flag set in subtag mode)
                                   ▼
                          WORKER (.NET FW 4.8, x86, STA)
  ┌─────────────────────────────────────────────────────────────────┐
  │ AlarmDispatcher → FailoverAlarmConsumer : IMxAccessAlarmConsumer  │
  │   ├─ primary : WnWrapAlarmConsumer   (wnwrap COM poll, unchanged) │
  │   └─ standby : SubtagAlarmConsumer   (AddItem/Advise on subtags)  │
  │                                                                   │
  │  FailoverAlarmConsumer owns the state machine:                    │
  │   PrimaryActive ──(N consecutive wnwrap COM failures)──▶ Degraded │
  │   Degraded ──(M consecutive clean wnwrap probe polls)──▶ Primary  │
  │   on each switch: snapshot the now-active provider, hand off      │
  └─────────────────────────────────────────────────────────────────┘

The failover state machine lives worker-local so the switch is instant — no IPC round-trip at the moment alarmmgr dies. The gateway arms the standby consumer up front (passes the watch-list at subscribe time) so it is ready before it is ever needed.

Components

Worker (src/ZB.MOM.WW.MxGateway.Worker/MxAccess/)

SubtagAlarmConsumer : IMxAccessAlarmConsumer (new) — the standby provider.

  • On Subscribe, instead of wnwrap registration it AddItem/Advises the configured subtags for each watch-list entry on the existing STA (reuses the worker's item-subscription machinery). Per attribute it advises at minimum .active and .acked; optionally .priority/severity, .descr, value/limit if present.
  • Converts each OnDataChange into the same MxAlarmTransitionEvent the wnwrap consumer emits, via the synthesis rules below, and raises AlarmTransitionEmitted. Marks each as degraded.
  • SnapshotActiveAlarms() returns the currently-active set computed from last-known subtag values.
  • AcknowledgeByName(...) resolves the watch-list entry's ack-comment subtag and issues a Write(comment) on the STA. AcknowledgeByGuid(...) maps the synthetic GUID (see below) back to a reference, then does the same. If the attribute exposes no writable ack-comment subtag, returns a failure code that the gateway surfaces as FailedPrecondition.
  • PollOnce() is a no-op (subtag mode is event-driven via Advise).

FailoverAlarmConsumer : IMxAccessAlarmConsumer (new) — composite + state machine. Owns the wnwrap consumer (primary) and the subtag consumer (standby), forwards AlarmTransitionEmitted from whichever child is active, and raises a new ProviderModeChanged event on every switch.

  • Failure counting: wraps Subscribe/PollOnce on the primary; a thrown COMException or a failure HRESULT increments a consecutive-failure counter, reset to zero on any clean poll.
  • Failover (PrimaryActive → Degraded): at ConsecutiveFailureThreshold (default 3), ensures the standby is subscribed (it was armed at startup), sets active = standby, snapshots the standby's active set for hand-off, and emits ProviderModeChanged(SUBTAG, reason, hresult).
  • Failback probe (Degraded → PrimaryActive): while degraded, every FailbackProbeIntervalSeconds (default 30) it re-attempts wnwrap Subscribe+PollOnce on the STA. After FailbackStableProbes (default 3) consecutive clean polls it switches active = primary, returns the standby to standby, and emits ProviderModeChanged(ALARMMGR, "recovered").
  • Hand-off: on every switch it takes SnapshotActiveAlarms() from the now-active provider so the gateway can reconcile and avoid spurious raise/clear storms.

AlarmDispatcher / MxAccessAlarmEventSink / AlarmCommandHandler (changed, minimal)AlarmDispatcher holds a FailoverAlarmConsumer instead of a bare WnWrapAlarmConsumer; it subscribes to ProviderModeChanged and enqueues a mode-changed worker event. The ack path routes by active mode (native wnwrap ack in alarmmgr mode; ack-comment write in subtag mode), but that routing is entirely inside the consumer — the dispatcher just calls AcknowledgeByName/AcknowledgeByGuid.

Gateway (src/ZB.MOM.WW.MxGateway.Server/)

Galaxy Repository discovery (new query) — alongside the existing GR SQL browse RPCs, a query "attributes that have alarms configured, with their ack-comment subtag and area", scoped to the configured area. Merged with the config override (explicit includes/excludes). Produces the watch-list of AlarmSubtagTargets.

GatewayAlarmMonitor (changed) — resolves the watch-list at subscribe time and passes it to the worker; consumes AlarmProviderModeChanged and reflects the current provider mode into (a) the AlarmFeedMessage provider-status, (b) the /hubs/alarms dashboard hub, and (c) metrics; forces a reconcile (QueryActiveAlarms) on every switch. Re-runs discovery on its existing reconcile cadence and pushes an updated watch-list when the model changes.

AlarmsOptions (extended) — new Fallback sub-section (below).

Contract (src/ZB.MOM.WW.MxGateway.Contracts/Protos/)

mxaccess_gateway.proto:

  • enum AlarmProviderMode { ALARM_PROVIDER_MODE_UNSPECIFIED = 0; ALARMMGR = 1; SUBTAG = 2; }
  • New AlarmFeedMessage oneof case AlarmProviderStatus provider_status, carrying { AlarmProviderMode mode; bool degraded; string reason; google.protobuf.Timestamp since; }. Emitted on stream open and on every change so a late-joining client immediately learns the mode.
  • Add bool degraded + AlarmProviderMode source_provider to OnAlarmTransitionEvent and ActiveAlarmSnapshot, so per-item provenance is visible even mid-stream. All additions are new field numbers — backward compatible; existing clients ignore them and keep seeing alarms.

mxaccess_worker.proto:

Superseded: these additions shipped in mxaccess_gateway.proto, not mxaccess_worker.proto — the worker imports the gateway proto and the alarm commands/events live there (AlarmSubtagTarget, OnAlarmProviderModeChangedEvent, the extended subscribe command).

  • Extend the alarm-subscribe command with: AlarmProviderMode forced_mode (UNSPECIFIED = auto), int32 consecutive_failure_threshold, int32 failback_probe_interval_seconds, int32 failback_stable_probes, and repeated AlarmSubtagTarget watch_list, where AlarmSubtagTarget = { string alarm_full_reference; string source_object_reference; string active_subtag; string acked_subtag; string ack_comment_subtag; string priority_subtag; }.
  • New worker→gateway event AlarmProviderModeChanged { AlarmProviderMode mode; string reason; int32 hresult; google.protobuf.Timestamp at; }.

Generated code under Generated/ and clients/*/generated*/ is rebuilt from these .proto files — never hand-edited. Every generated client touched by the contract is rebuilt per the source-update workflow.

Data flow

Subtag synthesis rules

SubtagAlarmConsumer keeps last-known (active, acked) per watch-list entry and emits transitions on change:

Subtag change Emitted transition Notes
active false → true RAISE (state UNACK_ALM) original_raise_timestamp = first-observed active time
acked false → true while active ACKNOWLEDGE operator_user/operator_comment from ack-comment subtag if advised
active true → false CLEAR maps to AckRtn if acked at clear, else UnackRtn
active stays true, re-alarm RETRIGGER only if a re-alarm counter subtag exists; otherwise not synthesized (documented limitation)

Snapshot state mapping for ActiveAlarmSnapshot.current_state: active && !acked → ACTIVE, active && acked → ACTIVE_ACKED, !active → INACTIVE.

Field degradation in subtag mode:

  • alarm_full_reference — from the watch-list entry (stable, drives ack-by-ref).
  • Synthetic, deterministic GUID derived by hashing alarm_full_reference so GUID-based ack still resolves; flagged degraded = true.
  • severity — from the priority subtag if advised, else 0.
  • original_raise_timestamp — first-observed active time (best effort).
  • transition_timestamp — the OnDataChange timestamp.
  • category/description/current_value/limit_value — populated only if the corresponding subtag is advised; otherwise empty.

Acknowledge

AcknowledgeAlarm/AcknowledgeAlarmByName are unchanged at the RPC surface. AlarmDispatcher routes by active provider mode:

  • alarmmgr mode: native wnwrap AlarmAckByName/AlarmAckByGUID (unchanged).
  • subtag mode: resolve the target's ack_comment_subtag, Write the operator comment via the existing worker write path on the STA. No writable ack-comment subtag → FailedPrecondition.

Provider-mode reflection

Worker AlarmProviderModeChangedGatewayAlarmMonitor → (a) emit/refresh AlarmFeedMessage.provider_status to every StreamAlarms subscriber, (b) push to /hubs/alarms, (c) update metrics, (d) force a reconcile.

Error handling

  • Both providers down (subtag advise also failing): the monitor stays faulted and keeps retrying both; acknowledge returns Unavailable. No silent data loss — the feed reports degraded with reason.
  • Empty watch-list in subtag mode (GR SQL unavailable, no config override): log + metric alarm_fallback_watchlist_empty; the feed reports degraded + empty; the gateway keeps re-running discovery on its reconcile cadence and pushes an updated watch-list when one becomes available.
  • Switch hand-off: every switch snapshots the now-active provider and reconciles against the gateway cache to avoid a raise/clear storm.
  • STA affinity: all subtag advise/write and wnwrap probe calls run on the worker STA (reuse the existing affinity guard) to satisfy ThreadingModel=Apartment.

Metrics

  • mxgateway_alarm_provider_mode (gauge: 1 = alarmmgr, 2 = subtag)
  • mxgateway_alarm_provider_switch_total{from,to,reason} (counter)
  • mxgateway_alarm_fallback_watchlist_size (gauge)

Superseded: the shipped meter names are mxgateway.alarms.provider_mode (gauge) and mxgateway.alarms.provider_switches{from,to,reason} (counter, reason bounded to failover/failback/unknown). The watch-list-size / watch-list-empty gauges were not implemented; an empty watch-list is surfaced via a warning log and the feed's degraded ProviderStatus instead.

Configuration

"MxGateway": {
  "Alarms": {
    "Enabled": true,
    "SubscriptionExpression": "\\\\DESKTOP-6JL3KKO\\Galaxy!DEV",
    "DefaultArea": "DEV",
    "ReconcileIntervalSeconds": 30,
    "Fallback": {
      "Mode": "Auto",                      // Auto | ForceAlarmManager | ForceSubtag
      "ConsecutiveFailureThreshold": 3,
      "FailbackProbeIntervalSeconds": 30,
      "FailbackStableProbes": 3,
      "Discovery": {
        "UseGalaxyRepository": true,
        "Area": "",                        // defaults to Alarms.DefaultArea
        "IncludeAttributes": [],           // explicit additions
        "ExcludeAttributes": []
      },
      "Subtags": {
        "Active": "active",
        "Acked": "acked",
        "AckComment": "",                  // verified against MXAccess analysis
        "Priority": "priority"
      }
    }
  }
}

GatewayOptionsValidator additions: Mode = ForceSubtag with empty discovery result and no explicit IncludeAttributes → startup validation warning; threshold/interval/probe values floored at sane minimums.

Open item to confirm during implementation

The exact AVEVA subtag names (.active, .acked, the ack-comment attribute, priority) must be confirmed against the MXAccess analysis project (C:\Users\dohertj2\Desktop\mxaccess, docs/MXAccess-Public-API.md) and the live Galaxy before wiring SubtagAlarmConsumer. The config Subtags block exists precisely so the resolved names are not hard-coded.

Testing

Layer Tests
Worker unit (MxGateway.Worker.Tests, x86) SubtagAlarmConsumer synthesis — feed OnDataChange sequences, assert raise/ack/clear transitions, snapshot states, degraded flag, synthetic-GUID stability, ack-comment write routing
Worker unit FailoverAlarmConsumer state machine — fake wnwrap throwing after K polls: assert switch at threshold, failback after stable probes, ProviderModeChanged emitted, no duplicate transitions across switch (hand-off reconcile)
Gateway unit (MxGateway.Tests, fake worker) discovery + config-override merge; GatewayAlarmMonitor reflects mode into feed + hub; metrics increment on switch
Contract proto round-trip for new fields; existing alarm tests unchanged (alarmmgr-mode regression — parity preserved)
Live (opt-in, MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1) real subtag advise + ack-comment write against a live alarm; GR SQL discovery query against the ZB DB (gated like existing GR tests)

Docs to update in the same change

gateway.md (alarm provider section), docs/DesignDecisions.md (record the fallback decision), docs/GatewayConfiguration.md (the Fallback block), docs/AlarmClientDiscovery.md (subtag provider + synthesis rules), docs/Grpc.md (the new provider_status / degraded fields), and any client READMEs whose generated alarm types gain fields.