# Alarm Subtag-Monitoring Fallback — Design **Date:** 2026-06-13 **Status:** Superseded by implementation (merged to `main`). This is the original brainstorming design; a few details below were refined during implementation — see the inline **Superseded** notes. The shipped behaviour is documented in `docs/AlarmClientDiscovery.md`, the client READMEs, and the contracts. **Branch:** `feat/alarm-subtag-fallback` ## Problem The gateway's central alarm feed (`GatewayAlarmMonitor` → worker `WnWrapAlarmConsumer`) depends on the AVEVA wnwrap COM consumer (`WNWRAPCONSUMERLib.wwAlarmConsumerClass`), which polls `GetXmlCurrentAlarms2` on the worker STA. That provider can fail at the COM boundary (the older `aaAlarmManagedClient` crashed on FILETIME marshaling; wnwrap can still return failure HRESULTs or throw `COMException`). When it does, the gateway loses all alarm visibility. This design adds a **second alarm source** — direct monitoring of each alarm attribute's subtags (`.active`, `.acked`, …) via the existing MXAccess `AddItem`/`Advise` pipeline — and **fails over to it automatically when the wnwrap provider breaks, then fails back automatically when it recovers**. The subtag source can also be forced on by config. ## Decisions (locked during brainstorming) | Decision | Choice | |---|---| | Failover model | **Auto-failover + auto-failback** (both directions, runtime) | | Watch-list source | **Galaxy Repository SQL discovery + config override** | | Acknowledge in subtag mode | **Write the operator comment to the alarm's ack-comment subtag** (the write performs the ack) | | Failure signal | **N consecutive wnwrap COM failures** (Subscribe / `GetXmlCurrentAlarms2` throws or returns a failure HRESULT) | | Degraded-state visibility | **Both** — explicit field in the gRPC contract **and** dashboard + metrics | | Synthesis location | **Worker-side** (`Approach A`) — keeps the parity rule "the gateway forwards only events the worker emits; it never synthesizes events" | ## Core principle Subtag monitoring is, by definition, a **non-parity, lower-fidelity** alarm source: it synthesizes alarm transitions from raw data changes, has no native alarm GUID, no native original-raise timestamp, and a narrower field set. Per `CLAUDE.md`, synthesizing events is allowed only as an explicit opt-in non-parity mode. This design satisfies that by (a) doing the synthesis **inside the worker** (so the gateway still only forwards worker-emitted events) and (b) marking every degraded event and the whole feed as degraded so no client mistakes it for the authoritative alarmmgr feed. ## Architecture ``` GATEWAY (.NET 10, x64) ┌─────────────────────────────────────────────────────────────────┐ │ GatewayAlarmMonitor (BackgroundService) │ │ • resolves watch-list: Galaxy Repository SQL + config override │ │ • arms the worker with the watch-list at subscribe time │ │ • consumes AlarmProviderModeChanged → reflects mode into feed, │ │ /hubs/alarms dashboard hub, and metrics │ │ • forces a cache reconcile (QueryActiveAlarms) on every switch │ └───────────────────────────────┬───────────────────────────────────┘ │ IPC (WorkerEnvelope frames) │ · SubscribeAlarms{ watch_list, failover cfg } │ · AlarmProviderModeChanged{ mode, reason, hresult } │ · OnAlarmTransitionEvent (degraded flag set in subtag mode) ▼ WORKER (.NET FW 4.8, x86, STA) ┌─────────────────────────────────────────────────────────────────┐ │ AlarmDispatcher → FailoverAlarmConsumer : IMxAccessAlarmConsumer │ │ ├─ primary : WnWrapAlarmConsumer (wnwrap COM poll, unchanged) │ │ └─ standby : SubtagAlarmConsumer (AddItem/Advise on subtags) │ │ │ │ FailoverAlarmConsumer owns the state machine: │ │ PrimaryActive ──(N consecutive wnwrap COM failures)──▶ Degraded │ │ Degraded ──(M consecutive clean wnwrap probe polls)──▶ Primary │ │ on each switch: snapshot the now-active provider, hand off │ └─────────────────────────────────────────────────────────────────┘ ``` The failover state machine lives **worker-local** so the switch is instant — no IPC round-trip at the moment alarmmgr dies. The gateway *arms* the standby consumer up front (passes the watch-list at subscribe time) so it is ready before it is ever needed. ## Components ### Worker (`src/ZB.MOM.WW.MxGateway.Worker/MxAccess/`) **`SubtagAlarmConsumer : IMxAccessAlarmConsumer` (new)** — the standby provider. - On `Subscribe`, instead of wnwrap registration it `AddItem`/`Advise`s the configured subtags for each watch-list entry on the existing STA (reuses the worker's item-subscription machinery). Per attribute it advises at minimum `.active` and `.acked`; optionally `.priority`/severity, `.descr`, value/limit if present. - Converts each `OnDataChange` into the same `MxAlarmTransitionEvent` the wnwrap consumer emits, via the synthesis rules below, and raises `AlarmTransitionEmitted`. Marks each as **degraded**. - `SnapshotActiveAlarms()` returns the currently-active set computed from last-known subtag values. - `AcknowledgeByName(...)` resolves the watch-list entry's ack-comment subtag and issues a `Write(comment)` on the STA. `AcknowledgeByGuid(...)` maps the synthetic GUID (see below) back to a reference, then does the same. If the attribute exposes no writable ack-comment subtag, returns a failure code that the gateway surfaces as `FailedPrecondition`. - `PollOnce()` is a no-op (subtag mode is event-driven via Advise). **`FailoverAlarmConsumer : IMxAccessAlarmConsumer` (new)** — composite + state machine. Owns the wnwrap consumer (primary) and the subtag consumer (standby), forwards `AlarmTransitionEmitted` from whichever child is active, and raises a new `ProviderModeChanged` event on every switch. - **Failure counting:** wraps `Subscribe`/`PollOnce` on the primary; a thrown `COMException` or a failure HRESULT increments a consecutive-failure counter, reset to zero on any clean poll. - **Failover** (`PrimaryActive → Degraded`): at `ConsecutiveFailureThreshold` (default 3), ensures the standby is subscribed (it was armed at startup), sets active = standby, snapshots the standby's active set for hand-off, and emits `ProviderModeChanged(SUBTAG, reason, hresult)`. - **Failback probe** (`Degraded → PrimaryActive`): while degraded, every `FailbackProbeIntervalSeconds` (default 30) it re-attempts wnwrap `Subscribe`+`PollOnce` on the STA. After `FailbackStableProbes` (default 3) consecutive clean polls it switches active = primary, returns the standby to standby, and emits `ProviderModeChanged(ALARMMGR, "recovered")`. - **Hand-off:** on every switch it takes `SnapshotActiveAlarms()` from the now-active provider so the gateway can reconcile and avoid spurious raise/clear storms. **`AlarmDispatcher` / `MxAccessAlarmEventSink` / `AlarmCommandHandler` (changed, minimal)** — `AlarmDispatcher` holds a `FailoverAlarmConsumer` instead of a bare `WnWrapAlarmConsumer`; it subscribes to `ProviderModeChanged` and enqueues a mode-changed worker event. The ack path routes by active mode (native wnwrap ack in alarmmgr mode; ack-comment write in subtag mode), but that routing is entirely inside the consumer — the dispatcher just calls `AcknowledgeByName`/`AcknowledgeByGuid`. ### Gateway (`src/ZB.MOM.WW.MxGateway.Server/`) **Galaxy Repository discovery (new query)** — alongside the existing GR SQL browse RPCs, a query "attributes that have alarms configured, with their ack-comment subtag and area", scoped to the configured area. Merged with the config override (explicit includes/excludes). Produces the watch-list of `AlarmSubtagTarget`s. **`GatewayAlarmMonitor` (changed)** — resolves the watch-list at subscribe time and passes it to the worker; consumes `AlarmProviderModeChanged` and reflects the current provider mode into (a) the `AlarmFeedMessage` provider-status, (b) the `/hubs/alarms` dashboard hub, and (c) metrics; forces a reconcile (`QueryActiveAlarms`) on every switch. Re-runs discovery on its existing reconcile cadence and pushes an updated watch-list when the model changes. **`AlarmsOptions` (extended)** — new `Fallback` sub-section (below). ### Contract (`src/ZB.MOM.WW.MxGateway.Contracts/Protos/`) **`mxaccess_gateway.proto`:** - `enum AlarmProviderMode { ALARM_PROVIDER_MODE_UNSPECIFIED = 0; ALARMMGR = 1; SUBTAG = 2; }` - New `AlarmFeedMessage` oneof case `AlarmProviderStatus provider_status`, carrying `{ AlarmProviderMode mode; bool degraded; string reason; google.protobuf.Timestamp since; }`. Emitted on stream open and on every change so a late-joining client immediately learns the mode. - Add `bool degraded` + `AlarmProviderMode source_provider` to `OnAlarmTransitionEvent` **and** `ActiveAlarmSnapshot`, so per-item provenance is visible even mid-stream. All additions are new field numbers — backward compatible; existing clients ignore them and keep seeing alarms. **`mxaccess_worker.proto`:** > **Superseded:** these additions shipped in `mxaccess_gateway.proto`, not > `mxaccess_worker.proto` — the worker imports the gateway proto and the alarm > commands/events live there (`AlarmSubtagTarget`, > `OnAlarmProviderModeChangedEvent`, the extended subscribe command). - Extend the alarm-subscribe command with: `AlarmProviderMode forced_mode` (`UNSPECIFIED` = auto), `int32 consecutive_failure_threshold`, `int32 failback_probe_interval_seconds`, `int32 failback_stable_probes`, and `repeated AlarmSubtagTarget watch_list`, where `AlarmSubtagTarget = { string alarm_full_reference; string source_object_reference; string active_subtag; string acked_subtag; string ack_comment_subtag; string priority_subtag; }`. - New worker→gateway event `AlarmProviderModeChanged { AlarmProviderMode mode; string reason; int32 hresult; google.protobuf.Timestamp at; }`. > Generated code under `Generated/` and `clients/*/generated*/` is rebuilt from > these `.proto` files — never hand-edited. Every generated client touched by > the contract is rebuilt per the source-update workflow. ## Data flow ### Subtag synthesis rules `SubtagAlarmConsumer` keeps last-known `(active, acked)` per watch-list entry and emits transitions on change: | Subtag change | Emitted transition | Notes | |---|---|---| | `active` false → true | `RAISE` (state `UNACK_ALM`) | `original_raise_timestamp` = first-observed active time | | `acked` false → true while `active` | `ACKNOWLEDGE` | `operator_user`/`operator_comment` from ack-comment subtag if advised | | `active` true → false | `CLEAR` | maps to `AckRtn` if acked at clear, else `UnackRtn` | | `active` stays true, re-alarm | `RETRIGGER` | **only** if a re-alarm counter subtag exists; otherwise not synthesized (documented limitation) | Snapshot state mapping for `ActiveAlarmSnapshot.current_state`: `active && !acked → ACTIVE`, `active && acked → ACTIVE_ACKED`, `!active → INACTIVE`. Field degradation in subtag mode: - `alarm_full_reference` — from the watch-list entry (stable, drives ack-by-ref). - Synthetic, deterministic GUID derived by hashing `alarm_full_reference` so GUID-based ack still resolves; flagged `degraded = true`. - `severity` — from the priority subtag if advised, else 0. - `original_raise_timestamp` — first-observed active time (best effort). - `transition_timestamp` — the `OnDataChange` timestamp. - `category`/`description`/`current_value`/`limit_value` — populated only if the corresponding subtag is advised; otherwise empty. ### Acknowledge `AcknowledgeAlarm`/`AcknowledgeAlarmByName` are unchanged at the RPC surface. `AlarmDispatcher` routes by active provider mode: - **alarmmgr mode:** native wnwrap `AlarmAckByName`/`AlarmAckByGUID` (unchanged). - **subtag mode:** resolve the target's `ack_comment_subtag`, `Write` the operator comment via the existing worker write path on the STA. No writable ack-comment subtag → `FailedPrecondition`. ### Provider-mode reflection Worker `AlarmProviderModeChanged` → `GatewayAlarmMonitor` → (a) emit/refresh `AlarmFeedMessage.provider_status` to every `StreamAlarms` subscriber, (b) push to `/hubs/alarms`, (c) update metrics, (d) force a reconcile. ## Error handling - **Both providers down** (subtag advise also failing): the monitor stays faulted and keeps retrying both; acknowledge returns `Unavailable`. No silent data loss — the feed reports degraded with reason. - **Empty watch-list in subtag mode** (GR SQL unavailable, no config override): log + metric `alarm_fallback_watchlist_empty`; the feed reports degraded + empty; the gateway keeps re-running discovery on its reconcile cadence and pushes an updated watch-list when one becomes available. - **Switch hand-off:** every switch snapshots the now-active provider and reconciles against the gateway cache to avoid a raise/clear storm. - **STA affinity:** all subtag advise/write and wnwrap probe calls run on the worker STA (reuse the existing affinity guard) to satisfy `ThreadingModel=Apartment`. ### Metrics - `mxgateway_alarm_provider_mode` (gauge: 1 = alarmmgr, 2 = subtag) - `mxgateway_alarm_provider_switch_total{from,to,reason}` (counter) - `mxgateway_alarm_fallback_watchlist_size` (gauge) > **Superseded:** the shipped meter names are `mxgateway.alarms.provider_mode` > (gauge) and `mxgateway.alarms.provider_switches{from,to,reason}` (counter, > `reason` bounded to `failover`/`failback`/`unknown`). The watch-list-size / > watch-list-empty gauges were not implemented; an empty watch-list is surfaced > via a warning log and the feed's degraded `ProviderStatus` instead. ## Configuration ```jsonc "MxGateway": { "Alarms": { "Enabled": true, "SubscriptionExpression": "\\\\DESKTOP-6JL3KKO\\Galaxy!DEV", "DefaultArea": "DEV", "ReconcileIntervalSeconds": 30, "Fallback": { "Mode": "Auto", // Auto | ForceAlarmManager | ForceSubtag "ConsecutiveFailureThreshold": 3, "FailbackProbeIntervalSeconds": 30, "FailbackStableProbes": 3, "Discovery": { "UseGalaxyRepository": true, "Area": "", // defaults to Alarms.DefaultArea "IncludeAttributes": [], // explicit additions "ExcludeAttributes": [] }, "Subtags": { "Active": "active", "Acked": "acked", "AckComment": "", // verified against MXAccess analysis "Priority": "priority" } } } } ``` `GatewayOptionsValidator` additions: `Mode = ForceSubtag` with empty discovery result and no explicit `IncludeAttributes` → startup validation warning; threshold/interval/probe values floored at sane minimums. ## Open item to confirm during implementation The exact AVEVA subtag names (`.active`, `.acked`, the ack-comment attribute, priority) must be confirmed against the MXAccess analysis project (`C:\Users\dohertj2\Desktop\mxaccess`, `docs/MXAccess-Public-API.md`) and the live Galaxy before wiring `SubtagAlarmConsumer`. The config `Subtags` block exists precisely so the resolved names are not hard-coded. ## Testing | Layer | Tests | |---|---| | Worker unit (`MxGateway.Worker.Tests`, x86) | `SubtagAlarmConsumer` synthesis — feed `OnDataChange` sequences, assert raise/ack/clear transitions, snapshot states, degraded flag, synthetic-GUID stability, ack-comment write routing | | Worker unit | `FailoverAlarmConsumer` state machine — fake wnwrap throwing after K polls: assert switch at threshold, failback after stable probes, `ProviderModeChanged` emitted, no duplicate transitions across switch (hand-off reconcile) | | Gateway unit (`MxGateway.Tests`, fake worker) | discovery + config-override merge; `GatewayAlarmMonitor` reflects mode into feed + hub; metrics increment on switch | | Contract | proto round-trip for new fields; existing alarm tests unchanged (alarmmgr-mode regression — parity preserved) | | Live (opt-in, `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1`) | real subtag advise + ack-comment write against a live alarm; GR SQL discovery query against the `ZB` DB (gated like existing GR tests) | ## Docs to update in the same change `gateway.md` (alarm provider section), `docs/DesignDecisions.md` (record the fallback decision), `docs/GatewayConfiguration.md` (the `Fallback` block), `docs/AlarmClientDiscovery.md` (subtag provider + synthesis rules), `docs/Grpc.md` (the new `provider_status` / `degraded` fields), and any client READMEs whose generated alarm types gain fields.