diff --git a/docs/plans/2026-06-13-alarm-subtag-fallback-design.md b/docs/plans/2026-06-13-alarm-subtag-fallback-design.md new file mode 100644 index 0000000..a1382b1 --- /dev/null +++ b/docs/plans/2026-06-13-alarm-subtag-fallback-design.md @@ -0,0 +1,302 @@ +# Alarm Subtag-Monitoring Fallback — Design + +**Date:** 2026-06-13 +**Status:** Approved (brainstorming), ready for implementation planning +**Branch:** `feat/alarm-subtag-fallback` + +## Problem + +The gateway's central alarm feed (`GatewayAlarmMonitor` → worker +`WnWrapAlarmConsumer`) depends on the AVEVA wnwrap COM consumer +(`WNWRAPCONSUMERLib.wwAlarmConsumerClass`), which polls `GetXmlCurrentAlarms2` +on the worker STA. That provider can fail at the COM boundary (the older +`aaAlarmManagedClient` crashed on FILETIME marshaling; wnwrap can still return +failure HRESULTs or throw `COMException`). When it does, the gateway loses all +alarm visibility. + +This design adds a **second alarm source** — direct monitoring of each alarm +attribute's subtags (`.active`, `.acked`, …) via the existing MXAccess +`AddItem`/`Advise` pipeline — and **fails over to it automatically when the +wnwrap provider breaks, then fails back automatically when it recovers**. The +subtag source can also be forced on by config. + +## Decisions (locked during brainstorming) + +| Decision | Choice | +|---|---| +| Failover model | **Auto-failover + auto-failback** (both directions, runtime) | +| Watch-list source | **Galaxy Repository SQL discovery + config override** | +| Acknowledge in subtag mode | **Write the operator comment to the alarm's ack-comment subtag** (the write performs the ack) | +| Failure signal | **N consecutive wnwrap COM failures** (Subscribe / `GetXmlCurrentAlarms2` throws or returns a failure HRESULT) | +| Degraded-state visibility | **Both** — explicit field in the gRPC contract **and** dashboard + metrics | +| Synthesis location | **Worker-side** (`Approach A`) — keeps the parity rule "the gateway forwards only events the worker emits; it never synthesizes events" | + +## Core principle + +Subtag monitoring is, by definition, a **non-parity, lower-fidelity** alarm +source: it synthesizes alarm transitions from raw data changes, has no native +alarm GUID, no native original-raise timestamp, and a narrower field set. Per +`CLAUDE.md`, synthesizing events is allowed only as an explicit opt-in +non-parity mode. This design satisfies that by (a) doing the synthesis **inside +the worker** (so the gateway still only forwards worker-emitted events) and +(b) marking every degraded event and the whole feed as degraded so no client +mistakes it for the authoritative alarmmgr feed. + +## Architecture + +``` + GATEWAY (.NET 10, x64) + ┌─────────────────────────────────────────────────────────────────┐ + │ GatewayAlarmMonitor (BackgroundService) │ + │ • resolves watch-list: Galaxy Repository SQL + config override │ + │ • arms the worker with the watch-list at subscribe time │ + │ • consumes AlarmProviderModeChanged → reflects mode into feed, │ + │ /hubs/alarms dashboard hub, and metrics │ + │ • forces a cache reconcile (QueryActiveAlarms) on every switch │ + └───────────────────────────────┬───────────────────────────────────┘ + │ IPC (WorkerEnvelope frames) + │ · SubscribeAlarms{ watch_list, failover cfg } + │ · AlarmProviderModeChanged{ mode, reason, hresult } + │ · OnAlarmTransitionEvent (degraded flag set in subtag mode) + ▼ + WORKER (.NET FW 4.8, x86, STA) + ┌─────────────────────────────────────────────────────────────────┐ + │ AlarmDispatcher → FailoverAlarmConsumer : IMxAccessAlarmConsumer │ + │ ├─ primary : WnWrapAlarmConsumer (wnwrap COM poll, unchanged) │ + │ └─ standby : SubtagAlarmConsumer (AddItem/Advise on subtags) │ + │ │ + │ FailoverAlarmConsumer owns the state machine: │ + │ PrimaryActive ──(N consecutive wnwrap COM failures)──▶ Degraded │ + │ Degraded ──(M consecutive clean wnwrap probe polls)──▶ Primary │ + │ on each switch: snapshot the now-active provider, hand off │ + └─────────────────────────────────────────────────────────────────┘ +``` + +The failover state machine lives **worker-local** so the switch is instant — no +IPC round-trip at the moment alarmmgr dies. The gateway *arms* the standby +consumer up front (passes the watch-list at subscribe time) so it is ready +before it is ever needed. + +## Components + +### Worker (`src/ZB.MOM.WW.MxGateway.Worker/MxAccess/`) + +**`SubtagAlarmConsumer : IMxAccessAlarmConsumer` (new)** — the standby provider. + +- On `Subscribe`, instead of wnwrap registration it `AddItem`/`Advise`s the + configured subtags for each watch-list entry on the existing STA (reuses the + worker's item-subscription machinery). Per attribute it advises at minimum + `.active` and `.acked`; optionally `.priority`/severity, `.descr`, value/limit + if present. +- Converts each `OnDataChange` into the same `MxAlarmTransitionEvent` the wnwrap + consumer emits, via the synthesis rules below, and raises + `AlarmTransitionEmitted`. Marks each as **degraded**. +- `SnapshotActiveAlarms()` returns the currently-active set computed from + last-known subtag values. +- `AcknowledgeByName(...)` resolves the watch-list entry's ack-comment subtag and + issues a `Write(comment)` on the STA. `AcknowledgeByGuid(...)` maps the + synthetic GUID (see below) back to a reference, then does the same. If the + attribute exposes no writable ack-comment subtag, returns a failure code that + the gateway surfaces as `FailedPrecondition`. +- `PollOnce()` is a no-op (subtag mode is event-driven via Advise). + +**`FailoverAlarmConsumer : IMxAccessAlarmConsumer` (new)** — composite + state +machine. Owns the wnwrap consumer (primary) and the subtag consumer (standby), +forwards `AlarmTransitionEmitted` from whichever child is active, and raises a +new `ProviderModeChanged` event on every switch. + +- **Failure counting:** wraps `Subscribe`/`PollOnce` on the primary; a thrown + `COMException` or a failure HRESULT increments a consecutive-failure counter, + reset to zero on any clean poll. +- **Failover** (`PrimaryActive → Degraded`): at `ConsecutiveFailureThreshold` + (default 3), ensures the standby is subscribed (it was armed at startup), sets + active = standby, snapshots the standby's active set for hand-off, and emits + `ProviderModeChanged(SUBTAG, reason, hresult)`. +- **Failback probe** (`Degraded → PrimaryActive`): while degraded, every + `FailbackProbeIntervalSeconds` (default 30) it re-attempts wnwrap + `Subscribe`+`PollOnce` on the STA. After `FailbackStableProbes` (default 3) + consecutive clean polls it switches active = primary, returns the standby to + standby, and emits `ProviderModeChanged(ALARMMGR, "recovered")`. +- **Hand-off:** on every switch it takes `SnapshotActiveAlarms()` from the + now-active provider so the gateway can reconcile and avoid spurious + raise/clear storms. + +**`AlarmDispatcher` / `MxAccessAlarmEventSink` / `AlarmCommandHandler` +(changed, minimal)** — `AlarmDispatcher` holds a `FailoverAlarmConsumer` instead +of a bare `WnWrapAlarmConsumer`; it subscribes to `ProviderModeChanged` and +enqueues a mode-changed worker event. The ack path routes by active mode (native +wnwrap ack in alarmmgr mode; ack-comment write in subtag mode), but that routing +is entirely inside the consumer — the dispatcher just calls +`AcknowledgeByName`/`AcknowledgeByGuid`. + +### Gateway (`src/ZB.MOM.WW.MxGateway.Server/`) + +**Galaxy Repository discovery (new query)** — alongside the existing GR SQL +browse RPCs, a query "attributes that have alarms configured, with their +ack-comment subtag and area", scoped to the configured area. Merged with the +config override (explicit includes/excludes). Produces the watch-list of +`AlarmSubtagTarget`s. + +**`GatewayAlarmMonitor` (changed)** — resolves the watch-list at subscribe time +and passes it to the worker; consumes `AlarmProviderModeChanged` and reflects +the current provider mode into (a) the `AlarmFeedMessage` provider-status, +(b) the `/hubs/alarms` dashboard hub, and (c) metrics; forces a reconcile +(`QueryActiveAlarms`) on every switch. Re-runs discovery on its existing +reconcile cadence and pushes an updated watch-list when the model changes. + +**`AlarmsOptions` (extended)** — new `Fallback` sub-section (below). + +### Contract (`src/ZB.MOM.WW.MxGateway.Contracts/Protos/`) + +**`mxaccess_gateway.proto`:** + +- `enum AlarmProviderMode { ALARM_PROVIDER_MODE_UNSPECIFIED = 0; ALARMMGR = 1; SUBTAG = 2; }` +- New `AlarmFeedMessage` oneof case `AlarmProviderStatus provider_status`, + carrying `{ AlarmProviderMode mode; bool degraded; string reason; + google.protobuf.Timestamp since; }`. Emitted on stream open and on every + change so a late-joining client immediately learns the mode. +- Add `bool degraded` + `AlarmProviderMode source_provider` to + `OnAlarmTransitionEvent` **and** `ActiveAlarmSnapshot`, so per-item provenance + is visible even mid-stream. All additions are new field numbers — backward + compatible; existing clients ignore them and keep seeing alarms. + +**`mxaccess_worker.proto`:** + +- Extend the alarm-subscribe command with: `AlarmProviderMode forced_mode` + (`UNSPECIFIED` = auto), `int32 consecutive_failure_threshold`, + `int32 failback_probe_interval_seconds`, `int32 failback_stable_probes`, and + `repeated AlarmSubtagTarget watch_list`, where `AlarmSubtagTarget = + { string alarm_full_reference; string source_object_reference; + string active_subtag; string acked_subtag; string ack_comment_subtag; + string priority_subtag; }`. +- New worker→gateway event `AlarmProviderModeChanged { AlarmProviderMode mode; + string reason; int32 hresult; google.protobuf.Timestamp at; }`. + +> Generated code under `Generated/` and `clients/*/generated*/` is rebuilt from +> these `.proto` files — never hand-edited. Every generated client touched by +> the contract is rebuilt per the source-update workflow. + +## Data flow + +### Subtag synthesis rules + +`SubtagAlarmConsumer` keeps last-known `(active, acked)` per watch-list entry and +emits transitions on change: + +| Subtag change | Emitted transition | Notes | +|---|---|---| +| `active` false → true | `RAISE` (state `UNACK_ALM`) | `original_raise_timestamp` = first-observed active time | +| `acked` false → true while `active` | `ACKNOWLEDGE` | `operator_user`/`operator_comment` from ack-comment subtag if advised | +| `active` true → false | `CLEAR` | maps to `AckRtn` if acked at clear, else `UnackRtn` | +| `active` stays true, re-alarm | `RETRIGGER` | **only** if a re-alarm counter subtag exists; otherwise not synthesized (documented limitation) | + +Snapshot state mapping for `ActiveAlarmSnapshot.current_state`: +`active && !acked → ACTIVE`, `active && acked → ACTIVE_ACKED`, +`!active → INACTIVE`. + +Field degradation in subtag mode: +- `alarm_full_reference` — from the watch-list entry (stable, drives ack-by-ref). +- Synthetic, deterministic GUID derived by hashing `alarm_full_reference` so + GUID-based ack still resolves; flagged `degraded = true`. +- `severity` — from the priority subtag if advised, else 0. +- `original_raise_timestamp` — first-observed active time (best effort). +- `transition_timestamp` — the `OnDataChange` timestamp. +- `category`/`description`/`current_value`/`limit_value` — populated only if the + corresponding subtag is advised; otherwise empty. + +### Acknowledge + +`AcknowledgeAlarm`/`AcknowledgeAlarmByName` are unchanged at the RPC surface. +`AlarmDispatcher` routes by active provider mode: +- **alarmmgr mode:** native wnwrap `AlarmAckByName`/`AlarmAckByGUID` (unchanged). +- **subtag mode:** resolve the target's `ack_comment_subtag`, `Write` the + operator comment via the existing worker write path on the STA. No writable + ack-comment subtag → `FailedPrecondition`. + +### Provider-mode reflection + +Worker `AlarmProviderModeChanged` → `GatewayAlarmMonitor` → (a) emit/refresh +`AlarmFeedMessage.provider_status` to every `StreamAlarms` subscriber, (b) push +to `/hubs/alarms`, (c) update metrics, (d) force a reconcile. + +## Error handling + +- **Both providers down** (subtag advise also failing): the monitor stays + faulted and keeps retrying both; acknowledge returns `Unavailable`. No silent + data loss — the feed reports degraded with reason. +- **Empty watch-list in subtag mode** (GR SQL unavailable, no config override): + log + metric `alarm_fallback_watchlist_empty`; the feed reports degraded + + empty; the gateway keeps re-running discovery on its reconcile cadence and + pushes an updated watch-list when one becomes available. +- **Switch hand-off:** every switch snapshots the now-active provider and + reconciles against the gateway cache to avoid a raise/clear storm. +- **STA affinity:** all subtag advise/write and wnwrap probe calls run on the + worker STA (reuse the existing affinity guard) to satisfy + `ThreadingModel=Apartment`. + +### Metrics + +- `mxgateway_alarm_provider_mode` (gauge: 1 = alarmmgr, 2 = subtag) +- `mxgateway_alarm_provider_switch_total{from,to,reason}` (counter) +- `mxgateway_alarm_fallback_watchlist_size` (gauge) + +## Configuration + +```jsonc +"MxGateway": { + "Alarms": { + "Enabled": true, + "SubscriptionExpression": "\\\\DESKTOP-6JL3KKO\\Galaxy!DEV", + "DefaultArea": "DEV", + "ReconcileIntervalSeconds": 30, + "Fallback": { + "Mode": "Auto", // Auto | ForceAlarmManager | ForceSubtag + "ConsecutiveFailureThreshold": 3, + "FailbackProbeIntervalSeconds": 30, + "FailbackStableProbes": 3, + "Discovery": { + "UseGalaxyRepository": true, + "Area": "", // defaults to Alarms.DefaultArea + "IncludeAttributes": [], // explicit additions + "ExcludeAttributes": [] + }, + "Subtags": { + "Active": "active", + "Acked": "acked", + "AckComment": "", // verified against MXAccess analysis + "Priority": "priority" + } + } + } +} +``` + +`GatewayOptionsValidator` additions: `Mode = ForceSubtag` with empty discovery +result and no explicit `IncludeAttributes` → startup validation warning; +threshold/interval/probe values floored at sane minimums. + +## Open item to confirm during implementation + +The exact AVEVA subtag names (`.active`, `.acked`, the ack-comment attribute, +priority) must be confirmed against the MXAccess analysis project +(`C:\Users\dohertj2\Desktop\mxaccess`, `docs/MXAccess-Public-API.md`) and the +live Galaxy before wiring `SubtagAlarmConsumer`. The config `Subtags` block +exists precisely so the resolved names are not hard-coded. + +## Testing + +| Layer | Tests | +|---|---| +| Worker unit (`MxGateway.Worker.Tests`, x86) | `SubtagAlarmConsumer` synthesis — feed `OnDataChange` sequences, assert raise/ack/clear transitions, snapshot states, degraded flag, synthetic-GUID stability, ack-comment write routing | +| Worker unit | `FailoverAlarmConsumer` state machine — fake wnwrap throwing after K polls: assert switch at threshold, failback after stable probes, `ProviderModeChanged` emitted, no duplicate transitions across switch (hand-off reconcile) | +| Gateway unit (`MxGateway.Tests`, fake worker) | discovery + config-override merge; `GatewayAlarmMonitor` reflects mode into feed + hub; metrics increment on switch | +| Contract | proto round-trip for new fields; existing alarm tests unchanged (alarmmgr-mode regression — parity preserved) | +| Live (opt-in, `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1`) | real subtag advise + ack-comment write against a live alarm; GR SQL discovery query against the `ZB` DB (gated like existing GR tests) | + +## Docs to update in the same change + +`gateway.md` (alarm provider section), `docs/DesignDecisions.md` (record the +fallback decision), `docs/GatewayConfiguration.md` (the `Fallback` block), +`docs/AlarmClientDiscovery.md` (subtag provider + synthesis rules), +`docs/Grpc.md` (the new `provider_status` / `degraded` fields), and any client +READMEs whose generated alarm types gain fields.