C6b: IAlarmWatchListResolver.ResolveAsync doc now notes that while discovery being unavailable never throws, a triggered cancellation token still propagates. C7: annotate the original design doc where it drifted from the shipped code — metric names / unimplemented watch-list gauges, and the proto-type location (gateway proto, not worker proto).
17 KiB
Alarm Subtag-Monitoring Fallback — Design
Date: 2026-06-13
Status: Superseded by implementation (merged to main). This is the original
brainstorming design; a few details below were refined during implementation —
see the inline Superseded notes. The shipped behaviour is documented in
docs/AlarmClientDiscovery.md, the client READMEs, and the contracts.
Branch: feat/alarm-subtag-fallback
Problem
The gateway's central alarm feed (GatewayAlarmMonitor → worker
WnWrapAlarmConsumer) depends on the AVEVA wnwrap COM consumer
(WNWRAPCONSUMERLib.wwAlarmConsumerClass), which polls GetXmlCurrentAlarms2
on the worker STA. That provider can fail at the COM boundary (the older
aaAlarmManagedClient crashed on FILETIME marshaling; wnwrap can still return
failure HRESULTs or throw COMException). When it does, the gateway loses all
alarm visibility.
This design adds a second alarm source — direct monitoring of each alarm
attribute's subtags (.active, .acked, …) via the existing MXAccess
AddItem/Advise pipeline — and fails over to it automatically when the
wnwrap provider breaks, then fails back automatically when it recovers. The
subtag source can also be forced on by config.
Decisions (locked during brainstorming)
| Decision | Choice |
|---|---|
| Failover model | Auto-failover + auto-failback (both directions, runtime) |
| Watch-list source | Galaxy Repository SQL discovery + config override |
| Acknowledge in subtag mode | Write the operator comment to the alarm's ack-comment subtag (the write performs the ack) |
| Failure signal | N consecutive wnwrap COM failures (Subscribe / GetXmlCurrentAlarms2 throws or returns a failure HRESULT) |
| Degraded-state visibility | Both — explicit field in the gRPC contract and dashboard + metrics |
| Synthesis location | Worker-side (Approach A) — keeps the parity rule "the gateway forwards only events the worker emits; it never synthesizes events" |
Core principle
Subtag monitoring is, by definition, a non-parity, lower-fidelity alarm
source: it synthesizes alarm transitions from raw data changes, has no native
alarm GUID, no native original-raise timestamp, and a narrower field set. Per
CLAUDE.md, synthesizing events is allowed only as an explicit opt-in
non-parity mode. This design satisfies that by (a) doing the synthesis inside
the worker (so the gateway still only forwards worker-emitted events) and
(b) marking every degraded event and the whole feed as degraded so no client
mistakes it for the authoritative alarmmgr feed.
Architecture
GATEWAY (.NET 10, x64)
┌─────────────────────────────────────────────────────────────────┐
│ GatewayAlarmMonitor (BackgroundService) │
│ • resolves watch-list: Galaxy Repository SQL + config override │
│ • arms the worker with the watch-list at subscribe time │
│ • consumes AlarmProviderModeChanged → reflects mode into feed, │
│ /hubs/alarms dashboard hub, and metrics │
│ • forces a cache reconcile (QueryActiveAlarms) on every switch │
└───────────────────────────────┬───────────────────────────────────┘
│ IPC (WorkerEnvelope frames)
│ · SubscribeAlarms{ watch_list, failover cfg }
│ · AlarmProviderModeChanged{ mode, reason, hresult }
│ · OnAlarmTransitionEvent (degraded flag set in subtag mode)
▼
WORKER (.NET FW 4.8, x86, STA)
┌─────────────────────────────────────────────────────────────────┐
│ AlarmDispatcher → FailoverAlarmConsumer : IMxAccessAlarmConsumer │
│ ├─ primary : WnWrapAlarmConsumer (wnwrap COM poll, unchanged) │
│ └─ standby : SubtagAlarmConsumer (AddItem/Advise on subtags) │
│ │
│ FailoverAlarmConsumer owns the state machine: │
│ PrimaryActive ──(N consecutive wnwrap COM failures)──▶ Degraded │
│ Degraded ──(M consecutive clean wnwrap probe polls)──▶ Primary │
│ on each switch: snapshot the now-active provider, hand off │
└─────────────────────────────────────────────────────────────────┘
The failover state machine lives worker-local so the switch is instant — no IPC round-trip at the moment alarmmgr dies. The gateway arms the standby consumer up front (passes the watch-list at subscribe time) so it is ready before it is ever needed.
Components
Worker (src/ZB.MOM.WW.MxGateway.Worker/MxAccess/)
SubtagAlarmConsumer : IMxAccessAlarmConsumer (new) — the standby provider.
- On
Subscribe, instead of wnwrap registration itAddItem/Advises the configured subtags for each watch-list entry on the existing STA (reuses the worker's item-subscription machinery). Per attribute it advises at minimum.activeand.acked; optionally.priority/severity,.descr, value/limit if present. - Converts each
OnDataChangeinto the sameMxAlarmTransitionEventthe wnwrap consumer emits, via the synthesis rules below, and raisesAlarmTransitionEmitted. Marks each as degraded. SnapshotActiveAlarms()returns the currently-active set computed from last-known subtag values.AcknowledgeByName(...)resolves the watch-list entry's ack-comment subtag and issues aWrite(comment)on the STA.AcknowledgeByGuid(...)maps the synthetic GUID (see below) back to a reference, then does the same. If the attribute exposes no writable ack-comment subtag, returns a failure code that the gateway surfaces asFailedPrecondition.PollOnce()is a no-op (subtag mode is event-driven via Advise).
FailoverAlarmConsumer : IMxAccessAlarmConsumer (new) — composite + state
machine. Owns the wnwrap consumer (primary) and the subtag consumer (standby),
forwards AlarmTransitionEmitted from whichever child is active, and raises a
new ProviderModeChanged event on every switch.
- Failure counting: wraps
Subscribe/PollOnceon the primary; a thrownCOMExceptionor a failure HRESULT increments a consecutive-failure counter, reset to zero on any clean poll. - Failover (
PrimaryActive → Degraded): atConsecutiveFailureThreshold(default 3), ensures the standby is subscribed (it was armed at startup), sets active = standby, snapshots the standby's active set for hand-off, and emitsProviderModeChanged(SUBTAG, reason, hresult). - Failback probe (
Degraded → PrimaryActive): while degraded, everyFailbackProbeIntervalSeconds(default 30) it re-attempts wnwrapSubscribe+PollOnceon the STA. AfterFailbackStableProbes(default 3) consecutive clean polls it switches active = primary, returns the standby to standby, and emitsProviderModeChanged(ALARMMGR, "recovered"). - Hand-off: on every switch it takes
SnapshotActiveAlarms()from the now-active provider so the gateway can reconcile and avoid spurious raise/clear storms.
AlarmDispatcher / MxAccessAlarmEventSink / AlarmCommandHandler
(changed, minimal) — AlarmDispatcher holds a FailoverAlarmConsumer instead
of a bare WnWrapAlarmConsumer; it subscribes to ProviderModeChanged and
enqueues a mode-changed worker event. The ack path routes by active mode (native
wnwrap ack in alarmmgr mode; ack-comment write in subtag mode), but that routing
is entirely inside the consumer — the dispatcher just calls
AcknowledgeByName/AcknowledgeByGuid.
Gateway (src/ZB.MOM.WW.MxGateway.Server/)
Galaxy Repository discovery (new query) — alongside the existing GR SQL
browse RPCs, a query "attributes that have alarms configured, with their
ack-comment subtag and area", scoped to the configured area. Merged with the
config override (explicit includes/excludes). Produces the watch-list of
AlarmSubtagTargets.
GatewayAlarmMonitor (changed) — resolves the watch-list at subscribe time
and passes it to the worker; consumes AlarmProviderModeChanged and reflects
the current provider mode into (a) the AlarmFeedMessage provider-status,
(b) the /hubs/alarms dashboard hub, and (c) metrics; forces a reconcile
(QueryActiveAlarms) on every switch. Re-runs discovery on its existing
reconcile cadence and pushes an updated watch-list when the model changes.
AlarmsOptions (extended) — new Fallback sub-section (below).
Contract (src/ZB.MOM.WW.MxGateway.Contracts/Protos/)
mxaccess_gateway.proto:
enum AlarmProviderMode { ALARM_PROVIDER_MODE_UNSPECIFIED = 0; ALARMMGR = 1; SUBTAG = 2; }- New
AlarmFeedMessageoneof caseAlarmProviderStatus provider_status, carrying{ AlarmProviderMode mode; bool degraded; string reason; google.protobuf.Timestamp since; }. Emitted on stream open and on every change so a late-joining client immediately learns the mode. - Add
bool degraded+AlarmProviderMode source_providertoOnAlarmTransitionEventandActiveAlarmSnapshot, so per-item provenance is visible even mid-stream. All additions are new field numbers — backward compatible; existing clients ignore them and keep seeing alarms.
mxaccess_worker.proto:
Superseded: these additions shipped in
mxaccess_gateway.proto, notmxaccess_worker.proto— the worker imports the gateway proto and the alarm commands/events live there (AlarmSubtagTarget,OnAlarmProviderModeChangedEvent, the extended subscribe command).
- Extend the alarm-subscribe command with:
AlarmProviderMode forced_mode(UNSPECIFIED= auto),int32 consecutive_failure_threshold,int32 failback_probe_interval_seconds,int32 failback_stable_probes, andrepeated AlarmSubtagTarget watch_list, whereAlarmSubtagTarget = { string alarm_full_reference; string source_object_reference; string active_subtag; string acked_subtag; string ack_comment_subtag; string priority_subtag; }. - New worker→gateway event
AlarmProviderModeChanged { AlarmProviderMode mode; string reason; int32 hresult; google.protobuf.Timestamp at; }.
Generated code under
Generated/andclients/*/generated*/is rebuilt from these.protofiles — never hand-edited. Every generated client touched by the contract is rebuilt per the source-update workflow.
Data flow
Subtag synthesis rules
SubtagAlarmConsumer keeps last-known (active, acked) per watch-list entry and
emits transitions on change:
| Subtag change | Emitted transition | Notes |
|---|---|---|
active false → true |
RAISE (state UNACK_ALM) |
original_raise_timestamp = first-observed active time |
acked false → true while active |
ACKNOWLEDGE |
operator_user/operator_comment from ack-comment subtag if advised |
active true → false |
CLEAR |
maps to AckRtn if acked at clear, else UnackRtn |
active stays true, re-alarm |
RETRIGGER |
only if a re-alarm counter subtag exists; otherwise not synthesized (documented limitation) |
Snapshot state mapping for ActiveAlarmSnapshot.current_state:
active && !acked → ACTIVE, active && acked → ACTIVE_ACKED,
!active → INACTIVE.
Field degradation in subtag mode:
alarm_full_reference— from the watch-list entry (stable, drives ack-by-ref).- Synthetic, deterministic GUID derived by hashing
alarm_full_referenceso GUID-based ack still resolves; flaggeddegraded = true. severity— from the priority subtag if advised, else 0.original_raise_timestamp— first-observed active time (best effort).transition_timestamp— theOnDataChangetimestamp.category/description/current_value/limit_value— populated only if the corresponding subtag is advised; otherwise empty.
Acknowledge
AcknowledgeAlarm/AcknowledgeAlarmByName are unchanged at the RPC surface.
AlarmDispatcher routes by active provider mode:
- alarmmgr mode: native wnwrap
AlarmAckByName/AlarmAckByGUID(unchanged). - subtag mode: resolve the target's
ack_comment_subtag,Writethe operator comment via the existing worker write path on the STA. No writable ack-comment subtag →FailedPrecondition.
Provider-mode reflection
Worker AlarmProviderModeChanged → GatewayAlarmMonitor → (a) emit/refresh
AlarmFeedMessage.provider_status to every StreamAlarms subscriber, (b) push
to /hubs/alarms, (c) update metrics, (d) force a reconcile.
Error handling
- Both providers down (subtag advise also failing): the monitor stays
faulted and keeps retrying both; acknowledge returns
Unavailable. No silent data loss — the feed reports degraded with reason. - Empty watch-list in subtag mode (GR SQL unavailable, no config override):
log + metric
alarm_fallback_watchlist_empty; the feed reports degraded + empty; the gateway keeps re-running discovery on its reconcile cadence and pushes an updated watch-list when one becomes available. - Switch hand-off: every switch snapshots the now-active provider and reconciles against the gateway cache to avoid a raise/clear storm.
- STA affinity: all subtag advise/write and wnwrap probe calls run on the
worker STA (reuse the existing affinity guard) to satisfy
ThreadingModel=Apartment.
Metrics
mxgateway_alarm_provider_mode(gauge: 1 = alarmmgr, 2 = subtag)mxgateway_alarm_provider_switch_total{from,to,reason}(counter)mxgateway_alarm_fallback_watchlist_size(gauge)
Superseded: the shipped meter names are
mxgateway.alarms.provider_mode(gauge) andmxgateway.alarms.provider_switches{from,to,reason}(counter,reasonbounded tofailover/failback/unknown). The watch-list-size / watch-list-empty gauges were not implemented; an empty watch-list is surfaced via a warning log and the feed's degradedProviderStatusinstead.
Configuration
"MxGateway": {
"Alarms": {
"Enabled": true,
"SubscriptionExpression": "\\\\DESKTOP-6JL3KKO\\Galaxy!DEV",
"DefaultArea": "DEV",
"ReconcileIntervalSeconds": 30,
"Fallback": {
"Mode": "Auto", // Auto | ForceAlarmManager | ForceSubtag
"ConsecutiveFailureThreshold": 3,
"FailbackProbeIntervalSeconds": 30,
"FailbackStableProbes": 3,
"Discovery": {
"UseGalaxyRepository": true,
"Area": "", // defaults to Alarms.DefaultArea
"IncludeAttributes": [], // explicit additions
"ExcludeAttributes": []
},
"Subtags": {
"Active": "active",
"Acked": "acked",
"AckComment": "", // verified against MXAccess analysis
"Priority": "priority"
}
}
}
}
GatewayOptionsValidator additions: Mode = ForceSubtag with empty discovery
result and no explicit IncludeAttributes → startup validation warning;
threshold/interval/probe values floored at sane minimums.
Open item to confirm during implementation
The exact AVEVA subtag names (.active, .acked, the ack-comment attribute,
priority) must be confirmed against the MXAccess analysis project
(C:\Users\dohertj2\Desktop\mxaccess, docs/MXAccess-Public-API.md) and the
live Galaxy before wiring SubtagAlarmConsumer. The config Subtags block
exists precisely so the resolved names are not hard-coded.
Testing
| Layer | Tests |
|---|---|
Worker unit (MxGateway.Worker.Tests, x86) |
SubtagAlarmConsumer synthesis — feed OnDataChange sequences, assert raise/ack/clear transitions, snapshot states, degraded flag, synthetic-GUID stability, ack-comment write routing |
| Worker unit | FailoverAlarmConsumer state machine — fake wnwrap throwing after K polls: assert switch at threshold, failback after stable probes, ProviderModeChanged emitted, no duplicate transitions across switch (hand-off reconcile) |
Gateway unit (MxGateway.Tests, fake worker) |
discovery + config-override merge; GatewayAlarmMonitor reflects mode into feed + hub; metrics increment on switch |
| Contract | proto round-trip for new fields; existing alarm tests unchanged (alarmmgr-mode regression — parity preserved) |
Live (opt-in, MXGATEWAY_RUN_LIVE_MXACCESS_TESTS=1) |
real subtag advise + ack-comment write against a live alarm; GR SQL discovery query against the ZB DB (gated like existing GR tests) |
Docs to update in the same change
gateway.md (alarm provider section), docs/DesignDecisions.md (record the
fallback decision), docs/GatewayConfiguration.md (the Fallback block),
docs/AlarmClientDiscovery.md (subtag provider + synthesis rules),
docs/Grpc.md (the new provider_status / degraded fields), and any client
READMEs whose generated alarm types gain fields.