docs(alarms): document alarmmgr->subtag fallback (providers, failover, config, contract, parity)

This commit is contained in:
Joseph Doherty
2026-06-13 10:43:37 -04:00
parent 27f6c9e6b7
commit 2f30f0c7c0
5 changed files with 368 additions and 0 deletions
+124
View File
@@ -790,3 +790,127 @@ Post-ack transition: kind=Clear …
10s cadence held throughout; full proto fields populated correctly;
ack registered server-side without errors.
## Subtag-monitoring fallback provider
When the wnwrap alarm-manager source fails, the gateway worker switches to
`SubtagAlarmConsumer` — a synthetic alarm source that advises each alarm
attribute's subtags via the existing MXAccess `AddItem`/`Advise` pipeline and
derives alarm transitions from the resulting value-change stream. This is a
non-parity, degraded-mode source; every transition and snapshot it produces
carries `degraded = true`.
### Watch-list discovery
`GatewayAlarmMonitor` resolves the subtag watch-list at subscribe time by
calling `IAlarmWatchListResolver.GetAlarmAttributesAsync`. The resolver merges:
1. Galaxy Repository SQL (`GetAlarmAttributesAsync`) — objects that have alarm
extensions in the configured area.
2. Config overrides — `IncludeAttributes` adds explicit entries;
`ExcludeAttributes` removes Repository-derived ones. The config list takes
effect even when `UseGalaxyRepository` is `false`.
The resolved list is a set of `AlarmSubtagTarget` messages sent to the worker
inside `SubscribeAlarmsCommand.watch_list`. Each target carries the composed
MXAccess item addresses for the `.active`, `.acked`, ack-comment, and priority
subtags. The gateway re-runs discovery on its reconcile cadence and pushes an
updated watch-list when the model changes.
### Subtag advise and `LmxSubtagAlarmSource`
`LmxSubtagAlarmSource` (implements `ISubtagAlarmSource`) owns a separate
`LMXProxyServerClass` instance on the worker STA — it does not share the
session's main MXAccess object. For each watch-list target it calls
`AddItem`/`Advise` on the configured subtag addresses. When a subtag value
changes, it raises `ValueChanged` on the STA and `SubtagAlarmConsumer`
forwards it to `SubtagAlarmStateMachine`.
`PollOnce()` on the subtag consumer is a no-op — the path is event-driven
through `Advise`, not poll-driven.
### Synthesis rules
`SubtagAlarmStateMachine` tracks `(active, acked)` per watch-list entry and
emits `MxAlarmTransitionEvent` records on change:
| Subtag change | Emitted transition | Notes |
|---|---|---|
| `active` false → true | Raise (`UNACK_ALM`) | `original_raise_timestamp` = first observed active time for this episode |
| `acked` false → true, while `active` | Acknowledge (`ACK_ALM`) | `AckedDuringEpisode` latch set |
| `active` true → false | Clear | `AckRtn` if `AckedDuringEpisode` is set, else `UnackRtn` |
| `acked` true → false, while `active` | (none) | Latch is NOT cleared; the episode retains its acknowledged status at clear |
The `AckedDuringEpisode` latch addresses out-of-order subtag delivery:
MXAccess does not guarantee the `acked = false` update arrives before the
`active = false` update. The latch ensures a clear always emits `ACK_RTN`
when the alarm was acknowledged at any point during the active episode.
`SnapshotActive()` returns one `MxAlarmSnapshotRecord` per currently-active
alarm. State mapping:
- `active && !acked` → `UNACK_ALM`
- `active && acked` → `ACK_ALM`
- `!active` → not included in the snapshot
### Synthetic GUID
The alarmmgr provider supplies a native GUID per alarm record. The subtag
provider has no native GUID. `SubtagAlarmConsumer` derives a deterministic
GUID by hashing `alarm_full_reference` (via `SyntheticAlarmGuid.ForReference`).
The same reference always produces the same GUID within a session, so
GUID-based ack routing resolves correctly. The GUID is not stable across
different alarm references or gateway restarts in the sense of matching any
AVEVA-internal GUID.
### Acknowledge in subtag mode
`AlarmDispatcher` routes ack calls by active provider mode:
- **Alarm-manager mode:** `AlarmAckByName` on `wwAlarmConsumerClass` (unchanged).
- **Subtag mode:** `SubtagAlarmConsumer.AcknowledgeByName` resolves the
watch-list entry's `ack_comment_subtag` and issues a `Write(comment)` on
the STA via `LmxSubtagAlarmSource`. The write performs the ack in AVEVA.
If the alarm has no writable ack-comment subtag (`AckComment` config key is
empty, or the entry's `ack_comment_subtag` field is empty), the ack call
returns a failure code that the gateway surfaces as `FailedPrecondition`.
`AcknowledgeByGuid` maps the synthetic GUID back to its reference via an
internal dictionary, then calls the same write path.
### Fidelity limitations
The following fields are not available or have lower quality in subtag mode:
| Field | Subtag-mode behavior |
|-------|---------------------|
| `alarm_guid` | Synthetic deterministic GUID from `alarm_full_reference`; not an AVEVA-native GUID |
| `original_raise_timestamp` | First observed `active = true` time; no AVEVA-native raise time |
| `transition_timestamp` | `OnDataChange` source timestamp from MXAccess |
| `severity` | From priority subtag if advised; 0 otherwise |
| `category` / `description` | Not populated (no subtag for these) |
| `current_value` / `limit_value` | Not populated unless corresponding subtags are in the watch-list |
| `alarm_type_name` | Not populated |
| `operator_user` / `operator_comment` | Not populated on synthesized raise/clear transitions |
| `retrigger` transition | Not synthesized (no re-alarm counter subtag is observed) |
Every transition and snapshot record carries `degraded = true` and
`source_provider = ALARM_PROVIDER_MODE_SUBTAG`. Clients that require full
fidelity must wait for failback to the alarm manager.
### Provider mode reflection
When `FailoverAlarmConsumer` switches between providers, it raises
`ProviderModeChanged`. `AlarmDispatcher` enqueues an
`OnAlarmProviderModeChangedEvent` (carried as an `MxEvent`), which the
gateway receives and reflects into:
- `AlarmFeedMessage.provider_status` emitted to every `StreamAlarms`
subscriber.
- The `/hubs/alarms` SignalR hub for the dashboard.
- Metrics: `mxgateway.alarms.provider_mode` gauge and
`mxgateway.alarms.provider_switches` counter.
On every switch `GatewayAlarmMonitor` also forces a reconcile
(`QueryActiveAlarms`) against the now-active provider so the gateway cache
reflects the post-switch state without a spurious raise/clear storm.
+52
View File
@@ -411,6 +411,58 @@ a per-channel skip-verify hook:
See [Gateway Configuration — Automatic self-signed certificate](./GatewayConfiguration.md#automatic-self-signed-certificate)
and the per-client READMEs for the as-built behavior.
## Alarm-Manager to Subtag Fallback
Decision: add a second alarm provider (subtag monitoring) that the worker
activates automatically when the native wnwrap alarm manager fails, and fails
back to automatically when the manager recovers.
### Worker-side synthesis
Synthesis of alarm transitions from subtag value changes happens entirely in
the worker (`SubtagAlarmConsumer` / `SubtagAlarmStateMachine`). The gateway
still forwards only events the worker emits and synthesizes nothing itself.
This satisfies the parity rule even though the subtag path is inherently
non-parity: the parity rule governs where synthesis lives, not whether
synthesis is permitted when the native source is unavailable.
### Degraded is explicit
Every subtag-mode transition carries `degraded = true` on the
`OnAlarmTransitionEvent` and `ActiveAlarmSnapshot` proto messages, and the
`AlarmFeedMessage` feed carries an `AlarmProviderStatus` payload on stream
open and on every switch. No client can mistake a subtag-mode alarm for an
authoritative alarmmgr record. Subtag mode has lower fidelity: synthetic
deterministic GUID (SHA-derived from the alarm reference), best-effort
original-raise timestamp, narrower field set. Clients that need full fidelity
must wait for failback.
### Failover trigger
The failover trigger is N consecutive wnwrap COM failures — a `COMException`
thrown by `Subscribe` or `PollOnce`, or a failure HRESULT from
`GetXmlCurrentAlarms2`. A single poll failure does not trigger a switch; the
threshold (default 3, floored at 1) guards against transient COM hiccups. The
counter resets on any clean poll so a flapping provider does not permanently
latch in subtag mode.
### Acknowledge via ack-comment write
In subtag mode, `AcknowledgeAlarm` writes the operator comment to the alarm
attribute's ack-comment subtag (`Fallback:Subtags:AckComment`). The write
performs the native ack in AVEVA. This differs from alarmmgr mode, where
`AlarmAckByName` on `wwAlarmConsumerClass` is called directly. The `AckComment`
subtag name is empty by default; configuring it is required for ack to work in
subtag mode. The exact AVEVA subtag names are not hard-coded — the `Subtags`
config block exists precisely so names are not guessed without validation
against the live MXAccess attribute set.
### Related documentation
- [Gateway Configuration — Alarm Fallback options](./GatewayConfiguration.md#alarm-fallback-options)
- [Alarm Client Discovery — Subtag provider](./AlarmClientDiscovery.md)
- [gRPC Contract — provider_status and degraded fields](./Grpc.md)
## Later Revisit Items
These are explicit post-v1 revisit items, not open blockers:
+68
View File
@@ -230,6 +230,74 @@ behavior.
The alarm monitor is independent of client sessions: `AcknowledgeAlarm` and
`StreamAlarms` are session-less RPCs served by the monitor.
### Alarm fallback options
The `Fallback` sub-section controls how the alarm feed selects between the
native wnwrap alarm-manager provider and the subtag-monitoring fallback.
| Option | Default | Description |
|--------|---------|-------------|
| `MxGateway:Alarms:Fallback:Mode` | `Auto` | Provider selection mode. `Auto` uses the alarm manager as primary and fails over to subtag monitoring after consecutive COM failures, then fails back automatically. `ForceAlarmManager` disables failover. `ForceSubtag` forces subtag monitoring on from startup. Values are case-insensitive. |
| `MxGateway:Alarms:Fallback:ConsecutiveFailureThreshold` | `3` | Number of consecutive wnwrap COM failures (`COMException` or failure HRESULT from `Subscribe` / `GetXmlCurrentAlarms2`) before the monitor switches to subtag mode. Floored at 1. |
| `MxGateway:Alarms:Fallback:FailbackProbeIntervalSeconds` | `30` | While in subtag mode, how often (in seconds) the monitor probes the wnwrap provider to detect recovery. Floored at 1. |
| `MxGateway:Alarms:Fallback:FailbackStableProbes` | `3` | Number of consecutive clean wnwrap probes required before the monitor switches back to the alarm manager. Floored at 1. |
| `MxGateway:Alarms:Fallback:Discovery:UseGalaxyRepository` | `true` | When `true`, the monitor queries the Galaxy Repository SQL database to build the subtag watch-list for the configured area. |
| `MxGateway:Alarms:Fallback:Discovery:Area` | _(empty)_ | Galaxy area to scope the Repository query to. Falls back to `MxGateway:Alarms:DefaultArea` when empty. Ignored when `UseGalaxyRepository` is `false`. |
| `MxGateway:Alarms:Fallback:Discovery:IncludeAttributes` | _(empty)_ | Explicit MXAccess attribute paths to add to the subtag watch-list, supplementing (or replacing, when `UseGalaxyRepository` is `false`) the Repository-derived list. |
| `MxGateway:Alarms:Fallback:Discovery:ExcludeAttributes` | _(empty)_ | Attribute paths to remove from the Repository-derived watch-list. Ignored when `UseGalaxyRepository` is `false`. |
| `MxGateway:Alarms:Fallback:Subtags:Active` | `active` | Subtag name for the in-alarm boolean. |
| `MxGateway:Alarms:Fallback:Subtags:Acked` | `acked` | Subtag name for the acknowledged boolean. |
| `MxGateway:Alarms:Fallback:Subtags:AckComment` | _(empty)_ | Subtag name for the acknowledgement comment attribute. When empty, writing an ack comment in subtag mode is disabled. Must be verified against the live MXAccess attribute set before use. |
| `MxGateway:Alarms:Fallback:Subtags:Priority` | `priority` | Subtag name for the alarm priority / severity value. |
Validation rules:
- `Mode` must be `Auto`, `ForceAlarmManager`, or `ForceSubtag` (case-insensitive).
- `Mode = ForceSubtag` with both `UseGalaxyRepository = false` and an empty
`IncludeAttributes` list produces a startup validation warning: the subtag
provider has no attributes to advise.
- `ConsecutiveFailureThreshold`, `FailbackProbeIntervalSeconds`, and
`FailbackStableProbes` are floored at 1 by `GatewayOptionsValidator`.
Full example with non-default fallback settings:
```json
{
"MxGateway": {
"Alarms": {
"Enabled": true,
"SubscriptionExpression": "\\\\SCADA01\\Galaxy!PlantArea",
"DefaultArea": "PlantArea",
"ReconcileIntervalSeconds": 30,
"Fallback": {
"Mode": "Auto",
"ConsecutiveFailureThreshold": 3,
"FailbackProbeIntervalSeconds": 30,
"FailbackStableProbes": 3,
"Discovery": {
"UseGalaxyRepository": true,
"Area": "",
"IncludeAttributes": [],
"ExcludeAttributes": []
},
"Subtags": {
"Active": "active",
"Acked": "acked",
"AckComment": "",
"Priority": "priority"
}
}
}
}
}
```
The exact AVEVA subtag names for `Active`, `Acked`, `AckComment`, and
`Priority` are not hard-coded. The `Subtags` block exists so names can be
confirmed against the live MXAccess attribute set and configured without a
code change. See `docs/AlarmClientDiscovery.md` for the synthesis rules that
depend on these names.
## Host Endpoints and Transport Security (Kestrel)
The listening endpoints are **not** part of the `MxGateway` section. The gateway
+67
View File
@@ -94,6 +94,73 @@ Carrying the enqueue timestamp into the worker layer is what lets queue-wait tim
`StreamAlarms` is a server-streaming, **session-less** RPC that attaches to the gateway's central alarm feed. The handler delegates to `IGatewayAlarmService.StreamAsync`. The stream opens with one `AlarmFeedMessage` carrying an `active_alarm` per currently-active alarm (the ConditionRefresh snapshot), then a single `snapshot_complete`, then a `transition` for every subsequent raise / acknowledge / clear. It is served by the always-on `GatewayAlarmMonitor`, which owns a single gateway-managed worker session and fans out to every attached client — clients no longer open a session of their own. `alarm_filter_prefix`, when set, scopes the stream to a sub-tree.
#### Provider status on the alarm feed
`AlarmFeedMessage` has a fourth `payload` case, `provider_status`, carrying
an `AlarmProviderStatus` message:
```protobuf
message AlarmProviderStatus {
AlarmProviderMode mode = 1;
bool degraded = 2; // true whenever mode == SUBTAG
string reason = 3; // human-readable switch reason
google.protobuf.Timestamp since = 4;
}
```
The gateway emits `provider_status` once when a client first subscribes
(immediately after the initial snapshot and before the first live transition)
and again on every failover or failback. A late-joining client therefore
always learns the current provider mode without waiting for the next switch.
`AlarmProviderMode` is an enum with three values:
| Value | Meaning |
|-------|---------|
| `ALARM_PROVIDER_MODE_UNSPECIFIED` (0) | Default / unset |
| `ALARM_PROVIDER_MODE_ALARMMGR` (1) | Native wnwrap alarm-manager source |
| `ALARM_PROVIDER_MODE_SUBTAG` (2) | Subtag-monitoring fallback (degraded) |
#### Degraded and source-provider fields on transitions and snapshots
`OnAlarmTransitionEvent` and `ActiveAlarmSnapshot` both carry two new fields:
- `bool degraded` (field 14) — `true` when the record came from the subtag
fallback, not the native alarmmgr.
- `AlarmProviderMode source_provider` (field 15) — which provider produced
this record (`ALARMMGR` or `SUBTAG`).
Both fields are proto3 defaults (`false` / `UNSPECIFIED`) in alarmmgr mode,
so existing clients that do not read them continue to function without change.
Clients that care about provenance — for example, an OPC UA server that
applies different quality flags to degraded alarms — should inspect `degraded`
before consuming the transition.
Subtag-mode records are a non-parity source. They carry synthetic GUIDs,
best-effort timestamps, and reduced field coverage. See
`docs/AlarmClientDiscovery.md` for the full fidelity table.
#### Provider-mode-changed event
The worker emits `OnAlarmProviderModeChangedEvent` (family
`MX_EVENT_FAMILY_ON_ALARM_PROVIDER_MODE_CHANGED`) on each switch between
providers:
```protobuf
message OnAlarmProviderModeChangedEvent {
AlarmProviderMode mode = 1;
string reason = 2;
int32 hresult = 3; // COM HRESULT that triggered failover; 0 on failback
google.protobuf.Timestamp at = 4;
}
```
This event arrives on the `StreamEvents` stream of the alarm monitor's
internal gateway session (not on client sessions). `GatewayAlarmMonitor`
consumes it and reflects the new mode into the `StreamAlarms` feed's
`provider_status`, the dashboard hub, and metrics. Client sessions do not
receive this event directly.
## Validation Rules
`MxAccessGrpcRequestValidator` rejects requests with `StatusCode.InvalidArgument` before any session work happens. The rules are intentionally narrow — anything that requires session state (for example, "session does not exist") is left for `ISessionManager` so the validator can stay synchronous and side-effect free.
+57
View File
@@ -143,6 +143,63 @@ session if the worker faults. Gated by `MxGateway:Alarms:Enabled` — see
`docs/DesignDecisions.md` for why this reverses the v1 single-subscriber rule
for the alarm subsystem.
### Alarm providers and failover
The alarm feed has two providers, both implemented worker-side:
- **Alarm manager (primary):** `WnWrapAlarmConsumer` polls
`wwAlarmConsumerClass.GetXmlCurrentAlarms2` on the worker STA. This is the
authoritative native source.
- **Subtag monitoring (standby):** `SubtagAlarmConsumer` advises each alarm
attribute's subtags (`.active`, `.acked`, optionally `.priority`) via the
existing `AddItem`/`Advise` pipeline through `LmxSubtagAlarmSource` and
synthesizes alarm transitions with `SubtagAlarmStateMachine`. This is a
non-parity, lower-fidelity source — synthetic GUIDs, no native raise
timestamps, narrower fields.
`FailoverAlarmConsumer` wraps both and owns the state machine:
- **Auto-failover:** after `ConsecutiveFailureThreshold` (default 3)
consecutive wnwrap COM failures — `Subscribe` or `PollOnce` throws or
returns a failure HRESULT — it activates the standby. The standby is armed
(subscribed and adviseing) from the start so its state is warm at the moment
of switch.
- **Auto-failback:** while degraded, every `FailbackProbeIntervalSeconds`
(default 30) it re-probes the still-subscribed primary. After
`FailbackStableProbes` (default 3) consecutive clean polls it switches back
to the alarm manager.
- **On every switch:** the consumer snapshots the now-active provider and
emits `OnAlarmProviderModeChangedEvent` so the gateway can reconcile its
cache without a raise/clear storm.
Synthesis is worker-side. This preserves the parity rule — the gateway
forwards only events the worker emits and never synthesizes transitions
itself. The synthesis rules are documented in
`docs/AlarmClientDiscovery.md`.
**Acknowledge in subtag mode:** the ack-by-name path writes the operator
comment to the alarm attribute's ack-comment subtag. The write performs the
ack. If the attribute has no writable ack-comment subtag configured, the RPC
returns `FailedPrecondition`. In alarm-manager mode, `AlarmAckByName` is
used as before.
**Degraded state visibility:** every subtag-mode transition carries
`degraded = true` and `source_provider = ALARM_PROVIDER_MODE_SUBTAG` on the
`OnAlarmTransitionEvent` and `ActiveAlarmSnapshot` proto fields. The
`AlarmFeedMessage` feed emits an `AlarmProviderStatus` message (the
`provider_status` oneof case) on stream open and on every switch. The
dashboard shows a Bootstrap badge (green for alarm manager, amber when
degraded). Metrics: `mxgateway.alarms.provider_mode` gauge (1 = alarmmgr,
2 = subtag) and `mxgateway.alarms.provider_switches` counter.
Forced modes are available via `MxGateway:Alarms:Fallback:Mode`:
`ForceAlarmManager` disables failover; `ForceSubtag` forces the standby
on from startup; `Auto` (default) enables failover and failback. Watch-list
discovery for the subtag provider uses Galaxy Repository SQL with config
overrides. See `docs/GatewayConfiguration.md` for the full `Fallback` option
block and `docs/AlarmClientDiscovery.md` for synthesis rules and fidelity
limitations.
Dashboard authentication is LDAP-backed (distinct from the API-key model on
the gRPC API). `/login` accepts username and password in a form body, binds
against `MxGateway:Ldap`, maps the user's LDAP groups to `Admin` or `Viewer`