diff --git a/docs/AlarmClientDiscovery.md b/docs/AlarmClientDiscovery.md index 056bb75..1d08632 100644 --- a/docs/AlarmClientDiscovery.md +++ b/docs/AlarmClientDiscovery.md @@ -790,3 +790,127 @@ Post-ack transition: kind=Clear … 10s cadence held throughout; full proto fields populated correctly; ack registered server-side without errors. + +## Subtag-monitoring fallback provider + +When the wnwrap alarm-manager source fails, the gateway worker switches to +`SubtagAlarmConsumer` — a synthetic alarm source that advises each alarm +attribute's subtags via the existing MXAccess `AddItem`/`Advise` pipeline and +derives alarm transitions from the resulting value-change stream. This is a +non-parity, degraded-mode source; every transition and snapshot it produces +carries `degraded = true`. + +### Watch-list discovery + +`GatewayAlarmMonitor` resolves the subtag watch-list at subscribe time by +calling `IAlarmWatchListResolver.GetAlarmAttributesAsync`. The resolver merges: + +1. Galaxy Repository SQL (`GetAlarmAttributesAsync`) — objects that have alarm + extensions in the configured area. +2. Config overrides — `IncludeAttributes` adds explicit entries; + `ExcludeAttributes` removes Repository-derived ones. The config list takes + effect even when `UseGalaxyRepository` is `false`. + +The resolved list is a set of `AlarmSubtagTarget` messages sent to the worker +inside `SubscribeAlarmsCommand.watch_list`. Each target carries the composed +MXAccess item addresses for the `.active`, `.acked`, ack-comment, and priority +subtags. The gateway re-runs discovery on its reconcile cadence and pushes an +updated watch-list when the model changes. + +### Subtag advise and `LmxSubtagAlarmSource` + +`LmxSubtagAlarmSource` (implements `ISubtagAlarmSource`) owns a separate +`LMXProxyServerClass` instance on the worker STA — it does not share the +session's main MXAccess object. For each watch-list target it calls +`AddItem`/`Advise` on the configured subtag addresses. When a subtag value +changes, it raises `ValueChanged` on the STA and `SubtagAlarmConsumer` +forwards it to `SubtagAlarmStateMachine`. + +`PollOnce()` on the subtag consumer is a no-op — the path is event-driven +through `Advise`, not poll-driven. + +### Synthesis rules + +`SubtagAlarmStateMachine` tracks `(active, acked)` per watch-list entry and +emits `MxAlarmTransitionEvent` records on change: + +| Subtag change | Emitted transition | Notes | +|---|---|---| +| `active` false → true | Raise (`UNACK_ALM`) | `original_raise_timestamp` = first observed active time for this episode | +| `acked` false → true, while `active` | Acknowledge (`ACK_ALM`) | `AckedDuringEpisode` latch set | +| `active` true → false | Clear | `AckRtn` if `AckedDuringEpisode` is set, else `UnackRtn` | +| `acked` true → false, while `active` | (none) | Latch is NOT cleared; the episode retains its acknowledged status at clear | + +The `AckedDuringEpisode` latch addresses out-of-order subtag delivery: +MXAccess does not guarantee the `acked = false` update arrives before the +`active = false` update. The latch ensures a clear always emits `ACK_RTN` +when the alarm was acknowledged at any point during the active episode. + +`SnapshotActive()` returns one `MxAlarmSnapshotRecord` per currently-active +alarm. State mapping: + +- `active && !acked` → `UNACK_ALM` +- `active && acked` → `ACK_ALM` +- `!active` → not included in the snapshot + +### Synthetic GUID + +The alarmmgr provider supplies a native GUID per alarm record. The subtag +provider has no native GUID. `SubtagAlarmConsumer` derives a deterministic +GUID by hashing `alarm_full_reference` (via `SyntheticAlarmGuid.ForReference`). +The same reference always produces the same GUID within a session, so +GUID-based ack routing resolves correctly. The GUID is not stable across +different alarm references or gateway restarts in the sense of matching any +AVEVA-internal GUID. + +### Acknowledge in subtag mode + +`AlarmDispatcher` routes ack calls by active provider mode: + +- **Alarm-manager mode:** `AlarmAckByName` on `wwAlarmConsumerClass` (unchanged). +- **Subtag mode:** `SubtagAlarmConsumer.AcknowledgeByName` resolves the + watch-list entry's `ack_comment_subtag` and issues a `Write(comment)` on + the STA via `LmxSubtagAlarmSource`. The write performs the ack in AVEVA. + +If the alarm has no writable ack-comment subtag (`AckComment` config key is +empty, or the entry's `ack_comment_subtag` field is empty), the ack call +returns a failure code that the gateway surfaces as `FailedPrecondition`. +`AcknowledgeByGuid` maps the synthetic GUID back to its reference via an +internal dictionary, then calls the same write path. + +### Fidelity limitations + +The following fields are not available or have lower quality in subtag mode: + +| Field | Subtag-mode behavior | +|-------|---------------------| +| `alarm_guid` | Synthetic deterministic GUID from `alarm_full_reference`; not an AVEVA-native GUID | +| `original_raise_timestamp` | First observed `active = true` time; no AVEVA-native raise time | +| `transition_timestamp` | `OnDataChange` source timestamp from MXAccess | +| `severity` | From priority subtag if advised; 0 otherwise | +| `category` / `description` | Not populated (no subtag for these) | +| `current_value` / `limit_value` | Not populated unless corresponding subtags are in the watch-list | +| `alarm_type_name` | Not populated | +| `operator_user` / `operator_comment` | Not populated on synthesized raise/clear transitions | +| `retrigger` transition | Not synthesized (no re-alarm counter subtag is observed) | + +Every transition and snapshot record carries `degraded = true` and +`source_provider = ALARM_PROVIDER_MODE_SUBTAG`. Clients that require full +fidelity must wait for failback to the alarm manager. + +### Provider mode reflection + +When `FailoverAlarmConsumer` switches between providers, it raises +`ProviderModeChanged`. `AlarmDispatcher` enqueues an +`OnAlarmProviderModeChangedEvent` (carried as an `MxEvent`), which the +gateway receives and reflects into: + +- `AlarmFeedMessage.provider_status` emitted to every `StreamAlarms` + subscriber. +- The `/hubs/alarms` SignalR hub for the dashboard. +- Metrics: `mxgateway.alarms.provider_mode` gauge and + `mxgateway.alarms.provider_switches` counter. + +On every switch `GatewayAlarmMonitor` also forces a reconcile +(`QueryActiveAlarms`) against the now-active provider so the gateway cache +reflects the post-switch state without a spurious raise/clear storm. diff --git a/docs/DesignDecisions.md b/docs/DesignDecisions.md index ad8005e..6a3c6d0 100644 --- a/docs/DesignDecisions.md +++ b/docs/DesignDecisions.md @@ -411,6 +411,58 @@ a per-channel skip-verify hook: See [Gateway Configuration — Automatic self-signed certificate](./GatewayConfiguration.md#automatic-self-signed-certificate) and the per-client READMEs for the as-built behavior. +## Alarm-Manager to Subtag Fallback + +Decision: add a second alarm provider (subtag monitoring) that the worker +activates automatically when the native wnwrap alarm manager fails, and fails +back to automatically when the manager recovers. + +### Worker-side synthesis + +Synthesis of alarm transitions from subtag value changes happens entirely in +the worker (`SubtagAlarmConsumer` / `SubtagAlarmStateMachine`). The gateway +still forwards only events the worker emits and synthesizes nothing itself. +This satisfies the parity rule even though the subtag path is inherently +non-parity: the parity rule governs where synthesis lives, not whether +synthesis is permitted when the native source is unavailable. + +### Degraded is explicit + +Every subtag-mode transition carries `degraded = true` on the +`OnAlarmTransitionEvent` and `ActiveAlarmSnapshot` proto messages, and the +`AlarmFeedMessage` feed carries an `AlarmProviderStatus` payload on stream +open and on every switch. No client can mistake a subtag-mode alarm for an +authoritative alarmmgr record. Subtag mode has lower fidelity: synthetic +deterministic GUID (SHA-derived from the alarm reference), best-effort +original-raise timestamp, narrower field set. Clients that need full fidelity +must wait for failback. + +### Failover trigger + +The failover trigger is N consecutive wnwrap COM failures — a `COMException` +thrown by `Subscribe` or `PollOnce`, or a failure HRESULT from +`GetXmlCurrentAlarms2`. A single poll failure does not trigger a switch; the +threshold (default 3, floored at 1) guards against transient COM hiccups. The +counter resets on any clean poll so a flapping provider does not permanently +latch in subtag mode. + +### Acknowledge via ack-comment write + +In subtag mode, `AcknowledgeAlarm` writes the operator comment to the alarm +attribute's ack-comment subtag (`Fallback:Subtags:AckComment`). The write +performs the native ack in AVEVA. This differs from alarmmgr mode, where +`AlarmAckByName` on `wwAlarmConsumerClass` is called directly. The `AckComment` +subtag name is empty by default; configuring it is required for ack to work in +subtag mode. The exact AVEVA subtag names are not hard-coded — the `Subtags` +config block exists precisely so names are not guessed without validation +against the live MXAccess attribute set. + +### Related documentation + +- [Gateway Configuration — Alarm Fallback options](./GatewayConfiguration.md#alarm-fallback-options) +- [Alarm Client Discovery — Subtag provider](./AlarmClientDiscovery.md) +- [gRPC Contract — provider_status and degraded fields](./Grpc.md) + ## Later Revisit Items These are explicit post-v1 revisit items, not open blockers: diff --git a/docs/GatewayConfiguration.md b/docs/GatewayConfiguration.md index 07c1c04..e65ce68 100644 --- a/docs/GatewayConfiguration.md +++ b/docs/GatewayConfiguration.md @@ -230,6 +230,74 @@ behavior. The alarm monitor is independent of client sessions: `AcknowledgeAlarm` and `StreamAlarms` are session-less RPCs served by the monitor. +### Alarm fallback options + +The `Fallback` sub-section controls how the alarm feed selects between the +native wnwrap alarm-manager provider and the subtag-monitoring fallback. + +| Option | Default | Description | +|--------|---------|-------------| +| `MxGateway:Alarms:Fallback:Mode` | `Auto` | Provider selection mode. `Auto` uses the alarm manager as primary and fails over to subtag monitoring after consecutive COM failures, then fails back automatically. `ForceAlarmManager` disables failover. `ForceSubtag` forces subtag monitoring on from startup. Values are case-insensitive. | +| `MxGateway:Alarms:Fallback:ConsecutiveFailureThreshold` | `3` | Number of consecutive wnwrap COM failures (`COMException` or failure HRESULT from `Subscribe` / `GetXmlCurrentAlarms2`) before the monitor switches to subtag mode. Floored at 1. | +| `MxGateway:Alarms:Fallback:FailbackProbeIntervalSeconds` | `30` | While in subtag mode, how often (in seconds) the monitor probes the wnwrap provider to detect recovery. Floored at 1. | +| `MxGateway:Alarms:Fallback:FailbackStableProbes` | `3` | Number of consecutive clean wnwrap probes required before the monitor switches back to the alarm manager. Floored at 1. | +| `MxGateway:Alarms:Fallback:Discovery:UseGalaxyRepository` | `true` | When `true`, the monitor queries the Galaxy Repository SQL database to build the subtag watch-list for the configured area. | +| `MxGateway:Alarms:Fallback:Discovery:Area` | _(empty)_ | Galaxy area to scope the Repository query to. Falls back to `MxGateway:Alarms:DefaultArea` when empty. Ignored when `UseGalaxyRepository` is `false`. | +| `MxGateway:Alarms:Fallback:Discovery:IncludeAttributes` | _(empty)_ | Explicit MXAccess attribute paths to add to the subtag watch-list, supplementing (or replacing, when `UseGalaxyRepository` is `false`) the Repository-derived list. | +| `MxGateway:Alarms:Fallback:Discovery:ExcludeAttributes` | _(empty)_ | Attribute paths to remove from the Repository-derived watch-list. Ignored when `UseGalaxyRepository` is `false`. | +| `MxGateway:Alarms:Fallback:Subtags:Active` | `active` | Subtag name for the in-alarm boolean. | +| `MxGateway:Alarms:Fallback:Subtags:Acked` | `acked` | Subtag name for the acknowledged boolean. | +| `MxGateway:Alarms:Fallback:Subtags:AckComment` | _(empty)_ | Subtag name for the acknowledgement comment attribute. When empty, writing an ack comment in subtag mode is disabled. Must be verified against the live MXAccess attribute set before use. | +| `MxGateway:Alarms:Fallback:Subtags:Priority` | `priority` | Subtag name for the alarm priority / severity value. | + +Validation rules: + +- `Mode` must be `Auto`, `ForceAlarmManager`, or `ForceSubtag` (case-insensitive). +- `Mode = ForceSubtag` with both `UseGalaxyRepository = false` and an empty + `IncludeAttributes` list produces a startup validation warning: the subtag + provider has no attributes to advise. +- `ConsecutiveFailureThreshold`, `FailbackProbeIntervalSeconds`, and + `FailbackStableProbes` are floored at 1 by `GatewayOptionsValidator`. + +Full example with non-default fallback settings: + +```json +{ + "MxGateway": { + "Alarms": { + "Enabled": true, + "SubscriptionExpression": "\\\\SCADA01\\Galaxy!PlantArea", + "DefaultArea": "PlantArea", + "ReconcileIntervalSeconds": 30, + "Fallback": { + "Mode": "Auto", + "ConsecutiveFailureThreshold": 3, + "FailbackProbeIntervalSeconds": 30, + "FailbackStableProbes": 3, + "Discovery": { + "UseGalaxyRepository": true, + "Area": "", + "IncludeAttributes": [], + "ExcludeAttributes": [] + }, + "Subtags": { + "Active": "active", + "Acked": "acked", + "AckComment": "", + "Priority": "priority" + } + } + } + } +} +``` + +The exact AVEVA subtag names for `Active`, `Acked`, `AckComment`, and +`Priority` are not hard-coded. The `Subtags` block exists so names can be +confirmed against the live MXAccess attribute set and configured without a +code change. See `docs/AlarmClientDiscovery.md` for the synthesis rules that +depend on these names. + ## Host Endpoints and Transport Security (Kestrel) The listening endpoints are **not** part of the `MxGateway` section. The gateway diff --git a/docs/Grpc.md b/docs/Grpc.md index 0043127..db62307 100644 --- a/docs/Grpc.md +++ b/docs/Grpc.md @@ -94,6 +94,73 @@ Carrying the enqueue timestamp into the worker layer is what lets queue-wait tim `StreamAlarms` is a server-streaming, **session-less** RPC that attaches to the gateway's central alarm feed. The handler delegates to `IGatewayAlarmService.StreamAsync`. The stream opens with one `AlarmFeedMessage` carrying an `active_alarm` per currently-active alarm (the ConditionRefresh snapshot), then a single `snapshot_complete`, then a `transition` for every subsequent raise / acknowledge / clear. It is served by the always-on `GatewayAlarmMonitor`, which owns a single gateway-managed worker session and fans out to every attached client — clients no longer open a session of their own. `alarm_filter_prefix`, when set, scopes the stream to a sub-tree. +#### Provider status on the alarm feed + +`AlarmFeedMessage` has a fourth `payload` case, `provider_status`, carrying +an `AlarmProviderStatus` message: + +```protobuf +message AlarmProviderStatus { + AlarmProviderMode mode = 1; + bool degraded = 2; // true whenever mode == SUBTAG + string reason = 3; // human-readable switch reason + google.protobuf.Timestamp since = 4; +} +``` + +The gateway emits `provider_status` once when a client first subscribes +(immediately after the initial snapshot and before the first live transition) +and again on every failover or failback. A late-joining client therefore +always learns the current provider mode without waiting for the next switch. + +`AlarmProviderMode` is an enum with three values: + +| Value | Meaning | +|-------|---------| +| `ALARM_PROVIDER_MODE_UNSPECIFIED` (0) | Default / unset | +| `ALARM_PROVIDER_MODE_ALARMMGR` (1) | Native wnwrap alarm-manager source | +| `ALARM_PROVIDER_MODE_SUBTAG` (2) | Subtag-monitoring fallback (degraded) | + +#### Degraded and source-provider fields on transitions and snapshots + +`OnAlarmTransitionEvent` and `ActiveAlarmSnapshot` both carry two new fields: + +- `bool degraded` (field 14) — `true` when the record came from the subtag + fallback, not the native alarmmgr. +- `AlarmProviderMode source_provider` (field 15) — which provider produced + this record (`ALARMMGR` or `SUBTAG`). + +Both fields are proto3 defaults (`false` / `UNSPECIFIED`) in alarmmgr mode, +so existing clients that do not read them continue to function without change. +Clients that care about provenance — for example, an OPC UA server that +applies different quality flags to degraded alarms — should inspect `degraded` +before consuming the transition. + +Subtag-mode records are a non-parity source. They carry synthetic GUIDs, +best-effort timestamps, and reduced field coverage. See +`docs/AlarmClientDiscovery.md` for the full fidelity table. + +#### Provider-mode-changed event + +The worker emits `OnAlarmProviderModeChangedEvent` (family +`MX_EVENT_FAMILY_ON_ALARM_PROVIDER_MODE_CHANGED`) on each switch between +providers: + +```protobuf +message OnAlarmProviderModeChangedEvent { + AlarmProviderMode mode = 1; + string reason = 2; + int32 hresult = 3; // COM HRESULT that triggered failover; 0 on failback + google.protobuf.Timestamp at = 4; +} +``` + +This event arrives on the `StreamEvents` stream of the alarm monitor's +internal gateway session (not on client sessions). `GatewayAlarmMonitor` +consumes it and reflects the new mode into the `StreamAlarms` feed's +`provider_status`, the dashboard hub, and metrics. Client sessions do not +receive this event directly. + ## Validation Rules `MxAccessGrpcRequestValidator` rejects requests with `StatusCode.InvalidArgument` before any session work happens. The rules are intentionally narrow — anything that requires session state (for example, "session does not exist") is left for `ISessionManager` so the validator can stay synchronous and side-effect free. diff --git a/gateway.md b/gateway.md index 4db6d7a..0665864 100644 --- a/gateway.md +++ b/gateway.md @@ -143,6 +143,63 @@ session if the worker faults. Gated by `MxGateway:Alarms:Enabled` — see `docs/DesignDecisions.md` for why this reverses the v1 single-subscriber rule for the alarm subsystem. +### Alarm providers and failover + +The alarm feed has two providers, both implemented worker-side: + +- **Alarm manager (primary):** `WnWrapAlarmConsumer` polls + `wwAlarmConsumerClass.GetXmlCurrentAlarms2` on the worker STA. This is the + authoritative native source. +- **Subtag monitoring (standby):** `SubtagAlarmConsumer` advises each alarm + attribute's subtags (`.active`, `.acked`, optionally `.priority`) via the + existing `AddItem`/`Advise` pipeline through `LmxSubtagAlarmSource` and + synthesizes alarm transitions with `SubtagAlarmStateMachine`. This is a + non-parity, lower-fidelity source — synthetic GUIDs, no native raise + timestamps, narrower fields. + +`FailoverAlarmConsumer` wraps both and owns the state machine: + +- **Auto-failover:** after `ConsecutiveFailureThreshold` (default 3) + consecutive wnwrap COM failures — `Subscribe` or `PollOnce` throws or + returns a failure HRESULT — it activates the standby. The standby is armed + (subscribed and adviseing) from the start so its state is warm at the moment + of switch. +- **Auto-failback:** while degraded, every `FailbackProbeIntervalSeconds` + (default 30) it re-probes the still-subscribed primary. After + `FailbackStableProbes` (default 3) consecutive clean polls it switches back + to the alarm manager. +- **On every switch:** the consumer snapshots the now-active provider and + emits `OnAlarmProviderModeChangedEvent` so the gateway can reconcile its + cache without a raise/clear storm. + +Synthesis is worker-side. This preserves the parity rule — the gateway +forwards only events the worker emits and never synthesizes transitions +itself. The synthesis rules are documented in +`docs/AlarmClientDiscovery.md`. + +**Acknowledge in subtag mode:** the ack-by-name path writes the operator +comment to the alarm attribute's ack-comment subtag. The write performs the +ack. If the attribute has no writable ack-comment subtag configured, the RPC +returns `FailedPrecondition`. In alarm-manager mode, `AlarmAckByName` is +used as before. + +**Degraded state visibility:** every subtag-mode transition carries +`degraded = true` and `source_provider = ALARM_PROVIDER_MODE_SUBTAG` on the +`OnAlarmTransitionEvent` and `ActiveAlarmSnapshot` proto fields. The +`AlarmFeedMessage` feed emits an `AlarmProviderStatus` message (the +`provider_status` oneof case) on stream open and on every switch. The +dashboard shows a Bootstrap badge (green for alarm manager, amber when +degraded). Metrics: `mxgateway.alarms.provider_mode` gauge (1 = alarmmgr, +2 = subtag) and `mxgateway.alarms.provider_switches` counter. + +Forced modes are available via `MxGateway:Alarms:Fallback:Mode`: +`ForceAlarmManager` disables failover; `ForceSubtag` forces the standby +on from startup; `Auto` (default) enables failover and failback. Watch-list +discovery for the subtag provider uses Galaxy Repository SQL with config +overrides. See `docs/GatewayConfiguration.md` for the full `Fallback` option +block and `docs/AlarmClientDiscovery.md` for synthesis rules and fidelity +limitations. + Dashboard authentication is LDAP-backed (distinct from the API-key model on the gRPC API). `/login` accepts username and password in a form body, binds against `MxGateway:Ldap`, maps the user's LDAP groups to `Admin` or `Viewer`