docs(alarms): document alarmmgr->subtag fallback (providers, failover, config, contract, parity)
This commit is contained in:
+57
@@ -143,6 +143,63 @@ session if the worker faults. Gated by `MxGateway:Alarms:Enabled` — see
|
||||
`docs/DesignDecisions.md` for why this reverses the v1 single-subscriber rule
|
||||
for the alarm subsystem.
|
||||
|
||||
### Alarm providers and failover
|
||||
|
||||
The alarm feed has two providers, both implemented worker-side:
|
||||
|
||||
- **Alarm manager (primary):** `WnWrapAlarmConsumer` polls
|
||||
`wwAlarmConsumerClass.GetXmlCurrentAlarms2` on the worker STA. This is the
|
||||
authoritative native source.
|
||||
- **Subtag monitoring (standby):** `SubtagAlarmConsumer` advises each alarm
|
||||
attribute's subtags (`.active`, `.acked`, optionally `.priority`) via the
|
||||
existing `AddItem`/`Advise` pipeline through `LmxSubtagAlarmSource` and
|
||||
synthesizes alarm transitions with `SubtagAlarmStateMachine`. This is a
|
||||
non-parity, lower-fidelity source — synthetic GUIDs, no native raise
|
||||
timestamps, narrower fields.
|
||||
|
||||
`FailoverAlarmConsumer` wraps both and owns the state machine:
|
||||
|
||||
- **Auto-failover:** after `ConsecutiveFailureThreshold` (default 3)
|
||||
consecutive wnwrap COM failures — `Subscribe` or `PollOnce` throws or
|
||||
returns a failure HRESULT — it activates the standby. The standby is armed
|
||||
(subscribed and adviseing) from the start so its state is warm at the moment
|
||||
of switch.
|
||||
- **Auto-failback:** while degraded, every `FailbackProbeIntervalSeconds`
|
||||
(default 30) it re-probes the still-subscribed primary. After
|
||||
`FailbackStableProbes` (default 3) consecutive clean polls it switches back
|
||||
to the alarm manager.
|
||||
- **On every switch:** the consumer snapshots the now-active provider and
|
||||
emits `OnAlarmProviderModeChangedEvent` so the gateway can reconcile its
|
||||
cache without a raise/clear storm.
|
||||
|
||||
Synthesis is worker-side. This preserves the parity rule — the gateway
|
||||
forwards only events the worker emits and never synthesizes transitions
|
||||
itself. The synthesis rules are documented in
|
||||
`docs/AlarmClientDiscovery.md`.
|
||||
|
||||
**Acknowledge in subtag mode:** the ack-by-name path writes the operator
|
||||
comment to the alarm attribute's ack-comment subtag. The write performs the
|
||||
ack. If the attribute has no writable ack-comment subtag configured, the RPC
|
||||
returns `FailedPrecondition`. In alarm-manager mode, `AlarmAckByName` is
|
||||
used as before.
|
||||
|
||||
**Degraded state visibility:** every subtag-mode transition carries
|
||||
`degraded = true` and `source_provider = ALARM_PROVIDER_MODE_SUBTAG` on the
|
||||
`OnAlarmTransitionEvent` and `ActiveAlarmSnapshot` proto fields. The
|
||||
`AlarmFeedMessage` feed emits an `AlarmProviderStatus` message (the
|
||||
`provider_status` oneof case) on stream open and on every switch. The
|
||||
dashboard shows a Bootstrap badge (green for alarm manager, amber when
|
||||
degraded). Metrics: `mxgateway.alarms.provider_mode` gauge (1 = alarmmgr,
|
||||
2 = subtag) and `mxgateway.alarms.provider_switches` counter.
|
||||
|
||||
Forced modes are available via `MxGateway:Alarms:Fallback:Mode`:
|
||||
`ForceAlarmManager` disables failover; `ForceSubtag` forces the standby
|
||||
on from startup; `Auto` (default) enables failover and failback. Watch-list
|
||||
discovery for the subtag provider uses Galaxy Repository SQL with config
|
||||
overrides. See `docs/GatewayConfiguration.md` for the full `Fallback` option
|
||||
block and `docs/AlarmClientDiscovery.md` for synthesis rules and fidelity
|
||||
limitations.
|
||||
|
||||
Dashboard authentication is LDAP-backed (distinct from the API-key model on
|
||||
the gRPC API). `/login` accepts username and password in a form body, binds
|
||||
against `MxGateway:Ldap`, maps the user's LDAP groups to `Admin` or `Viewer`
|
||||
|
||||
Reference in New Issue
Block a user