Files
ScadaBridge/docs/plans/2026-05-29-native-alarms-design.md
Joseph Doherty dadebbe227 docs(plans): native OPC UA & MxAccess GW alarms design
Read-only mirror of native alarm sources into a unified A&C-style state
model (severity + active/acked/shelved/suppressed). Instance-bound source
discovery, site-only SQLite state with live central query (no central
tables), DebugView enrichment. OPC UA A&C events + ConditionRefresh and
MxGateway session-less StreamAlarms via a new IAlarmSubscribableConnection
seam routed connection-level by source reference; new NativeAlarmActor peer
to computed AlarmActor.
2026-05-29 15:14:01 -04:00

213 lines
19 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Native OPC UA & MxAccess Gateway Alarms — Design
**Date:** 2026-05-29
**Status:** Approved
## Problem
Today alarms are **computed at the site**: a `TemplateAlarm` defines a trigger (ValueMatch, RangeViolation, RateOfChange, HiLo, Expression); one `AlarmActor` per alarm evaluates attribute values and emits `AlarmStateChanged` carrying a bare `AlarmState { Active, Normal }` plus an integer `Priority` and (for HiLo) a `Level`. State is in-memory only — there is **no severity dimension, no acknowledgement, no shelve/suppress state, and no operator metadata** — and it surfaces only in the per-instance DebugView.
Two data sources we connect to own their own alarm lifecycle and expose far richer state:
- **OPC UA Alarms & Conditions (Part 9)** — the server raises/acks/clears `AlarmCondition` nodes with orthogonal sub-states (Active/Inactive, Acked/Unacked, Confirmed/Unconfirmed, Shelved, Suppressed) and a 11000 severity. The DCL OPC UA adapter currently subscribes only to the `Value` attribute.
- **MxAccess Gateway** — already exposes a session-less `StreamAlarms` feed (`OnAlarmTransitionEvent`: raise/ack/clear/retrigger, severity, operator user + comment, category, description, current/limit value) plus `QueryActiveAlarms`. The DCL MxGateway adapter currently consumes only the `OnDataChange` event family.
These are **mirrored** alarms — the source is the source of truth — which is a real divergence from the computed model. This design enriches the alarm tracking model to carry severity + ack/shelve/suppress state, and ingests native alarms from both sources.
## Design Decisions
| Decision | Choice |
|----------|--------|
| State model scope | **Unified** A&C-style state model for *all* alarms (computed + native) |
| Interactivity | **Read-only mirror** — display source-reported state; no acking/shelving from ScadaBridge, no command relay, no operator identity captured by ScadaBridge (source-supplied operator user/comment *are* displayed for native alarms) |
| Binding | Instance declares a `NativeAlarmSource` (connection + source ref); conditions under it are **discovered at runtime**, keyed by source reference |
| State location | **Site-only**, persisted to SQLite (survives restart/failover); central **queries live** (snapshot + live stream); **no central tables, no central history** |
| MxGateway transport | Gateway session-less `StreamAlarms` feed |
| OPC UA transport | Alarms & Conditions events + `ConditionRefresh` snapshot |
| Site actor structure | New `NativeAlarmActor` child of `InstanceActor`, peer to computed `AlarmActor`s (Approach 1) |
| Authoring | Central UI design-time panels (Template editor + Instance Configure) **and** CLI |
| Runtime UI | Enrich the per-instance DebugView alarm table only (no new operator page) |
### Trade-offs accepted
- **No central audit trail of alarms** (who acked, history). Acceptable because the source systems own ack and retain their own alarm history; ScadaBridge is a read-only window. If audit of alarm state is later wanted, a central mirror following the Site Call Audit (#22) pattern can be added without disturbing this design.
- **Read-only** means MxGateway/OPC UA acknowledgements happen in the source's own tools; ScadaBridge reflects them.
---
## Section 1 — Unified state model & wire contracts
**New Commons types** (`Types/Enums/`, `Types/Alarms/`):
```
enum AlarmKind { Computed, NativeOpcUa, NativeMxAccess }
enum AlarmShelveState { Unshelved, OneShotShelved, TimedShelved, PermanentShelved }
record AlarmConditionState(
bool Active, // Active vs Inactive
bool Acknowledged, // Acked / Unacked
bool? Confirmed, // null = not a confirmable condition (OPC UA optional)
AlarmShelveState Shelve,
bool Suppressed,
int Severity) // 01000, unified scale
```
The OPC UA Part 9 sub-conditions are **orthogonal** and MxAccess's `ACTIVE / ACTIVE_ACKED / INACTIVE` maps cleanly onto them, so they are modeled as independent flags (the UI rolls them up for display).
**`AlarmStateChanged` is extended additively** (existing fields kept for back-compat; new fields defaulted):
| New field | Default | Notes |
|-----------|---------|-------|
| `Kind` | `Computed` | discriminator |
| `Condition` | computed from existing | the `AlarmConditionState` above |
| `SourceReference` | `""` | native key, e.g. `"Tank01.Level.HiHi"` |
| `AlarmTypeName` | `""` | native, e.g. `"AnalogLimitAlarm.HiHi"` |
| `Category` | `""` | native taxonomy |
| `OperatorUser` | `""` | native ack metadata (display-only) |
| `OperatorComment` | `""` | native ack metadata (display-only) |
| `OriginalRaiseTime` | `null` | native |
| `CurrentValue` | `""` | native, display |
| `LimitValue` | `""` | native, display |
**Identity / key:** computed alarms key by `AlarmName` (canonical); native alarms key by `SourceReference` (stable across transitions). `Kind` discriminates. The existing `AlarmName` field carries the source reference's display form for native rows so existing consumers don't break.
**Source → `AlarmConditionState` mapping:**
- **Computed:** `Active = state==Active`, `Acknowledged = true` (auto), `Confirmed = null`, `Shelve = Unshelved`, `Suppressed = false`, `Severity = Priority`, `Level` retained for HiLo.
- **OPC UA A&C:** read `ActiveState`, `AckedState`, `ConfirmedState`, `ShelvingState`, `SuppressedState`, `Severity` from the condition's event fields.
- **MxAccess:** `ACTIVE → (Active=t, Ack=f)`, `ACTIVE_ACKED → (Active=t, Ack=t)`, `INACTIVE → (Active=f)`; `Severity` from the gateway's remapped 01000; shelve/suppress default (gateway proto doesn't surface them).
**gRPC `AlarmStateUpdate` (`sitestream.proto`)** gets the same fields appended as new field numbers (additive — never renumber/remove): `kind`, `active`, `acknowledged`, `confirmed`, `shelve_state`, `suppressed`, `source_reference`, `alarm_type_name`, `category`, `operator_user`, `operator_comment`, `original_raise_time`, `current_value`, `limit_value`. Existing `state`, `priority`, `level`, `message` stay for compatibility.
---
## Section 2 — Configuration, binding & deployment
This mirrors how template **attributes bind to a data source** today (template declares, instance overrides the concrete reference).
**New Commons entities**
- `TemplateNativeAlarmSource` (`Entities/Templates/`): `Id`, `TemplateId`, `Name` (unique within template), `Description?`, `ConnectionName`, `SourceReference` (OPC UA SourceNode/notifier nodeId, or MxAccess object/area), `ConditionFilter?` (null = mirror *all* conditions under the source), `IsLocked`, `IsInherited` — same lock/inherit bookkeeping as `TemplateAlarm`.
- `InstanceNativeAlarmSourceOverride` (`Entities/Instances/`): `Id`, `InstanceId`, `SourceCanonicalName`, `ConnectionNameOverride?`, `SourceReferenceOverride?`, `ConditionFilterOverride?`; unique `(InstanceId, SourceCanonicalName)`. `SourceReference` is the field that varies per physical instance, so per-instance override is the common case.
**Flattening** (`FlatteningService`, `FlattenedConfiguration`)
- New `ResolvedNativeAlarmSource { CanonicalName, ConnectionName, SourceReference, ConditionFilter?, Source }`, resolved through the same steps as `ResolvedAlarm`: inherited → composed (path-qualified `[Module].[Name]`) → instance overrides applied.
- **Pre-deployment semantic validation** (extends existing checks): `ConnectionName` resolves to a real site `DataConnection`; that connection's protocol is alarm-capable (`OpcUa` or `MxGateway`); `SourceReference` non-empty; canonical-name collision check.
**ConfigurationDatabase (EF + migration)**
- `TemplateNativeAlarmSourceConfiguration` → table `Templates.NativeAlarmSources`, unique `(TemplateId, Name)`, FK cascade.
- `InstanceNativeAlarmSourceOverrideConfiguration` → table `InstanceNativeAlarmSourceOverrides`, unique `(InstanceId, SourceCanonicalName)`, FK cascade.
- One migration adds both tables (auto-apply in dev per existing convention).
**Deployment**`FlattenedConfiguration` carries `ResolvedNativeAlarmSource[]`, deployed alongside `ResolvedAlarm[]` on the existing artifact path. Site Runtime consumes them when building the instance actor hierarchy. All-or-nothing per-instance apply unchanged.
**Authoring (Central UI + CLI)**
- Template editor: a "Native Alarm Sources" subsection (name, connection dropdown filtered to alarm-capable protocols, source reference, optional filter).
- Instance Configure: override connection/source-ref/filter per instance, like attribute data-source overrides.
- CLI: `template native-alarm-source add/list/remove`, `instance native-alarm-source set/clear`.
---
## Section 3 — DCL ingestion & the two adapters
**Capability seam** (mirrors the existing `IBrowsableDataConnection` pattern):
```
interface IAlarmSubscribableConnection {
Task<string> SubscribeAlarmsAsync(string sourceReference, string? conditionFilter, AlarmTransitionCallback cb);
Task UnsubscribeAlarmsAsync(string subscriptionId);
}
delegate void AlarmTransitionCallback(NativeAlarmTransition t);
```
**Protocol-neutral transition** (`Commons/Types/Alarms/`):
```
enum AlarmTransitionKind { Snapshot, SnapshotComplete, Raise, Acknowledge, Clear, Retrigger, StateChange }
record NativeAlarmTransition(
string SourceReference, string SourceObjectReference, string AlarmTypeName,
AlarmTransitionKind Kind, AlarmConditionState Condition,
string Category, string Description, string Message,
string OperatorUser, string OperatorComment,
DateTimeOffset? OriginalRaiseTime, DateTimeOffset TransitionTime,
string CurrentValue, string LimitValue)
```
`Snapshot`/`SnapshotComplete` carry the initial active-condition replay so the consumer re-seeds a source's state on every (re)subscribe — this is how reconnect reconciliation works without central storage.
**Connection-level transport + source-ref routing.** Although binding is *declared* per-instance, the subscription is naturally **connection-level** (OPC UA wants one event subscription; MxGateway `StreamAlarms` is one session-less feed). `DataConnectionActor` opens **one** alarm feed per connection and maintains `_alarmSubscribers: SourceObjectRef → set<instance actorRef>`, routing each transition to matching instances.
New messages (`Messages/DataConnection/`): `SubscribeAlarmsRequest`/`Response`, `UnsubscribeAlarmsRequest`, internal `NativeAlarmTransitionReceived(conn, transition, generation)`, forwarded as `NativeAlarmTransitionUpdate(conn, transition)`. Subscribe/unsubscribe obey the existing **Become/Stash** lifecycle (stashed while Connecting/Reconnecting, replayed on Connected). The stale-callback **generation guard** and once-only disconnect guard apply unchanged. On disconnect the actor emits a per-source `NativeAlarmSourceUnavailable` so consumers mark mirrored alarms *uncertain* rather than clearing them.
**OPC UA A&C adapter** (`OpcUaDataConnection` / `RealOpcUaClient`)
- One event `MonitoredItem` (`AttributeId = EventNotifier`) on the Server object (i=2253) or configured notifier, with an `EventFilter`: SelectClauses for EventId, EventType, SourceNode, SourceName, Time, Message, Severity + `ConditionType`/`AcknowledgeableConditionType`/`AlarmConditionType` state fields (Acked/Confirmed/Active/Shelving/Suppressed). Optional WhereClause scoping to the union of bound SourceNodes.
- Map event fields → `NativeAlarmTransition`; derive `Kind` from which sub-state changed.
- Call `ConditionRefresh` on (re)subscribe → emit the `Snapshot`/`SnapshotComplete` sequence.
**MxGateway adapter** (`MxGatewayDataConnection` / `RealMxGatewayClient`)
- Open session-less `StreamAlarms` (optional `alarm_filter_prefix` from bound source refs). Map `AlarmFeedMessage`: `active_alarm``Snapshot`, `snapshot_complete``SnapshotComplete`, `transition (OnAlarmTransitionEvent)` → mapped transition (RAISE/ACK/CLEAR/RETRIGGER, severity, operator user+comment, category, description, raise/transition times, current/limit value).
- Resumable stream; on transport fault re-open (existing `RaiseDisconnected` once-only guard) → fresh snapshot re-seeds.
- Uses `ZB.MOM.WW.MxGateway.Client`'s `StreamAlarmsAsync` (already exercised by OtOpcUa's `GatewayGalaxyAlarmFeed`); bump the NuGet package if the referenced version predates it.
---
## Section 4 — Site runtime, central query, UI, errors & testing
**`NativeAlarmActor` (new)**
- Child of `InstanceActor`, one per `ResolvedNativeAlarmSource` (named `native-alarm-{canonicalName}`). On `PreStart` sends `SubscribeAlarmsRequest` for its `(ConnectionName, SourceReference, ConditionFilter)`. Holds `_alarms: Dictionary<sourceRef, MirroredAlarm>` (discovered conditions + `AlarmConditionState` + metadata).
- On `NativeAlarmTransitionUpdate`: `Snapshot…SnapshotComplete` → buffer then **atomic swap** the source's set (drop conditions absent from the snapshot, emit diffs — no flicker); `Raise/Ack/Clear/Retrigger/StateChange` → update entry, last-write-wins by `TransitionTime` (ignore older). Each change emits an enriched `AlarmStateChanged` to `InstanceActor` → existing stream path.
- **Retention:** keep an entry while `Active` OR `Unacked`; once fully normal (`Inactive` AND `Acked`) emit a final return-to-normal and drop it.
- On `NativeAlarmSourceUnavailable`: mark its alarms **uncertain** (snapshot flag) rather than clearing; re-seed from the reconnect snapshot.
- **Persistence:** site-SQLite table `NativeAlarmState (InstanceUniqueName, SourceCanonicalName, SourceReference, serialized condition+metadata, LastTransitionTime)`. Rehydrate on `PreStart` (so central can query immediately after restart), then reconcile against the fresh snapshot. Reset on redeployment, like static attribute writes.
- **Supervision:** coordinator-style child → **Resume**. A bad source ref / subscribe failure logs to the site event log (`alarm`), reports unhealthy, and is retried periodically (same spirit as tag-resolution retry) without crashing the instance.
**Computed `AlarmActor`:** no logic change — populate `AlarmConditionState` on emit (`Active`, `Acknowledged=true`, `Severity=Priority`, `Level` retained, `Kind=Computed`).
**`InstanceActor`:** builds `NativeAlarmActor`s from `ResolvedNativeAlarmSource[]`; native `AlarmStateChanged` flows through the existing `_alarmStates`/`_alarmTimestamps` + `_streamManager.PublishAlarmStateChanged` path (state dictionaries extended to carry the enriched shape); the instance snapshot includes native alarms.
**Streaming + central query (no central tables)**
- Live: enriched `AlarmStateChanged``SiteStreamManager` → enriched gRPC `AlarmStateUpdate` → DebugView, as today.
- Initial snapshot: the existing **ClusterClient instance-snapshot** request (DebugView's seed) is extended to include native alarms in the unified shape. Large snapshots reuse existing per-subscriber buffering / frame-size guard (the browse-cap precedent); chunk if needed.
**Central UI — DebugView enrichment** (+ Section 2 authoring panels)
- Alarm table gains: Severity, a composite condition badge (Active/Acked/Shelved/Suppressed), a Kind badge (computed vs native), Source reference, Alarm type, Category, Operator/comment (tooltip), Original raise time, Current/Limit value (tooltip). Computed rows show severity=priority, auto-acked. Built with the `frontend-design` skill, Bootstrap-only custom components.
**Error handling / edge cases**
- Connection loss → uncertain, not cleared; reconnect snapshot reconciles. Source ref absent from snapshot → cleared. Severity normalized to 01000. **Bounded growth:** configurable per-source mirrored-alarm cap in `SiteRuntimeOptions`; when hit, **log it** (no silent truncation). Disabled/deleted instance → unsubscribe.
- `DataConnectionActor` health report extended with alarm-feed status (active feeds, last-event time, uncertain sources) via `ISiteHealthCollector`.
**Testing**
- Unit: `AlarmConditionState` mapping (computed / OPC UA fields / MxAccess states); `NativeAlarmActor` snapshot-swap, transition handling, persistence rehydrate, uncertain-on-disconnect; `FlatteningService` native-source inherit/compose/override; semantic validation.
- Adapter: OPC UA event→transition + ConditionRefresh snapshot (fake client); MxGateway `AlarmFeedMessage`→transition + reconnect re-seed (fake client, existing fake patterns).
- Integration: end-to-end against the infra OPC UA server — **confirm the test OPC UA server exposes A&C; if not, add an alarm-capable test source or simulate.** MxGateway path mocked in CI unless a gateway-with-alarms is available.
- Seed: add a `NativeAlarmSource` binding to the `docker-env2` site-x MxGateway connection for manual verification.
---
## Affected components & documents
| Area | Changes |
|------|---------|
| Commons | New enums/records (`AlarmKind`, `AlarmShelveState`, `AlarmConditionState`, `NativeAlarmTransition`); extend `AlarmStateChanged`; new entities `TemplateNativeAlarmSource`, `InstanceNativeAlarmSourceOverride`; new DCL messages; `IAlarmSubscribableConnection` |
| Template Engine (#1) | `ResolvedNativeAlarmSource`, flattening resolution, semantic validation |
| Site Runtime (#3) | `NativeAlarmActor`, enriched `AlarmActor`, `InstanceActor` wiring, `NativeAlarmState` SQLite persistence, `SiteRuntimeOptions` cap |
| Data Connection Layer (#4) | `DataConnectionActor` alarm feed + routing; OPC UA A&C adapter; MxGateway `StreamAlarms` adapter |
| Communication (#5) | `sitestream.proto` `AlarmStateUpdate` enrichment; instance-snapshot enrichment |
| Configuration Database (#17) | EF configurations + migration for two new tables |
| Central UI (#9) | DebugView alarm table enrichment; Template editor + Instance Configure authoring panels |
| CLI (#19) | `native-alarm-source` commands |
| Health Monitoring (#11) | Alarm-feed status in `DataConnectionHealthReport` |
| Docs | `Component-DataConnectionLayer.md`, `Component-SiteRuntime.md`, `Component-TemplateEngine.md`, `Component-CentralUI.md`, `Component-CLI.md`, `Component-Communication.md`, `Component-ConfigurationDatabase.md`; CLAUDE.md design-decisions; README if needed |
## Out of scope (this pass)
- Acknowledging / shelving / suppressing from ScadaBridge (read-only mirror).
- Central alarm tables, alarm history/journal, central audit of alarm state.
- A dedicated operator-facing Alarm Summary page (DebugView only).
- Alarm-driven notifications or scripts off native alarms.
## Open items / risks
- **MxGateway alarm delivery** must work end-to-end via `StreamAlarms`. OtOpcUa notes record the x86 COM worker historically delivered no native alarm events; we are trusting that the gateway now delivers (per the chosen transport). Verify against a live gateway before integration sign-off.
- **Test OPC UA server A&C support** — confirm the infra OPC UA server exposes Alarms & Conditions; otherwise add/simulate an alarm-capable source for integration tests.
- **`ZB.MOM.WW.MxGateway.Client` version** — ensure the referenced package exposes `StreamAlarmsAsync`; bump if needed.