From dadebbe227efa0e9b8d819f3e0098026c51c8d46 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Fri, 29 May 2026 15:14:01 -0400 Subject: [PATCH] docs(plans): native OPC UA & MxAccess GW alarms design Read-only mirror of native alarm sources into a unified A&C-style state model (severity + active/acked/shelved/suppressed). Instance-bound source discovery, site-only SQLite state with live central query (no central tables), DebugView enrichment. OPC UA A&C events + ConditionRefresh and MxGateway session-less StreamAlarms via a new IAlarmSubscribableConnection seam routed connection-level by source reference; new NativeAlarmActor peer to computed AlarmActor. --- docs/plans/2026-05-29-native-alarms-design.md | 212 ++++++++++++++++++ 1 file changed, 212 insertions(+) create mode 100644 docs/plans/2026-05-29-native-alarms-design.md diff --git a/docs/plans/2026-05-29-native-alarms-design.md b/docs/plans/2026-05-29-native-alarms-design.md new file mode 100644 index 00000000..674e6b18 --- /dev/null +++ b/docs/plans/2026-05-29-native-alarms-design.md @@ -0,0 +1,212 @@ +# Native OPC UA & MxAccess Gateway Alarms — Design + +**Date:** 2026-05-29 +**Status:** Approved + +## Problem + +Today alarms are **computed at the site**: a `TemplateAlarm` defines a trigger (ValueMatch, RangeViolation, RateOfChange, HiLo, Expression); one `AlarmActor` per alarm evaluates attribute values and emits `AlarmStateChanged` carrying a bare `AlarmState { Active, Normal }` plus an integer `Priority` and (for HiLo) a `Level`. State is in-memory only — there is **no severity dimension, no acknowledgement, no shelve/suppress state, and no operator metadata** — and it surfaces only in the per-instance DebugView. + +Two data sources we connect to own their own alarm lifecycle and expose far richer state: + +- **OPC UA Alarms & Conditions (Part 9)** — the server raises/acks/clears `AlarmCondition` nodes with orthogonal sub-states (Active/Inactive, Acked/Unacked, Confirmed/Unconfirmed, Shelved, Suppressed) and a 1–1000 severity. The DCL OPC UA adapter currently subscribes only to the `Value` attribute. +- **MxAccess Gateway** — already exposes a session-less `StreamAlarms` feed (`OnAlarmTransitionEvent`: raise/ack/clear/retrigger, severity, operator user + comment, category, description, current/limit value) plus `QueryActiveAlarms`. The DCL MxGateway adapter currently consumes only the `OnDataChange` event family. + +These are **mirrored** alarms — the source is the source of truth — which is a real divergence from the computed model. This design enriches the alarm tracking model to carry severity + ack/shelve/suppress state, and ingests native alarms from both sources. + +## Design Decisions + +| Decision | Choice | +|----------|--------| +| State model scope | **Unified** A&C-style state model for *all* alarms (computed + native) | +| Interactivity | **Read-only mirror** — display source-reported state; no acking/shelving from ScadaBridge, no command relay, no operator identity captured by ScadaBridge (source-supplied operator user/comment *are* displayed for native alarms) | +| Binding | Instance declares a `NativeAlarmSource` (connection + source ref); conditions under it are **discovered at runtime**, keyed by source reference | +| State location | **Site-only**, persisted to SQLite (survives restart/failover); central **queries live** (snapshot + live stream); **no central tables, no central history** | +| MxGateway transport | Gateway session-less `StreamAlarms` feed | +| OPC UA transport | Alarms & Conditions events + `ConditionRefresh` snapshot | +| Site actor structure | New `NativeAlarmActor` child of `InstanceActor`, peer to computed `AlarmActor`s (Approach 1) | +| Authoring | Central UI design-time panels (Template editor + Instance Configure) **and** CLI | +| Runtime UI | Enrich the per-instance DebugView alarm table only (no new operator page) | + +### Trade-offs accepted + +- **No central audit trail of alarms** (who acked, history). Acceptable because the source systems own ack and retain their own alarm history; ScadaBridge is a read-only window. If audit of alarm state is later wanted, a central mirror following the Site Call Audit (#22) pattern can be added without disturbing this design. +- **Read-only** means MxGateway/OPC UA acknowledgements happen in the source's own tools; ScadaBridge reflects them. + +--- + +## Section 1 — Unified state model & wire contracts + +**New Commons types** (`Types/Enums/`, `Types/Alarms/`): + +``` +enum AlarmKind { Computed, NativeOpcUa, NativeMxAccess } +enum AlarmShelveState { Unshelved, OneShotShelved, TimedShelved, PermanentShelved } + +record AlarmConditionState( + bool Active, // Active vs Inactive + bool Acknowledged, // Acked / Unacked + bool? Confirmed, // null = not a confirmable condition (OPC UA optional) + AlarmShelveState Shelve, + bool Suppressed, + int Severity) // 0–1000, unified scale +``` + +The OPC UA Part 9 sub-conditions are **orthogonal** and MxAccess's `ACTIVE / ACTIVE_ACKED / INACTIVE` maps cleanly onto them, so they are modeled as independent flags (the UI rolls them up for display). + +**`AlarmStateChanged` is extended additively** (existing fields kept for back-compat; new fields defaulted): + +| New field | Default | Notes | +|-----------|---------|-------| +| `Kind` | `Computed` | discriminator | +| `Condition` | computed from existing | the `AlarmConditionState` above | +| `SourceReference` | `""` | native key, e.g. `"Tank01.Level.HiHi"` | +| `AlarmTypeName` | `""` | native, e.g. `"AnalogLimitAlarm.HiHi"` | +| `Category` | `""` | native taxonomy | +| `OperatorUser` | `""` | native ack metadata (display-only) | +| `OperatorComment` | `""` | native ack metadata (display-only) | +| `OriginalRaiseTime` | `null` | native | +| `CurrentValue` | `""` | native, display | +| `LimitValue` | `""` | native, display | + +**Identity / key:** computed alarms key by `AlarmName` (canonical); native alarms key by `SourceReference` (stable across transitions). `Kind` discriminates. The existing `AlarmName` field carries the source reference's display form for native rows so existing consumers don't break. + +**Source → `AlarmConditionState` mapping:** +- **Computed:** `Active = state==Active`, `Acknowledged = true` (auto), `Confirmed = null`, `Shelve = Unshelved`, `Suppressed = false`, `Severity = Priority`, `Level` retained for HiLo. +- **OPC UA A&C:** read `ActiveState`, `AckedState`, `ConfirmedState`, `ShelvingState`, `SuppressedState`, `Severity` from the condition's event fields. +- **MxAccess:** `ACTIVE → (Active=t, Ack=f)`, `ACTIVE_ACKED → (Active=t, Ack=t)`, `INACTIVE → (Active=f)`; `Severity` from the gateway's remapped 0–1000; shelve/suppress default (gateway proto doesn't surface them). + +**gRPC `AlarmStateUpdate` (`sitestream.proto`)** gets the same fields appended as new field numbers (additive — never renumber/remove): `kind`, `active`, `acknowledged`, `confirmed`, `shelve_state`, `suppressed`, `source_reference`, `alarm_type_name`, `category`, `operator_user`, `operator_comment`, `original_raise_time`, `current_value`, `limit_value`. Existing `state`, `priority`, `level`, `message` stay for compatibility. + +--- + +## Section 2 — Configuration, binding & deployment + +This mirrors how template **attributes bind to a data source** today (template declares, instance overrides the concrete reference). + +**New Commons entities** +- `TemplateNativeAlarmSource` (`Entities/Templates/`): `Id`, `TemplateId`, `Name` (unique within template), `Description?`, `ConnectionName`, `SourceReference` (OPC UA SourceNode/notifier nodeId, or MxAccess object/area), `ConditionFilter?` (null = mirror *all* conditions under the source), `IsLocked`, `IsInherited` — same lock/inherit bookkeeping as `TemplateAlarm`. +- `InstanceNativeAlarmSourceOverride` (`Entities/Instances/`): `Id`, `InstanceId`, `SourceCanonicalName`, `ConnectionNameOverride?`, `SourceReferenceOverride?`, `ConditionFilterOverride?`; unique `(InstanceId, SourceCanonicalName)`. `SourceReference` is the field that varies per physical instance, so per-instance override is the common case. + +**Flattening** (`FlatteningService`, `FlattenedConfiguration`) +- New `ResolvedNativeAlarmSource { CanonicalName, ConnectionName, SourceReference, ConditionFilter?, Source }`, resolved through the same steps as `ResolvedAlarm`: inherited → composed (path-qualified `[Module].[Name]`) → instance overrides applied. +- **Pre-deployment semantic validation** (extends existing checks): `ConnectionName` resolves to a real site `DataConnection`; that connection's protocol is alarm-capable (`OpcUa` or `MxGateway`); `SourceReference` non-empty; canonical-name collision check. + +**ConfigurationDatabase (EF + migration)** +- `TemplateNativeAlarmSourceConfiguration` → table `Templates.NativeAlarmSources`, unique `(TemplateId, Name)`, FK cascade. +- `InstanceNativeAlarmSourceOverrideConfiguration` → table `InstanceNativeAlarmSourceOverrides`, unique `(InstanceId, SourceCanonicalName)`, FK cascade. +- One migration adds both tables (auto-apply in dev per existing convention). + +**Deployment** — `FlattenedConfiguration` carries `ResolvedNativeAlarmSource[]`, deployed alongside `ResolvedAlarm[]` on the existing artifact path. Site Runtime consumes them when building the instance actor hierarchy. All-or-nothing per-instance apply unchanged. + +**Authoring (Central UI + CLI)** +- Template editor: a "Native Alarm Sources" subsection (name, connection dropdown filtered to alarm-capable protocols, source reference, optional filter). +- Instance Configure: override connection/source-ref/filter per instance, like attribute data-source overrides. +- CLI: `template native-alarm-source add/list/remove`, `instance native-alarm-source set/clear`. + +--- + +## Section 3 — DCL ingestion & the two adapters + +**Capability seam** (mirrors the existing `IBrowsableDataConnection` pattern): + +``` +interface IAlarmSubscribableConnection { + Task SubscribeAlarmsAsync(string sourceReference, string? conditionFilter, AlarmTransitionCallback cb); + Task UnsubscribeAlarmsAsync(string subscriptionId); +} +delegate void AlarmTransitionCallback(NativeAlarmTransition t); +``` + +**Protocol-neutral transition** (`Commons/Types/Alarms/`): + +``` +enum AlarmTransitionKind { Snapshot, SnapshotComplete, Raise, Acknowledge, Clear, Retrigger, StateChange } + +record NativeAlarmTransition( + string SourceReference, string SourceObjectReference, string AlarmTypeName, + AlarmTransitionKind Kind, AlarmConditionState Condition, + string Category, string Description, string Message, + string OperatorUser, string OperatorComment, + DateTimeOffset? OriginalRaiseTime, DateTimeOffset TransitionTime, + string CurrentValue, string LimitValue) +``` + +`Snapshot`/`SnapshotComplete` carry the initial active-condition replay so the consumer re-seeds a source's state on every (re)subscribe — this is how reconnect reconciliation works without central storage. + +**Connection-level transport + source-ref routing.** Although binding is *declared* per-instance, the subscription is naturally **connection-level** (OPC UA wants one event subscription; MxGateway `StreamAlarms` is one session-less feed). `DataConnectionActor` opens **one** alarm feed per connection and maintains `_alarmSubscribers: SourceObjectRef → set`, routing each transition to matching instances. + +New messages (`Messages/DataConnection/`): `SubscribeAlarmsRequest`/`Response`, `UnsubscribeAlarmsRequest`, internal `NativeAlarmTransitionReceived(conn, transition, generation)`, forwarded as `NativeAlarmTransitionUpdate(conn, transition)`. Subscribe/unsubscribe obey the existing **Become/Stash** lifecycle (stashed while Connecting/Reconnecting, replayed on Connected). The stale-callback **generation guard** and once-only disconnect guard apply unchanged. On disconnect the actor emits a per-source `NativeAlarmSourceUnavailable` so consumers mark mirrored alarms *uncertain* rather than clearing them. + +**OPC UA A&C adapter** (`OpcUaDataConnection` / `RealOpcUaClient`) +- One event `MonitoredItem` (`AttributeId = EventNotifier`) on the Server object (i=2253) or configured notifier, with an `EventFilter`: SelectClauses for EventId, EventType, SourceNode, SourceName, Time, Message, Severity + `ConditionType`/`AcknowledgeableConditionType`/`AlarmConditionType` state fields (Acked/Confirmed/Active/Shelving/Suppressed). Optional WhereClause scoping to the union of bound SourceNodes. +- Map event fields → `NativeAlarmTransition`; derive `Kind` from which sub-state changed. +- Call `ConditionRefresh` on (re)subscribe → emit the `Snapshot`/`SnapshotComplete` sequence. + +**MxGateway adapter** (`MxGatewayDataConnection` / `RealMxGatewayClient`) +- Open session-less `StreamAlarms` (optional `alarm_filter_prefix` from bound source refs). Map `AlarmFeedMessage`: `active_alarm` → `Snapshot`, `snapshot_complete` → `SnapshotComplete`, `transition (OnAlarmTransitionEvent)` → mapped transition (RAISE/ACK/CLEAR/RETRIGGER, severity, operator user+comment, category, description, raise/transition times, current/limit value). +- Resumable stream; on transport fault re-open (existing `RaiseDisconnected` once-only guard) → fresh snapshot re-seeds. +- Uses `ZB.MOM.WW.MxGateway.Client`'s `StreamAlarmsAsync` (already exercised by OtOpcUa's `GatewayGalaxyAlarmFeed`); bump the NuGet package if the referenced version predates it. + +--- + +## Section 4 — Site runtime, central query, UI, errors & testing + +**`NativeAlarmActor` (new)** +- Child of `InstanceActor`, one per `ResolvedNativeAlarmSource` (named `native-alarm-{canonicalName}`). On `PreStart` sends `SubscribeAlarmsRequest` for its `(ConnectionName, SourceReference, ConditionFilter)`. Holds `_alarms: Dictionary` (discovered conditions + `AlarmConditionState` + metadata). +- On `NativeAlarmTransitionUpdate`: `Snapshot…SnapshotComplete` → buffer then **atomic swap** the source's set (drop conditions absent from the snapshot, emit diffs — no flicker); `Raise/Ack/Clear/Retrigger/StateChange` → update entry, last-write-wins by `TransitionTime` (ignore older). Each change emits an enriched `AlarmStateChanged` to `InstanceActor` → existing stream path. +- **Retention:** keep an entry while `Active` OR `Unacked`; once fully normal (`Inactive` AND `Acked`) emit a final return-to-normal and drop it. +- On `NativeAlarmSourceUnavailable`: mark its alarms **uncertain** (snapshot flag) rather than clearing; re-seed from the reconnect snapshot. +- **Persistence:** site-SQLite table `NativeAlarmState (InstanceUniqueName, SourceCanonicalName, SourceReference, serialized condition+metadata, LastTransitionTime)`. Rehydrate on `PreStart` (so central can query immediately after restart), then reconcile against the fresh snapshot. Reset on redeployment, like static attribute writes. +- **Supervision:** coordinator-style child → **Resume**. A bad source ref / subscribe failure logs to the site event log (`alarm`), reports unhealthy, and is retried periodically (same spirit as tag-resolution retry) without crashing the instance. + +**Computed `AlarmActor`:** no logic change — populate `AlarmConditionState` on emit (`Active`, `Acknowledged=true`, `Severity=Priority`, `Level` retained, `Kind=Computed`). + +**`InstanceActor`:** builds `NativeAlarmActor`s from `ResolvedNativeAlarmSource[]`; native `AlarmStateChanged` flows through the existing `_alarmStates`/`_alarmTimestamps` + `_streamManager.PublishAlarmStateChanged` path (state dictionaries extended to carry the enriched shape); the instance snapshot includes native alarms. + +**Streaming + central query (no central tables)** +- Live: enriched `AlarmStateChanged` → `SiteStreamManager` → enriched gRPC `AlarmStateUpdate` → DebugView, as today. +- Initial snapshot: the existing **ClusterClient instance-snapshot** request (DebugView's seed) is extended to include native alarms in the unified shape. Large snapshots reuse existing per-subscriber buffering / frame-size guard (the browse-cap precedent); chunk if needed. + +**Central UI — DebugView enrichment** (+ Section 2 authoring panels) +- Alarm table gains: Severity, a composite condition badge (Active/Acked/Shelved/Suppressed), a Kind badge (computed vs native), Source reference, Alarm type, Category, Operator/comment (tooltip), Original raise time, Current/Limit value (tooltip). Computed rows show severity=priority, auto-acked. Built with the `frontend-design` skill, Bootstrap-only custom components. + +**Error handling / edge cases** +- Connection loss → uncertain, not cleared; reconnect snapshot reconciles. Source ref absent from snapshot → cleared. Severity normalized to 0–1000. **Bounded growth:** configurable per-source mirrored-alarm cap in `SiteRuntimeOptions`; when hit, **log it** (no silent truncation). Disabled/deleted instance → unsubscribe. +- `DataConnectionActor` health report extended with alarm-feed status (active feeds, last-event time, uncertain sources) via `ISiteHealthCollector`. + +**Testing** +- Unit: `AlarmConditionState` mapping (computed / OPC UA fields / MxAccess states); `NativeAlarmActor` snapshot-swap, transition handling, persistence rehydrate, uncertain-on-disconnect; `FlatteningService` native-source inherit/compose/override; semantic validation. +- Adapter: OPC UA event→transition + ConditionRefresh snapshot (fake client); MxGateway `AlarmFeedMessage`→transition + reconnect re-seed (fake client, existing fake patterns). +- Integration: end-to-end against the infra OPC UA server — **confirm the test OPC UA server exposes A&C; if not, add an alarm-capable test source or simulate.** MxGateway path mocked in CI unless a gateway-with-alarms is available. +- Seed: add a `NativeAlarmSource` binding to the `docker-env2` site-x MxGateway connection for manual verification. + +--- + +## Affected components & documents + +| Area | Changes | +|------|---------| +| Commons | New enums/records (`AlarmKind`, `AlarmShelveState`, `AlarmConditionState`, `NativeAlarmTransition`); extend `AlarmStateChanged`; new entities `TemplateNativeAlarmSource`, `InstanceNativeAlarmSourceOverride`; new DCL messages; `IAlarmSubscribableConnection` | +| Template Engine (#1) | `ResolvedNativeAlarmSource`, flattening resolution, semantic validation | +| Site Runtime (#3) | `NativeAlarmActor`, enriched `AlarmActor`, `InstanceActor` wiring, `NativeAlarmState` SQLite persistence, `SiteRuntimeOptions` cap | +| Data Connection Layer (#4) | `DataConnectionActor` alarm feed + routing; OPC UA A&C adapter; MxGateway `StreamAlarms` adapter | +| Communication (#5) | `sitestream.proto` `AlarmStateUpdate` enrichment; instance-snapshot enrichment | +| Configuration Database (#17) | EF configurations + migration for two new tables | +| Central UI (#9) | DebugView alarm table enrichment; Template editor + Instance Configure authoring panels | +| CLI (#19) | `native-alarm-source` commands | +| Health Monitoring (#11) | Alarm-feed status in `DataConnectionHealthReport` | +| Docs | `Component-DataConnectionLayer.md`, `Component-SiteRuntime.md`, `Component-TemplateEngine.md`, `Component-CentralUI.md`, `Component-CLI.md`, `Component-Communication.md`, `Component-ConfigurationDatabase.md`; CLAUDE.md design-decisions; README if needed | + +## Out of scope (this pass) + +- Acknowledging / shelving / suppressing from ScadaBridge (read-only mirror). +- Central alarm tables, alarm history/journal, central audit of alarm state. +- A dedicated operator-facing Alarm Summary page (DebugView only). +- Alarm-driven notifications or scripts off native alarms. + +## Open items / risks + +- **MxGateway alarm delivery** must work end-to-end via `StreamAlarms`. OtOpcUa notes record the x86 COM worker historically delivered no native alarm events; we are trusting that the gateway now delivers (per the chosen transport). Verify against a live gateway before integration sign-off. +- **Test OPC UA server A&C support** — confirm the infra OPC UA server exposes Alarms & Conditions; otherwise add/simulate an alarm-capable source for integration tests. +- **`ZB.MOM.WW.MxGateway.Client` version** — ensure the referenced package exposes `StreamAlarmsAsync`; bump if needed.