diff --git a/docs/plans/2026-06-14-galaxy-phase-b-native-alarms-design.md b/docs/plans/2026-06-14-galaxy-phase-b-native-alarms-design.md new file mode 100644 index 00000000..eaa9893b --- /dev/null +++ b/docs/plans/2026-06-14-galaxy-phase-b-native-alarms-design.md @@ -0,0 +1,265 @@ +# Galaxy Phase B — Native Alarms on the Equipment-Tag Path — Design + +**Date:** 2026-06-14 +**Status:** Approved (brainstorming) — ready for implementation planning +**Scope:** Phase B of the Galaxy standard-driver design +(`docs/plans/2026-06-12-galaxy-standard-driver-design.md`). Restores native +`IAlarmSource` alarms in the post-Phase-A Equipment model, delivered over a new +driver→server alarm seam that mirrors the scripted-alarm seam. +**Builds on:** master (Milestone 1b complete — equipment-tag live values + write +pipeline shipped). Feature branch off master. + +## Goal + +A Galaxy equipment `Tag` bound to `GalaxyMxGateway` and marked as a **native +alarm** materializes a real OPC UA Part 9 `AlarmConditionState` under its +equipment folder, and the driver's live `IAlarmSource.OnAlarmEvent` transitions +drive that condition (active / severity / message / ack-state) and fan out to the +`alerts` topic (AdminUI `/alerts` + historian), exactly as scripted alarms do +today. **No EF/schema migration.** + +## Why this is needed (the gap, grounded in current code) + +Phase A retired the `SystemPlatform` mirror (`MaterialiseGalaxyTags` + +`GenericDriverNodeManager`), which was the **only** path that wired native +alarms. Three concrete consequences, all verified against current code: + +1. **No cross-actor transport for native alarms.** In the fused-host actor model, + `DriverInstanceActor` subscribes the driver's `ISubscribable.OnDataChange` + (value seam) but **never** subscribes `IAlarmSource.OnAlarmEvent`. No actor + message carries an alarm transition. `GenericDriverNodeManager` + (`src/Core/.../OpcUa/GenericDriverNodeManager.cs`) subscribed `OnAlarmEvent` + **in-process**; it is now orphaned (only tests construct it). +2. **No condition-update sink survives.** There is no production + `IAlarmConditionSink.OnTransition` implementation anymore — only the interface + (`IAddressSpaceBuilder.cs:115`) and an empty CLI stub + (`TwinCAT.Cli/.../BrowseCommand.cs:150`). The real sink lived in the mirror's + builder and died with it. The surviving condition path is the **snapshot** + path: `OpcUaPublishActor.AlarmStateUpdate` → `OtOpcUaNodeManager.WriteAlarmCondition`. +3. **Authored tags carry no alarm flag.** Native-alarm metadata came from driver + discovery (`DriverAttributeInfo.IsAlarm`), but Phase 7 composes from authored + `Tag` rows only. `EquipmentTagPlan` carries `{TagId, EquipmentId, + DriverInstanceId, FolderPath, Name, DataType, FullName, Writable}` — **no + alarm field**. The `Tag` entity has no alarm column. + +**Decisive design fact:** `AlarmEventArgs` (the `OnAlarmEvent` payload) does +**not** carry the transition kind. `GalaxyAlarmTransition` has an explicit +`GalaxyAlarmTransitionKind {Raise, Acknowledge, Clear, Retrigger}`, but +`GalaxyDriver.OnAlarmFeedTransition` drops it when building `AlarmEventArgs`. A +consumer therefore cannot tell raise-from-clear. Phase B fixes this at the source +(additive contract change) rather than guessing. + +## Locked decisions (from brainstorming) + +| Decision | Choice | +|---|---| +| How the server learns a tag is a native alarm | **TagConfig JSON (no migration).** The alarm intent rides in the schemaless `TagConfig` blob — `{"FullName":"…","alarm":{"alarmType":"OffNormalAlarm","severity":500}}` — parsed byte-parity in `Phase7Composer` + `DeploymentArtifact`, exactly as `FullName` is today. No EF/schema change. | +| Phase B scope line | **Live condition + alerts fan-out; defer device-ack.** Trip → Part 9 condition + `/alerts` + historian (Primary-gated). A client Acknowledge updates the **local** condition state; routing it back to the driver's `IAlarmSource.AcknowledgeAsync` (→ AVEVA) is a deferred follow-up. | +| Transition→state model | **Snapshot projection.** Reuse the scripted-alarm sink path (`WriteAlarmCondition` + delta-gate + `ReportConditionEvent` + Part-9 ack mechanics) unchanged; a new pure projector derives an `AlarmConditionSnapshot` from each transition. The retired `OnTransition` sink path is **not** resurrected. | +| Transition-kind plumbing | **Additive contract change.** Add `AlarmTransitionKind Kind` to `AlarmEventArgs` (default `Unspecified`); `GalaxyDriver` populates it from the kind it already has. A record default keeps every other `IAlarmSource` implementer compiling. | + +### Approaches considered and rejected +- **Resurrect the `IAlarmConditionSink`/`OnTransition` builder path.** Rejected: + reintroduces a second condition-state mechanism alongside the live, delta-gated, + ack-wired `WriteAlarmCondition` path. The snapshot path is the one that works + for scripted alarms today. +- **Auto-discover the alarm flag at deploy** (query driver `DriverAttributeInfo` + during composition). Rejected (this session): couples Phase 7 composition to a + cross-process discovery query; larger and higher-risk than reading the authored + blob. The TagConfig route matches the protocol-linkage precedent. +- **Extend the `Tag` entity with `IsAlarm`/`AlarmType` columns.** Rejected: + requires a Configuration/EF migration, out of scope for this phase. +- **Infer raise/clear from `AlarmEventArgs` heuristically.** Rejected: the record + has no active flag; inference is fragile and wrong. Fix the contract instead. + +## Architecture (target end-state) + +``` +Author Tag{Equipment, GalaxyMxGateway, TagConfig={FullName, alarm:{alarmType,severity}}} + → Phase7Composer / DeploymentArtifact (ExtractTagAlarm, byte-parity) + → EquipmentTagPlan.Alarm != null + → Phase7Applier.MaterialiseEquipmentTags + • Alarm == null : SafeEnsureVariable (today's value variable, unchanged) + • Alarm != null : SafeMaterialiseAlarmCondition(nodeId, equip, name, alarmType, severity) + → real Part 9 AlarmConditionState under the equipment folder (reused) + +Runtime (the NEW seam — mirrors the scripted-alarm seam): + GalaxyDriver.OnAlarmEvent (AlarmEventArgs{SourceNodeId=FullName, Kind, Severity, …}) + → DriverInstanceActor (subscribes OnAlarmEvent; marshals via Self.Tell) + → AttributeAlarmPublished(DriverInstanceId, AlarmEventArgs) [NEW msg, parallels AttributeValuePublished] + → DriverHostActor.ForwardNativeAlarm + • resolve (DriverInstanceId, SourceNodeId) → condition NodeId(s) via _alarmNodeIdByDriverRef [NEW map] + • NativeAlarmProjector.Project(nodeId, args) → AlarmConditionSnapshot [NEW pure helper] + • Tell OpcUaPublishActor.AlarmStateUpdate(nodeId, snapshot, ts) (UNGATED — warm on all nodes) [REUSED] + • Publish AlarmTransitionEvent → `alerts` topic (Primary-gated — reuse _localRole) [REUSED] + → OtOpcUaNodeManager.WriteAlarmCondition → delta-gate → ReportConditionEvent (real Part 9 event) [REUSED] +``` + +`OpcUaPublishActor` and `OtOpcUaNodeManager` are **unchanged** — Phase B reuses +`AlarmStateUpdate` → `WriteAlarmCondition` verbatim. + +## Components / workstreams + +### WS-1 — Transition-kind contract (additive, Core.Abstractions) +- New enum `AlarmTransitionKind { Unspecified = 0, Raise, Acknowledge, Clear, Retrigger }` + (mirrors the internal `GalaxyAlarmTransitionKind`). +- `AlarmEventArgs` gains a trailing `AlarmTransitionKind Kind = AlarmTransitionKind.Unspecified` + param (record default → all existing implementers compile untouched). +- `GalaxyDriver.OnAlarmFeedTransition` (`GalaxyDriver.cs:~1128-1167`) populates + `Kind` from `transition.TransitionKind`. + +### WS-2 — Alarm intent in TagConfig + compose plan (no EF) +- New never-throw `ExtractTagAlarm(string tagConfig) → EquipmentTagAlarmInfo?` + (parses the optional `"alarm"` object: `alarmType` default `"AlarmCondition"`, + `severity` default `500`; absent/malformed → null). Lives next to + `ExtractTagFullName`; **used by both** `Phase7Composer` and `DeploymentArtifact`. +- `EquipmentTagPlan` (`Phase7Composer.cs:~80`) gains `EquipmentTagAlarmInfo? Alarm` + (`record EquipmentTagAlarmInfo(string AlarmType, int Severity)`); null ⇒ plain + variable. Populated in `Phase7Composer` `Select(...)` **and** + `DeploymentArtifact.BuildEquipmentTagPlans` — **byte-parity invariant**, covered + by a round-trip test. + +### WS-3 — Materialize the condition node (reuse) +- `Phase7Applier.MaterialiseEquipmentTags` (`:~162-199`) branches per tag: + `tag.Alarm is not null` → `SafeMaterialiseAlarmCondition(nodeId, parentEquipment, + tag.Name, tag.Alarm.AlarmType, tag.Alarm.Severity)` (the **same** method scripted + alarms use; condition NodeId = the tag's equipment-scoped NodeId); else → + `SafeEnsureVariable(...)` (today). `RebuildAddressSpace` already clears + `_alarmConditions`, so redeploy teardown is covered. + +### WS-4 — The driver→server alarm seam (the new plumbing) +- **`DriverInstanceActor`**: on connect, if `_driver is IAlarmSource src`, subscribe + `src.OnAlarmEvent += handler`; the handler marshals to the actor thread via + `Self.Tell(new NativeAlarmRaised(e))` (mirrors the `_dataChangeHandler` pattern, + `:409/:456`). `Receive` → `Context.Parent.Tell(new + AttributeAlarmPublished(DriverInstanceId, Args))`. Unsubscribe on + disconnect/teardown (mirror the `OnDataChange` unsubscribe). New messages + `NativeAlarmRaised` (internal) + `AttributeAlarmPublished` (parallels + `AttributeValuePublished`, `:65`). Phase B follows the mirror's model: subscribe + the event and let the server filter by `SourceNodeId`; it does **not** drive + `SubscribeAlarmsAsync` (Galaxy's feed auto-starts session-less in + `InitializeAsync` and fires `OnAlarmEvent` regardless). Driving + `SubscribeAlarmsAsync` from the materialized alarm-ref set, for drivers that gate + on it, is a noted follow-up. +- **`DriverHostActor`**: build `_alarmNodeIdByDriverRef: (DriverInstanceId, + FullName) → HashSet` from equipment-tag plans where `Alarm != null` + (alongside the existing `_nodeIdByDriverRef`, in the same apply pass). Add + `Receive` in the steady + applying states. Handler + `ForwardNativeAlarm`: resolve nodeIds (unknown ref → drop silently, mirror + behavior); per nodeId `NativeAlarmProjector.Project(...)` → snapshot → + `_publishActor.Tell(AlarmStateUpdate(nodeId, snapshot, ts))` **ungated**; then + publish `AlarmTransitionEvent` to `alerts` **Primary-gated** via the existing + `_localRole` the write-routing already tracks. +- **`NativeAlarmProjector`** (new pure class; unit-tested): per-condition-NodeId + prior-state `(Active, Acked, Severity, Message)`; `Project(nodeId, AlarmEventArgs) + → AlarmConditionSnapshot` by `Kind`: + - `Raise`/`Retrigger` → `Active=true, Acked=false`, severity+message from event. + - `Acknowledge` → `Acked=true` (keep prior Active), carry `OperatorComment`. + - `Clear` → `Active=false` (keep prior Acked). + - `Unspecified` → keep prior Active/Acked, refresh severity+message. + - `Enabled=true`, `Confirmed=true`, `Shelving=Unshelved` (shelving is a + server/local concern). Severity: `AlarmSeverity` 4-bucket → 1..1000 ushort + (Low→200, Medium→500, High→700, Critical→900). + +### WS-5 — Historian / alerts parity (reuse) +- The `AlarmTransitionEvent` published in WS-4 is the same contract + `ScriptedAlarmHostActor` publishes; `HistorianAdapterActor` + AdminUI `/alerts` + consume it unchanged. Populate `AlarmId` = condition NodeId, `EquipmentPath` + + `AlarmName` from the plan, `TransitionKind` = `Kind.ToString()`, `AlarmTypeName` + = the configured OPC UA alarm type, `User`/`Comment` from the event. + +## Data type / severity mapping +- OPC UA alarm subtype string → SDK type via the existing + `CreateAlarmConditionOfType` (`OffNormalAlarm`/`DiscreteAlarm`/`LimitAlarm`/base). +- `AlarmSeverity` (4-bucket) → 1..1000 via the projector map above; the authored + `severity` seeds the condition's initial severity at materialization + (`MaterialiseAlarmCondition`'s `MapSeverity`). + +## Error handling / edge cases +- **Unknown `SourceNodeId`** (no materialized condition for the ref): drop + silently — preserves `GenericDriverNodeManager`'s documented behavior. +- **Byte-parity** between `Phase7Composer` and `DeploymentArtifact` for alarm tags: + parity round-trip test (the established invariant). +- **Redeploy double-delivery**: `DriverInstanceActor` unsubscribes `OnAlarmEvent` + on teardown; `WriteAlarmCondition`'s delta-gate independently suppresses + duplicate events; `RebuildAddressSpace` clears `_alarmConditions`. +- **Transition before condition materialized / after rebuild**: unknown-ref drop + handles it; the projector's prior-state dict is keyed by NodeId and tolerates a + cold start (first event seeds state). +- **A tag with both a value and an alarm intent**: Phase B treats an `alarm`-marked + tag as a **condition node only** (not also a plain variable) — matching the + retired mirror, where an alarm attribute surfaced as a condition. + +## Testing (no bUnit) + +**xUnit + Shouldly (offline):** +- `ExtractTagAlarm`: present / absent / malformed / defaults / unknown-keys-preserved. +- `Phase7Composer` ↔ `DeploymentArtifact` byte-parity with alarm-bearing equipment tags. +- `NativeAlarmProjector`: Raise→active+unacked, Acknowledge→acked, Clear→inactive, + Retrigger, Unspecified, severity-bucket map, prior-state carry. +- `GalaxyDriver.OnAlarmFeedTransition` populates `Kind`. +- Akka.TestKit — `DriverInstanceActor`: a fake `IAlarmSource` driver fires + `OnAlarmEvent` → the actor publishes `AttributeAlarmPublished` to its parent; + unsubscribes on teardown. +- Akka.TestKit — `DriverHostActor`: `AttributeAlarmPublished` resolves the ref → + Tells `AlarmStateUpdate` with the projected snapshot; unknown ref dropped; + `alerts` publish is Primary-gated (secondary suppresses). + +**Live docker-dev `/run` (user-driven; the agent does NOT sign in)** — the gate: +- Author a Galaxy alarm equipment tag (raw `TagConfig` carrying the `alarm` object) + on the live-gateway-backed `MAIN-galaxy-eq`; deploy. +- Trip the Galaxy alarm → a Part 9 `AlarmConditionState` appears active under the + equipment via Client.CLI `alarms` (and `read`); the AdminUI `/alerts` row appears. +- Clear → condition goes inactive. (Device-ack round-trip is the deferred + follow-up, not part of this gate.) + +## Suggested slicing (for the plan) +1. **WS-1** — `AlarmTransitionKind` + `AlarmEventArgs.Kind` + Galaxy populates it + (small/standard; touches a Core.Abstractions contract → ripples to implementers). +2. **WS-2** — `ExtractTagAlarm` + `EquipmentTagPlan.Alarm` in both composer + + artifact + parity test (high-risk: data-contract byte-parity). +3. **WS-3** — `MaterialiseEquipmentTags` alarm branch (standard; reuses + `MaterialiseAlarmCondition`). +4. **WS-4a** — `NativeAlarmProjector` (standard; pure, fully TDD-able offline). +5. **WS-4b** — `DriverInstanceActor` `OnAlarmEvent` subscription + publish (high-risk: + actor state machine + driver-thread marshaling). +6. **WS-4c** — `DriverHostActor` alarm map + `ForwardNativeAlarm` + Primary-gated + alerts publish (high-risk: actor/concurrency/redundancy gate). +7. **WS-5** — wire `AlarmTransitionEvent` fields (folds into WS-4c; verify historian + + `/alerts` consume it). +8. **Docs** — document the `TagConfig` `alarm` schema (a Galaxy/alarms doc note). +9. **Live `/run`** — the gate above (user-driven). + +## Deferred follow-ups (explicitly out of Phase B) +- **Inbound device-ack**: client Acknowledge → `IAlarmSource.AcknowledgeAsync` → + AVEVA (its own inbound pipeline, mirrors the write-through work). +- **`SubscribeAlarmsAsync` from the materialized alarm-ref set** for drivers that + gate their feed on it (Galaxy doesn't). +- **AdminUI Galaxy picker pre-fill** of the `alarm` object from discovery + (`IsAlarm`/`SecurityClass` already known) — a UI convenience; raw-JSON authoring + works without it and avoids live-only Razor binding risk. +- Carrying the raw OPC UA severity (vs. the 4-bucket) end-to-end. + +## Hard rules (carried into implementation) +- Stage by path; never `git add .`. Never stage `sql_login.txt`, + `src/Server/ZB.MOM.WW.OtOpcUa.Host/pki/`, `pending.md`, `current.md`, or + `docker-dev/docker-compose.yml`. +- Never echo the gateway API key or any secret into a tracked file. +- No force-push, no `--no-verify`. +- **No Configuration entity / EF migration change** (the TagConfig route is chosen + specifically to honor this). +- No bUnit; Razor/JS proven only by live `/run`. +- Build on a feature branch off master. + +## Authoritative touched-code list (for planning) +- `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAlarmSource.cs` (`AlarmEventArgs.Kind`, new enum) +- `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/GalaxyDriver.cs` (`OnAlarmFeedTransition` populates `Kind`) +- `src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/Phase7Composer.cs` (`EquipmentTagPlan.Alarm`, `ExtractTagAlarm`) +- `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DeploymentArtifact.cs` (`BuildEquipmentTagPlans` parity) +- `src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/Phase7Applier.cs` (`MaterialiseEquipmentTags` alarm branch) +- `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs` (`OnAlarmEvent` sub + publish) +- `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs` (alarm map + `ForwardNativeAlarm` + gated publish) +- NEW `NativeAlarmProjector` (Runtime or Commons) + its tests +- `OpcUaPublishActor` / `OtOpcUaNodeManager` — **no change** (reuse `AlarmStateUpdate`/`WriteAlarmCondition`) +- A docs note for the `TagConfig` `alarm` schema +```