docs(galaxy): Phase B native-alarms design (equipment-tag path)

This commit is contained in:
Joseph Doherty
2026-06-14 02:50:34 -04:00
parent 1d797c1c8a
commit 90096e9c00
@@ -0,0 +1,265 @@
# Galaxy Phase B — Native Alarms on the Equipment-Tag Path — Design
**Date:** 2026-06-14
**Status:** Approved (brainstorming) — ready for implementation planning
**Scope:** Phase B of the Galaxy standard-driver design
(`docs/plans/2026-06-12-galaxy-standard-driver-design.md`). Restores native
`IAlarmSource` alarms in the post-Phase-A Equipment model, delivered over a new
driver→server alarm seam that mirrors the scripted-alarm seam.
**Builds on:** master (Milestone 1b complete — equipment-tag live values + write
pipeline shipped). Feature branch off master.
## Goal
A Galaxy equipment `Tag` bound to `GalaxyMxGateway` and marked as a **native
alarm** materializes a real OPC UA Part 9 `AlarmConditionState` under its
equipment folder, and the driver's live `IAlarmSource.OnAlarmEvent` transitions
drive that condition (active / severity / message / ack-state) and fan out to the
`alerts` topic (AdminUI `/alerts` + historian), exactly as scripted alarms do
today. **No EF/schema migration.**
## Why this is needed (the gap, grounded in current code)
Phase A retired the `SystemPlatform` mirror (`MaterialiseGalaxyTags` +
`GenericDriverNodeManager`), which was the **only** path that wired native
alarms. Three concrete consequences, all verified against current code:
1. **No cross-actor transport for native alarms.** In the fused-host actor model,
`DriverInstanceActor` subscribes the driver's `ISubscribable.OnDataChange`
(value seam) but **never** subscribes `IAlarmSource.OnAlarmEvent`. No actor
message carries an alarm transition. `GenericDriverNodeManager`
(`src/Core/.../OpcUa/GenericDriverNodeManager.cs`) subscribed `OnAlarmEvent`
**in-process**; it is now orphaned (only tests construct it).
2. **No condition-update sink survives.** There is no production
`IAlarmConditionSink.OnTransition` implementation anymore — only the interface
(`IAddressSpaceBuilder.cs:115`) and an empty CLI stub
(`TwinCAT.Cli/.../BrowseCommand.cs:150`). The real sink lived in the mirror's
builder and died with it. The surviving condition path is the **snapshot**
path: `OpcUaPublishActor.AlarmStateUpdate``OtOpcUaNodeManager.WriteAlarmCondition`.
3. **Authored tags carry no alarm flag.** Native-alarm metadata came from driver
discovery (`DriverAttributeInfo.IsAlarm`), but Phase 7 composes from authored
`Tag` rows only. `EquipmentTagPlan` carries `{TagId, EquipmentId,
DriverInstanceId, FolderPath, Name, DataType, FullName, Writable}` — **no
alarm field**. The `Tag` entity has no alarm column.
**Decisive design fact:** `AlarmEventArgs` (the `OnAlarmEvent` payload) does
**not** carry the transition kind. `GalaxyAlarmTransition` has an explicit
`GalaxyAlarmTransitionKind {Raise, Acknowledge, Clear, Retrigger}`, but
`GalaxyDriver.OnAlarmFeedTransition` drops it when building `AlarmEventArgs`. A
consumer therefore cannot tell raise-from-clear. Phase B fixes this at the source
(additive contract change) rather than guessing.
## Locked decisions (from brainstorming)
| Decision | Choice |
|---|---|
| How the server learns a tag is a native alarm | **TagConfig JSON (no migration).** The alarm intent rides in the schemaless `TagConfig` blob — `{"FullName":"…","alarm":{"alarmType":"OffNormalAlarm","severity":500}}` — parsed byte-parity in `Phase7Composer` + `DeploymentArtifact`, exactly as `FullName` is today. No EF/schema change. |
| Phase B scope line | **Live condition + alerts fan-out; defer device-ack.** Trip → Part 9 condition + `/alerts` + historian (Primary-gated). A client Acknowledge updates the **local** condition state; routing it back to the driver's `IAlarmSource.AcknowledgeAsync` (→ AVEVA) is a deferred follow-up. |
| Transition→state model | **Snapshot projection.** Reuse the scripted-alarm sink path (`WriteAlarmCondition` + delta-gate + `ReportConditionEvent` + Part-9 ack mechanics) unchanged; a new pure projector derives an `AlarmConditionSnapshot` from each transition. The retired `OnTransition` sink path is **not** resurrected. |
| Transition-kind plumbing | **Additive contract change.** Add `AlarmTransitionKind Kind` to `AlarmEventArgs` (default `Unspecified`); `GalaxyDriver` populates it from the kind it already has. A record default keeps every other `IAlarmSource` implementer compiling. |
### Approaches considered and rejected
- **Resurrect the `IAlarmConditionSink`/`OnTransition` builder path.** Rejected:
reintroduces a second condition-state mechanism alongside the live, delta-gated,
ack-wired `WriteAlarmCondition` path. The snapshot path is the one that works
for scripted alarms today.
- **Auto-discover the alarm flag at deploy** (query driver `DriverAttributeInfo`
during composition). Rejected (this session): couples Phase 7 composition to a
cross-process discovery query; larger and higher-risk than reading the authored
blob. The TagConfig route matches the protocol-linkage precedent.
- **Extend the `Tag` entity with `IsAlarm`/`AlarmType` columns.** Rejected:
requires a Configuration/EF migration, out of scope for this phase.
- **Infer raise/clear from `AlarmEventArgs` heuristically.** Rejected: the record
has no active flag; inference is fragile and wrong. Fix the contract instead.
## Architecture (target end-state)
```
Author Tag{Equipment, GalaxyMxGateway, TagConfig={FullName, alarm:{alarmType,severity}}}
→ Phase7Composer / DeploymentArtifact (ExtractTagAlarm, byte-parity)
→ EquipmentTagPlan.Alarm != null
→ Phase7Applier.MaterialiseEquipmentTags
• Alarm == null : SafeEnsureVariable (today's value variable, unchanged)
• Alarm != null : SafeMaterialiseAlarmCondition(nodeId, equip, name, alarmType, severity)
→ real Part 9 AlarmConditionState under the equipment folder (reused)
Runtime (the NEW seam — mirrors the scripted-alarm seam):
GalaxyDriver.OnAlarmEvent (AlarmEventArgs{SourceNodeId=FullName, Kind, Severity, …})
→ DriverInstanceActor (subscribes OnAlarmEvent; marshals via Self.Tell)
→ AttributeAlarmPublished(DriverInstanceId, AlarmEventArgs) [NEW msg, parallels AttributeValuePublished]
→ DriverHostActor.ForwardNativeAlarm
• resolve (DriverInstanceId, SourceNodeId) → condition NodeId(s) via _alarmNodeIdByDriverRef [NEW map]
• NativeAlarmProjector.Project(nodeId, args) → AlarmConditionSnapshot [NEW pure helper]
• Tell OpcUaPublishActor.AlarmStateUpdate(nodeId, snapshot, ts) (UNGATED — warm on all nodes) [REUSED]
• Publish AlarmTransitionEvent → `alerts` topic (Primary-gated — reuse _localRole) [REUSED]
→ OtOpcUaNodeManager.WriteAlarmCondition → delta-gate → ReportConditionEvent (real Part 9 event) [REUSED]
```
`OpcUaPublishActor` and `OtOpcUaNodeManager` are **unchanged** — Phase B reuses
`AlarmStateUpdate``WriteAlarmCondition` verbatim.
## Components / workstreams
### WS-1 — Transition-kind contract (additive, Core.Abstractions)
- New enum `AlarmTransitionKind { Unspecified = 0, Raise, Acknowledge, Clear, Retrigger }`
(mirrors the internal `GalaxyAlarmTransitionKind`).
- `AlarmEventArgs` gains a trailing `AlarmTransitionKind Kind = AlarmTransitionKind.Unspecified`
param (record default → all existing implementers compile untouched).
- `GalaxyDriver.OnAlarmFeedTransition` (`GalaxyDriver.cs:~1128-1167`) populates
`Kind` from `transition.TransitionKind`.
### WS-2 — Alarm intent in TagConfig + compose plan (no EF)
- New never-throw `ExtractTagAlarm(string tagConfig) → EquipmentTagAlarmInfo?`
(parses the optional `"alarm"` object: `alarmType` default `"AlarmCondition"`,
`severity` default `500`; absent/malformed → null). Lives next to
`ExtractTagFullName`; **used by both** `Phase7Composer` and `DeploymentArtifact`.
- `EquipmentTagPlan` (`Phase7Composer.cs:~80`) gains `EquipmentTagAlarmInfo? Alarm`
(`record EquipmentTagAlarmInfo(string AlarmType, int Severity)`); null ⇒ plain
variable. Populated in `Phase7Composer` `Select(...)` **and**
`DeploymentArtifact.BuildEquipmentTagPlans`**byte-parity invariant**, covered
by a round-trip test.
### WS-3 — Materialize the condition node (reuse)
- `Phase7Applier.MaterialiseEquipmentTags` (`:~162-199`) branches per tag:
`tag.Alarm is not null``SafeMaterialiseAlarmCondition(nodeId, parentEquipment,
tag.Name, tag.Alarm.AlarmType, tag.Alarm.Severity)` (the **same** method scripted
alarms use; condition NodeId = the tag's equipment-scoped NodeId); else →
`SafeEnsureVariable(...)` (today). `RebuildAddressSpace` already clears
`_alarmConditions`, so redeploy teardown is covered.
### WS-4 — The driver→server alarm seam (the new plumbing)
- **`DriverInstanceActor`**: on connect, if `_driver is IAlarmSource src`, subscribe
`src.OnAlarmEvent += handler`; the handler marshals to the actor thread via
`Self.Tell(new NativeAlarmRaised(e))` (mirrors the `_dataChangeHandler` pattern,
`:409/:456`). `Receive<NativeAlarmRaised>``Context.Parent.Tell(new
AttributeAlarmPublished(DriverInstanceId, Args))`. Unsubscribe on
disconnect/teardown (mirror the `OnDataChange` unsubscribe). New messages
`NativeAlarmRaised` (internal) + `AttributeAlarmPublished` (parallels
`AttributeValuePublished`, `:65`). Phase B follows the mirror's model: subscribe
the event and let the server filter by `SourceNodeId`; it does **not** drive
`SubscribeAlarmsAsync` (Galaxy's feed auto-starts session-less in
`InitializeAsync` and fires `OnAlarmEvent` regardless). Driving
`SubscribeAlarmsAsync` from the materialized alarm-ref set, for drivers that gate
on it, is a noted follow-up.
- **`DriverHostActor`**: build `_alarmNodeIdByDriverRef: (DriverInstanceId,
FullName) → HashSet<NodeId>` from equipment-tag plans where `Alarm != null`
(alongside the existing `_nodeIdByDriverRef`, in the same apply pass). Add
`Receive<AttributeAlarmPublished>` in the steady + applying states. Handler
`ForwardNativeAlarm`: resolve nodeIds (unknown ref → drop silently, mirror
behavior); per nodeId `NativeAlarmProjector.Project(...)` → snapshot →
`_publishActor.Tell(AlarmStateUpdate(nodeId, snapshot, ts))` **ungated**; then
publish `AlarmTransitionEvent` to `alerts` **Primary-gated** via the existing
`_localRole` the write-routing already tracks.
- **`NativeAlarmProjector`** (new pure class; unit-tested): per-condition-NodeId
prior-state `(Active, Acked, Severity, Message)`; `Project(nodeId, AlarmEventArgs)
→ AlarmConditionSnapshot` by `Kind`:
- `Raise`/`Retrigger` → `Active=true, Acked=false`, severity+message from event.
- `Acknowledge` → `Acked=true` (keep prior Active), carry `OperatorComment`.
- `Clear` → `Active=false` (keep prior Acked).
- `Unspecified` → keep prior Active/Acked, refresh severity+message.
- `Enabled=true`, `Confirmed=true`, `Shelving=Unshelved` (shelving is a
server/local concern). Severity: `AlarmSeverity` 4-bucket → 1..1000 ushort
(Low→200, Medium→500, High→700, Critical→900).
### WS-5 — Historian / alerts parity (reuse)
- The `AlarmTransitionEvent` published in WS-4 is the same contract
`ScriptedAlarmHostActor` publishes; `HistorianAdapterActor` + AdminUI `/alerts`
consume it unchanged. Populate `AlarmId` = condition NodeId, `EquipmentPath` +
`AlarmName` from the plan, `TransitionKind` = `Kind.ToString()`, `AlarmTypeName`
= the configured OPC UA alarm type, `User`/`Comment` from the event.
## Data type / severity mapping
- OPC UA alarm subtype string → SDK type via the existing
`CreateAlarmConditionOfType` (`OffNormalAlarm`/`DiscreteAlarm`/`LimitAlarm`/base).
- `AlarmSeverity` (4-bucket) → 1..1000 via the projector map above; the authored
`severity` seeds the condition's initial severity at materialization
(`MaterialiseAlarmCondition`'s `MapSeverity`).
## Error handling / edge cases
- **Unknown `SourceNodeId`** (no materialized condition for the ref): drop
silently — preserves `GenericDriverNodeManager`'s documented behavior.
- **Byte-parity** between `Phase7Composer` and `DeploymentArtifact` for alarm tags:
parity round-trip test (the established invariant).
- **Redeploy double-delivery**: `DriverInstanceActor` unsubscribes `OnAlarmEvent`
on teardown; `WriteAlarmCondition`'s delta-gate independently suppresses
duplicate events; `RebuildAddressSpace` clears `_alarmConditions`.
- **Transition before condition materialized / after rebuild**: unknown-ref drop
handles it; the projector's prior-state dict is keyed by NodeId and tolerates a
cold start (first event seeds state).
- **A tag with both a value and an alarm intent**: Phase B treats an `alarm`-marked
tag as a **condition node only** (not also a plain variable) — matching the
retired mirror, where an alarm attribute surfaced as a condition.
## Testing (no bUnit)
**xUnit + Shouldly (offline):**
- `ExtractTagAlarm`: present / absent / malformed / defaults / unknown-keys-preserved.
- `Phase7Composer` ↔ `DeploymentArtifact` byte-parity with alarm-bearing equipment tags.
- `NativeAlarmProjector`: Raise→active+unacked, Acknowledge→acked, Clear→inactive,
Retrigger, Unspecified, severity-bucket map, prior-state carry.
- `GalaxyDriver.OnAlarmFeedTransition` populates `Kind`.
- Akka.TestKit — `DriverInstanceActor`: a fake `IAlarmSource` driver fires
`OnAlarmEvent` → the actor publishes `AttributeAlarmPublished` to its parent;
unsubscribes on teardown.
- Akka.TestKit — `DriverHostActor`: `AttributeAlarmPublished` resolves the ref →
Tells `AlarmStateUpdate` with the projected snapshot; unknown ref dropped;
`alerts` publish is Primary-gated (secondary suppresses).
**Live docker-dev `/run` (user-driven; the agent does NOT sign in)** — the gate:
- Author a Galaxy alarm equipment tag (raw `TagConfig` carrying the `alarm` object)
on the live-gateway-backed `MAIN-galaxy-eq`; deploy.
- Trip the Galaxy alarm → a Part 9 `AlarmConditionState` appears active under the
equipment via Client.CLI `alarms` (and `read`); the AdminUI `/alerts` row appears.
- Clear → condition goes inactive. (Device-ack round-trip is the deferred
follow-up, not part of this gate.)
## Suggested slicing (for the plan)
1. **WS-1** — `AlarmTransitionKind` + `AlarmEventArgs.Kind` + Galaxy populates it
(small/standard; touches a Core.Abstractions contract → ripples to implementers).
2. **WS-2** — `ExtractTagAlarm` + `EquipmentTagPlan.Alarm` in both composer +
artifact + parity test (high-risk: data-contract byte-parity).
3. **WS-3** — `MaterialiseEquipmentTags` alarm branch (standard; reuses
`MaterialiseAlarmCondition`).
4. **WS-4a** — `NativeAlarmProjector` (standard; pure, fully TDD-able offline).
5. **WS-4b** — `DriverInstanceActor` `OnAlarmEvent` subscription + publish (high-risk:
actor state machine + driver-thread marshaling).
6. **WS-4c** — `DriverHostActor` alarm map + `ForwardNativeAlarm` + Primary-gated
alerts publish (high-risk: actor/concurrency/redundancy gate).
7. **WS-5** — wire `AlarmTransitionEvent` fields (folds into WS-4c; verify historian
+ `/alerts` consume it).
8. **Docs** — document the `TagConfig` `alarm` schema (a Galaxy/alarms doc note).
9. **Live `/run`** — the gate above (user-driven).
## Deferred follow-ups (explicitly out of Phase B)
- **Inbound device-ack**: client Acknowledge → `IAlarmSource.AcknowledgeAsync` →
AVEVA (its own inbound pipeline, mirrors the write-through work).
- **`SubscribeAlarmsAsync` from the materialized alarm-ref set** for drivers that
gate their feed on it (Galaxy doesn't).
- **AdminUI Galaxy picker pre-fill** of the `alarm` object from discovery
(`IsAlarm`/`SecurityClass` already known) — a UI convenience; raw-JSON authoring
works without it and avoids live-only Razor binding risk.
- Carrying the raw OPC UA severity (vs. the 4-bucket) end-to-end.
## Hard rules (carried into implementation)
- Stage by path; never `git add .`. Never stage `sql_login.txt`,
`src/Server/ZB.MOM.WW.OtOpcUa.Host/pki/`, `pending.md`, `current.md`, or
`docker-dev/docker-compose.yml`.
- Never echo the gateway API key or any secret into a tracked file.
- No force-push, no `--no-verify`.
- **No Configuration entity / EF migration change** (the TagConfig route is chosen
specifically to honor this).
- No bUnit; Razor/JS proven only by live `/run`.
- Build on a feature branch off master.
## Authoritative touched-code list (for planning)
- `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAlarmSource.cs` (`AlarmEventArgs.Kind`, new enum)
- `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/GalaxyDriver.cs` (`OnAlarmFeedTransition` populates `Kind`)
- `src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/Phase7Composer.cs` (`EquipmentTagPlan.Alarm`, `ExtractTagAlarm`)
- `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DeploymentArtifact.cs` (`BuildEquipmentTagPlans` parity)
- `src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/Phase7Applier.cs` (`MaterialiseEquipmentTags` alarm branch)
- `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs` (`OnAlarmEvent` sub + publish)
- `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs` (alarm map + `ForwardNativeAlarm` + gated publish)
- NEW `NativeAlarmProjector` (Runtime or Commons) + its tests
- `OpcUaPublishActor` / `OtOpcUaNodeManager` — **no change** (reuse `AlarmStateUpdate`/`WriteAlarmCondition`)
- A docs note for the `TagConfig` `alarm` schema
```