Files
lmxopcua/docs/plans/2026-06-14-galaxy-phase-b-native-alarms-design.md
T

17 KiB

Galaxy Phase B — Native Alarms on the Equipment-Tag Path — Design

Date: 2026-06-14 Status: Approved (brainstorming) — ready for implementation planning Scope: Phase B of the Galaxy standard-driver design (docs/plans/2026-06-12-galaxy-standard-driver-design.md). Restores native IAlarmSource alarms in the post-Phase-A Equipment model, delivered over a new driver→server alarm seam that mirrors the scripted-alarm seam. Builds on: master (Milestone 1b complete — equipment-tag live values + write pipeline shipped). Feature branch off master.

Goal

A Galaxy equipment Tag bound to GalaxyMxGateway and marked as a native alarm materializes a real OPC UA Part 9 AlarmConditionState under its equipment folder, and the driver's live IAlarmSource.OnAlarmEvent transitions drive that condition (active / severity / message / ack-state) and fan out to the alerts topic (AdminUI /alerts + historian), exactly as scripted alarms do today. No EF/schema migration.

Why this is needed (the gap, grounded in current code)

Phase A retired the SystemPlatform mirror (MaterialiseGalaxyTags + GenericDriverNodeManager), which was the only path that wired native alarms. Three concrete consequences, all verified against current code:

  1. No cross-actor transport for native alarms. In the fused-host actor model, DriverInstanceActor subscribes the driver's ISubscribable.OnDataChange (value seam) but never subscribes IAlarmSource.OnAlarmEvent. No actor message carries an alarm transition. GenericDriverNodeManager (src/Core/.../OpcUa/GenericDriverNodeManager.cs) subscribed OnAlarmEvent in-process; it is now orphaned (only tests construct it).
  2. No condition-update sink survives. There is no production IAlarmConditionSink.OnTransition implementation anymore — only the interface (IAddressSpaceBuilder.cs:115) and an empty CLI stub (TwinCAT.Cli/.../BrowseCommand.cs:150). The real sink lived in the mirror's builder and died with it. The surviving condition path is the snapshot path: OpcUaPublishActor.AlarmStateUpdateOtOpcUaNodeManager.WriteAlarmCondition.
  3. Authored tags carry no alarm flag. Native-alarm metadata came from driver discovery (DriverAttributeInfo.IsAlarm), but Phase 7 composes from authored Tag rows only. EquipmentTagPlan carries {TagId, EquipmentId, DriverInstanceId, FolderPath, Name, DataType, FullName, Writable}no alarm field. The Tag entity has no alarm column.

Decisive design fact: AlarmEventArgs (the OnAlarmEvent payload) does not carry the transition kind. GalaxyAlarmTransition has an explicit GalaxyAlarmTransitionKind {Raise, Acknowledge, Clear, Retrigger}, but GalaxyDriver.OnAlarmFeedTransition drops it when building AlarmEventArgs. A consumer therefore cannot tell raise-from-clear. Phase B fixes this at the source (additive contract change) rather than guessing.

Locked decisions (from brainstorming)

Decision Choice
How the server learns a tag is a native alarm TagConfig JSON (no migration). The alarm intent rides in the schemaless TagConfig blob — {"FullName":"…","alarm":{"alarmType":"OffNormalAlarm","severity":500}} — parsed byte-parity in Phase7Composer + DeploymentArtifact, exactly as FullName is today. No EF/schema change.
Phase B scope line Live condition + alerts fan-out; defer device-ack. Trip → Part 9 condition + /alerts + historian (Primary-gated). A client Acknowledge updates the local condition state; routing it back to the driver's IAlarmSource.AcknowledgeAsync (→ AVEVA) is a deferred follow-up.
Transition→state model Snapshot projection. Reuse the scripted-alarm sink path (WriteAlarmCondition + delta-gate + ReportConditionEvent + Part-9 ack mechanics) unchanged; a new pure projector derives an AlarmConditionSnapshot from each transition. The retired OnTransition sink path is not resurrected.
Transition-kind plumbing Additive contract change. Add AlarmTransitionKind Kind to AlarmEventArgs (default Unspecified); GalaxyDriver populates it from the kind it already has. A record default keeps every other IAlarmSource implementer compiling.

Approaches considered and rejected

  • Resurrect the IAlarmConditionSink/OnTransition builder path. Rejected: reintroduces a second condition-state mechanism alongside the live, delta-gated, ack-wired WriteAlarmCondition path. The snapshot path is the one that works for scripted alarms today.
  • Auto-discover the alarm flag at deploy (query driver DriverAttributeInfo during composition). Rejected (this session): couples Phase 7 composition to a cross-process discovery query; larger and higher-risk than reading the authored blob. The TagConfig route matches the protocol-linkage precedent.
  • Extend the Tag entity with IsAlarm/AlarmType columns. Rejected: requires a Configuration/EF migration, out of scope for this phase.
  • Infer raise/clear from AlarmEventArgs heuristically. Rejected: the record has no active flag; inference is fragile and wrong. Fix the contract instead.

Architecture (target end-state)

Author Tag{Equipment, GalaxyMxGateway, TagConfig={FullName, alarm:{alarmType,severity}}}
   → Phase7Composer / DeploymentArtifact  (ExtractTagAlarm, byte-parity)
        → EquipmentTagPlan.Alarm != null
   → Phase7Applier.MaterialiseEquipmentTags
        • Alarm == null : SafeEnsureVariable          (today's value variable, unchanged)
        • Alarm != null : SafeMaterialiseAlarmCondition(nodeId, equip, name, alarmType, severity)
                          → real Part 9 AlarmConditionState under the equipment folder  (reused)

Runtime (the NEW seam — mirrors the scripted-alarm seam):
  GalaxyDriver.OnAlarmEvent (AlarmEventArgs{SourceNodeId=FullName, Kind, Severity, …})
   → DriverInstanceActor  (subscribes OnAlarmEvent; marshals via Self.Tell)
        → AttributeAlarmPublished(DriverInstanceId, AlarmEventArgs)   [NEW msg, parallels AttributeValuePublished]
   → DriverHostActor.ForwardNativeAlarm
        • resolve (DriverInstanceId, SourceNodeId) → condition NodeId(s)  via _alarmNodeIdByDriverRef  [NEW map]
        • NativeAlarmProjector.Project(nodeId, args) → AlarmConditionSnapshot              [NEW pure helper]
        • Tell OpcUaPublishActor.AlarmStateUpdate(nodeId, snapshot, ts)   (UNGATED — warm on all nodes)  [REUSED]
        • Publish AlarmTransitionEvent → `alerts` topic                  (Primary-gated — reuse _localRole) [REUSED]
   → OtOpcUaNodeManager.WriteAlarmCondition → delta-gate → ReportConditionEvent (real Part 9 event)  [REUSED]

OpcUaPublishActor and OtOpcUaNodeManager are unchanged — Phase B reuses AlarmStateUpdateWriteAlarmCondition verbatim.

Components / workstreams

WS-1 — Transition-kind contract (additive, Core.Abstractions)

  • New enum AlarmTransitionKind { Unspecified = 0, Raise, Acknowledge, Clear, Retrigger } (mirrors the internal GalaxyAlarmTransitionKind).
  • AlarmEventArgs gains a trailing AlarmTransitionKind Kind = AlarmTransitionKind.Unspecified param (record default → all existing implementers compile untouched).
  • GalaxyDriver.OnAlarmFeedTransition (GalaxyDriver.cs:~1128-1167) populates Kind from transition.TransitionKind.

WS-2 — Alarm intent in TagConfig + compose plan (no EF)

  • New never-throw ExtractTagAlarm(string tagConfig) → EquipmentTagAlarmInfo? (parses the optional "alarm" object: alarmType default "AlarmCondition", severity default 500; absent/malformed → null). Lives next to ExtractTagFullName; used by both Phase7Composer and DeploymentArtifact.
  • EquipmentTagPlan (Phase7Composer.cs:~80) gains EquipmentTagAlarmInfo? Alarm (record EquipmentTagAlarmInfo(string AlarmType, int Severity)); null ⇒ plain variable. Populated in Phase7Composer Select(...) and DeploymentArtifact.BuildEquipmentTagPlansbyte-parity invariant, covered by a round-trip test.

WS-3 — Materialize the condition node (reuse)

  • Phase7Applier.MaterialiseEquipmentTags (:~162-199) branches per tag: tag.Alarm is not nullSafeMaterialiseAlarmCondition(nodeId, parentEquipment, tag.Name, tag.Alarm.AlarmType, tag.Alarm.Severity) (the same method scripted alarms use; condition NodeId = the tag's equipment-scoped NodeId); else → SafeEnsureVariable(...) (today). RebuildAddressSpace already clears _alarmConditions, so redeploy teardown is covered.

WS-4 — The driver→server alarm seam (the new plumbing)

  • DriverInstanceActor: on connect, if _driver is IAlarmSource src, subscribe src.OnAlarmEvent += handler; the handler marshals to the actor thread via Self.Tell(new NativeAlarmRaised(e)) (mirrors the _dataChangeHandler pattern, :409/:456). Receive<NativeAlarmRaised>Context.Parent.Tell(new AttributeAlarmPublished(DriverInstanceId, Args)). Unsubscribe on disconnect/teardown (mirror the OnDataChange unsubscribe). New messages NativeAlarmRaised (internal) + AttributeAlarmPublished (parallels AttributeValuePublished, :65). Phase B follows the mirror's model: subscribe the event and let the server filter by SourceNodeId; it does not drive SubscribeAlarmsAsync (Galaxy's feed auto-starts session-less in InitializeAsync and fires OnAlarmEvent regardless). Driving SubscribeAlarmsAsync from the materialized alarm-ref set, for drivers that gate on it, is a noted follow-up.
  • DriverHostActor: build _alarmNodeIdByDriverRef: (DriverInstanceId, FullName) → HashSet<NodeId> from equipment-tag plans where Alarm != null (alongside the existing _nodeIdByDriverRef, in the same apply pass). Add Receive<AttributeAlarmPublished> in the steady + applying states. Handler ForwardNativeAlarm: resolve nodeIds (unknown ref → drop silently, mirror behavior); per nodeId NativeAlarmProjector.Project(...) → snapshot → _publishActor.Tell(AlarmStateUpdate(nodeId, snapshot, ts)) ungated; then publish AlarmTransitionEvent to alerts Primary-gated via the existing _localRole the write-routing already tracks.
  • NativeAlarmProjector (new pure class; unit-tested): per-condition-NodeId prior-state (Active, Acked, Severity, Message); Project(nodeId, AlarmEventArgs) → AlarmConditionSnapshot by Kind:
    • Raise/RetriggerActive=true, Acked=false, severity+message from event.
    • AcknowledgeAcked=true (keep prior Active), carry OperatorComment.
    • ClearActive=false (keep prior Acked).
    • Unspecified → keep prior Active/Acked, refresh severity+message.
    • Enabled=true, Confirmed=true, Shelving=Unshelved (shelving is a server/local concern). Severity: AlarmSeverity 4-bucket → 1..1000 ushort (Low→200, Medium→500, High→700, Critical→900).

WS-5 — Historian / alerts parity (reuse)

  • The AlarmTransitionEvent published in WS-4 is the same contract ScriptedAlarmHostActor publishes; HistorianAdapterActor + AdminUI /alerts consume it unchanged. Populate AlarmId = condition NodeId, EquipmentPath + AlarmName from the plan, TransitionKind = Kind.ToString(), AlarmTypeName = the configured OPC UA alarm type, User/Comment from the event.

Data type / severity mapping

  • OPC UA alarm subtype string → SDK type via the existing CreateAlarmConditionOfType (OffNormalAlarm/DiscreteAlarm/LimitAlarm/base).
  • AlarmSeverity (4-bucket) → 1..1000 via the projector map above; the authored severity seeds the condition's initial severity at materialization (MaterialiseAlarmCondition's MapSeverity).

Error handling / edge cases

  • Unknown SourceNodeId (no materialized condition for the ref): drop silently — preserves GenericDriverNodeManager's documented behavior.
  • Byte-parity between Phase7Composer and DeploymentArtifact for alarm tags: parity round-trip test (the established invariant).
  • Redeploy double-delivery: DriverInstanceActor unsubscribes OnAlarmEvent on teardown; WriteAlarmCondition's delta-gate independently suppresses duplicate events; RebuildAddressSpace clears _alarmConditions.
  • Transition before condition materialized / after rebuild: unknown-ref drop handles it; the projector's prior-state dict is keyed by NodeId and tolerates a cold start (first event seeds state).
  • A tag with both a value and an alarm intent: Phase B treats an alarm-marked tag as a condition node only (not also a plain variable) — matching the retired mirror, where an alarm attribute surfaced as a condition.

Testing (no bUnit)

xUnit + Shouldly (offline):

  • ExtractTagAlarm: present / absent / malformed / defaults / unknown-keys-preserved.
  • Phase7ComposerDeploymentArtifact byte-parity with alarm-bearing equipment tags.
  • NativeAlarmProjector: Raise→active+unacked, Acknowledge→acked, Clear→inactive, Retrigger, Unspecified, severity-bucket map, prior-state carry.
  • GalaxyDriver.OnAlarmFeedTransition populates Kind.
  • Akka.TestKit — DriverInstanceActor: a fake IAlarmSource driver fires OnAlarmEvent → the actor publishes AttributeAlarmPublished to its parent; unsubscribes on teardown.
  • Akka.TestKit — DriverHostActor: AttributeAlarmPublished resolves the ref → Tells AlarmStateUpdate with the projected snapshot; unknown ref dropped; alerts publish is Primary-gated (secondary suppresses).

Live docker-dev /run (user-driven; the agent does NOT sign in) — the gate:

  • Author a Galaxy alarm equipment tag (raw TagConfig carrying the alarm object) on the live-gateway-backed MAIN-galaxy-eq; deploy.
  • Trip the Galaxy alarm → a Part 9 AlarmConditionState appears active under the equipment via Client.CLI alarms (and read); the AdminUI /alerts row appears.
  • Clear → condition goes inactive. (Device-ack round-trip is the deferred follow-up, not part of this gate.)

Suggested slicing (for the plan)

  1. WS-1AlarmTransitionKind + AlarmEventArgs.Kind + Galaxy populates it (small/standard; touches a Core.Abstractions contract → ripples to implementers).
  2. WS-2ExtractTagAlarm + EquipmentTagPlan.Alarm in both composer + artifact + parity test (high-risk: data-contract byte-parity).
  3. WS-3MaterialiseEquipmentTags alarm branch (standard; reuses MaterialiseAlarmCondition).
  4. WS-4aNativeAlarmProjector (standard; pure, fully TDD-able offline).
  5. WS-4bDriverInstanceActor OnAlarmEvent subscription + publish (high-risk: actor state machine + driver-thread marshaling).
  6. WS-4cDriverHostActor alarm map + ForwardNativeAlarm + Primary-gated alerts publish (high-risk: actor/concurrency/redundancy gate).
  7. WS-5 — wire AlarmTransitionEvent fields (folds into WS-4c; verify historian
    • /alerts consume it).
  8. Docs — document the TagConfig alarm schema (a Galaxy/alarms doc note).
  9. Live /run — the gate above (user-driven).

Deferred follow-ups (explicitly out of Phase B)

  • Inbound device-ack: client Acknowledge → IAlarmSource.AcknowledgeAsync → AVEVA (its own inbound pipeline, mirrors the write-through work).
  • SubscribeAlarmsAsync from the materialized alarm-ref set for drivers that gate their feed on it (Galaxy doesn't).
  • AdminUI Galaxy picker pre-fill of the alarm object from discovery (IsAlarm/SecurityClass already known) — a UI convenience; raw-JSON authoring works without it and avoids live-only Razor binding risk.
  • Carrying the raw OPC UA severity (vs. the 4-bucket) end-to-end.

Hard rules (carried into implementation)

  • Stage by path; never git add .. Never stage sql_login.txt, src/Server/ZB.MOM.WW.OtOpcUa.Host/pki/, pending.md, current.md, or docker-dev/docker-compose.yml.
  • Never echo the gateway API key or any secret into a tracked file.
  • No force-push, no --no-verify.
  • No Configuration entity / EF migration change (the TagConfig route is chosen specifically to honor this).
  • No bUnit; Razor/JS proven only by live /run.
  • Build on a feature branch off master.

Authoritative touched-code list (for planning)

  • src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IAlarmSource.cs (AlarmEventArgs.Kind, new enum)
  • src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/GalaxyDriver.cs (OnAlarmFeedTransition populates Kind)
  • src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/Phase7Composer.cs (EquipmentTagPlan.Alarm, ExtractTagAlarm)
  • src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DeploymentArtifact.cs (BuildEquipmentTagPlans parity)
  • src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/Phase7Applier.cs (MaterialiseEquipmentTags alarm branch)
  • src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs (OnAlarmEvent sub + publish)
  • src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs (alarm map + ForwardNativeAlarm + gated publish)
  • NEW NativeAlarmProjector (Runtime or Commons) + its tests
  • OpcUaPublishActor / OtOpcUaNodeManagerno change (reuse AlarmStateUpdate/WriteAlarmCondition)
  • A docs note for the TagConfig alarm schema