From 65a5f6493111c3d783c7f15254ac8539e5a1c2fb Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Thu, 30 Apr 2026 15:02:48 -0400 Subject: [PATCH] =?UTF-8?q?docs:=20plan=20=E2=80=94=20alarms=20over=20the?= =?UTF-8?q?=20mxaccessgw=20gateway?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Coordinated cross-repo epic to restore the three v1 alarm capabilities that PR 7.2 regressed: rich MxAccess alarm-event metadata, native Acknowledge semantics, and the IAlarmHistorianWriter write-back path. Architectural split: gateway owns MxAccess transport (new OnAlarmTransition event family + AcknowledgeAlarm / QueryActiveAlarms / WriteHistorianEvent RPCs); lmxopcua keeps the OPC UA Part 9 state machine, ACL/role enforcement, and multi-source aggregation. The existing value-driven sub-attribute path stays as fallback. 10 PRs total — 5 in mxaccessgw, 5 in lmxopcua — sequenced so each side's work is independently reviewable. End-of-epic gate is a parity matrix run with five new alarm scenarios. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/plans/alarms-over-gateway.md | 617 ++++++++++++++++++++++++++++++ 1 file changed, 617 insertions(+) create mode 100644 docs/plans/alarms-over-gateway.md diff --git a/docs/plans/alarms-over-gateway.md b/docs/plans/alarms-over-gateway.md new file mode 100644 index 0000000..dfb870e --- /dev/null +++ b/docs/plans/alarms-over-gateway.md @@ -0,0 +1,617 @@ +# Plan — alarms over the mxaccessgw gateway + +Coordinated epic across two repos: + +- **`lmxopcua`** (this repo) — `c:\Users\dohertj2\Desktop\lmxopcua\` +- **`mxaccessgw`** — `c:\Users\dohertj2\Desktop\mxaccessgw\` + +## Why + +PR 7.2 (2026-04-30, commit `ae7106d`) retired the in-process v1 Galaxy stack +(`Driver.Galaxy.Host` / `.Proxy` / `.Shared` + `OtOpcUaGalaxyHost` Windows +service) and migrated Galaxy access to the in-process `GalaxyDriver` over +mxaccessgw's gRPC. In doing so, three v1 capabilities regressed: + +1. **Native MxAccess alarm-event metadata** — v1's `GalaxyAlarmTracker` + surfaced rich alarm transitions (operator comment, original raise time, + ack time, alarm category, native severity). The current architecture + reconstructs Part 9 transitions by subscribing to four sub-attribute + value updates (`InAlarm`, `Acked`, `Priority`, `Description`) — fine for + raise/clear but loses everything else. +2. **Native MxAccess Acknowledge semantics** — v1 called the MxAccess ack + API directly from `GalaxyAlarmTracker`. Today, OPC UA acks are written + into the `AckMsgWriteRef` sub-attribute — semantically valid but a + round-trip through the value path that loses operator-comment fidelity. +3. **Alarm-historian write-back path** — `GalaxyHistorianWriter` + implemented `IAlarmHistorianWriter` and forwarded scripted-alarm and + Galaxy-native alarm transitions back to AVEVA Historian via + `aahClientManaged`. PR 7.2 deleted it. `Phase7Composer.ResolveHistorianSink` + now finds no writer and falls back to `NullAlarmHistorianSink`, so + **scripted-alarm transitions queue locally and silently discard.** + (Galaxy-native alarms still reach AVEVA Historian via the Galaxy template's + own `HistorizeToAveva` toggle, independent of our sink — that path + wasn't broken.) + +`gateway.md` (mxaccessgw, line 8) explicitly commits the gateway to "full +MXAccess parity… preserve MXAccess behavior first… **native MXAccess event +families**." Today's gateway proto exposes only data-change families. Closing +the alarm regression and fulfilling that parity statement are the same task. + +## Goals + +- Restore all three regressed capabilities to feature parity with v1. +- Keep the v2 architectural split — gateway owns MxAccess transport; + lmxopcua owns OPC UA Part 9 semantics, ACL/role enforcement, and + multi-source aggregation (driver-native + scripted + sub-attribute). +- Preserve the value-driven sub-attribute path as a fallback for Galaxy + templates that don't carry `$Alarm*` extensions. +- Land the work as a sequence of small, independently-reviewable PRs that + alternate between repos in dependency order. + +## Non-goals + +- Reimplementing the Part 9 state machine inside mxaccessgw. The gateway + stays UA-agnostic. +- Reworking the LDAP role-grant or OPC UA AlarmAck ACL surface — those + already exist and route through `Server/Alarms/IAlarmAcknowledger`. +- Adding alarm support to non-Galaxy drivers (AbCip / FOCAS / OpcUaClient + already have their own `IAlarmSource` implementations; Modbus / S7 / + AbLegacy / TwinCAT don't have a native alarm bus and are out of scope). +- Altering Galaxy template conventions or `$Alarm*` extensions in the + customer's Galaxy. + +## Before → after + +**Today (post-PR 7.2):** + +``` +MxAccess COM (gateway worker) + │ data-change events only on the MxEvent stream + ▼ +GalaxyDriver (no IAlarmSource) + │ IWritable / ISubscribable / ITagDiscovery only + ▼ +DriverNodeManager + ├─ subscribes to four $Alarm* sub-attributes per condition + ├─ AlarmConditionService rebuilds Part 9 transitions from value updates + └─ DriverWritableAcknowledger writes AckMsgWriteRef on ack + +Phase7Composer.ResolveHistorianSink → NullAlarmHistorianSink + (scripted-alarm transitions queue → silently discarded) +``` + +**After this epic:** + +``` +MxAccess COM (gateway worker) + │ data-change ──┐ + │ alarm-transition │ + │ write-complete ├─► single MxEvent stream (new family added) + ▼ ▼ +GalaxyDriver : ITagDiscovery, IReadable, IWritable, ISubscribable, IRediscoverable, + IHostConnectivityProbe, IAlarmSource ← restored + ├─ EventPump dispatches OnAlarmTransition family → IAlarmSource.OnAlarmEvent + ├─ AcknowledgeAsync → gateway RPC AcknowledgeAlarm + └─ QueryActiveAlarmsAsync → gateway RPC QueryActiveAlarms (ConditionRefresh) + +DriverNodeManager + ├─ rich alarm events from IAlarmSource.OnAlarmEvent → AlarmConditionService + ├─ value-driven sub-attribute path STILL WORKS for templates without $Alarm + ├─ DriverWritableAcknowledger preserved as fallback for the value path + └─ ScriptedAlarmEngine output continues to feed AlarmConditionService + +Phase7Composer.ResolveHistorianSink → GatewayAlarmHistorianWriter + ├─ scripted-alarm transitions → SqliteStoreAndForwardSink + └─ drain worker → gateway RPC WriteHistorianEvent → AVEVA Historian +``` + +## Architecture decisions + +**D1 — Where the Part 9 state machine runs.** Stays in lmxopcua's +`AlarmConditionService`. Gateway is UA-agnostic. ScriptedAlarmEngine produces +Part 9 transitions with no MxAccess origin; the aggregator must live where all +sources converge. + +**D2 — Where authz on Acknowledge runs.** Stays in lmxopcua. The OPC UA +`AlarmConditionState.OnAcknowledge` delegate already checks the session's +roles for `AlarmAck` against the LDAP/role-grant ACL. The gateway should +never be reachable in a way that bypasses that check. + +**D3 — How rich alarm events reach OPC UA clients.** New `MxEventFamily` +on the existing `StreamEvents` RPC (no second stream). Adds latency +parity with data-change events, reuses the bounded-channel + worker-side +delivery semantics already documented in `gateway.md`. + +**D4 — Sub-attribute fallback path stays.** Some Galaxy templates won't +have `$Alarm*` extensions yet; the existing value-driven path remains the +only way to surface alarms for those templates. Both paths feed +`AlarmConditionService`. Driver-native events take precedence when both +are present (more authoritative, lower latency). + +**D5 — Where the historian writer lives.** As a new RPC on the gateway +(`WriteHistorianEvent`). The Wonderware sidecar's existing +`WriteAlarmEvents` IPC slot stays unwired and is deleted as part of this +epic — the gateway is the canonical place for "write to AVEVA Historian" +since the gateway already owns AVEVA-COM access. This also means the +sidecar (long term) only does *reads* and could potentially retire entirely +if the historian-client REST migration (`docs/plans/...`) lands. + +## Track A — mxaccessgw changes + +All five PRs land in `c:\Users\dohertj2\Desktop\mxaccessgw\`. + +### PR A.1 — proto: add alarm-transition event family + ack/query RPCs + +**Files** (`src\MxGateway.Contracts\Protos\mxaccess_gateway.proto`): + +1. Extend `MxEventFamily` (line 403): + ``` + MX_EVENT_FAMILY_ON_ALARM_TRANSITION = 5; + ``` + +2. Extend `MxEvent.body` oneof (line 395) with: + ``` + OnAlarmTransitionEvent on_alarm_transition = 24; + ``` + +3. New message `OnAlarmTransitionEvent` after the existing event-family + bodies (line 425+). Carry the full MxAccess alarm payload — alarm name, + source object reference, alarm-type-name (e.g. "AnalogLimitAlarm.HiHi"), + transition kind enum (`Raise` / `Acknowledge` / `Clear`), severity (raw + numeric — keep MxAccess scale; mapping to OPC UA 0-1000 happens + server-side in lmxopcua), `original_raise_timestamp`, + `transition_timestamp`, optional `operator_user`, optional + `operator_comment`, alarm `category` string, alarm `description`. Mirror + the field set documented in v1's `GalaxyAlarmTracker`. + +4. New RPC on `MxAccessGateway` service (line 11): + ``` + rpc AcknowledgeAlarm(AcknowledgeAlarmRequest) returns (AcknowledgeAlarmReply); + rpc QueryActiveAlarms(QueryActiveAlarmsRequest) returns (stream ActiveAlarmSnapshot); + ``` + + `AcknowledgeAlarmRequest` carries `session_id`, `alarm_full_reference`, + `comment`, `user_principal`. Reply carries `MxStatusProxy`. + + `QueryActiveAlarmsRequest` carries `session_id`, optional + `alarm_filter_prefix` (for ConditionRefresh on a sub-tree). + `ActiveAlarmSnapshot` carries the same fields as + `OnAlarmTransitionEvent` plus `current_state` enum (`Active` / + `ActiveAcked` / `Inactive`). + +**Tests** (`MxGateway.Tests` — proto/codegen sanity): + +- Round-trip Serialize→Deserialize for the new messages with all-fields + populated and empty-optional-fields cases. +- `MxEvent.body` oneof selection guard — supplying multiple bodies + rejected. + +**Out of scope:** worker-side wiring (PR A.2), gateway-side dispatch (PR A.3). +PR A.1 is a pure contract-surface change; nothing functional yet. + +### PR A.2 — worker: subscribe to MxAccess alarm event source + +**Files** (`src\MxGateway.Worker\` — net48/x86): + +The MxAccess Toolkit exposes alarm subscription separately from data +subscription. Per AVEVA's MXAccess C++ Toolkit reference (canonical doc +referenced from `gateway.md`), alarm events arrive through the +`IAlarmEventSink` interface registered against the MxAccess `Alarms` +collection of an open session, OR via the MxAccess "alarm provider" +subscription pattern (depends on Toolkit version on the worker host — +verify against the version actually deployed in the worker bin during +PR A.2). + +1. Worker subscribes to MxAccess alarms once per session, with a single + sink that fans out into the same bounded channel the data-change pump + uses (`MxGateway.Worker\Eventing\EventChannel.cs` or whatever the worker + currently calls its sink — verify name during the PR). +2. Sink translates each MxAccess alarm event into a `WorkerEvent` proto + (defined in `mxaccess_worker.proto`) carrying the new + `OnAlarmTransitionEvent` body. Reuses the existing `worker_sequence` + counter so ordering is preserved across families. +3. Worker honours the same backpressure rules as data-change events — + newest-dropped on full channel, single dropped-counter metric per + family. + +**Tests** (`MxGateway.Worker.Tests`): + +- Fake `IAlarmEventSink` source emits canned transitions; assert the + worker forwards each as the right `WorkerEvent` shape. +- Cancellation test — closing the session unsubscribes from MxAccess + alarms cleanly (no leaked sinks if the worker is recycled mid-session). + +**Out of scope:** any gateway-side dispatch, any RPC handler — PR A.2 +is worker-internal. + +### PR A.3 — gateway: dispatch OnAlarmTransition + implement AcknowledgeAlarm + +**Files** (`src\MxGateway.Server\`): + +1. The session-level event multiplexer (`Sessions\SessionEventStream.cs` + or equivalent — verify name during PR) recognizes the new + `WorkerEvent` body and forwards as an `MxEvent` with family + `MX_EVENT_FAMILY_ON_ALARM_TRANSITION` to the gRPC + `StreamEvents` consumer. +2. New RPC handler `AcknowledgeAlarm` builds an MxAccess `WorkerCommand` + carrying an `AlarmAcknowledgeCommand` (new in `mxaccess_worker.proto` + under PR A.1). Forwarded to the worker; reply mapped to + `AcknowledgeAlarmReply` with the MxAccess `MxStatus` proxy populated. +3. AuthN — same API-key + scope check as existing RPCs. Add a new scope + `invoke:alarm-ack` (mirrors `invoke:write` granularity); existing keys + without it return `PERMISSION_DENIED`. + +**Tests** (`MxGateway.Tests`, `MxGateway.IntegrationTests`): + +- Unit: dispatch test — fake worker emits an `AlarmTransition` event; + assert the gateway forwards it on the live `StreamEvents` channel of + every subscribed session. +- Integration: end-to-end against the real worker (requires the parity + rig setup — see `docs\v2\Galaxy.ParityRig.md` in lmxopcua for the + MxAccess-installed dev box prerequisites). Trigger a real Galaxy + alarm, assert the gateway emits `OnAlarmTransition`. Acknowledge via + the new RPC, assert the alarm transitions to `ActiveAcked` and an + `Acknowledge` transition event is emitted back. +- AuthN: existing key without `invoke:alarm-ack` scope rejected. + +### PR A.4 — gateway: ConditionRefresh snapshot via QueryActiveAlarms + +**Files** (`src\MxGateway.Server\`, `src\MxGateway.Worker\`): + +1. Worker exposes a `QueryActiveAlarmsCommand` that walks the session's + active-alarm collection and streams snapshots back through the + existing command-reply channel. The MxAccess Toolkit's + `Alarms.GetActive()` (verify exact API name during PR) is the + underlying call. +2. Gateway RPC `QueryActiveAlarms` opens a server-streaming reply, + batches snapshots through. +3. AuthN — new scope `invoke:alarm-query` (separate from ack so a + read-only client can refresh without ack rights). + +**Tests:** + +- Worker-test: synthetic active set of 0 / 1 / 100 alarms; assert + pagination respects worker channel capacity. +- Integration: against the parity rig, assert a ConditionRefresh after + reconnect returns every alarm currently `Active` or `ActiveAcked` in + the Galaxy. + +### PR A.5 — gateway: WriteHistorianEvent RPC for sink write-back + +**Files** (`src\MxGateway.Server\`, `src\MxGateway.Worker\`, +`src\MxGateway.Contracts\Protos\mxaccess_gateway.proto`). + +1. New RPC `WriteHistorianEvent(WriteHistorianEventRequest) → + WriteHistorianEventReply`. Request carries an + `AlarmHistorianRecord` mirroring the existing + `Core.AlarmHistorian.AlarmHistorianEvent` payload (alarm id, + equipment path, alarm name, alarm-type-name, severity, event kind, + message, user, comment, timestamp). +2. Worker maps the record onto `aahClientManaged`'s alarm-event + write API (the same path v1's `GalaxyHistorianWriter` used). Worker + batches up to N records per write to amortize the COM round-trip. +3. AuthN — new scope `invoke:historian-write`. Cross-cutting with + `invoke:write` — keys for OPC UA servers that publish historian + data must hold both. + +**Tests:** + +- Worker test: fake `aahClientManaged` writer; assert batching + semantics + retry-on-Bad-status-code behaviour matches v1's + `GalaxyHistorianWriter` (per-row outcome reporting). +- Integration: write a record, query it back via existing Historian + read APIs, assert round-trip fidelity. + +**Sequencing within Track A:** A.1 → A.2 → A.3 → A.4 → A.5. A.1 is +mechanical; A.2 + A.3 are the load-bearing changes that unlock lmxopcua +side. A.4 + A.5 can ship after lmxopcua starts consuming A.3 output. + +## Track B — lmxopcua changes + +All five PRs land in `c:\Users\dohertj2\Desktop\lmxopcua\`. Each B-PR +depends on a specific A-PR — see the sequencing matrix below. + +### PR B.1 — EventPump: dispatch OnAlarmTransition family + +**Depends on:** A.1 (proto), A.3 (gateway dispatching the new family). + +**Files:** + +- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\EventPump.cs:160` — + current `Dispatch(MxEvent ev)` returns early for any non-`OnDataChange` + family. Add a branch: + ```csharp + switch (ev.Family) { + case MxEventFamily.OnDataChange: DispatchDataChange(ev); break; + case MxEventFamily.OnAlarmTransition: DispatchAlarmTransition(ev); break; + default: return; + } + ``` +- New `DispatchAlarmTransition` translates the proto event into an + `AlarmEventArgs` (existing type from `Core.Abstractions`) and raises an + internal event the driver subscribes to. +- New `MxAccessSeverityMapper` in `Driver.Galaxy\Runtime\` — maps the + MxAccess raw severity into the `AlarmSeverity` enum + the OPC UA + numeric severity (250 / 500 / 700 / 900 ladder per v1's + `AlarmTracking.md`). + +**Tests** (`tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests\Runtime\`): + +- `EventPumpAlarmTests` — feed three synthetic MxEvents (raise / ack / + clear); assert each fires `OnAlarmEvent` on the driver with correct + payload. +- Severity-mapping table tests — every documented MxAccess severity + level → expected (`AlarmSeverity`, OPC UA numeric) tuple. + +### PR B.2 — GalaxyDriver re-implements IAlarmSource + +**Depends on:** A.3 (`AcknowledgeAlarm` RPC available), B.1 (event +dispatch). + +**Files:** + +- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriver.cs:28` — extend the + class declaration: + ```csharp + public sealed class GalaxyDriver + : IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, + IRediscoverable, IHostConnectivityProbe, IAlarmSource, IDisposable + ``` +- Implement the four `IAlarmSource` members: + - `SubscribeAlarmsAsync` — no-op returning a sentinel handle. The + driver is already subscribed for data; alarm events arrive on the + same event stream once the gateway emits the new family. (Same + pattern AbCip uses today — see `Driver.AbCip\AbCipDriver.cs:208`.) + - `UnsubscribeAlarmsAsync` — no-op. + - `OnAlarmEvent` — wired to the EventPump branch added in B.1. + - `AcknowledgeAsync` — calls the new gateway RPC via the + `IGalaxyAlarmAcknowledger` abstraction (new file, mirrors the + `IGalaxyDataWriter` pattern), with `GatewayGalaxyAlarmAcknowledger` + as the production implementation in `Runtime\`. Resilience wrapping + via `AlarmSurfaceInvoker` per existing pattern. +- `DriverInstanceFactory` for Galaxy registers + `IGalaxyAlarmAcknowledger` alongside the existing data writer. + +**Tests:** + +- Subscribe-noop returns a non-null handle; unsubscribe accepts it. +- Acknowledge — fake `IGalaxyAlarmAcknowledger` records the call; assert + the request shape and resilience-pipeline routing. +- End-to-end test in `Driver.Galaxy.Tests` — fake gateway emits a + raise-then-ack event sequence; assert the driver fires `OnAlarmEvent` + twice with matching alarm-id correlation. + +### PR B.3 — DriverNodeManager: route to driver-native when present + +**Depends on:** B.2. + +**Files:** + +- `src\ZB.MOM.WW.OtOpcUa.Server\OpcUa\DriverNodeManager.cs` — when + registering an `AlarmConditionState` for a Galaxy variable, check + whether the driver is `IAlarmSource`. If yes, prefer the + `OnAlarmEvent`-driven path; the value-driven sub-attribute path + becomes the secondary path that handles transitions the driver-native + stream missed (network blip, gateway restart, gw missing the + `$Alarm*` extension on this template). +- `Server\Alarms\AlarmConditionService` — already accepts events from + multiple sources; only addition is a `DriverEventOrigin` enum on + internal transitions so the dedup logic prefers the richer + driver-native record over a stale sub-attribute synthesis. +- `IAlarmAcknowledger` resolution in `DriverNodeManager` — + prefer the driver's `IAlarmSource.AcknowledgeAsync` over + `DriverWritableAcknowledger` when both are available. Keep + `DriverWritableAcknowledger` as the fallback for templates without + `$Alarm*` extensions. + +**Tests:** + +- Two-source-fan-in test: same alarm condition receives both a + driver-native ack event and a sub-attribute value update for the same + transition; assert no duplicate Part 9 transition fires. +- Acknowledger routing — driver implements `IAlarmSource` → + ack-via-RPC; driver implements only `IWritable` → ack-via-write + (existing path). + +### PR B.4 — IAlarmHistorianWriter via gateway + +**Depends on:** A.5 (`WriteHistorianEvent` RPC available). + +**Files:** + +- New `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\GatewayAlarmHistorianWriter.cs` + implementing `IAlarmHistorianWriter`. Calls the gateway RPC from + Track A.5 with the same batch + per-row outcome semantics v1's + `GalaxyHistorianWriter` exposed. +- `GalaxyDriverFactory` registers it as a singleton tied to the + `DriverInstance`. +- `Server\Phase7\Phase7Composer.ResolveHistorianSink` — already scans + registered drivers for an `IAlarmHistorianWriter`. Once GalaxyDriver + exposes one, `SqliteStoreAndForwardSink` boots with a real writer + attached and the `NullAlarmHistorianSink` fallback no longer applies + on Galaxy installs. +- Delete `WriteAlarmEventsRequest` / `WriteAlarmEventsReply` / + `IAlarmEventWriter` from the Wonderware sidecar + (`src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Ipc\Contracts.cs`, + `Ipc\HistorianFrameHandler.cs`, `Ipc\Framing.cs`). The historian + sidecar becomes read-only — matches the audit done earlier. + +**Tests:** + +- `GatewayAlarmHistorianWriter` against a fake gRPC server — single + record, batch, per-row failure modes (Ack / RetryPlease / + PermanentFail). +- `Phase7Composer` end-to-end — register a Galaxy driver, assert + `ResolveHistorianSink` picks `SqliteStoreAndForwardSink` with the + new writer attached. + +### PR B.5 — docs + memory housekeeping + +**Depends on:** B.1 / B.2 / B.3 / B.4 all green on the parity rig. + +**Files:** + +- `docs\drivers\Galaxy.md` — current text says the driver implements + five capability interfaces; update to seven (`IAlarmSource`, + `IAlarmHistorianWriter`-via-companion). +- `docs\AlarmTracking.md` — promote a fresh top-level doc that + describes the v2-final architecture (driver-native primary path + + sub-attribute fallback + scripted-alarm aggregation). Cross-link from + `docs\README.md`. The v1 archive stays as historical record. +- `docs\v1\AlarmTracking.md` — extend the existing historical banner + with "Restored to functional parity in this epic — see + `docs\AlarmTracking.md` for current state." +- Memory entries (`C:\Users\dohertj2\.claude\projects\…\memory\`): + - Update `project_galaxy_via_mxgateway.md` — add the alarm path + restoration. + - Update `project_server_history_alarm_subsystems.md` — note that + `Phase7Composer.ResolveHistorianSink` now finds a writer on + Galaxy installs. +- `docs\plans\alarms-over-gateway.md` (this file) — banner the doc + `✅ Completed YYYY-MM-DD — historical record.` matching the existing + v2-mxgw plan retirement convention. + +## Sequencing matrix + +``` +Track A (mxaccessgw) Track B (lmxopcua) +───────────────────────── ───────────────────────── +A.1 proto (waits) + │ + ├──────────────────────────► B.1 EventPump branch +A.2 worker subscription │ uses proto types only + │ │ unit-testable without live gw + │ +A.3 gateway dispatch + ack RPC ──►B.2 GalaxyDriver : IAlarmSource + │ │ + │ ──►B.3 DriverNodeManager routing + │ +A.4 ConditionRefresh │ (B.3 closes the loop with A.4 + │ once ConditionRefresh wired) + │ +A.5 WriteHistorianEvent ─────────►B.4 GatewayAlarmHistorianWriter + │ + sidecar write-path deletion + ──►B.5 docs + memory +``` + +A.1 + B.1 can land in parallel (B.1's tests use proto types without +needing a running gateway). B.1 stays inert until A.3 ships the gateway +dispatch — which is fine; the dispatch branch is a no-op until events +arrive. + +## Test gates + +Per PR: unit tests pass + build green + analyzer clean (Roslyn +OTOPCUA0001 still wraps every alarm-capability call through +`AlarmSurfaceInvoker`). + +End-of-epic gate: re-run the parity rig (`docs\v2\Galaxy.ParityRig.md`) +with these scenarios added: + +1. **Native alarm raise** — Galaxy `$Alarm*` raise with operator-time + metadata appears as an OPC UA Part 9 transition with full payload + (no longer reconstructed from sub-attribute writes). +2. **Native ack** — OPC UA client acks; assert the gateway records the + ack against MxAccess directly (not via sub-attribute write); operator + comment present in the resulting `Acknowledged` transition. +3. **ConditionRefresh after reconnect** — disconnect the GalaxyDriver, + raise three alarms in Galaxy, reconnect; assert all three appear in + the next ConditionRefresh. +4. **Historian write-back** — fire a scripted alarm; assert it arrives in + AVEVA Historian via the gateway path (use the existing Historian + sidecar's read API to query it back). +5. **Sub-attribute fallback still works** — disable `IAlarmSource` on + the GalaxyDriver via test seam, fire a sub-attribute value change; + assert Part 9 transition still raised. + +Soak target: 24h × 1k tags (light) — same parity-rig harness but +extended to also subscribe to alarms. Pass criterion: zero dropped +alarm transitions, zero state-machine inversions, zero unhandled +exceptions in the AlarmSurfaceInvoker pipeline. + +## Risks and mitigations + +| Risk | Mitigation | +|---|---| +| MxAccess Toolkit alarm subscription API differs across installed AVEVA versions | PR A.2 verifies against the worker-host's installed Toolkit version; documents the exact API used. Pin the worker DLL set per major MxAccess version if needed. | +| Worker-side alarm subscription leaks between sessions if cleanup is wrong | PR A.2 includes a session-recycle test that asserts no `IAlarmEventSink` instances remain registered after Close. | +| Gateway adds a new auth scope (`invoke:alarm-ack`); existing keys lack it | PR A.3 + A.5 ship with a one-time bootstrap migration: keys with `invoke:write` get the new scope auto-granted on the dev rig and parity rig. Production keys are reissued via `apikey rotate-key` (existing CLI). | +| Two simultaneous alarm sources (driver-native + sub-attribute) double-fire transitions | PR B.3 dedup is the load-bearing design. End-to-end test #1 covers it explicitly. | +| Historian write-back batch fails mid-batch — partial success | The existing `SqliteStoreAndForwardSink.HistorianWriteOutcome` per-row enum + dead-letter retention already handles this; PR A.5 just exposes the same outcome shape over gRPC. | +| Sidecar write-path deletion in B.4 leaves orphan IPC frames in old client builds | The frame-kind enum is forward-compatible (`MessageKind.WriteAlarmEventsRequest = 0x20`). Old clients sending the request to a new sidecar receive `Unsupported message kind`; new clients never send it. Acceptable — same-version deploy is the existing rollout convention. | + +## Roll-out + +Track A lands first onto `mxaccessgw/main`, deployed to the parity rig. +Track B lands onto `lmxopcua/master` once A.3 is live on the rig — earlier +Track B PRs can target a feature branch (`feat/alarms-over-gateway`) and +merge to master after the rig is fully green. + +## Back-out + +Each PR is individually revertable. The cleanest back-out point is at +the gateway-side enum extension: removing `MX_EVENT_FAMILY_ON_ALARM_TRANSITION` +from the proto means EventPump silently drops alarm events again and +GalaxyDriver's `OnAlarmEvent` never fires — but the sub-attribute fallback +path still produces functional alarms, so the OPC UA surface degrades to +v2-current behaviour without breaking. PR B.4 is the only one with a +non-trivial back-out (re-add the deleted sidecar IPC slot if revert +needed); land B.4 last and only after end-of-epic gate is green. + +## Out of scope (explicit) + +- **Other alarm sources beyond Galaxy.** AbCip / FOCAS / OpcUaClient + drivers already implement `IAlarmSource`; they're untouched. +- **Modbus / S7 / AbLegacy / TwinCAT alarms.** None of those protocols + has a native alarm bus. Alarms on those drivers, if needed, ship via + the scripted-alarm path. +- **Multi-Galaxy ack routing.** Today's gateway model is one Galaxy per + session; if a deployment splits across galaxies, each gets its own + GalaxyDriver and they don't cross-talk. No change. +- **OPC UA Part 9 advanced features** beyond the current scope — + shelving, subscribed-to-events-only, branch-state for re-trigger + semantics. Future epic if a customer asks. +- **Insight / cloud Historian write-back path.** Track A.5 targets the + on-prem AVEVA Historian via aahClientManaged. The cloud variant + would mirror the same gateway RPC over the REST API discussed in + `docs/histsdk` — separate epic. + +## File inventory (touched) + +**mxaccessgw:** + +- `src\MxGateway.Contracts\Protos\mxaccess_gateway.proto` (A.1, A.5) +- `src\MxGateway.Contracts\Protos\mxaccess_worker.proto` (A.2, A.4, A.5) +- `src\MxGateway.Worker\…\Eventing\` (A.2, A.3, A.4) +- `src\MxGateway.Worker\…\Commands\` (A.3, A.4, A.5) +- `src\MxGateway.Server\Sessions\SessionEventStream.cs` (A.3) +- `src\MxGateway.Server\Rpc\` (A.3, A.4, A.5) +- `src\MxGateway.Server\Auth\Scopes.cs` (A.3, A.4, A.5) +- `MxGateway.Tests`, `MxGateway.Worker.Tests`, `MxGateway.IntegrationTests` + +**lmxopcua:** + +- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\EventPump.cs` (B.1) +- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\MxAccessSeverityMapper.cs` *(new — B.1)* +- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\IGalaxyAlarmAcknowledger.cs` *(new — B.2)* +- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\GatewayGalaxyAlarmAcknowledger.cs` *(new — B.2)* +- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\GatewayAlarmHistorianWriter.cs` *(new — B.4)* +- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriver.cs` (B.2) +- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriverFactory.cs` (B.2, B.4) +- `src\ZB.MOM.WW.OtOpcUa.Server\OpcUa\DriverNodeManager.cs` (B.3) +- `src\ZB.MOM.WW.OtOpcUa.Server\Alarms\AlarmConditionService.cs` (B.3) +- `src\ZB.MOM.WW.OtOpcUa.Server\Phase7\Phase7Composer.cs` (B.4) +- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Ipc\Contracts.cs` (B.4 — deletions) +- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Ipc\HistorianFrameHandler.cs` (B.4 — deletions) +- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Ipc\Framing.cs` (B.4 — deletions) +- `tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests\Runtime\` (B.1, B.2) +- `tests\ZB.MOM.WW.OtOpcUa.Server.Tests\Alarms\` (B.3) +- `tests\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests\` (B.4 — drop deleted-contract tests) +- `docs\drivers\Galaxy.md` (B.5) +- `docs\AlarmTracking.md` *(new — B.5)* +- `docs\v1\AlarmTracking.md` (B.5 — banner update) +- `docs\plans\alarms-over-gateway.md` (B.5 — completion banner) + +Total: ~12 source files added/modified in mxaccessgw; ~17 in lmxopcua; +~10 test files. Should land in 4-6 weeks of focused work given the +parity-rig dependency for end-to-end validation.