# Plan — alarms over the mxaccessgw gateway Coordinated epic across two repos: - **`lmxopcua`** (this repo) — `c:\Users\dohertj2\Desktop\lmxopcua\` - **`mxaccessgw`** — `c:\Users\dohertj2\Desktop\mxaccessgw\` ## Why PR 7.2 (2026-04-30, commit `ae7106d`) retired the in-process v1 Galaxy stack (`Driver.Galaxy.Host` / `.Proxy` / `.Shared` + `OtOpcUaGalaxyHost` Windows service) and migrated Galaxy access to the in-process `GalaxyDriver` over mxaccessgw's gRPC. In doing so, three v1 capabilities regressed: 1. **Native MxAccess alarm-event metadata** — v1's `GalaxyAlarmTracker` surfaced rich alarm transitions (operator comment, original raise time, ack time, alarm category, native severity). The current architecture reconstructs Part 9 transitions by subscribing to four sub-attribute value updates (`InAlarm`, `Acked`, `Priority`, `Description`) — fine for raise/clear but loses everything else. 2. **Native MxAccess Acknowledge semantics** — v1 called the MxAccess ack API directly from `GalaxyAlarmTracker`. Today, OPC UA acks are written into the `AckMsgWriteRef` sub-attribute — semantically valid but a round-trip through the value path that loses operator-comment fidelity. 3. **Alarm-historian write-back path for non-Galaxy alarm sources.** v1's `GalaxyHistorianWriter` implemented `IAlarmHistorianWriter` and forwarded *scripted-alarm* transitions (and any future non-Galaxy alarm source — AB CIP ALMD, OpcUaClient A&E, etc.) back to AVEVA Historian via `aahClientManaged`. PR 7.2 deleted it. `Phase7Composer.ResolveHistorianSink` now finds no writer and falls back to `NullAlarmHistorianSink`, so **scripted-alarm transitions queue locally and silently discard.** Galaxy-native alarms (with `$Alarm*` extensions) reach AVEVA Historian via System Platform's own `HistorizeToAveva` toggle on the Galaxy template — that path was never broken and is not in scope for this epic. `gateway.md` (mxaccessgw, line 8) explicitly commits the gateway to "full MXAccess parity… preserve MXAccess behavior first… **native MXAccess event families**." Today's gateway proto exposes only data-change families. Closing the alarm regression and fulfilling that parity statement are the same task. ## Goals - Restore all three regressed capabilities to feature parity with v1. - Keep the v2 architectural split — gateway owns MxAccess transport; lmxopcua owns OPC UA Part 9 semantics, ACL/role enforcement, and multi-source aggregation (driver-native + scripted + sub-attribute). - Preserve the value-driven sub-attribute path as a fallback for Galaxy templates that don't carry `$Alarm*` extensions. - Land the work as a sequence of small, independently-reviewable PRs that alternate between repos in dependency order. ## Non-goals - Reimplementing the Part 9 state machine inside mxaccessgw. The gateway stays UA-agnostic. - Reworking the LDAP role-grant or OPC UA AlarmAck ACL surface — those already exist and route through `Server/Alarms/IAlarmAcknowledger`. - Adding alarm support to non-Galaxy drivers (AbCip / FOCAS / OpcUaClient already have their own `IAlarmSource` implementations; Modbus / S7 / AbLegacy / TwinCAT don't have a native alarm bus and are out of scope). - Altering Galaxy template conventions or `$Alarm*` extensions in the customer's Galaxy. ## Before → after **Today (post-PR 7.2):** ``` MxAccess COM (gateway worker) │ data-change events only on the MxEvent stream ▼ GalaxyDriver (no IAlarmSource) │ IWritable / ISubscribable / ITagDiscovery only ▼ DriverNodeManager ├─ subscribes to four $Alarm* sub-attributes per condition ├─ AlarmConditionService rebuilds Part 9 transitions from value updates └─ DriverWritableAcknowledger writes AckMsgWriteRef on ack Phase7Composer.ResolveHistorianSink → NullAlarmHistorianSink (scripted-alarm transitions queue → silently discarded) ``` **After this epic:** ``` MxAccess COM (gateway worker) │ data-change ──┐ │ alarm-transition │ │ write-complete ├─► single MxEvent stream (new family added) ▼ ▼ GalaxyDriver : ITagDiscovery, IReadable, IWritable, ISubscribable, IRediscoverable, IHostConnectivityProbe, IAlarmSource ← restored ├─ EventPump dispatches OnAlarmTransition family → IAlarmSource.OnAlarmEvent ├─ AcknowledgeAsync → gateway RPC AcknowledgeAlarm └─ QueryActiveAlarmsAsync → gateway RPC QueryActiveAlarms (ConditionRefresh) DriverNodeManager ├─ rich alarm events from IAlarmSource.OnAlarmEvent → AlarmConditionService ├─ value-driven sub-attribute path STILL WORKS for templates without $Alarm ├─ DriverWritableAcknowledger preserved as fallback for the value path └─ ScriptedAlarmEngine output continues to feed AlarmConditionService Phase7Composer.ResolveHistorianSink → GatewayAlarmHistorianWriter ├─ scripted-alarm transitions → SqliteStoreAndForwardSink └─ drain worker → gateway RPC WriteHistorianEvent → AVEVA Historian ``` ## Architecture decisions **D1 — Where the Part 9 state machine runs.** Stays in lmxopcua's `AlarmConditionService`. Gateway is UA-agnostic. ScriptedAlarmEngine produces Part 9 transitions with no MxAccess origin; the aggregator must live where all sources converge. **D2 — Where authz on Acknowledge runs.** Stays in lmxopcua. The OPC UA `AlarmConditionState.OnAcknowledge` delegate already checks the session's roles for `AlarmAck` against the LDAP/role-grant ACL. The gateway should never be reachable in a way that bypasses that check. **D3 — How rich alarm events reach OPC UA clients.** New `MxEventFamily` on the existing `StreamEvents` RPC (no second stream). Adds latency parity with data-change events, reuses the bounded-channel + worker-side delivery semantics already documented in `gateway.md`. **D4 — Sub-attribute fallback path stays.** Some Galaxy templates won't have `$Alarm*` extensions yet; the existing value-driven path remains the only way to surface alarms for those templates. Both paths feed `AlarmConditionService`. Driver-native events take precedence when both are present (more authoritative, lower latency). **D5 — Where the historian writer lives.** In the **Wonderware historian sidecar**, not in the gateway. The sidecar already owns `aahClientManaged`, already has a `WriteAlarmEvents` IPC slot defined in `Ipc/Contracts.cs`, and already dispatches to an `IAlarmEventWriter` interface — it's just unwired in `Program.cs:57`. The gateway is for MxAccess (live data + Galaxy hierarchy); the historian sidecar is for `aahClientManaged` (time-series + alarms historian). Two different SDKs, two different concerns; keep the split. Bonus: completing the sidecar's write path also gives it a clearer long-term role — once the REST-API migration in `histsdk\instructions.md` takes over reads, write-back keeps the sidecar relevant rather than retiring it as a read-only relic. **Galaxy-native alarms bypass this entirely** — System Platform's own `HistorizeToAveva` toggle on the Galaxy template publishes them directly. The sidecar write path is exclusively for non-Galaxy producers (today: scripted alarms; future: AB CIP ALMD or any other lmxopcua-side alarm source the customer wants unified into AVEVA Historian). ## Track A — mxaccessgw changes All five PRs land in `c:\Users\dohertj2\Desktop\mxaccessgw\`. ### PR A.1 — proto: add alarm-transition event family + ack/query RPCs **Files** (`src\MxGateway.Contracts\Protos\mxaccess_gateway.proto`): 1. Extend `MxEventFamily` (line 403): ``` MX_EVENT_FAMILY_ON_ALARM_TRANSITION = 5; ``` 2. Extend `MxEvent.body` oneof (line 395) with: ``` OnAlarmTransitionEvent on_alarm_transition = 24; ``` 3. New message `OnAlarmTransitionEvent` after the existing event-family bodies (line 425+). Carry the full MxAccess alarm payload — alarm name, source object reference, alarm-type-name (e.g. "AnalogLimitAlarm.HiHi"), transition kind enum (`Raise` / `Acknowledge` / `Clear`), severity (raw numeric — keep MxAccess scale; mapping to OPC UA 0-1000 happens server-side in lmxopcua), `original_raise_timestamp`, `transition_timestamp`, optional `operator_user`, optional `operator_comment`, alarm `category` string, alarm `description`. Mirror the field set documented in v1's `GalaxyAlarmTracker`. 4. New RPC on `MxAccessGateway` service (line 11): ``` rpc AcknowledgeAlarm(AcknowledgeAlarmRequest) returns (AcknowledgeAlarmReply); rpc QueryActiveAlarms(QueryActiveAlarmsRequest) returns (stream ActiveAlarmSnapshot); ``` `AcknowledgeAlarmRequest` carries `session_id`, `alarm_full_reference`, `comment`, `user_principal`. Reply carries `MxStatusProxy`. `QueryActiveAlarmsRequest` carries `session_id`, optional `alarm_filter_prefix` (for ConditionRefresh on a sub-tree). `ActiveAlarmSnapshot` carries the same fields as `OnAlarmTransitionEvent` plus `current_state` enum (`Active` / `ActiveAcked` / `Inactive`). **Tests** (`MxGateway.Tests` — proto/codegen sanity): - Round-trip Serialize→Deserialize for the new messages with all-fields populated and empty-optional-fields cases. - `MxEvent.body` oneof selection guard — supplying multiple bodies rejected. **Out of scope:** worker-side wiring (PR A.2), gateway-side dispatch (PR A.3). PR A.1 is a pure contract-surface change; nothing functional yet. ### PR A.2 — worker: subscribe to MxAccess alarm event source **Files** (`src\MxGateway.Worker\` — net48/x86): The MxAccess Toolkit exposes alarm subscription separately from data subscription. Per AVEVA's MXAccess C++ Toolkit reference (canonical doc referenced from `gateway.md`), alarm events arrive through the `IAlarmEventSink` interface registered against the MxAccess `Alarms` collection of an open session, OR via the MxAccess "alarm provider" subscription pattern (depends on Toolkit version on the worker host — verify against the version actually deployed in the worker bin during PR A.2). 1. Worker subscribes to MxAccess alarms once per session, with a single sink that fans out into the same bounded channel the data-change pump uses (`MxGateway.Worker\Eventing\EventChannel.cs` or whatever the worker currently calls its sink — verify name during the PR). 2. Sink translates each MxAccess alarm event into a `WorkerEvent` proto (defined in `mxaccess_worker.proto`) carrying the new `OnAlarmTransitionEvent` body. Reuses the existing `worker_sequence` counter so ordering is preserved across families. 3. Worker honours the same backpressure rules as data-change events — newest-dropped on full channel, single dropped-counter metric per family. **Tests** (`MxGateway.Worker.Tests`): - Fake `IAlarmEventSink` source emits canned transitions; assert the worker forwards each as the right `WorkerEvent` shape. - Cancellation test — closing the session unsubscribes from MxAccess alarms cleanly (no leaked sinks if the worker is recycled mid-session). **Out of scope:** any gateway-side dispatch, any RPC handler — PR A.2 is worker-internal. ### PR A.3 — gateway: dispatch OnAlarmTransition + implement AcknowledgeAlarm **Files** (`src\MxGateway.Server\`): 1. The session-level event multiplexer (`Sessions\SessionEventStream.cs` or equivalent — verify name during PR) recognizes the new `WorkerEvent` body and forwards as an `MxEvent` with family `MX_EVENT_FAMILY_ON_ALARM_TRANSITION` to the gRPC `StreamEvents` consumer. 2. New RPC handler `AcknowledgeAlarm` builds an MxAccess `WorkerCommand` carrying an `AlarmAcknowledgeCommand` (new in `mxaccess_worker.proto` under PR A.1). Forwarded to the worker; reply mapped to `AcknowledgeAlarmReply` with the MxAccess `MxStatus` proxy populated. 3. AuthN — same API-key + scope check as existing RPCs. Add a new scope `invoke:alarm-ack` (mirrors `invoke:write` granularity); existing keys without it return `PERMISSION_DENIED`. **Tests** (`MxGateway.Tests`, `MxGateway.IntegrationTests`): - Unit: dispatch test — fake worker emits an `AlarmTransition` event; assert the gateway forwards it on the live `StreamEvents` channel of every subscribed session. - Integration: end-to-end against the real worker (requires the parity rig setup — see `docs\v2\Galaxy.ParityRig.md` in lmxopcua for the MxAccess-installed dev box prerequisites). Trigger a real Galaxy alarm, assert the gateway emits `OnAlarmTransition`. Acknowledge via the new RPC, assert the alarm transitions to `ActiveAcked` and an `Acknowledge` transition event is emitted back. - AuthN: existing key without `invoke:alarm-ack` scope rejected. ### PR A.4 — gateway: ConditionRefresh snapshot via QueryActiveAlarms **Files** (`src\MxGateway.Server\`, `src\MxGateway.Worker\`): 1. Worker exposes a `QueryActiveAlarmsCommand` that walks the session's active-alarm collection and streams snapshots back through the existing command-reply channel. The MxAccess Toolkit's `Alarms.GetActive()` (verify exact API name during PR) is the underlying call. 2. Gateway RPC `QueryActiveAlarms` opens a server-streaming reply, batches snapshots through. 3. AuthN — new scope `invoke:alarm-query` (separate from ack so a read-only client can refresh without ack rights). **Tests:** - Worker-test: synthetic active set of 0 / 1 / 100 alarms; assert pagination respects worker channel capacity. - Integration: against the parity rig, assert a ConditionRefresh after reconnect returns every alarm currently `Active` or `ActiveAcked` in the Galaxy. **Sequencing within Track A:** A.1 → A.2 → A.3 → A.4. A.1 is mechanical; A.2 + A.3 are the load-bearing changes that unlock lmxopcua side. A.4 can ship after lmxopcua starts consuming A.3 output. The historian-write capability moved to **Track C** below — the gateway intentionally stays out of `aahClientManaged`. ## Track B — lmxopcua changes All five PRs land in `c:\Users\dohertj2\Desktop\lmxopcua\`. Each B-PR depends on a specific A-PR — see the sequencing matrix below. ### PR B.1 — EventPump: dispatch OnAlarmTransition family **Depends on:** A.1 (proto), A.3 (gateway dispatching the new family). **Files:** - `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\EventPump.cs:160` — current `Dispatch(MxEvent ev)` returns early for any non-`OnDataChange` family. Add a branch: ```csharp switch (ev.Family) { case MxEventFamily.OnDataChange: DispatchDataChange(ev); break; case MxEventFamily.OnAlarmTransition: DispatchAlarmTransition(ev); break; default: return; } ``` - New `DispatchAlarmTransition` translates the proto event into an `AlarmEventArgs` (existing type from `Core.Abstractions`) and raises an internal event the driver subscribes to. - New `MxAccessSeverityMapper` in `Driver.Galaxy\Runtime\` — maps the MxAccess raw severity into the `AlarmSeverity` enum + the OPC UA numeric severity (250 / 500 / 700 / 900 ladder per v1's `AlarmTracking.md`). **Tests** (`tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests\Runtime\`): - `EventPumpAlarmTests` — feed three synthetic MxEvents (raise / ack / clear); assert each fires `OnAlarmEvent` on the driver with correct payload. - Severity-mapping table tests — every documented MxAccess severity level → expected (`AlarmSeverity`, OPC UA numeric) tuple. ### PR B.2 — GalaxyDriver re-implements IAlarmSource **Depends on:** A.3 (`AcknowledgeAlarm` RPC available), B.1 (event dispatch). **Files:** - `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriver.cs:28` — extend the class declaration: ```csharp public sealed class GalaxyDriver : IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable, IRediscoverable, IHostConnectivityProbe, IAlarmSource, IDisposable ``` - Implement the four `IAlarmSource` members: - `SubscribeAlarmsAsync` — no-op returning a sentinel handle. The driver is already subscribed for data; alarm events arrive on the same event stream once the gateway emits the new family. (Same pattern AbCip uses today — see `Driver.AbCip\AbCipDriver.cs:208`.) - `UnsubscribeAlarmsAsync` — no-op. - `OnAlarmEvent` — wired to the EventPump branch added in B.1. - `AcknowledgeAsync` — calls the new gateway RPC via the `IGalaxyAlarmAcknowledger` abstraction (new file, mirrors the `IGalaxyDataWriter` pattern), with `GatewayGalaxyAlarmAcknowledger` as the production implementation in `Runtime\`. Resilience wrapping via `AlarmSurfaceInvoker` per existing pattern. - `DriverInstanceFactory` for Galaxy registers `IGalaxyAlarmAcknowledger` alongside the existing data writer. **Tests:** - Subscribe-noop returns a non-null handle; unsubscribe accepts it. - Acknowledge — fake `IGalaxyAlarmAcknowledger` records the call; assert the request shape and resilience-pipeline routing. - End-to-end test in `Driver.Galaxy.Tests` — fake gateway emits a raise-then-ack event sequence; assert the driver fires `OnAlarmEvent` twice with matching alarm-id correlation. ### PR B.3 — DriverNodeManager: route to driver-native when present **Depends on:** B.2. **Files:** - `src\ZB.MOM.WW.OtOpcUa.Server\OpcUa\DriverNodeManager.cs` — when registering an `AlarmConditionState` for a Galaxy variable, check whether the driver is `IAlarmSource`. If yes, prefer the `OnAlarmEvent`-driven path; the value-driven sub-attribute path becomes the secondary path that handles transitions the driver-native stream missed (network blip, gateway restart, gw missing the `$Alarm*` extension on this template). - `Server\Alarms\AlarmConditionService` — already accepts events from multiple sources; only addition is a `DriverEventOrigin` enum on internal transitions so the dedup logic prefers the richer driver-native record over a stale sub-attribute synthesis. - `IAlarmAcknowledger` resolution in `DriverNodeManager` — prefer the driver's `IAlarmSource.AcknowledgeAsync` over `DriverWritableAcknowledger` when both are available. Keep `DriverWritableAcknowledger` as the fallback for templates without `$Alarm*` extensions. **Tests:** - Two-source-fan-in test: same alarm condition receives both a driver-native ack event and a sub-attribute value update for the same transition; assert no duplicate Part 9 transition fires. - Acknowledger routing — driver implements `IAlarmSource` → ack-via-RPC; driver implements only `IWritable` → ack-via-write (existing path). ### PR B.4 — IAlarmHistorianWriter via the historian sidecar IPC **Depends on:** C.2 (sidecar wires its `IAlarmEventWriter`). See Track C for the sidecar-side work; B.4 is the lmxopcua-side consumer. **Files:** - New `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client\SidecarAlarmHistorianWriter.cs` implementing `IAlarmHistorianWriter`. Sends batches over the existing named-pipe IPC using the **already-defined** `WriteAlarmEventsRequest` / `WriteAlarmEventsReply` contracts at `Ipc\Contracts.cs:153`. No protocol changes — the slot is wired today on the contract side; only the production behaviour and the consumer on this side need to land. - `Server\Phase7\Phase7Composer.ResolveHistorianSink` — already scans for registered `IAlarmHistorianWriter` instances. Register the new sidecar-backed writer at server bootstrap when the historian sidecar is enabled (`appsettings.json` `Historian:Wonderware:Enabled = true`). `SqliteStoreAndForwardSink` then boots with a real writer attached and the `NullAlarmHistorianSink` fallback no longer applies on installs that have the sidecar deployed. **Tests:** - `SidecarAlarmHistorianWriter` against a fake `PipeServer` — single record, batch, per-row failure modes (Ack / RetryPlease / PermanentFail) mapped from the sidecar's `PerEventOk[]` reply. - `Phase7Composer` end-to-end — start the server with the historian sidecar enabled; assert `ResolveHistorianSink` picks `SqliteStoreAndForwardSink` with the new sidecar writer attached. **Note on producer scope:** This path historizes **non-Galaxy alarms only.** Galaxy-native alarms (with `$Alarm*` extensions) reach AVEVA Historian directly via System Platform's `HistorizeToAveva` toggle on the alarm primitive, with no involvement from us. Today the only live producer feeding `SqliteStoreAndForwardSink` is `Phase7EngineComposer.RouteToHistorianAsync` for scripted alarms; future producers (AB CIP ALMD, FOCAS CNC alarms if a customer wants unified storage) plug into the same path. ### PR B.5 — docs + memory housekeeping **Depends on:** B.1 / B.2 / B.3 / B.4 all green on the parity rig + D.1 (deployment refresh) verified on the dev rig. **Files:** - `docs\drivers\Galaxy.md` — current text says the driver implements five capability interfaces; update to seven (`IAlarmSource`, `IAlarmHistorianWriter`-via-companion). - `docs\AlarmTracking.md` — promote a fresh top-level doc that describes the v2-final architecture (driver-native primary path + sub-attribute fallback + scripted-alarm aggregation). Cross-link from `docs\README.md`. The v1 archive stays as historical record. - `docs\v1\AlarmTracking.md` — extend the existing historical banner with "Restored to functional parity in this epic — see `docs\AlarmTracking.md` for current state." - Memory entries (`C:\Users\dohertj2\.claude\projects\…\memory\`): - Update `project_galaxy_via_mxgateway.md` — add the alarm path restoration. - Update `project_server_history_alarm_subsystems.md` — note that `Phase7Composer.ResolveHistorianSink` now finds a writer on Galaxy installs. - `docs\plans\alarms-over-gateway.md` (this file) — banner the doc `✅ Completed YYYY-MM-DD — historical record.` matching the existing v2-mxgw plan retirement convention. ## Track C — historian sidecar wires the dormant write path The Wonderware historian sidecar at `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\` is a separately deployable Windows service (NSSM-wrapped) that already loads `aahClientManaged` x64 and serves a named-pipe IPC for read operations. The `WriteAlarmEvents` IPC slot is defined but unwired (`Program.cs:57` constructs `HistorianFrameHandler` without an `alarmWriter`). Track C completes that slot. Two PRs in the sidecar + one consumer-side PR (B.4) in lmxopcua finishes the path. ### PR C.1 — sidecar: AahClientManagedAlarmEventWriter **Files** (`src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Backend\`): 1. New `AahClientManagedAlarmEventWriter.cs` implementing the existing `IAlarmEventWriter` interface (defined in `Ipc\HistorianFrameHandler.cs:242`). 2. Implementation calls `aahClientManaged`'s alarm-event write API — the same path v1's `GalaxyHistorianWriter` used. Use the existing `HistorianClusterEndpointPicker` for multi-node routing so write failures fail over the same way reads do. 3. Batch size + retry behaviour mirrors v1's `GalaxyHistorianWriter` per-row outcome reporting (`HistorianWriteOutcome` enum: Ack / PermanentFail / RetryPlease). Map MxStatus codes onto outcomes. 4. Reuses `HistorianDataSource`'s existing connection-pool / health gating — no new TCP work needed; the same session that serves reads can issue writes too. **Tests** (`tests\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests\`): - Outcome-mapping table: every documented MxStatus on alarm-write → expected `HistorianWriteOutcome`. - Batching: 1 / 100 / 1000 events through a fake `aahClientManaged` writer; assert per-row outcome list parallel to input order. - Cluster failover: primary node returns `BadCommunicationError`; picker rotates to secondary; assert eventual success. ### PR C.2 — sidecar: wire IAlarmEventWriter into Program.cs **Files** (`src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Program.cs`): 1. Build an `AahClientManagedAlarmEventWriter` next to the existing `BuildHistorian()` call. 2. Pass it to `HistorianFrameHandler` (currently constructed at line 57 without an `alarmWriter`). The dispatcher already routes `WriteAlarmEventsRequest` through `_alarmWriter` when non-null (`HistorianFrameHandler.cs:158-172`); supplying it makes the slot functional. 3. Gate behind a new env var `OTOPCUA_HISTORIAN_ALARM_WRITE_ENABLED` (default `true` when `OTOPCUA_HISTORIAN_ENABLED=true`). Lets a read-only deployment skip the writer registration if needed. 4. Update `Install-Services.ps1` install-time env block in lmxopcua's `scripts\install\` to include the new toggle. **Tests:** - `Program.cs` unit-test seam: assert handler is constructed with alarm writer when enabled and without when disabled. - Live integration (parity rig): write a synthetic alarm event through the IPC; query it back via `ReadEvents`; assert round-trip fidelity. ### Sequencing within Track C: C.1 → C.2. C.2's lmxopcua-side consumer is **PR B.4 in Track B**, which depends on C.2 being deployed. ## Track D — deployment refresh The dev box at `DESKTOP-6JL3KKO` runs three live services from `C:\publish\` (installed in the session that produced commit `ea04547`'s install scripts). Once Tracks A / B / C are merged, the deployed binaries need to be refreshed so the running services pick up the new alarm path. Track D is one PR — pure ops, no code change. ### PR D.1 — refresh C:\publish + restart services **Depends on:** A.4 + B.4 + C.2 merged (every code-change PR landed). **Order matters** — services must stop in reverse-dependency order (`OtOpcUa` → `OtOpcUaWonderwareHistorian` → `MxAccessGw`) and start in forward-dependency order (`MxAccessGw` → `OtOpcUaWonderwareHistorian` → `OtOpcUa`). Touching binaries while a dependent service holds them locked produces the publish-time `MSB3027` file-lock error caught during the original install (see commit `80104ca`). **Steps (run as a single PowerShell session on the deploy host):** 1. **Stop in reverse order**: ```powershell nssm stop OtOpcUa nssm stop OtOpcUaWonderwareHistorian nssm stop MxAccessGw Start-Sleep -Seconds 3 Get-Process MxGateway.Server, MxGateway.Worker, OtOpcUa.Server, ` OtOpcUa.Driver.Historian.Wonderware -ErrorAction SilentlyContinue | Stop-Process -Force ``` 2. **Refresh mxaccessgw binaries** (Track A output): ```powershell $gwSrc = "C:\Users\dohertj2\Desktop\mxaccessgw" dotnet build "$gwSrc\src\MxGateway.Worker" -c Release dotnet build "$gwSrc\src\MxGateway.Server" -c Release Copy-Item -Recurse -Force ` "$gwSrc\src\MxGateway.Server\bin\Release\net10.0\*" ` "C:\publish\mxaccessgw\Server\" Copy-Item -Recurse -Force ` "$gwSrc\src\MxGateway.Worker\bin\x86\Release\net48\*" ` "C:\publish\mxaccessgw\Worker\" ``` 3. **Refresh OtOpcUa + historian sidecar binaries** (Tracks B + C output): ```powershell $repo = "C:\Users\dohertj2\Desktop\lmxopcua" dotnet publish "$repo\src\ZB.MOM.WW.OtOpcUa.Server" ` -c Release -o "C:\publish\lmxopcua" dotnet publish "$repo\src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware" ` -c Release -o "C:\publish\lmxopcua\WonderwareHistorian" ``` 4. **Update service env block if Track C added the new toggle**: ```powershell # Pull existing env, append OTOPCUA_HISTORIAN_ALARM_WRITE_ENABLED=true # (default-on per C.2 design, but explicit assignment lets us flip false # for read-only deployments without re-installing) nssm set OtOpcUaWonderwareHistorian AppEnvironmentExtra ` (((nssm get OtOpcUaWonderwareHistorian AppEnvironmentExtra) ` + "`r`nOTOPCUA_HISTORIAN_ALARM_WRITE_ENABLED=true")) ``` 5. **Start in forward order**: ```powershell nssm start MxAccessGw Start-Sleep -Seconds 4 nssm start OtOpcUaWonderwareHistorian Start-Sleep -Seconds 4 nssm start OtOpcUa Start-Sleep -Seconds 8 ``` 6. **Smoke verification:** ```powershell foreach ($s in 'MxAccessGw','OtOpcUaWonderwareHistorian','OtOpcUa') { (Get-Service $s).Status } foreach ($p in 5120, 4840, 4841) { Get-NetTCPConnection -LocalPort $p -State Listen ` -ErrorAction SilentlyContinue } Get-Content "C:\publish\lmxopcua\logs\otopcua-*.log" -Tail 20 Get-Content "C:\publish\mxaccessgw\stdout.log" -Tail 20 Get-Content "C:\ProgramData\OtOpcUa\historian-wonderware-*.log" -Tail 10 ``` Pass criterion: all three services `Running`; ports 5120 + 4840 listening; sidecar log shows `Wonderware historian sidecar serving — pipe=OtOpcUaWonderwareHistorian`; OtOpcUa log shows `OPC UA server started — endpoint=opc.tcp://0.0.0.0:4840/OtOpcUa` and a new line `IAlarmHistorianWriter resolved: Sidecar` (added in B.4). 7. **Functional verification — fire one alarm of each kind and assert it propagates:** - **Galaxy-native** — raise the `OtOpcUaParityTest_001.Counter` `$Alarm*` extension via Galaxy's alarm-fire mechanism; assert an OPC UA Part 9 transition reaches a connected `otopcua-cli alarms` subscriber with rich payload (operator-comment field non-null, original-raise-timestamp present). This validates Track A + B.1 + B.2 + B.3. - **Scripted** — author a one-line scripted alarm in the Admin UI against any always-true predicate; assert the transition lands in AVEVA Historian via `aaHistClientTrend` query (or `Driver.Historian.Wonderware.IntegrationTests` with a query for the alarm event). Validates Track C + B.4. - **Sub-attribute fallback** — disable `IAlarmSource` on the GalaxyDriver via the test seam (B.3 will introduce one); fire an alarm; assert Part 9 transition still raised by the value-driven path. Validates the fallback wasn't broken. **Files:** - `scripts\install\Refresh-Services.ps1` *(new — automates the above)* - `docs\v2\dev-environment.md` — add the refresh script to the dev workflow section. **Tests:** smoke run on the dev rig (`DESKTOP-6JL3KKO`) producing `docs\plans\artifacts\d1-rollout-YYYY-MM-DD.md` with the captured log tails + smoke-test assertions. Captured artifact lands as part of the PR. **Rollback:** the refresh script keeps a timestamped backup of the existing `C:\publish\mxaccessgw\` and `C:\publish\lmxopcua\` trees before overwriting (mirrored to `C:\publish\.backup-YYYY-MM-DD\`). Rollback is a stop / restore-from-backup / start sequence; no service re-install needed since the NSSM service definitions don't change. **Production deploy:** out of scope for D.1 — the dev rig is the only deployment in scope at this point. A separate PR-or-runbook lands the production refresh once the dev rig has soaked for the documented duration (parity-rig validation gate; see "Test gates" above). ## Sequencing matrix ``` Track A (mxaccessgw) Track B (lmxopcua) Track C (sidecar) ───────────────────────── ───────────────────────── ───────────────────────── A.1 proto (waits) C.1 AahClientManagedAlarmEventWriter │ │ no cross-repo dep ├──────────────────────────► B.1 EventPump branch │ A.2 worker subscription │ uses proto types only │ │ │ unit-testable │ │ C.2 Program.cs wires writer A.3 gateway dispatch + ack RPC ──►B.2 GalaxyDriver : IAlarmSource │ │ │ │ │ ──►B.3 DriverNodeManager routing │ │ │ A.4 ConditionRefresh │ │ │ │ B.4 SidecarAlarmHistorianWriter (depends on C.2 deployed) ▼ Track D (deployment) ───────────────────────── D.1 Refresh C:\publish + restart services (depends on A.4 + B.4 + C.2 merged) ▼ ──►B.5 docs + memory + completion banner ``` A.1 + B.1 + C.1 can all land in parallel — none have cross-repo runtime dependencies. B.1's tests use proto types without needing a running gateway. C.1 is purely sidecar-internal. The gateway-side dispatch (A.3) gates B.2; the sidecar-side wiring (C.2) gates B.4. D.1 (deployment refresh) gates B.5 (docs) — the docs sweep records the as-deployed state, so the deploy must be live first. ## Test gates Per PR: unit tests pass + build green + analyzer clean (Roslyn OTOPCUA0001 still wraps every alarm-capability call through `AlarmSurfaceInvoker`). End-of-epic gate: re-run the parity rig (`docs\v2\Galaxy.ParityRig.md`) with these scenarios added: 1. **Native alarm raise** — Galaxy `$Alarm*` raise with operator-time metadata appears as an OPC UA Part 9 transition with full payload (no longer reconstructed from sub-attribute writes). 2. **Native ack** — OPC UA client acks; assert the gateway records the ack against MxAccess directly (not via sub-attribute write); operator comment present in the resulting `Acknowledged` transition. 3. **ConditionRefresh after reconnect** — disconnect the GalaxyDriver, raise three alarms in Galaxy, reconnect; assert all three appear in the next ConditionRefresh. 4. **Historian write-back** — fire a scripted alarm; assert it arrives in AVEVA Historian via the gateway path (use the existing Historian sidecar's read API to query it back). 5. **Sub-attribute fallback still works** — disable `IAlarmSource` on the GalaxyDriver via test seam, fire a sub-attribute value change; assert Part 9 transition still raised. Soak target: 24h × 1k tags (light) — same parity-rig harness but extended to also subscribe to alarms. Pass criterion: zero dropped alarm transitions, zero state-machine inversions, zero unhandled exceptions in the AlarmSurfaceInvoker pipeline. ## Risks and mitigations | Risk | Mitigation | |---|---| | MxAccess Toolkit alarm subscription API differs across installed AVEVA versions | PR A.2 verifies against the worker-host's installed Toolkit version; documents the exact API used. Pin the worker DLL set per major MxAccess version if needed. | | Worker-side alarm subscription leaks between sessions if cleanup is wrong | PR A.2 includes a session-recycle test that asserts no `IAlarmEventSink` instances remain registered after Close. | | Gateway adds a new auth scope (`invoke:alarm-ack`); existing keys lack it | PR A.3 + A.5 ship with a one-time bootstrap migration: keys with `invoke:write` get the new scope auto-granted on the dev rig and parity rig. Production keys are reissued via `apikey rotate-key` (existing CLI). | | Two simultaneous alarm sources (driver-native + sub-attribute) double-fire transitions | PR B.3 dedup is the load-bearing design. End-to-end test #1 covers it explicitly. | | Historian write-back batch fails mid-batch — partial success | The existing `SqliteStoreAndForwardSink.HistorianWriteOutcome` per-row enum + dead-letter retention already handles this; PR A.5 just exposes the same outcome shape over gRPC. | | Sidecar starts honouring the `WriteAlarmEvents` slot — old lmxopcua-side consumers can now reach a previously inert path | The slot returns `Success=false, Error="not configured"` today; flipping to live writes means a build that *speculatively* sent the frame would suddenly start producing real historian rows. Inventory of any such caller is empty — `WriteAlarmEvents` was never invoked from the lmxopcua side; `Phase7EngineComposer.RouteToHistorianAsync` queues into `SqliteStoreAndForwardSink` and the drain worker is gated on `IAlarmHistorianWriter` registration which only the new B.4 path provides. So enabling C.2 without B.4 is safe. | ## Roll-out Track A lands first onto `mxaccessgw/main`, deployed to the parity rig. Track B lands onto `lmxopcua/master` once A.3 is live on the rig — earlier Track B PRs can target a feature branch (`feat/alarms-over-gateway`) and merge to master after the rig is fully green. ## Back-out Each PR is individually revertable. The cleanest back-out point is at the gateway-side enum extension: removing `MX_EVENT_FAMILY_ON_ALARM_TRANSITION` from the proto means EventPump silently drops alarm events again and GalaxyDriver's `OnAlarmEvent` never fires — but the sub-attribute fallback path still produces functional alarms, so the OPC UA surface degrades to v2-current behaviour without breaking. PR B.4 is the only one with a non-trivial back-out (re-add the deleted sidecar IPC slot if revert needed); land B.4 last and only after end-of-epic gate is green. ## Out of scope (explicit) - **Other alarm sources beyond Galaxy.** AbCip / FOCAS / OpcUaClient drivers already implement `IAlarmSource`; they're untouched. - **Modbus / S7 / AbLegacy / TwinCAT alarms.** None of those protocols has a native alarm bus. Alarms on those drivers, if needed, ship via the scripted-alarm path. - **Multi-Galaxy ack routing.** Today's gateway model is one Galaxy per session; if a deployment splits across galaxies, each gets its own GalaxyDriver and they don't cross-talk. No change. - **OPC UA Part 9 advanced features** beyond the current scope — shelving, subscribed-to-events-only, branch-state for re-trigger semantics. Future epic if a customer asks. - **Insight / cloud Historian write-back path.** Track A.5 targets the on-prem AVEVA Historian via aahClientManaged. The cloud variant would mirror the same gateway RPC over the REST API discussed in `docs/histsdk` — separate epic. ## File inventory (touched) **mxaccessgw (Track A):** - `src\MxGateway.Contracts\Protos\mxaccess_gateway.proto` (A.1) - `src\MxGateway.Contracts\Protos\mxaccess_worker.proto` (A.2, A.4) - `src\MxGateway.Worker\…\Eventing\` (A.2, A.3, A.4) - `src\MxGateway.Worker\…\Commands\` (A.3, A.4) - `src\MxGateway.Server\Sessions\SessionEventStream.cs` (A.3) - `src\MxGateway.Server\Rpc\` (A.3, A.4) - `src\MxGateway.Server\Auth\Scopes.cs` (A.3, A.4) - `MxGateway.Tests`, `MxGateway.Worker.Tests`, `MxGateway.IntegrationTests` **lmxopcua — Galaxy driver + server (Track B):** - `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\EventPump.cs` (B.1) - `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\MxAccessSeverityMapper.cs` *(new — B.1)* - `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\IGalaxyAlarmAcknowledger.cs` *(new — B.2)* - `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\GatewayGalaxyAlarmAcknowledger.cs` *(new — B.2)* - `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriver.cs` (B.2) - `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriverFactory.cs` (B.2) - `src\ZB.MOM.WW.OtOpcUa.Server\OpcUa\DriverNodeManager.cs` (B.3) - `src\ZB.MOM.WW.OtOpcUa.Server\Alarms\AlarmConditionService.cs` (B.3) - `src\ZB.MOM.WW.OtOpcUa.Server\Phase7\Phase7Composer.cs` (B.4) - `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client\SidecarAlarmHistorianWriter.cs` *(new — B.4)* - `tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests\Runtime\` (B.1, B.2) - `tests\ZB.MOM.WW.OtOpcUa.Server.Tests\Alarms\` (B.3) - `tests\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client.Tests\` (B.4 — new tests) - `docs\drivers\Galaxy.md` (B.5) - `docs\AlarmTracking.md` *(new — B.5)* - `docs\v1\AlarmTracking.md` (B.5 — banner update) - `docs\plans\alarms-over-gateway.md` (B.5 — completion banner) **lmxopcua — Wonderware historian sidecar (Track C):** - `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Backend\AahClientManagedAlarmEventWriter.cs` *(new — C.1)* - `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Program.cs` (C.2 — wire writer) - `scripts\install\Install-Services.ps1` (C.2 — env-var toggle for write-enable) - `tests\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests\` (C.1 — outcome mapping + batch + cluster failover) **lmxopcua — deployment refresh (Track D):** - `scripts\install\Refresh-Services.ps1` *(new — D.1)* - `docs\v2\dev-environment.md` (D.1 — document the refresh workflow) - `docs\plans\artifacts\d1-rollout-YYYY-MM-DD.md` *(new — D.1 captured smoke run)* Total: ~10 source files added/modified in mxaccessgw; ~14 in lmxopcua proper; ~3 in the historian sidecar; ~2 deployment scripts; ~12 test files across all repos. Should land in 4-6 weeks of focused work given the parity-rig dependency for end-to-end validation, plus a short final-week ops slot for D.1.