Files
lmxopcua/docs/plans/alarms-over-gateway.md
Joseph Doherty 5ed26d2ec6 docs: alarms-over-gateway plan banner — record A.2 dev-rig finding
Replaces the "ships as a follow-up gated on dev-rig validation"
banner with the actual finding from the dev-rig inspection: the
MXAccess COM Toolkit on this AVEVA install does not expose any
alarm-event family, and the AVEVA alarm-subscription managed
assemblies (aaAlarmManagedClient, ArchestrAAlarmsAndEvents.SDK)
are x64-only and incompatible with the worker's x86 bitness.

Two operator-facing paths forward documented inline:

1. Stay on the value-driven sub-attribute path (current production
   behaviour). Operator-comment fidelity is the only v1 regression.

2. Add an x64 alarm-helper sub-process alongside the worker that
   loads aaAlarmManagedClient and forwards transitions over a
   named-pipe IPC. Recovers full v1 fidelity but adds operational
   complexity.

The full architectural notes live in the mxaccessgw repo at
src/MxGateway.Worker/MxAccess/MxAccessAlarmEventSink.cs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 21:29:16 -04:00

1173 lines
57 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Plan — alarms over the mxaccessgw gateway
> ✅ **All 19 PRs merged 2026-04-30 — historical record.**
> A.1 / A.2 / A.3 / A.4 (gateway proto + handlers + worker scaffold),
> B.1 / B.2 / B.3 / B.4 / B.5 (driver, server, docs), C.1 / C.2
> (sidecar alarm historian writer), D.1 (deploy script),
> E.1 / E.2 / E.3 / E.4 / E.5 / E.6 / E.7 (5 client SDKs + lmxopcua
> client surface). Public contract surface is live; client SDKs ship
> the new RPCs; the sub-attribute fallback path keeps Galaxy alarms
> functional today.
>
> ⚠️ **Worker-side native alarm subscription blocked on a dev-rig
> finding (2026-04-30):** the MXAccess COM Toolkit at
> `C:\Program Files (x86)\ArchestrA\Framework\Bin\ArchestrA.MXAccess.dll`
> exposes no alarm-event family — only `OnDataChange`,
> `OnWriteComplete`, `OperationComplete`, `OnBufferedDataChange`.
> AVEVA's `aaAlarmManagedClient` / `ArchestrAAlarmsAndEvents.SDK`
> assemblies are x64-only and incompatible with the worker's x86
> bitness. **Operator decision needed before
> `MX_EVENT_FAMILY_ON_ALARM_TRANSITION` carries any events:** either
> accept the value-driven sub-attribute path as the production
> architecture (operator-comment fidelity is the only v1 regression)
> or add an x64 alarm-helper sub-process alongside the worker. See
> `src/MxGateway.Worker/MxAccess/MxAccessAlarmEventSink.cs` in the
> mxaccessgw repo for the architectural notes. Live
> `aahClientManaged` alarm-event write call site
> (`SdkAlarmHistorianWriteBackend` placeholder from PR C.1) and the
> D.1 smoke artifact ship once those decisions resolve. The
> remainder of this document is preserved as the design record.
Coordinated epic across two repos:
- **`lmxopcua`** (this repo) — `c:\Users\dohertj2\Desktop\lmxopcua\`
- **`mxaccessgw`** — `c:\Users\dohertj2\Desktop\mxaccessgw\`
## Why
PR 7.2 (2026-04-30, commit `ae7106d`) retired the in-process v1 Galaxy stack
(`Driver.Galaxy.Host` / `.Proxy` / `.Shared` + `OtOpcUaGalaxyHost` Windows
service) and migrated Galaxy access to the in-process `GalaxyDriver` over
mxaccessgw's gRPC. In doing so, three v1 capabilities regressed:
1. **Native MxAccess alarm-event metadata** — v1's `GalaxyAlarmTracker`
surfaced rich alarm transitions (operator comment, original raise time,
ack time, alarm category, native severity). The current architecture
reconstructs Part 9 transitions by subscribing to four sub-attribute
value updates (`InAlarm`, `Acked`, `Priority`, `Description`) — fine for
raise/clear but loses everything else.
2. **Native MxAccess Acknowledge semantics** — v1 called the MxAccess ack
API directly from `GalaxyAlarmTracker`. Today, OPC UA acks are written
into the `AckMsgWriteRef` sub-attribute — semantically valid but a
round-trip through the value path that loses operator-comment fidelity.
3. **Alarm-historian write-back path for non-Galaxy alarm sources.**
v1's `GalaxyHistorianWriter` implemented `IAlarmHistorianWriter` and
forwarded *scripted-alarm* transitions (and any future non-Galaxy
alarm source — AB CIP ALMD, OpcUaClient A&E, etc.) back to AVEVA
Historian via `aahClientManaged`. PR 7.2 deleted it.
`Phase7Composer.ResolveHistorianSink` now finds no writer and falls
back to `NullAlarmHistorianSink`, so **scripted-alarm transitions
queue locally and silently discard.** Galaxy-native alarms (with
`$Alarm*` extensions) reach AVEVA Historian via System Platform's
own `HistorizeToAveva` toggle on the Galaxy template — that path
was never broken and is not in scope for this epic.
`gateway.md` (mxaccessgw, line 8) explicitly commits the gateway to "full
MXAccess parity… preserve MXAccess behavior first… **native MXAccess event
families**." Today's gateway proto exposes only data-change families. Closing
the alarm regression and fulfilling that parity statement are the same task.
## Goals
- Restore all three regressed capabilities to feature parity with v1.
- Keep the v2 architectural split — gateway owns MxAccess transport;
lmxopcua owns OPC UA Part 9 semantics, ACL/role enforcement, and
multi-source aggregation (driver-native + scripted + sub-attribute).
- Preserve the value-driven sub-attribute path as a fallback for Galaxy
templates that don't carry `$Alarm*` extensions.
- Land the work as a sequence of small, independently-reviewable PRs that
alternate between repos in dependency order.
## Non-goals
- Reimplementing the Part 9 state machine inside mxaccessgw. The gateway
stays UA-agnostic.
- Reworking the LDAP role-grant or OPC UA AlarmAck ACL surface — those
already exist and route through `Server/Alarms/IAlarmAcknowledger`.
- Adding alarm support to non-Galaxy drivers (AbCip / FOCAS / OpcUaClient
already have their own `IAlarmSource` implementations; Modbus / S7 /
AbLegacy / TwinCAT don't have a native alarm bus and are out of scope).
- Altering Galaxy template conventions or `$Alarm*` extensions in the
customer's Galaxy.
## Before → after
**Today (post-PR 7.2):**
```
MxAccess COM (gateway worker)
│ data-change events only on the MxEvent stream
GalaxyDriver (no IAlarmSource)
│ IWritable / ISubscribable / ITagDiscovery only
DriverNodeManager
├─ subscribes to four $Alarm* sub-attributes per condition
├─ AlarmConditionService rebuilds Part 9 transitions from value updates
└─ DriverWritableAcknowledger writes AckMsgWriteRef on ack
Phase7Composer.ResolveHistorianSink → NullAlarmHistorianSink
(scripted-alarm transitions queue → silently discarded)
```
**After this epic:**
```
MxAccess COM (gateway worker)
│ data-change ──┐
│ alarm-transition │
│ write-complete ├─► single MxEvent stream (new family added)
▼ ▼
GalaxyDriver : ITagDiscovery, IReadable, IWritable, ISubscribable, IRediscoverable,
IHostConnectivityProbe, IAlarmSource ← restored
├─ EventPump dispatches OnAlarmTransition family → IAlarmSource.OnAlarmEvent
├─ AcknowledgeAsync → gateway RPC AcknowledgeAlarm
└─ QueryActiveAlarmsAsync → gateway RPC QueryActiveAlarms (ConditionRefresh)
DriverNodeManager
├─ rich alarm events from IAlarmSource.OnAlarmEvent → AlarmConditionService
├─ value-driven sub-attribute path STILL WORKS for templates without $Alarm
├─ DriverWritableAcknowledger preserved as fallback for the value path
└─ ScriptedAlarmEngine output continues to feed AlarmConditionService
Phase7Composer.ResolveHistorianSink → GatewayAlarmHistorianWriter
├─ scripted-alarm transitions → SqliteStoreAndForwardSink
└─ drain worker → gateway RPC WriteHistorianEvent → AVEVA Historian
```
## Architecture decisions
**D1 — Where the Part 9 state machine runs.** Stays in lmxopcua's
`AlarmConditionService`. Gateway is UA-agnostic. ScriptedAlarmEngine produces
Part 9 transitions with no MxAccess origin; the aggregator must live where all
sources converge.
**D2 — Where authz on Acknowledge runs.** Stays in lmxopcua. The OPC UA
`AlarmConditionState.OnAcknowledge` delegate already checks the session's
roles for `AlarmAck` against the LDAP/role-grant ACL. The gateway should
never be reachable in a way that bypasses that check.
**D3 — How rich alarm events reach OPC UA clients.** New `MxEventFamily`
on the existing `StreamEvents` RPC (no second stream). Adds latency
parity with data-change events, reuses the bounded-channel + worker-side
delivery semantics already documented in `gateway.md`.
**D4 — Sub-attribute fallback path stays.** Some Galaxy templates won't
have `$Alarm*` extensions yet; the existing value-driven path remains the
only way to surface alarms for those templates. Both paths feed
`AlarmConditionService`. Driver-native events take precedence when both
are present (more authoritative, lower latency).
**D5 — Where the historian writer lives.** In the **Wonderware historian
sidecar**, not in the gateway. The sidecar already owns `aahClientManaged`,
already has a `WriteAlarmEvents` IPC slot defined in `Ipc/Contracts.cs`, and
already dispatches to an `IAlarmEventWriter` interface — it's just unwired
in `Program.cs:57`. The gateway is for MxAccess (live data + Galaxy
hierarchy); the historian sidecar is for `aahClientManaged` (time-series +
alarms historian). Two different SDKs, two different concerns; keep the
split. Bonus: completing the sidecar's write path also gives it a clearer
long-term role — once the REST-API migration in `histsdk\instructions.md`
takes over reads, write-back keeps the sidecar relevant rather than
retiring it as a read-only relic. **Galaxy-native alarms bypass this
entirely** — System Platform's own `HistorizeToAveva` toggle on the
Galaxy template publishes them directly. The sidecar write path is
exclusively for non-Galaxy producers (today: scripted alarms; future: AB
CIP ALMD or any other lmxopcua-side alarm source the customer wants
unified into AVEVA Historian).
## Track A — mxaccessgw changes
All five PRs land in `c:\Users\dohertj2\Desktop\mxaccessgw\`.
### PR A.1 — proto: add alarm-transition event family + ack/query RPCs
**Files** (`src\MxGateway.Contracts\Protos\mxaccess_gateway.proto`):
1. Extend `MxEventFamily` (line 403):
```
MX_EVENT_FAMILY_ON_ALARM_TRANSITION = 5;
```
2. Extend `MxEvent.body` oneof (line 395) with:
```
OnAlarmTransitionEvent on_alarm_transition = 24;
```
3. New message `OnAlarmTransitionEvent` after the existing event-family
bodies (line 425+). Carry the full MxAccess alarm payload — alarm name,
source object reference, alarm-type-name (e.g. "AnalogLimitAlarm.HiHi"),
transition kind enum (`Raise` / `Acknowledge` / `Clear`), severity (raw
numeric — keep MxAccess scale; mapping to OPC UA 0-1000 happens
server-side in lmxopcua), `original_raise_timestamp`,
`transition_timestamp`, optional `operator_user`, optional
`operator_comment`, alarm `category` string, alarm `description`. Mirror
the field set documented in v1's `GalaxyAlarmTracker`.
4. New RPC on `MxAccessGateway` service (line 11):
```
rpc AcknowledgeAlarm(AcknowledgeAlarmRequest) returns (AcknowledgeAlarmReply);
rpc QueryActiveAlarms(QueryActiveAlarmsRequest) returns (stream ActiveAlarmSnapshot);
```
`AcknowledgeAlarmRequest` carries `session_id`, `alarm_full_reference`,
`comment`, `user_principal`. Reply carries `MxStatusProxy`.
`QueryActiveAlarmsRequest` carries `session_id`, optional
`alarm_filter_prefix` (for ConditionRefresh on a sub-tree).
`ActiveAlarmSnapshot` carries the same fields as
`OnAlarmTransitionEvent` plus `current_state` enum (`Active` /
`ActiveAcked` / `Inactive`).
**Tests** (`MxGateway.Tests` — proto/codegen sanity):
- Round-trip Serialize→Deserialize for the new messages with all-fields
populated and empty-optional-fields cases.
- `MxEvent.body` oneof selection guard — supplying multiple bodies
rejected.
**Out of scope:** worker-side wiring (PR A.2), gateway-side dispatch (PR A.3).
PR A.1 is a pure contract-surface change; nothing functional yet.
### PR A.2 — worker: subscribe to MxAccess alarm event source
**Files** (`src\MxGateway.Worker\` — net48/x86):
The MxAccess Toolkit exposes alarm subscription separately from data
subscription. Per AVEVA's MXAccess C++ Toolkit reference (canonical doc
referenced from `gateway.md`), alarm events arrive through the
`IAlarmEventSink` interface registered against the MxAccess `Alarms`
collection of an open session, OR via the MxAccess "alarm provider"
subscription pattern (depends on Toolkit version on the worker host —
verify against the version actually deployed in the worker bin during
PR A.2).
1. Worker subscribes to MxAccess alarms once per session, with a single
sink that fans out into the same bounded channel the data-change pump
uses (`MxGateway.Worker\Eventing\EventChannel.cs` or whatever the worker
currently calls its sink — verify name during the PR).
2. Sink translates each MxAccess alarm event into a `WorkerEvent` proto
(defined in `mxaccess_worker.proto`) carrying the new
`OnAlarmTransitionEvent` body. Reuses the existing `worker_sequence`
counter so ordering is preserved across families.
3. Worker honours the same backpressure rules as data-change events —
newest-dropped on full channel, single dropped-counter metric per
family.
**Tests** (`MxGateway.Worker.Tests`):
- Fake `IAlarmEventSink` source emits canned transitions; assert the
worker forwards each as the right `WorkerEvent` shape.
- Cancellation test — closing the session unsubscribes from MxAccess
alarms cleanly (no leaked sinks if the worker is recycled mid-session).
**Out of scope:** any gateway-side dispatch, any RPC handler — PR A.2
is worker-internal.
### PR A.3 — gateway: dispatch OnAlarmTransition + implement AcknowledgeAlarm
**Files** (`src\MxGateway.Server\`):
1. The session-level event multiplexer (`Sessions\SessionEventStream.cs`
or equivalent — verify name during PR) recognizes the new
`WorkerEvent` body and forwards as an `MxEvent` with family
`MX_EVENT_FAMILY_ON_ALARM_TRANSITION` to the gRPC
`StreamEvents` consumer.
2. New RPC handler `AcknowledgeAlarm` builds an MxAccess `WorkerCommand`
carrying an `AlarmAcknowledgeCommand` (new in `mxaccess_worker.proto`
under PR A.1). Forwarded to the worker; reply mapped to
`AcknowledgeAlarmReply` with the MxAccess `MxStatus` proxy populated.
3. AuthN — same API-key + scope check as existing RPCs. Add a new scope
`invoke:alarm-ack` (mirrors `invoke:write` granularity); existing keys
without it return `PERMISSION_DENIED`.
**Tests** (`MxGateway.Tests`, `MxGateway.IntegrationTests`):
- Unit: dispatch test — fake worker emits an `AlarmTransition` event;
assert the gateway forwards it on the live `StreamEvents` channel of
every subscribed session.
- Integration: end-to-end against the real worker (requires the parity
rig setup — see `docs\v2\Galaxy.ParityRig.md` in lmxopcua for the
MxAccess-installed dev box prerequisites). Trigger a real Galaxy
alarm, assert the gateway emits `OnAlarmTransition`. Acknowledge via
the new RPC, assert the alarm transitions to `ActiveAcked` and an
`Acknowledge` transition event is emitted back.
- AuthN: existing key without `invoke:alarm-ack` scope rejected.
### PR A.4 — gateway: ConditionRefresh snapshot via QueryActiveAlarms
**Files** (`src\MxGateway.Server\`, `src\MxGateway.Worker\`):
1. Worker exposes a `QueryActiveAlarmsCommand` that walks the session's
active-alarm collection and streams snapshots back through the
existing command-reply channel. The MxAccess Toolkit's
`Alarms.GetActive()` (verify exact API name during PR) is the
underlying call.
2. Gateway RPC `QueryActiveAlarms` opens a server-streaming reply,
batches snapshots through.
3. AuthN — new scope `invoke:alarm-query` (separate from ack so a
read-only client can refresh without ack rights).
**Tests:**
- Worker-test: synthetic active set of 0 / 1 / 100 alarms; assert
pagination respects worker channel capacity.
- Integration: against the parity rig, assert a ConditionRefresh after
reconnect returns every alarm currently `Active` or `ActiveAcked` in
the Galaxy.
**Sequencing within Track A:** A.1 → A.2 → A.3 → A.4. A.1 is
mechanical; A.2 + A.3 are the load-bearing changes that unlock lmxopcua
side. A.4 can ship after lmxopcua starts consuming A.3 output. The
historian-write capability moved to **Track C** below — the gateway
intentionally stays out of `aahClientManaged`.
## Track B — lmxopcua changes
All five PRs land in `c:\Users\dohertj2\Desktop\lmxopcua\`. Each B-PR
depends on a specific A-PR — see the sequencing matrix below.
### PR B.1 — EventPump: dispatch OnAlarmTransition family
**Depends on:** A.1 (proto), A.3 (gateway dispatching the new family).
**Files:**
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\EventPump.cs:160`
current `Dispatch(MxEvent ev)` returns early for any non-`OnDataChange`
family. Add a branch:
```csharp
switch (ev.Family) {
case MxEventFamily.OnDataChange: DispatchDataChange(ev); break;
case MxEventFamily.OnAlarmTransition: DispatchAlarmTransition(ev); break;
default: return;
}
```
- New `DispatchAlarmTransition` translates the proto event into an
`AlarmEventArgs` (existing type from `Core.Abstractions`) and raises an
internal event the driver subscribes to.
- New `MxAccessSeverityMapper` in `Driver.Galaxy\Runtime\` — maps the
MxAccess raw severity into the `AlarmSeverity` enum + the OPC UA
numeric severity (250 / 500 / 700 / 900 ladder per v1's
`AlarmTracking.md`).
**Tests** (`tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests\Runtime\`):
- `EventPumpAlarmTests` — feed three synthetic MxEvents (raise / ack /
clear); assert each fires `OnAlarmEvent` on the driver with correct
payload.
- Severity-mapping table tests — every documented MxAccess severity
level → expected (`AlarmSeverity`, OPC UA numeric) tuple.
### PR B.2 — GalaxyDriver re-implements IAlarmSource
**Depends on:** A.3 (`AcknowledgeAlarm` RPC available), B.1 (event
dispatch).
**Files:**
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriver.cs:28` — extend the
class declaration:
```csharp
public sealed class GalaxyDriver
: IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable,
IRediscoverable, IHostConnectivityProbe, IAlarmSource, IDisposable
```
- Implement the four `IAlarmSource` members:
- `SubscribeAlarmsAsync` — no-op returning a sentinel handle. The
driver is already subscribed for data; alarm events arrive on the
same event stream once the gateway emits the new family. (Same
pattern AbCip uses today — see `Driver.AbCip\AbCipDriver.cs:208`.)
- `UnsubscribeAlarmsAsync` — no-op.
- `OnAlarmEvent` — wired to the EventPump branch added in B.1.
- `AcknowledgeAsync` — calls the new gateway RPC via the
`IGalaxyAlarmAcknowledger` abstraction (new file, mirrors the
`IGalaxyDataWriter` pattern), with `GatewayGalaxyAlarmAcknowledger`
as the production implementation in `Runtime\`. Resilience wrapping
via `AlarmSurfaceInvoker` per existing pattern.
- `DriverInstanceFactory` for Galaxy registers
`IGalaxyAlarmAcknowledger` alongside the existing data writer.
**Tests:**
- Subscribe-noop returns a non-null handle; unsubscribe accepts it.
- Acknowledge — fake `IGalaxyAlarmAcknowledger` records the call; assert
the request shape and resilience-pipeline routing.
- End-to-end test in `Driver.Galaxy.Tests` — fake gateway emits a
raise-then-ack event sequence; assert the driver fires `OnAlarmEvent`
twice with matching alarm-id correlation.
### PR B.3 — DriverNodeManager: route to driver-native when present
**Depends on:** B.2.
**Files:**
- `src\ZB.MOM.WW.OtOpcUa.Server\OpcUa\DriverNodeManager.cs` — when
registering an `AlarmConditionState` for a Galaxy variable, check
whether the driver is `IAlarmSource`. If yes, prefer the
`OnAlarmEvent`-driven path; the value-driven sub-attribute path
becomes the secondary path that handles transitions the driver-native
stream missed (network blip, gateway restart, gw missing the
`$Alarm*` extension on this template).
- `Server\Alarms\AlarmConditionService` — already accepts events from
multiple sources; only addition is a `DriverEventOrigin` enum on
internal transitions so the dedup logic prefers the richer
driver-native record over a stale sub-attribute synthesis.
- `IAlarmAcknowledger` resolution in `DriverNodeManager`
prefer the driver's `IAlarmSource.AcknowledgeAsync` over
`DriverWritableAcknowledger` when both are available. Keep
`DriverWritableAcknowledger` as the fallback for templates without
`$Alarm*` extensions.
**Tests:**
- Two-source-fan-in test: same alarm condition receives both a
driver-native ack event and a sub-attribute value update for the same
transition; assert no duplicate Part 9 transition fires.
- Acknowledger routing — driver implements `IAlarmSource`
ack-via-RPC; driver implements only `IWritable` → ack-via-write
(existing path).
### PR B.4 — IAlarmHistorianWriter via the historian sidecar IPC
**Depends on:** C.2 (sidecar wires its `IAlarmEventWriter`). See Track C
for the sidecar-side work; B.4 is the lmxopcua-side consumer.
**Files:**
- New `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client\SidecarAlarmHistorianWriter.cs`
implementing `IAlarmHistorianWriter`. Sends batches over the existing
named-pipe IPC using the **already-defined**
`WriteAlarmEventsRequest` / `WriteAlarmEventsReply` contracts at
`Ipc\Contracts.cs:153`. No protocol changes — the slot is wired today
on the contract side; only the production behaviour and the consumer
on this side need to land.
- `Server\Phase7\Phase7Composer.ResolveHistorianSink` — already scans
for registered `IAlarmHistorianWriter` instances. Register the new
sidecar-backed writer at server bootstrap when the historian sidecar
is enabled (`appsettings.json` `Historian:Wonderware:Enabled = true`).
`SqliteStoreAndForwardSink` then boots with a real writer attached
and the `NullAlarmHistorianSink` fallback no longer applies on
installs that have the sidecar deployed.
**Tests:**
- `SidecarAlarmHistorianWriter` against a fake `PipeServer`
single record, batch, per-row failure modes (Ack / RetryPlease /
PermanentFail) mapped from the sidecar's `PerEventOk[]` reply.
- `Phase7Composer` end-to-end — start the server with the historian
sidecar enabled; assert `ResolveHistorianSink` picks
`SqliteStoreAndForwardSink` with the new sidecar writer attached.
**Note on producer scope:** This path historizes **non-Galaxy alarms
only.** Galaxy-native alarms (with `$Alarm*` extensions) reach AVEVA
Historian directly via System Platform's `HistorizeToAveva` toggle on
the alarm primitive, with no involvement from us. Today the only live
producer feeding `SqliteStoreAndForwardSink` is
`Phase7EngineComposer.RouteToHistorianAsync` for scripted alarms; future
producers (AB CIP ALMD, FOCAS CNC alarms if a customer wants unified
storage) plug into the same path.
### PR B.5 — docs + memory housekeeping
**Depends on:** B.1 / B.2 / B.3 / B.4 all green on the parity rig + D.1
(deployment refresh) verified on the dev rig.
**Files:**
- `docs\drivers\Galaxy.md` — current text says the driver implements
five capability interfaces; update to seven (`IAlarmSource`,
`IAlarmHistorianWriter`-via-companion).
- `docs\AlarmTracking.md` — promote a fresh top-level doc that
describes the v2-final architecture (driver-native primary path +
sub-attribute fallback + scripted-alarm aggregation). Cross-link from
`docs\README.md`. The v1 archive stays as historical record.
- `docs\v1\AlarmTracking.md` — extend the existing historical banner
with "Restored to functional parity in this epic — see
`docs\AlarmTracking.md` for current state."
- Memory entries (`C:\Users\dohertj2\.claude\projects\…\memory\`):
- Update `project_galaxy_via_mxgateway.md` — add the alarm path
restoration.
- Update `project_server_history_alarm_subsystems.md` — note that
`Phase7Composer.ResolveHistorianSink` now finds a writer on
Galaxy installs.
- `docs\plans\alarms-over-gateway.md` (this file) — banner the doc
`✅ Completed YYYY-MM-DD — historical record.` matching the existing
v2-mxgw plan retirement convention.
## Track C — historian sidecar wires the dormant write path
The Wonderware historian sidecar at
`src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\` is a separately
deployable Windows service (NSSM-wrapped) that already loads
`aahClientManaged` x64 and serves a named-pipe IPC for read operations.
The `WriteAlarmEvents` IPC slot is defined but unwired (`Program.cs:57`
constructs `HistorianFrameHandler` without an `alarmWriter`). Track C
completes that slot. Two PRs in the sidecar + one consumer-side PR
(B.4) in lmxopcua finishes the path.
### PR C.1 — sidecar: AahClientManagedAlarmEventWriter
**Files** (`src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Backend\`):
1. New `AahClientManagedAlarmEventWriter.cs` implementing the existing
`IAlarmEventWriter` interface (defined in `Ipc\HistorianFrameHandler.cs:242`).
2. Implementation calls `aahClientManaged`'s alarm-event write API —
the same path v1's `GalaxyHistorianWriter` used. Use the existing
`HistorianClusterEndpointPicker` for multi-node routing so write
failures fail over the same way reads do.
3. Batch size + retry behaviour mirrors v1's `GalaxyHistorianWriter`
per-row outcome reporting (`HistorianWriteOutcome` enum: Ack /
PermanentFail / RetryPlease). Map MxStatus codes onto outcomes.
4. Reuses `HistorianDataSource`'s existing connection-pool / health
gating — no new TCP work needed; the same session that serves
reads can issue writes too.
**Tests** (`tests\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests\`):
- Outcome-mapping table: every documented MxStatus on alarm-write →
expected `HistorianWriteOutcome`.
- Batching: 1 / 100 / 1000 events through a fake `aahClientManaged`
writer; assert per-row outcome list parallel to input order.
- Cluster failover: primary node returns `BadCommunicationError`;
picker rotates to secondary; assert eventual success.
### PR C.2 — sidecar: wire IAlarmEventWriter into Program.cs
**Files** (`src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Program.cs`):
1. Build an `AahClientManagedAlarmEventWriter` next to the existing
`BuildHistorian()` call.
2. Pass it to `HistorianFrameHandler` (currently constructed at line 57
without an `alarmWriter`). The dispatcher already routes
`WriteAlarmEventsRequest` through `_alarmWriter` when non-null
(`HistorianFrameHandler.cs:158-172`); supplying it makes the slot
functional.
3. Gate behind a new env var `OTOPCUA_HISTORIAN_ALARM_WRITE_ENABLED`
(default `true` when `OTOPCUA_HISTORIAN_ENABLED=true`). Lets a
read-only deployment skip the writer registration if needed.
4. Update `Install-Services.ps1` install-time env block in
lmxopcua's `scripts\install\` to include the new toggle.
**Tests:**
- `Program.cs` unit-test seam: assert handler is constructed with
alarm writer when enabled and without when disabled.
- Live integration (parity rig): write a synthetic alarm event
through the IPC; query it back via `ReadEvents`; assert
round-trip fidelity.
### Sequencing within Track C: C.1 → C.2.
C.2's lmxopcua-side consumer is **PR B.4 in Track B**, which depends
on C.2 being deployed.
## Track E — client surface refresh
Two surfaces become user-visible when the alarm path lights up: the
**mxaccessgw client SDKs** (5 languages, each with its own CLI) that
consume the new `OnAlarmTransition` event family + `AcknowledgeAlarm`
/ `QueryActiveAlarms` RPCs directly, and the **lmxopcua OPC UA-facing
clients** (Client.CLI, Client.UI) that consume the richer Part 9
condition payload through the OPC UA server. Both need updates so the
new fields actually reach end-users; without Track E, the data
arrives at the gateway / OPC UA server but the off-the-shelf clients
display the same five columns they did under v2-pre-this-epic.
Track E is split per-language so each PR stays small and reviewable.
PRs E.2 through E.6 are independent — they share only the proto
regen from E.1 — and can land in parallel by whoever owns each
language binding.
### PR E.1 — regenerate proto across all client SDKs
**Depends on:** A.1 merged (proto change live).
**Files** (`c:\Users\dohertj2\Desktop\mxaccessgw\clients\`):
1. **.NET** — codegen runs on csproj rebuild via `Grpc.Tools`; just
rebuild `MxGateway.Client.csproj` after pulling A.1.
2. **Python** — run `clients\python\generate-proto.ps1`; commit the
regenerated `_pb2.py` + `_pb2_grpc.py` files under
`clients\python\src\`.
3. **Go** — run `clients\go\generate-proto.ps1`; commit the
regenerated `*.pb.go` + `*_grpc.pb.go` files under
`clients\go\mxgateway\`.
4. **Java** — Gradle's `protobuf-gradle-plugin` regenerates on
`gradle build`; verify the new types appear in the build
output. Commit any pinned generated source under
`clients\java\mxgateway-client\src\main\java\` if that's the
convention (check `JavaClientDesign.md`).
5. **Rust**`build.rs` runs `tonic-build` on the proto; just
`cargo build`. Generated code lives under
`clients\rust\target\` (gitignored) — nothing to commit;
verify the new types compile.
No hand-written code in this PR. Pure regen + commit of generated
artifacts. Per-language pre-existing proto-regen tests in each
client's test suite must stay green.
### PR E.2 — .NET client SDK + CLI
**Depends on:** E.1, A.3 (gateway alarm dispatch + ack RPC live).
**Files** (`clients\dotnet\MxGateway.Client\` + `MxGateway.Client.Cli\`):
1. `MxGatewayClient.cs` — new public methods:
```csharp
IAsyncEnumerable<AlarmTransition> SubscribeAlarmsAsync(
IAsyncEnumerable<MxGatewaySession> session,
AlarmFilter? filter = null,
CancellationToken ct = default);
Task<MxStatus> AcknowledgeAlarmAsync(
MxGatewaySession session,
string alarmFullReference,
string comment,
string userPrincipal,
CancellationToken ct = default);
IAsyncEnumerable<ActiveAlarmSnapshot> QueryActiveAlarmsAsync(
MxGatewaySession session,
string? filterPrefix = null,
CancellationToken ct = default);
```
Existing `MxGatewayClientRetryPolicy` covers the new operations
without bespoke retry config.
2. `MxGateway.Client.Cli` — add `alarms` verb with subcommands:
`subscribe` (streams transitions until cancelled),
`acknowledge --ref <full-ref> --comment "<text>"`,
`query-active [--prefix <equipment>]`. Output formatting mirrors
the existing `events stream` verb (default human-readable +
`--json` flag for machine output).
3. AuthN — `MxGatewayClientOptions` validates new scopes
`invoke:alarm-ack` / `invoke:alarm-query` exist on the API key
when those operations are invoked; pre-flight check fails fast
with a clear error rather than letting the gateway return
`PERMISSION_DENIED` mid-stream.
**Tests** (`clients\dotnet\MxGateway.Client.Tests\`):
- `FakeGatewayTransport` extended to emit `OnAlarmTransition`
events; assert `SubscribeAlarmsAsync` yields each as the right
payload shape.
- Ack: assert request shape, retry policy, and error wrapping
(Unauthenticated → `MxGatewayAuthenticationException`,
PermissionDenied → `MxGatewayAuthorizationException`,
resource-exhausted → `MxGatewayException` with the right
message).
- CLI verb tests in `MxGatewayClientCliTests.cs` — argument
parsing, JSON output shape, exit codes.
### PR E.3 — Python client SDK + CLI
**Depends on:** E.1.
**Files** (`clients\python\src\mxgateway\` + the existing CLI entry
point — verify the exact name during PR; `PythonClientDesign.md`
documents it):
1. New module `alarms.py` exposing async helpers:
```python
async def subscribe_alarms(session, *, filter=None) -> AsyncIterator[AlarmTransition]: ...
async def acknowledge_alarm(session, *, alarm_ref, comment, user) -> MxStatus: ...
async def query_active_alarms(session, *, prefix=None) -> AsyncIterator[ActiveAlarmSnapshot]: ...
```
2. CLI: add `alarms subscribe / acknowledge / query-active` verbs.
Use the same JSON output schema as E.2's CLI so cross-language
tooling can parse either.
3. Type stubs (`*.pyi`) updated for the new types.
**Tests** (`clients\python\tests\`):
- pytest-asyncio fixtures using a stub gRPC server; assert each
helper's request/response shape.
- CLI smoke via `subprocess` + captured stdout JSON comparison.
### PR E.4 — Go client SDK + CLI
**Depends on:** E.1.
**Files** (`clients\go\mxgateway\` + `clients\go\cmd\`):
1. New `alarms.go` exposing:
```go
func (c *Client) SubscribeAlarms(ctx context.Context, opts ...SubscribeOption) (<-chan AlarmTransition, error)
func (c *Client) AcknowledgeAlarm(ctx context.Context, ref, comment, user string) (MxStatus, error)
func (c *Client) QueryActiveAlarms(ctx context.Context, prefix string) ([]ActiveAlarmSnapshot, error)
```
2. CLI: add `alarms` subcommand under `clients\go\cmd\mxgateway-cli\`
(verify the binary name in `GoClientDesign.md`). Same verb shape
as E.2 / E.3.
3. Errors wrapped via `errors.Is` against named sentinels
(`ErrAuthFailed`, `ErrPermissionDenied`, etc.) so callers can
programmatically distinguish failure modes.
**Tests:** standard Go table-driven tests against a stub gRPC server
under `clients\go\internal\testserver\`.
### PR E.5 — Java client SDK + CLI
**Depends on:** E.1.
**Files** (`clients\java\mxgateway-client\src\main\java\` +
`clients\java\mxgateway-cli\`):
1. New methods on the existing client class (verify in
`JavaClientDesign.md`):
```java
Flowable<AlarmTransition> subscribeAlarms(Session s, AlarmFilter filter);
Single<MxStatus> acknowledgeAlarm(Session s, String alarmRef, String comment, String user);
Flowable<ActiveAlarmSnapshot> queryActiveAlarms(Session s, String prefix);
```
(RxJava idiom matching the existing data-change subscription
API; if the existing API uses `CompletableFuture` instead, follow
that convention — verify during PR.)
2. CLI: same `alarms subscribe / acknowledge / query-active`
verbs.
**Tests:** JUnit 5 + a stub gRPC server. CLI tested via
`ProcessBuilder` exec + JSON output comparison.
### PR E.6 — Rust client SDK
**Depends on:** E.1.
**Files** (`clients\rust\crates\mxgateway-client\src\` +
likely a `mxgateway-cli` crate — verify in `RustClientDesign.md`):
1. New methods on the client struct:
```rust
pub fn subscribe_alarms(&self, filter: Option<AlarmFilter>) -> impl Stream<Item = Result<AlarmTransition>>;
pub async fn acknowledge_alarm(&self, alarm_ref: &str, comment: &str, user: &str) -> Result<MxStatus>;
pub fn query_active_alarms(&self, prefix: Option<&str>) -> impl Stream<Item = Result<ActiveAlarmSnapshot>>;
```
2. CLI: same verb shape.
3. `thiserror`-based error enum extended with `AlarmAckPermissionDenied`
etc. variants if the existing pattern uses one.
**Tests:** `tokio::test` against a stub gRPC server using
`tonic-build`'s test harness. CLI tested via `assert_cmd`.
### PR E.7 — lmxopcua OPC UA-facing client refresh
**Depends on:** B.2 + B.3 (server-side payload final on the OPC UA
wire). Independent of E.2-E.6 — different consumer surface (OPC UA
Part 9, not gateway gRPC).
**Files** (`c:\Users\dohertj2\Desktop\lmxopcua\src\`):
1. `Core.Abstractions\AlarmEventArgs.cs` *(extend, not new)* — add
optional fields the new path surfaces:
- `OperatorComment` (nullable string — populated by the native
ack path; null on sub-attribute fallback path)
- `OriginalRaiseTimestampUtc` (nullable; null on fallback path)
- `AlarmCategory` (nullable string)
- `AlarmTypeName` (already exists per v1 docs — leave alone)
2. `Server\OpcUa\DriverNodeManager.cs` — populate the corresponding
OPC UA Part 9 condition fields when the new payload is non-null:
`Comment` (from OperatorComment), `Time` (from OriginalRaiseTimestampUtc
when present, else event arrival time), `ConditionClassName` (from
AlarmCategory if mapping is defined).
3. `Client.Shared\Models\AlarmEventArgs.cs` — mirror the new fields
on the client-side DTO.
4. `Client.CLI\Commands\AlarmsCommand.cs` — add columns under a new
`--verbose` flag, plus full payload under `--json`. Default output
stays five-column compatible.
5. `Client.UI\ViewModels\AlarmEventViewModel.cs` — bind the new
fields. Add columns to `Views\AlarmsView.axaml` (collapsible
under a "Show details" toggle so the default view stays compact).
Surface `OperatorComment` in `AckAlarmWindow.axaml` as a
prepopulated default when re-acknowledging an already-acked
alarm.
6. `docs\Client.CLI.md` — add the new `--verbose` and `--json`
flag examples to the alarms section.
7. `docs\Client.UI.md` — add a screenshot or description of the
"Show details" expansion behavior.
8. `docs\reqs\ClientRequirements.md` — line 116 + 153 reference
the alarm subscription contract; extend the field list to cover
the new payload.
9. `docs\AlarmTracking.md` (new in B.5) — wire in client-side
examples.
**Tests:**
- `Client.Shared.Tests` — DTO round-trip through the alarm event
pump with all fields populated and all-null cases.
- `Client.CLI.Tests``--verbose` column ordering, `--json`
schema validation, default output stays five-column.
- `Client.UI.Tests``AlarmEventViewModel` bindings exposed,
collapsible-detail toggle behavior.
### Sequencing within Track E:
E.1 first (mechanical). E.2-E.7 can land in parallel. E.7 has its own
dependency chain inside lmxopcua (B.2 + B.3) and doesn't gate any
other E PR. The .NET client (E.2) is the only language SDK
**lmxopcua** consumes today; if the gateway repo's release schedule
prefers landing E.2 first and shipping E.3-E.6 in a follow-up release,
that's a valid sequence — the customer-facing constraint is "at
least one language SDK ships at the same time as A.4 lights up the
gateway dispatch."
## Track D — deployment refresh
The dev box at `DESKTOP-6JL3KKO` runs three live services from
`C:\publish\` (installed in the session that produced commit
`ea04547`'s install scripts). Once Tracks A / B / C are merged, the
deployed binaries need to be refreshed so the running services pick
up the new alarm path. Track D is one PR — pure ops, no code change.
### PR D.1 — refresh C:\publish + restart services
**Depends on:** A.4 + B.4 + C.2 merged (every code-change PR landed).
**Order matters** — services must stop in reverse-dependency order
(`OtOpcUa``OtOpcUaWonderwareHistorian``MxAccessGw`) and start in
forward-dependency order (`MxAccessGw``OtOpcUaWonderwareHistorian`
`OtOpcUa`). Touching binaries while a dependent service holds them
locked produces the publish-time `MSB3027` file-lock error caught
during the original install (see commit `80104ca`).
**Steps (run as a single PowerShell session on the deploy host):**
1. **Stop in reverse order**:
```powershell
nssm stop OtOpcUa
nssm stop OtOpcUaWonderwareHistorian
nssm stop MxAccessGw
Start-Sleep -Seconds 3
Get-Process MxGateway.Server, MxGateway.Worker, OtOpcUa.Server, `
OtOpcUa.Driver.Historian.Wonderware -ErrorAction SilentlyContinue |
Stop-Process -Force
```
2. **Refresh mxaccessgw binaries** (Track A output):
```powershell
$gwSrc = "C:\Users\dohertj2\Desktop\mxaccessgw"
dotnet build "$gwSrc\src\MxGateway.Worker" -c Release
dotnet build "$gwSrc\src\MxGateway.Server" -c Release
Copy-Item -Recurse -Force `
"$gwSrc\src\MxGateway.Server\bin\Release\net10.0\*" `
"C:\publish\mxaccessgw\Server\"
Copy-Item -Recurse -Force `
"$gwSrc\src\MxGateway.Worker\bin\x86\Release\net48\*" `
"C:\publish\mxaccessgw\Worker\"
```
3. **Refresh OtOpcUa + historian sidecar binaries** (Tracks B + C
output):
```powershell
$repo = "C:\Users\dohertj2\Desktop\lmxopcua"
dotnet publish "$repo\src\ZB.MOM.WW.OtOpcUa.Server" `
-c Release -o "C:\publish\lmxopcua"
dotnet publish "$repo\src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware" `
-c Release -o "C:\publish\lmxopcua\WonderwareHistorian"
```
4. **Update service env block if Track C added the new toggle**:
```powershell
# Pull existing env, append OTOPCUA_HISTORIAN_ALARM_WRITE_ENABLED=true
# (default-on per C.2 design, but explicit assignment lets us flip false
# for read-only deployments without re-installing)
nssm set OtOpcUaWonderwareHistorian AppEnvironmentExtra `
(((nssm get OtOpcUaWonderwareHistorian AppEnvironmentExtra) `
+ "`r`nOTOPCUA_HISTORIAN_ALARM_WRITE_ENABLED=true"))
```
5. **Start in forward order**:
```powershell
nssm start MxAccessGw
Start-Sleep -Seconds 4
nssm start OtOpcUaWonderwareHistorian
Start-Sleep -Seconds 4
nssm start OtOpcUa
Start-Sleep -Seconds 8
```
6. **Smoke verification:**
```powershell
foreach ($s in 'MxAccessGw','OtOpcUaWonderwareHistorian','OtOpcUa') {
(Get-Service $s).Status
}
foreach ($p in 5120, 4840, 4841) {
Get-NetTCPConnection -LocalPort $p -State Listen `
-ErrorAction SilentlyContinue
}
Get-Content "C:\publish\lmxopcua\logs\otopcua-*.log" -Tail 20
Get-Content "C:\publish\mxaccessgw\stdout.log" -Tail 20
Get-Content "C:\ProgramData\OtOpcUa\historian-wonderware-*.log" -Tail 10
```
Pass criterion: all three services `Running`; ports 5120 + 4840
listening; sidecar log shows `Wonderware historian sidecar
serving — pipe=OtOpcUaWonderwareHistorian`; OtOpcUa log shows
`OPC UA server started — endpoint=opc.tcp://0.0.0.0:4840/OtOpcUa`
and a new line `IAlarmHistorianWriter resolved: Sidecar` (added
in B.4).
7. **Functional verification — fire one alarm of each kind and assert
it propagates:**
- **Galaxy-native** — raise the `OtOpcUaParityTest_001.Counter`
`$Alarm*` extension via Galaxy's alarm-fire mechanism; assert an
OPC UA Part 9 transition reaches a connected `otopcua-cli alarms`
subscriber with rich payload (operator-comment field non-null,
original-raise-timestamp present). This validates Track A + B.1
+ B.2 + B.3.
- **Scripted** — author a one-line scripted alarm in the Admin UI
against any always-true predicate; assert the transition lands in
AVEVA Historian via `aaHistClientTrend` query (or
`Driver.Historian.Wonderware.IntegrationTests` with a query for
the alarm event). Validates Track C + B.4.
- **Sub-attribute fallback** — disable `IAlarmSource` on the
GalaxyDriver via the test seam (B.3 will introduce one); fire an
alarm; assert Part 9 transition still raised by the value-driven
path. Validates the fallback wasn't broken.
**Files:**
- `scripts\install\Refresh-Services.ps1` *(new — automates the above)*
- `docs\v2\dev-environment.md` — add the refresh script to the dev
workflow section.
**Tests:** smoke run on the dev rig (`DESKTOP-6JL3KKO`) producing
`docs\plans\artifacts\d1-rollout-YYYY-MM-DD.md` with the captured log
tails + smoke-test assertions. Captured artifact lands as part of the
PR.
**Rollback:** the refresh script keeps a timestamped backup of the
existing `C:\publish\mxaccessgw\` and `C:\publish\lmxopcua\` trees
before overwriting (mirrored to `C:\publish\.backup-YYYY-MM-DD\`).
Rollback is a stop / restore-from-backup / start sequence; no service
re-install needed since the NSSM service definitions don't change.
**Production deploy:** out of scope for D.1 — the dev rig is the only
deployment in scope at this point. A separate PR-or-runbook lands the
production refresh once the dev rig has soaked for the documented
duration (parity-rig validation gate; see "Test gates" above).
## Sequencing matrix
```
Track A (mxaccessgw) Track B (lmxopcua) Track C (sidecar) Track E (clients)
───────────────────────── ───────────────────────── ───────────────────── ──────────────────────────
A.1 proto (waits) C.1 AahClientManagedWriter E.1 proto regen ×5 langs
│ │ │ (mechanical, after A.1)
├──────────────────────────► B.1 EventPump branch │ │
A.2 worker subscription │ uses proto types only │ │
│ │ unit-testable │ │
│ C.2 Program.cs wires │
A.3 gateway dispatch + ack RPC ──►B.2 GalaxyDriver : IAlarmSource │ ──►E.2 .NET SDK + CLI
│ │ │ ──►E.3 Python SDK + CLI
│ ──►B.3 DriverNodeManager routing │ ──►E.4 Go SDK + CLI
│ │ ──►E.5 Java SDK + CLI
│ │ ──►E.6 Rust SDK
A.4 ConditionRefresh │ │ │
│ │ │
B.4 SidecarAlarmHistorianWriter │
(depends on C.2 deployed) │ │
│ │
(B.2 + B.3 done) ────────────────────────────────────────────► E.7 lmxopcua client refresh
│ │
▼ │
Track D (deployment) │
───────────────────────── │
D.1 Refresh C:\publish + restart services │
(depends on A.4 + B.4 + C.2 + E.2 merged) │
▼ │
──►B.5 docs + memory + completion banner ◄─────────(E.7 done)──┘
```
A.1 + B.1 + C.1 + E.1 can all land in parallel — none have cross-repo
runtime dependencies. B.1's tests use proto types without needing a
running gateway. C.1 is purely sidecar-internal. E.1 is mechanical
codegen.
The gateway-side dispatch (A.3) gates B.2 and E.2-E.6. The
sidecar-side wiring (C.2) gates B.4. E.7 gates on B.2 + B.3 only —
it's the OPC UA client surface, not the gateway client surface.
D.1 (deployment refresh) requires E.2 to also be merged because the
deployed `MxGateway.Client.dll` consumed by GalaxyDriver needs the new
methods. E.3-E.6 (other-language SDKs) don't gate D.1 — they ship on
their own release cadence.
B.5 (docs sweep) gates on D.1 + E.7 both merged — it's the final
"snapshot the as-shipped state" pass.
## Test gates
Per PR: unit tests pass + build green + analyzer clean (Roslyn
OTOPCUA0001 still wraps every alarm-capability call through
`AlarmSurfaceInvoker`).
End-of-epic gate: re-run the parity rig (`docs\v2\Galaxy.ParityRig.md`)
with these scenarios added:
1. **Native alarm raise** — Galaxy `$Alarm*` raise with operator-time
metadata appears as an OPC UA Part 9 transition with full payload
(no longer reconstructed from sub-attribute writes).
2. **Native ack** — OPC UA client acks; assert the gateway records the
ack against MxAccess directly (not via sub-attribute write); operator
comment present in the resulting `Acknowledged` transition.
3. **ConditionRefresh after reconnect** — disconnect the GalaxyDriver,
raise three alarms in Galaxy, reconnect; assert all three appear in
the next ConditionRefresh.
4. **Historian write-back** — fire a scripted alarm; assert it arrives in
AVEVA Historian via the gateway path (use the existing Historian
sidecar's read API to query it back).
5. **Sub-attribute fallback still works** — disable `IAlarmSource` on
the GalaxyDriver via test seam, fire a sub-attribute value change;
assert Part 9 transition still raised.
Soak target: 24h × 1k tags (light) — same parity-rig harness but
extended to also subscribe to alarms. Pass criterion: zero dropped
alarm transitions, zero state-machine inversions, zero unhandled
exceptions in the AlarmSurfaceInvoker pipeline.
## Risks and mitigations
| Risk | Mitigation |
|---|---|
| MxAccess Toolkit alarm subscription API differs across installed AVEVA versions | PR A.2 verifies against the worker-host's installed Toolkit version; documents the exact API used. Pin the worker DLL set per major MxAccess version if needed. |
| Worker-side alarm subscription leaks between sessions if cleanup is wrong | PR A.2 includes a session-recycle test that asserts no `IAlarmEventSink` instances remain registered after Close. |
| Gateway adds a new auth scope (`invoke:alarm-ack`); existing keys lack it | PR A.3 + A.5 ship with a one-time bootstrap migration: keys with `invoke:write` get the new scope auto-granted on the dev rig and parity rig. Production keys are reissued via `apikey rotate-key` (existing CLI). |
| Two simultaneous alarm sources (driver-native + sub-attribute) double-fire transitions | PR B.3 dedup is the load-bearing design. End-to-end test #1 covers it explicitly. |
| Historian write-back batch fails mid-batch — partial success | The existing `SqliteStoreAndForwardSink.HistorianWriteOutcome` per-row enum + dead-letter retention already handles this; PR A.5 just exposes the same outcome shape over gRPC. |
| Sidecar starts honouring the `WriteAlarmEvents` slot — old lmxopcua-side consumers can now reach a previously inert path | The slot returns `Success=false, Error="not configured"` today; flipping to live writes means a build that *speculatively* sent the frame would suddenly start producing real historian rows. Inventory of any such caller is empty — `WriteAlarmEvents` was never invoked from the lmxopcua side; `Phase7EngineComposer.RouteToHistorianAsync` queues into `SqliteStoreAndForwardSink` and the drain worker is gated on `IAlarmHistorianWriter` registration which only the new B.4 path provides. So enabling C.2 without B.4 is safe. |
## Roll-out
Track A lands first onto `mxaccessgw/main`, deployed to the parity rig.
Track B lands onto `lmxopcua/master` once A.3 is live on the rig — earlier
Track B PRs can target a feature branch (`feat/alarms-over-gateway`) and
merge to master after the rig is fully green.
## Back-out
Each PR is individually revertable. The cleanest back-out point is at
the gateway-side enum extension: removing `MX_EVENT_FAMILY_ON_ALARM_TRANSITION`
from the proto means EventPump silently drops alarm events again and
GalaxyDriver's `OnAlarmEvent` never fires — but the sub-attribute fallback
path still produces functional alarms, so the OPC UA surface degrades to
v2-current behaviour without breaking. PR B.4 is the only one with a
non-trivial back-out (re-add the deleted sidecar IPC slot if revert
needed); land B.4 last and only after end-of-epic gate is green.
## Out of scope (explicit)
- **Other alarm sources beyond Galaxy.** AbCip / FOCAS / OpcUaClient
drivers already implement `IAlarmSource`; they're untouched.
- **Modbus / S7 / AbLegacy / TwinCAT alarms.** None of those protocols
has a native alarm bus. Alarms on those drivers, if needed, ship via
the scripted-alarm path.
- **Multi-Galaxy ack routing.** Today's gateway model is one Galaxy per
session; if a deployment splits across galaxies, each gets its own
GalaxyDriver and they don't cross-talk. No change.
- **OPC UA Part 9 advanced features** beyond the current scope —
shelving, subscribed-to-events-only, branch-state for re-trigger
semantics. Future epic if a customer asks.
- **Insight / cloud Historian write-back path.** Track A.5 targets the
on-prem AVEVA Historian via aahClientManaged. The cloud variant
would mirror the same gateway RPC over the REST API discussed in
`docs/histsdk` — separate epic.
## File inventory (touched)
**mxaccessgw (Track A):**
- `src\MxGateway.Contracts\Protos\mxaccess_gateway.proto` (A.1)
- `src\MxGateway.Contracts\Protos\mxaccess_worker.proto` (A.2, A.4)
- `src\MxGateway.Worker\…\Eventing\` (A.2, A.3, A.4)
- `src\MxGateway.Worker\…\Commands\` (A.3, A.4)
- `src\MxGateway.Server\Sessions\SessionEventStream.cs` (A.3)
- `src\MxGateway.Server\Rpc\` (A.3, A.4)
- `src\MxGateway.Server\Auth\Scopes.cs` (A.3, A.4)
- `MxGateway.Tests`, `MxGateway.Worker.Tests`, `MxGateway.IntegrationTests`
**lmxopcua — Galaxy driver + server (Track B):**
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\EventPump.cs` (B.1)
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\MxAccessSeverityMapper.cs` *(new — B.1)*
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\IGalaxyAlarmAcknowledger.cs` *(new — B.2)*
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\GatewayGalaxyAlarmAcknowledger.cs` *(new — B.2)*
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriver.cs` (B.2)
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriverFactory.cs` (B.2)
- `src\ZB.MOM.WW.OtOpcUa.Server\OpcUa\DriverNodeManager.cs` (B.3)
- `src\ZB.MOM.WW.OtOpcUa.Server\Alarms\AlarmConditionService.cs` (B.3)
- `src\ZB.MOM.WW.OtOpcUa.Server\Phase7\Phase7Composer.cs` (B.4)
- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client\SidecarAlarmHistorianWriter.cs` *(new — B.4)*
- `tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests\Runtime\` (B.1, B.2)
- `tests\ZB.MOM.WW.OtOpcUa.Server.Tests\Alarms\` (B.3)
- `tests\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client.Tests\` (B.4 — new tests)
- `docs\drivers\Galaxy.md` (B.5)
- `docs\AlarmTracking.md` *(new — B.5)*
- `docs\v1\AlarmTracking.md` (B.5 — banner update)
- `docs\plans\alarms-over-gateway.md` (B.5 — completion banner)
**lmxopcua — Wonderware historian sidecar (Track C):**
- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Backend\AahClientManagedAlarmEventWriter.cs` *(new — C.1)*
- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Program.cs` (C.2 — wire writer)
- `scripts\install\Install-Services.ps1` (C.2 — env-var toggle for write-enable)
- `tests\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests\` (C.1 — outcome mapping + batch + cluster failover)
**lmxopcua — deployment refresh (Track D):**
- `scripts\install\Refresh-Services.ps1` *(new — D.1)*
- `docs\v2\dev-environment.md` (D.1 — document the refresh workflow)
- `docs\plans\artifacts\d1-rollout-YYYY-MM-DD.md` *(new — D.1 captured smoke run)*
**mxaccessgw — client SDKs (Track E):**
- `clients\proto\` — no source change; downstream codegen consumes A.1
- **.NET (E.2)**:
- `clients\dotnet\MxGateway.Client\MxGatewayClient.cs`
- `clients\dotnet\MxGateway.Client\Alarms\` *(new namespace)*
- `clients\dotnet\MxGateway.Client.Cli\Verbs\AlarmsVerb.cs` *(new)*
- `clients\dotnet\MxGateway.Client.Tests\AlarmsTests.cs` *(new)*
- **Python (E.3)**:
- `clients\python\src\mxgateway\alarms.py` *(new)*
- `clients\python\src\mxgateway\cli\alarms.py` *(new — verify CLI module path)*
- `clients\python\tests\test_alarms.py` *(new)*
- **Go (E.4)**:
- `clients\go\mxgateway\alarms.go` *(new)*
- `clients\go\cmd\mxgateway-cli\alarms.go` *(new — verify dir name)*
- `clients\go\internal\testserver\alarms_test.go` *(new)*
- **Java (E.5)**:
- `clients\java\mxgateway-client\src\main\java\…\AlarmsApi.java` *(new)*
- `clients\java\mxgateway-cli\src\main\java\…\AlarmsCommand.java` *(new)*
- `clients\java\mxgateway-client\src\test\java\…\AlarmsApiTest.java` *(new)*
- **Rust (E.6)**:
- `clients\rust\crates\mxgateway-client\src\alarms.rs` *(new)*
- `clients\rust\crates\mxgateway-cli\src\alarms.rs` *(new — verify crate name)*
- `clients\rust\tests\alarms.rs` *(new)*
**lmxopcua — OPC UA client refresh (Track E.7):**
- `src\ZB.MOM.WW.OtOpcUa.Core.Abstractions\AlarmEventArgs.cs` (extend)
- `src\ZB.MOM.WW.OtOpcUa.Server\OpcUa\DriverNodeManager.cs` (Part 9 field population)
- `src\ZB.MOM.WW.OtOpcUa.Client.Shared\Models\AlarmEventArgs.cs` (DTO mirror)
- `src\ZB.MOM.WW.OtOpcUa.Client.CLI\Commands\AlarmsCommand.cs` (verbose / json flags)
- `src\ZB.MOM.WW.OtOpcUa.Client.UI\ViewModels\AlarmEventViewModel.cs`
- `src\ZB.MOM.WW.OtOpcUa.Client.UI\ViewModels\AlarmsViewModel.cs`
- `src\ZB.MOM.WW.OtOpcUa.Client.UI\Views\AlarmsView.axaml` (+ `.cs`)
- `src\ZB.MOM.WW.OtOpcUa.Client.UI\Views\AckAlarmWindow.axaml` (+ `.cs`)
- `docs\Client.CLI.md` (alarms section examples)
- `docs\Client.UI.md` (Show-details toggle description)
- `docs\reqs\ClientRequirements.md` (extend AlarmEventArgs contract)
- `docs\AlarmTracking.md` (B.5 — cross-link client examples)
- `tests\ZB.MOM.WW.OtOpcUa.Client.Shared.Tests\` (DTO round-trip)
- `tests\ZB.MOM.WW.OtOpcUa.Client.CLI.Tests\` (flag behaviour)
- `tests\ZB.MOM.WW.OtOpcUa.Client.UI.Tests\` (view-model bindings)
Total: ~10 source files added/modified in mxaccessgw server/worker
side; ~14 in lmxopcua server/driver side; ~3 in the historian sidecar;
~2 deployment scripts; ~30 across the five gateway-client SDK
languages; ~12 in lmxopcua client surfaces; ~25 test files across
all repos. The gateway-client multi-language work is parallelizable
across maintainers, so wall-clock effort lands in 4-6 weeks of
coordinated work given the parity-rig dependency for end-to-end
validation. If only the .NET SDK ships at first (E.2 only) and
E.3-E.6 follow asynchronously, lmxopcua's critical path stays
unchanged.