docs: plan — alarms over the mxaccessgw gateway

Coordinated cross-repo epic to restore the three v1 alarm capabilities
that PR 7.2 regressed: rich MxAccess alarm-event metadata, native
Acknowledge semantics, and the IAlarmHistorianWriter write-back path.

Architectural split: gateway owns MxAccess transport (new
OnAlarmTransition event family + AcknowledgeAlarm / QueryActiveAlarms /
WriteHistorianEvent RPCs); lmxopcua keeps the OPC UA Part 9 state
machine, ACL/role enforcement, and multi-source aggregation. The
existing value-driven sub-attribute path stays as fallback.

10 PRs total — 5 in mxaccessgw, 5 in lmxopcua — sequenced so each
side's work is independently reviewable. End-of-epic gate is a parity
matrix run with five new alarm scenarios.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-04-30 15:02:48 -04:00
parent 80104caf09
commit 65a5f64931

View File

@@ -0,0 +1,617 @@
# Plan — alarms over the mxaccessgw gateway
Coordinated epic across two repos:
- **`lmxopcua`** (this repo) — `c:\Users\dohertj2\Desktop\lmxopcua\`
- **`mxaccessgw`** — `c:\Users\dohertj2\Desktop\mxaccessgw\`
## Why
PR 7.2 (2026-04-30, commit `ae7106d`) retired the in-process v1 Galaxy stack
(`Driver.Galaxy.Host` / `.Proxy` / `.Shared` + `OtOpcUaGalaxyHost` Windows
service) and migrated Galaxy access to the in-process `GalaxyDriver` over
mxaccessgw's gRPC. In doing so, three v1 capabilities regressed:
1. **Native MxAccess alarm-event metadata** — v1's `GalaxyAlarmTracker`
surfaced rich alarm transitions (operator comment, original raise time,
ack time, alarm category, native severity). The current architecture
reconstructs Part 9 transitions by subscribing to four sub-attribute
value updates (`InAlarm`, `Acked`, `Priority`, `Description`) — fine for
raise/clear but loses everything else.
2. **Native MxAccess Acknowledge semantics** — v1 called the MxAccess ack
API directly from `GalaxyAlarmTracker`. Today, OPC UA acks are written
into the `AckMsgWriteRef` sub-attribute — semantically valid but a
round-trip through the value path that loses operator-comment fidelity.
3. **Alarm-historian write-back path**`GalaxyHistorianWriter`
implemented `IAlarmHistorianWriter` and forwarded scripted-alarm and
Galaxy-native alarm transitions back to AVEVA Historian via
`aahClientManaged`. PR 7.2 deleted it. `Phase7Composer.ResolveHistorianSink`
now finds no writer and falls back to `NullAlarmHistorianSink`, so
**scripted-alarm transitions queue locally and silently discard.**
(Galaxy-native alarms still reach AVEVA Historian via the Galaxy template's
own `HistorizeToAveva` toggle, independent of our sink — that path
wasn't broken.)
`gateway.md` (mxaccessgw, line 8) explicitly commits the gateway to "full
MXAccess parity… preserve MXAccess behavior first… **native MXAccess event
families**." Today's gateway proto exposes only data-change families. Closing
the alarm regression and fulfilling that parity statement are the same task.
## Goals
- Restore all three regressed capabilities to feature parity with v1.
- Keep the v2 architectural split — gateway owns MxAccess transport;
lmxopcua owns OPC UA Part 9 semantics, ACL/role enforcement, and
multi-source aggregation (driver-native + scripted + sub-attribute).
- Preserve the value-driven sub-attribute path as a fallback for Galaxy
templates that don't carry `$Alarm*` extensions.
- Land the work as a sequence of small, independently-reviewable PRs that
alternate between repos in dependency order.
## Non-goals
- Reimplementing the Part 9 state machine inside mxaccessgw. The gateway
stays UA-agnostic.
- Reworking the LDAP role-grant or OPC UA AlarmAck ACL surface — those
already exist and route through `Server/Alarms/IAlarmAcknowledger`.
- Adding alarm support to non-Galaxy drivers (AbCip / FOCAS / OpcUaClient
already have their own `IAlarmSource` implementations; Modbus / S7 /
AbLegacy / TwinCAT don't have a native alarm bus and are out of scope).
- Altering Galaxy template conventions or `$Alarm*` extensions in the
customer's Galaxy.
## Before → after
**Today (post-PR 7.2):**
```
MxAccess COM (gateway worker)
│ data-change events only on the MxEvent stream
GalaxyDriver (no IAlarmSource)
│ IWritable / ISubscribable / ITagDiscovery only
DriverNodeManager
├─ subscribes to four $Alarm* sub-attributes per condition
├─ AlarmConditionService rebuilds Part 9 transitions from value updates
└─ DriverWritableAcknowledger writes AckMsgWriteRef on ack
Phase7Composer.ResolveHistorianSink → NullAlarmHistorianSink
(scripted-alarm transitions queue → silently discarded)
```
**After this epic:**
```
MxAccess COM (gateway worker)
│ data-change ──┐
│ alarm-transition │
│ write-complete ├─► single MxEvent stream (new family added)
▼ ▼
GalaxyDriver : ITagDiscovery, IReadable, IWritable, ISubscribable, IRediscoverable,
IHostConnectivityProbe, IAlarmSource ← restored
├─ EventPump dispatches OnAlarmTransition family → IAlarmSource.OnAlarmEvent
├─ AcknowledgeAsync → gateway RPC AcknowledgeAlarm
└─ QueryActiveAlarmsAsync → gateway RPC QueryActiveAlarms (ConditionRefresh)
DriverNodeManager
├─ rich alarm events from IAlarmSource.OnAlarmEvent → AlarmConditionService
├─ value-driven sub-attribute path STILL WORKS for templates without $Alarm
├─ DriverWritableAcknowledger preserved as fallback for the value path
└─ ScriptedAlarmEngine output continues to feed AlarmConditionService
Phase7Composer.ResolveHistorianSink → GatewayAlarmHistorianWriter
├─ scripted-alarm transitions → SqliteStoreAndForwardSink
└─ drain worker → gateway RPC WriteHistorianEvent → AVEVA Historian
```
## Architecture decisions
**D1 — Where the Part 9 state machine runs.** Stays in lmxopcua's
`AlarmConditionService`. Gateway is UA-agnostic. ScriptedAlarmEngine produces
Part 9 transitions with no MxAccess origin; the aggregator must live where all
sources converge.
**D2 — Where authz on Acknowledge runs.** Stays in lmxopcua. The OPC UA
`AlarmConditionState.OnAcknowledge` delegate already checks the session's
roles for `AlarmAck` against the LDAP/role-grant ACL. The gateway should
never be reachable in a way that bypasses that check.
**D3 — How rich alarm events reach OPC UA clients.** New `MxEventFamily`
on the existing `StreamEvents` RPC (no second stream). Adds latency
parity with data-change events, reuses the bounded-channel + worker-side
delivery semantics already documented in `gateway.md`.
**D4 — Sub-attribute fallback path stays.** Some Galaxy templates won't
have `$Alarm*` extensions yet; the existing value-driven path remains the
only way to surface alarms for those templates. Both paths feed
`AlarmConditionService`. Driver-native events take precedence when both
are present (more authoritative, lower latency).
**D5 — Where the historian writer lives.** As a new RPC on the gateway
(`WriteHistorianEvent`). The Wonderware sidecar's existing
`WriteAlarmEvents` IPC slot stays unwired and is deleted as part of this
epic — the gateway is the canonical place for "write to AVEVA Historian"
since the gateway already owns AVEVA-COM access. This also means the
sidecar (long term) only does *reads* and could potentially retire entirely
if the historian-client REST migration (`docs/plans/...`) lands.
## Track A — mxaccessgw changes
All five PRs land in `c:\Users\dohertj2\Desktop\mxaccessgw\`.
### PR A.1 — proto: add alarm-transition event family + ack/query RPCs
**Files** (`src\MxGateway.Contracts\Protos\mxaccess_gateway.proto`):
1. Extend `MxEventFamily` (line 403):
```
MX_EVENT_FAMILY_ON_ALARM_TRANSITION = 5;
```
2. Extend `MxEvent.body` oneof (line 395) with:
```
OnAlarmTransitionEvent on_alarm_transition = 24;
```
3. New message `OnAlarmTransitionEvent` after the existing event-family
bodies (line 425+). Carry the full MxAccess alarm payload — alarm name,
source object reference, alarm-type-name (e.g. "AnalogLimitAlarm.HiHi"),
transition kind enum (`Raise` / `Acknowledge` / `Clear`), severity (raw
numeric — keep MxAccess scale; mapping to OPC UA 0-1000 happens
server-side in lmxopcua), `original_raise_timestamp`,
`transition_timestamp`, optional `operator_user`, optional
`operator_comment`, alarm `category` string, alarm `description`. Mirror
the field set documented in v1's `GalaxyAlarmTracker`.
4. New RPC on `MxAccessGateway` service (line 11):
```
rpc AcknowledgeAlarm(AcknowledgeAlarmRequest) returns (AcknowledgeAlarmReply);
rpc QueryActiveAlarms(QueryActiveAlarmsRequest) returns (stream ActiveAlarmSnapshot);
```
`AcknowledgeAlarmRequest` carries `session_id`, `alarm_full_reference`,
`comment`, `user_principal`. Reply carries `MxStatusProxy`.
`QueryActiveAlarmsRequest` carries `session_id`, optional
`alarm_filter_prefix` (for ConditionRefresh on a sub-tree).
`ActiveAlarmSnapshot` carries the same fields as
`OnAlarmTransitionEvent` plus `current_state` enum (`Active` /
`ActiveAcked` / `Inactive`).
**Tests** (`MxGateway.Tests` — proto/codegen sanity):
- Round-trip Serialize→Deserialize for the new messages with all-fields
populated and empty-optional-fields cases.
- `MxEvent.body` oneof selection guard — supplying multiple bodies
rejected.
**Out of scope:** worker-side wiring (PR A.2), gateway-side dispatch (PR A.3).
PR A.1 is a pure contract-surface change; nothing functional yet.
### PR A.2 — worker: subscribe to MxAccess alarm event source
**Files** (`src\MxGateway.Worker\` — net48/x86):
The MxAccess Toolkit exposes alarm subscription separately from data
subscription. Per AVEVA's MXAccess C++ Toolkit reference (canonical doc
referenced from `gateway.md`), alarm events arrive through the
`IAlarmEventSink` interface registered against the MxAccess `Alarms`
collection of an open session, OR via the MxAccess "alarm provider"
subscription pattern (depends on Toolkit version on the worker host —
verify against the version actually deployed in the worker bin during
PR A.2).
1. Worker subscribes to MxAccess alarms once per session, with a single
sink that fans out into the same bounded channel the data-change pump
uses (`MxGateway.Worker\Eventing\EventChannel.cs` or whatever the worker
currently calls its sink — verify name during the PR).
2. Sink translates each MxAccess alarm event into a `WorkerEvent` proto
(defined in `mxaccess_worker.proto`) carrying the new
`OnAlarmTransitionEvent` body. Reuses the existing `worker_sequence`
counter so ordering is preserved across families.
3. Worker honours the same backpressure rules as data-change events —
newest-dropped on full channel, single dropped-counter metric per
family.
**Tests** (`MxGateway.Worker.Tests`):
- Fake `IAlarmEventSink` source emits canned transitions; assert the
worker forwards each as the right `WorkerEvent` shape.
- Cancellation test — closing the session unsubscribes from MxAccess
alarms cleanly (no leaked sinks if the worker is recycled mid-session).
**Out of scope:** any gateway-side dispatch, any RPC handler — PR A.2
is worker-internal.
### PR A.3 — gateway: dispatch OnAlarmTransition + implement AcknowledgeAlarm
**Files** (`src\MxGateway.Server\`):
1. The session-level event multiplexer (`Sessions\SessionEventStream.cs`
or equivalent — verify name during PR) recognizes the new
`WorkerEvent` body and forwards as an `MxEvent` with family
`MX_EVENT_FAMILY_ON_ALARM_TRANSITION` to the gRPC
`StreamEvents` consumer.
2. New RPC handler `AcknowledgeAlarm` builds an MxAccess `WorkerCommand`
carrying an `AlarmAcknowledgeCommand` (new in `mxaccess_worker.proto`
under PR A.1). Forwarded to the worker; reply mapped to
`AcknowledgeAlarmReply` with the MxAccess `MxStatus` proxy populated.
3. AuthN — same API-key + scope check as existing RPCs. Add a new scope
`invoke:alarm-ack` (mirrors `invoke:write` granularity); existing keys
without it return `PERMISSION_DENIED`.
**Tests** (`MxGateway.Tests`, `MxGateway.IntegrationTests`):
- Unit: dispatch test — fake worker emits an `AlarmTransition` event;
assert the gateway forwards it on the live `StreamEvents` channel of
every subscribed session.
- Integration: end-to-end against the real worker (requires the parity
rig setup — see `docs\v2\Galaxy.ParityRig.md` in lmxopcua for the
MxAccess-installed dev box prerequisites). Trigger a real Galaxy
alarm, assert the gateway emits `OnAlarmTransition`. Acknowledge via
the new RPC, assert the alarm transitions to `ActiveAcked` and an
`Acknowledge` transition event is emitted back.
- AuthN: existing key without `invoke:alarm-ack` scope rejected.
### PR A.4 — gateway: ConditionRefresh snapshot via QueryActiveAlarms
**Files** (`src\MxGateway.Server\`, `src\MxGateway.Worker\`):
1. Worker exposes a `QueryActiveAlarmsCommand` that walks the session's
active-alarm collection and streams snapshots back through the
existing command-reply channel. The MxAccess Toolkit's
`Alarms.GetActive()` (verify exact API name during PR) is the
underlying call.
2. Gateway RPC `QueryActiveAlarms` opens a server-streaming reply,
batches snapshots through.
3. AuthN — new scope `invoke:alarm-query` (separate from ack so a
read-only client can refresh without ack rights).
**Tests:**
- Worker-test: synthetic active set of 0 / 1 / 100 alarms; assert
pagination respects worker channel capacity.
- Integration: against the parity rig, assert a ConditionRefresh after
reconnect returns every alarm currently `Active` or `ActiveAcked` in
the Galaxy.
### PR A.5 — gateway: WriteHistorianEvent RPC for sink write-back
**Files** (`src\MxGateway.Server\`, `src\MxGateway.Worker\`,
`src\MxGateway.Contracts\Protos\mxaccess_gateway.proto`).
1. New RPC `WriteHistorianEvent(WriteHistorianEventRequest) →
WriteHistorianEventReply`. Request carries an
`AlarmHistorianRecord` mirroring the existing
`Core.AlarmHistorian.AlarmHistorianEvent` payload (alarm id,
equipment path, alarm name, alarm-type-name, severity, event kind,
message, user, comment, timestamp).
2. Worker maps the record onto `aahClientManaged`'s alarm-event
write API (the same path v1's `GalaxyHistorianWriter` used). Worker
batches up to N records per write to amortize the COM round-trip.
3. AuthN — new scope `invoke:historian-write`. Cross-cutting with
`invoke:write` — keys for OPC UA servers that publish historian
data must hold both.
**Tests:**
- Worker test: fake `aahClientManaged` writer; assert batching
semantics + retry-on-Bad-status-code behaviour matches v1's
`GalaxyHistorianWriter` (per-row outcome reporting).
- Integration: write a record, query it back via existing Historian
read APIs, assert round-trip fidelity.
**Sequencing within Track A:** A.1 → A.2 → A.3 → A.4 → A.5. A.1 is
mechanical; A.2 + A.3 are the load-bearing changes that unlock lmxopcua
side. A.4 + A.5 can ship after lmxopcua starts consuming A.3 output.
## Track B — lmxopcua changes
All five PRs land in `c:\Users\dohertj2\Desktop\lmxopcua\`. Each B-PR
depends on a specific A-PR — see the sequencing matrix below.
### PR B.1 — EventPump: dispatch OnAlarmTransition family
**Depends on:** A.1 (proto), A.3 (gateway dispatching the new family).
**Files:**
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\EventPump.cs:160`
current `Dispatch(MxEvent ev)` returns early for any non-`OnDataChange`
family. Add a branch:
```csharp
switch (ev.Family) {
case MxEventFamily.OnDataChange: DispatchDataChange(ev); break;
case MxEventFamily.OnAlarmTransition: DispatchAlarmTransition(ev); break;
default: return;
}
```
- New `DispatchAlarmTransition` translates the proto event into an
`AlarmEventArgs` (existing type from `Core.Abstractions`) and raises an
internal event the driver subscribes to.
- New `MxAccessSeverityMapper` in `Driver.Galaxy\Runtime\` — maps the
MxAccess raw severity into the `AlarmSeverity` enum + the OPC UA
numeric severity (250 / 500 / 700 / 900 ladder per v1's
`AlarmTracking.md`).
**Tests** (`tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests\Runtime\`):
- `EventPumpAlarmTests` — feed three synthetic MxEvents (raise / ack /
clear); assert each fires `OnAlarmEvent` on the driver with correct
payload.
- Severity-mapping table tests — every documented MxAccess severity
level → expected (`AlarmSeverity`, OPC UA numeric) tuple.
### PR B.2 — GalaxyDriver re-implements IAlarmSource
**Depends on:** A.3 (`AcknowledgeAlarm` RPC available), B.1 (event
dispatch).
**Files:**
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriver.cs:28` — extend the
class declaration:
```csharp
public sealed class GalaxyDriver
: IDriver, ITagDiscovery, IReadable, IWritable, ISubscribable,
IRediscoverable, IHostConnectivityProbe, IAlarmSource, IDisposable
```
- Implement the four `IAlarmSource` members:
- `SubscribeAlarmsAsync` — no-op returning a sentinel handle. The
driver is already subscribed for data; alarm events arrive on the
same event stream once the gateway emits the new family. (Same
pattern AbCip uses today — see `Driver.AbCip\AbCipDriver.cs:208`.)
- `UnsubscribeAlarmsAsync` — no-op.
- `OnAlarmEvent` — wired to the EventPump branch added in B.1.
- `AcknowledgeAsync` — calls the new gateway RPC via the
`IGalaxyAlarmAcknowledger` abstraction (new file, mirrors the
`IGalaxyDataWriter` pattern), with `GatewayGalaxyAlarmAcknowledger`
as the production implementation in `Runtime\`. Resilience wrapping
via `AlarmSurfaceInvoker` per existing pattern.
- `DriverInstanceFactory` for Galaxy registers
`IGalaxyAlarmAcknowledger` alongside the existing data writer.
**Tests:**
- Subscribe-noop returns a non-null handle; unsubscribe accepts it.
- Acknowledge — fake `IGalaxyAlarmAcknowledger` records the call; assert
the request shape and resilience-pipeline routing.
- End-to-end test in `Driver.Galaxy.Tests` — fake gateway emits a
raise-then-ack event sequence; assert the driver fires `OnAlarmEvent`
twice with matching alarm-id correlation.
### PR B.3 — DriverNodeManager: route to driver-native when present
**Depends on:** B.2.
**Files:**
- `src\ZB.MOM.WW.OtOpcUa.Server\OpcUa\DriverNodeManager.cs` — when
registering an `AlarmConditionState` for a Galaxy variable, check
whether the driver is `IAlarmSource`. If yes, prefer the
`OnAlarmEvent`-driven path; the value-driven sub-attribute path
becomes the secondary path that handles transitions the driver-native
stream missed (network blip, gateway restart, gw missing the
`$Alarm*` extension on this template).
- `Server\Alarms\AlarmConditionService` — already accepts events from
multiple sources; only addition is a `DriverEventOrigin` enum on
internal transitions so the dedup logic prefers the richer
driver-native record over a stale sub-attribute synthesis.
- `IAlarmAcknowledger` resolution in `DriverNodeManager`
prefer the driver's `IAlarmSource.AcknowledgeAsync` over
`DriverWritableAcknowledger` when both are available. Keep
`DriverWritableAcknowledger` as the fallback for templates without
`$Alarm*` extensions.
**Tests:**
- Two-source-fan-in test: same alarm condition receives both a
driver-native ack event and a sub-attribute value update for the same
transition; assert no duplicate Part 9 transition fires.
- Acknowledger routing — driver implements `IAlarmSource`
ack-via-RPC; driver implements only `IWritable` → ack-via-write
(existing path).
### PR B.4 — IAlarmHistorianWriter via gateway
**Depends on:** A.5 (`WriteHistorianEvent` RPC available).
**Files:**
- New `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\GatewayAlarmHistorianWriter.cs`
implementing `IAlarmHistorianWriter`. Calls the gateway RPC from
Track A.5 with the same batch + per-row outcome semantics v1's
`GalaxyHistorianWriter` exposed.
- `GalaxyDriverFactory` registers it as a singleton tied to the
`DriverInstance`.
- `Server\Phase7\Phase7Composer.ResolveHistorianSink` — already scans
registered drivers for an `IAlarmHistorianWriter`. Once GalaxyDriver
exposes one, `SqliteStoreAndForwardSink` boots with a real writer
attached and the `NullAlarmHistorianSink` fallback no longer applies
on Galaxy installs.
- Delete `WriteAlarmEventsRequest` / `WriteAlarmEventsReply` /
`IAlarmEventWriter` from the Wonderware sidecar
(`src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Ipc\Contracts.cs`,
`Ipc\HistorianFrameHandler.cs`, `Ipc\Framing.cs`). The historian
sidecar becomes read-only — matches the audit done earlier.
**Tests:**
- `GatewayAlarmHistorianWriter` against a fake gRPC server — single
record, batch, per-row failure modes (Ack / RetryPlease /
PermanentFail).
- `Phase7Composer` end-to-end — register a Galaxy driver, assert
`ResolveHistorianSink` picks `SqliteStoreAndForwardSink` with the
new writer attached.
### PR B.5 — docs + memory housekeeping
**Depends on:** B.1 / B.2 / B.3 / B.4 all green on the parity rig.
**Files:**
- `docs\drivers\Galaxy.md` — current text says the driver implements
five capability interfaces; update to seven (`IAlarmSource`,
`IAlarmHistorianWriter`-via-companion).
- `docs\AlarmTracking.md` — promote a fresh top-level doc that
describes the v2-final architecture (driver-native primary path +
sub-attribute fallback + scripted-alarm aggregation). Cross-link from
`docs\README.md`. The v1 archive stays as historical record.
- `docs\v1\AlarmTracking.md` — extend the existing historical banner
with "Restored to functional parity in this epic — see
`docs\AlarmTracking.md` for current state."
- Memory entries (`C:\Users\dohertj2\.claude\projects\…\memory\`):
- Update `project_galaxy_via_mxgateway.md` — add the alarm path
restoration.
- Update `project_server_history_alarm_subsystems.md` — note that
`Phase7Composer.ResolveHistorianSink` now finds a writer on
Galaxy installs.
- `docs\plans\alarms-over-gateway.md` (this file) — banner the doc
`✅ Completed YYYY-MM-DD — historical record.` matching the existing
v2-mxgw plan retirement convention.
## Sequencing matrix
```
Track A (mxaccessgw) Track B (lmxopcua)
───────────────────────── ─────────────────────────
A.1 proto (waits)
├──────────────────────────► B.1 EventPump branch
A.2 worker subscription │ uses proto types only
│ │ unit-testable without live gw
A.3 gateway dispatch + ack RPC ──►B.2 GalaxyDriver : IAlarmSource
│ │
│ ──►B.3 DriverNodeManager routing
A.4 ConditionRefresh │ (B.3 closes the loop with A.4
│ once ConditionRefresh wired)
A.5 WriteHistorianEvent ─────────►B.4 GatewayAlarmHistorianWriter
│ + sidecar write-path deletion
──►B.5 docs + memory
```
A.1 + B.1 can land in parallel (B.1's tests use proto types without
needing a running gateway). B.1 stays inert until A.3 ships the gateway
dispatch — which is fine; the dispatch branch is a no-op until events
arrive.
## Test gates
Per PR: unit tests pass + build green + analyzer clean (Roslyn
OTOPCUA0001 still wraps every alarm-capability call through
`AlarmSurfaceInvoker`).
End-of-epic gate: re-run the parity rig (`docs\v2\Galaxy.ParityRig.md`)
with these scenarios added:
1. **Native alarm raise** — Galaxy `$Alarm*` raise with operator-time
metadata appears as an OPC UA Part 9 transition with full payload
(no longer reconstructed from sub-attribute writes).
2. **Native ack** — OPC UA client acks; assert the gateway records the
ack against MxAccess directly (not via sub-attribute write); operator
comment present in the resulting `Acknowledged` transition.
3. **ConditionRefresh after reconnect** — disconnect the GalaxyDriver,
raise three alarms in Galaxy, reconnect; assert all three appear in
the next ConditionRefresh.
4. **Historian write-back** — fire a scripted alarm; assert it arrives in
AVEVA Historian via the gateway path (use the existing Historian
sidecar's read API to query it back).
5. **Sub-attribute fallback still works** — disable `IAlarmSource` on
the GalaxyDriver via test seam, fire a sub-attribute value change;
assert Part 9 transition still raised.
Soak target: 24h × 1k tags (light) — same parity-rig harness but
extended to also subscribe to alarms. Pass criterion: zero dropped
alarm transitions, zero state-machine inversions, zero unhandled
exceptions in the AlarmSurfaceInvoker pipeline.
## Risks and mitigations
| Risk | Mitigation |
|---|---|
| MxAccess Toolkit alarm subscription API differs across installed AVEVA versions | PR A.2 verifies against the worker-host's installed Toolkit version; documents the exact API used. Pin the worker DLL set per major MxAccess version if needed. |
| Worker-side alarm subscription leaks between sessions if cleanup is wrong | PR A.2 includes a session-recycle test that asserts no `IAlarmEventSink` instances remain registered after Close. |
| Gateway adds a new auth scope (`invoke:alarm-ack`); existing keys lack it | PR A.3 + A.5 ship with a one-time bootstrap migration: keys with `invoke:write` get the new scope auto-granted on the dev rig and parity rig. Production keys are reissued via `apikey rotate-key` (existing CLI). |
| Two simultaneous alarm sources (driver-native + sub-attribute) double-fire transitions | PR B.3 dedup is the load-bearing design. End-to-end test #1 covers it explicitly. |
| Historian write-back batch fails mid-batch — partial success | The existing `SqliteStoreAndForwardSink.HistorianWriteOutcome` per-row enum + dead-letter retention already handles this; PR A.5 just exposes the same outcome shape over gRPC. |
| Sidecar write-path deletion in B.4 leaves orphan IPC frames in old client builds | The frame-kind enum is forward-compatible (`MessageKind.WriteAlarmEventsRequest = 0x20`). Old clients sending the request to a new sidecar receive `Unsupported message kind`; new clients never send it. Acceptable — same-version deploy is the existing rollout convention. |
## Roll-out
Track A lands first onto `mxaccessgw/main`, deployed to the parity rig.
Track B lands onto `lmxopcua/master` once A.3 is live on the rig — earlier
Track B PRs can target a feature branch (`feat/alarms-over-gateway`) and
merge to master after the rig is fully green.
## Back-out
Each PR is individually revertable. The cleanest back-out point is at
the gateway-side enum extension: removing `MX_EVENT_FAMILY_ON_ALARM_TRANSITION`
from the proto means EventPump silently drops alarm events again and
GalaxyDriver's `OnAlarmEvent` never fires — but the sub-attribute fallback
path still produces functional alarms, so the OPC UA surface degrades to
v2-current behaviour without breaking. PR B.4 is the only one with a
non-trivial back-out (re-add the deleted sidecar IPC slot if revert
needed); land B.4 last and only after end-of-epic gate is green.
## Out of scope (explicit)
- **Other alarm sources beyond Galaxy.** AbCip / FOCAS / OpcUaClient
drivers already implement `IAlarmSource`; they're untouched.
- **Modbus / S7 / AbLegacy / TwinCAT alarms.** None of those protocols
has a native alarm bus. Alarms on those drivers, if needed, ship via
the scripted-alarm path.
- **Multi-Galaxy ack routing.** Today's gateway model is one Galaxy per
session; if a deployment splits across galaxies, each gets its own
GalaxyDriver and they don't cross-talk. No change.
- **OPC UA Part 9 advanced features** beyond the current scope —
shelving, subscribed-to-events-only, branch-state for re-trigger
semantics. Future epic if a customer asks.
- **Insight / cloud Historian write-back path.** Track A.5 targets the
on-prem AVEVA Historian via aahClientManaged. The cloud variant
would mirror the same gateway RPC over the REST API discussed in
`docs/histsdk` — separate epic.
## File inventory (touched)
**mxaccessgw:**
- `src\MxGateway.Contracts\Protos\mxaccess_gateway.proto` (A.1, A.5)
- `src\MxGateway.Contracts\Protos\mxaccess_worker.proto` (A.2, A.4, A.5)
- `src\MxGateway.Worker\…\Eventing\` (A.2, A.3, A.4)
- `src\MxGateway.Worker\…\Commands\` (A.3, A.4, A.5)
- `src\MxGateway.Server\Sessions\SessionEventStream.cs` (A.3)
- `src\MxGateway.Server\Rpc\` (A.3, A.4, A.5)
- `src\MxGateway.Server\Auth\Scopes.cs` (A.3, A.4, A.5)
- `MxGateway.Tests`, `MxGateway.Worker.Tests`, `MxGateway.IntegrationTests`
**lmxopcua:**
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\EventPump.cs` (B.1)
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\MxAccessSeverityMapper.cs` *(new — B.1)*
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\IGalaxyAlarmAcknowledger.cs` *(new — B.2)*
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\GatewayGalaxyAlarmAcknowledger.cs` *(new — B.2)*
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\GatewayAlarmHistorianWriter.cs` *(new — B.4)*
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriver.cs` (B.2)
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriverFactory.cs` (B.2, B.4)
- `src\ZB.MOM.WW.OtOpcUa.Server\OpcUa\DriverNodeManager.cs` (B.3)
- `src\ZB.MOM.WW.OtOpcUa.Server\Alarms\AlarmConditionService.cs` (B.3)
- `src\ZB.MOM.WW.OtOpcUa.Server\Phase7\Phase7Composer.cs` (B.4)
- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Ipc\Contracts.cs` (B.4 — deletions)
- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Ipc\HistorianFrameHandler.cs` (B.4 — deletions)
- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Ipc\Framing.cs` (B.4 — deletions)
- `tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests\Runtime\` (B.1, B.2)
- `tests\ZB.MOM.WW.OtOpcUa.Server.Tests\Alarms\` (B.3)
- `tests\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests\` (B.4 — drop deleted-contract tests)
- `docs\drivers\Galaxy.md` (B.5)
- `docs\AlarmTracking.md` *(new — B.5)*
- `docs\v1\AlarmTracking.md` (B.5 — banner update)
- `docs\plans\alarms-over-gateway.md` (B.5 — completion banner)
Total: ~12 source files added/modified in mxaccessgw; ~17 in lmxopcua;
~10 test files. Should land in 4-6 weeks of focused work given the
parity-rig dependency for end-to-end validation.