SdkAlarmHistorianWriteBackend.WriteBatchAsync replaces the RetryPlease placeholder with the real entry point — HistorianAccess.AddStreamedValue (HistorianEvent, out HistorianAccessError) in aahClientManaged, pinned by decompiling the installed SDK. The write path opens its own ReadOnly=false connection: the query-side HistorianDataSource opens ReadOnly sessions and AddStreamedValue fails on those with WriteToReadOnlyFile. IHistorianConnectionFactory gains a readOnly parameter (default true, query path unchanged); BuildConnectionArgs is extracted as a pure helper. HistorianClusterEndpointPicker is shared for node failover; connection-class errors abort the batch as RetryPlease and reset the connection, malformed-input codes map to PermanentFail. Tests: connection-unavailable batch deferral, ClassifyOutcome error-code table, BuildConnectionArgs read-vs-write shaping (80 pass, 2 rig-skipped). Live_* round-trip tests stay Skip-gated for the D.1 rollout smoke. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
345 lines
16 KiB
Markdown
345 lines
16 KiB
Markdown
# Alarms Worker Wiring Plan
|
||
|
||
> **Context**: The alarms-over-gateway epic shipped 19 PRs across the
|
||
> `lmxopcua` and `mxaccessgw` repos (merged 2026-04-30). Contracts are live;
|
||
> the sub-attribute fallback path keeps Galaxy alarms functional today. Four
|
||
> items remain as inert scaffolds gated on a dev-rig finding. This document is
|
||
> the focused implementation plan for those four items only.
|
||
>
|
||
> **Do not duplicate `docs/plans/alarms-over-gateway.md`** — that document is
|
||
> the full historical record of all 19 PRs. This document covers only what is
|
||
> still to be done and exactly what blocks each item.
|
||
>
|
||
> **This work lives in the mxaccessgw sibling repo** at
|
||
> `C:\Users\dohertj2\Desktop\mxaccessgw\` — not in this (lmxopcua) repo,
|
||
> except where lmxopcua changes are noted explicitly.
|
||
|
||
---
|
||
|
||
## Dev-rig finding that blocks everything (2026-04-30)
|
||
|
||
During PR A.2 work the following was discovered on the dev box:
|
||
|
||
> The MXAccess COM Toolkit at
|
||
> `C:\Program Files (x86)\ArchestrA\Framework\Bin\ArchestrA.MXAccess.dll`
|
||
> exposes **no alarm-event family** — only `OnDataChange`, `OnWriteComplete`,
|
||
> `OperationComplete`, `OnBufferedDataChange`.
|
||
>
|
||
> AVEVA's `aaAlarmManagedClient` / `ArchestrAAlarmsAndEvents.SDK` assemblies
|
||
> are **x64-only** and incompatible with the worker's x86 net48 bitness.
|
||
|
||
The architectural decision required before any of A.2, A.3/A.4, C.1 can ship:
|
||
|
||
> **Either** accept the value-driven sub-attribute path as the production
|
||
> architecture (operator-comment fidelity is the only v1 regression), **or**
|
||
> add an x64 alarm-helper sub-process alongside the x86 worker.
|
||
|
||
Resolution drives the implementation shape of every item below. The plan
|
||
presented here assumes the x64 alarm-helper sub-process route (the higher
|
||
parity option), but notes the sub-attribute-only exit at each step.
|
||
|
||
---
|
||
|
||
## Discovered AVEVA API surface
|
||
|
||
Before implementing, verify the following against the AVEVA SDK actually
|
||
installed on the dev box and in the mxaccessgw worker's deployment folder:
|
||
|
||
| Assembly | Bitness | Likely location | Key types |
|
||
|----------|---------|-----------------|-----------|
|
||
| `ArchestrA.MXAccess.dll` | x86 | `C:\Program Files (x86)\ArchestrA\Framework\Bin\` | `IMxAlarmEventSink`, `MxAlarmEventArgs` — **confirm exists at actual version** |
|
||
| `aaAlarmManagedClient.dll` | x64 | `C:\Program Files\ArchestrA\Framework\Bin\` | `AlarmClient`, `IAlarmConsumer`, `AlarmEventArgs` |
|
||
| `ArchestrAAlarmsAndEvents.SDK.dll` | x64 | Same or Historian SDK folder | `AlarmHistorianWriter`, `GetAlarmExtendedRec` |
|
||
|
||
The AVEVA MXAccess Toolkit reference in the mxaccessgw repo (`gateway.md`) is
|
||
the canonical API doc for the gateway worker's side. The alarm-client API is
|
||
documented separately; verify the following call shapes during PR A.2:
|
||
|
||
| Operation | Likely API | Notes |
|
||
|-----------|-----------|-------|
|
||
| Subscribe to alarm events | `AlarmClient.RegisterConsumer(IAlarmConsumer)` + `AlarmClient.Subscribe(filterSpec)` | Confirm exact method signatures against the SDK version on the dev box |
|
||
| Receive alarm event | `IAlarmConsumer.OnAlarmEvent(AlarmEventArgs)` callback | Field set: alarm name, source, type, transition kind, severity, timestamps, operator fields |
|
||
| Acknowledge alarm | `AlarmClient.AcknowledgeAlarm(alarmRef, comment, userPrincipal)` or equivalent | Confirm whether this is synchronous or returns a status |
|
||
| Query active alarms | `AlarmClient.GetAlarmExtendedRec(filter)` or `GetActiveAlarms()` | Returns current active set for ConditionRefresh |
|
||
| Get statistics | `AlarmClient.GetStatistics()` | Optional — useful for worker health checks |
|
||
|
||
Record the exact method signatures against the installed SDK before starting
|
||
A.2 — the proto field set in `OnAlarmTransitionEvent` must match the SDK's
|
||
actual payload.
|
||
|
||
---
|
||
|
||
## Dependency order
|
||
|
||
```
|
||
A.2 (worker: AlarmClient subscription)
|
||
└─► A.3 (gateway: dispatch OnAlarmTransition + AcknowledgeAlarm RPC handler)
|
||
└─► A.4 (gateway: QueryActiveAlarms RPC handler)
|
||
└─► lmxopcua B.2 (GalaxyDriver IAlarmSource live)
|
||
└─► C.1 (sidecar: AahClientManagedAlarmEventWriter live)
|
||
└─► D.1 (smoke artifact captured)
|
||
```
|
||
|
||
A.2 is the single blocking item. All subsequent items unblock serially once
|
||
A.2 delivers alarm events through the channel.
|
||
|
||
---
|
||
|
||
## Item A.2 — Worker: subscribe to MxAccess alarm event source
|
||
|
||
**Repo**: `mxaccessgw` — `src\MxGateway.Worker\` (net48, x86)
|
||
|
||
**What it needs**:
|
||
|
||
The worker must subscribe to AVEVA's alarm events and fan them into the same
|
||
bounded channel the data-change pump uses, translating each MxAccess alarm
|
||
event into a `WorkerEvent` proto with family `MX_EVENT_FAMILY_ON_ALARM_TRANSITION`
|
||
(defined in PR A.1, already merged).
|
||
|
||
**Architectural choice determines the implementation path**:
|
||
|
||
**Option X1 — aaAlarmManagedClient in a new x64 alarm-helper process**
|
||
|
||
Add a second worker-mode sub-process (`MxGateway.AlarmWorker`, net8.0 x64)
|
||
alongside the existing x86 worker. The AlarmWorker:
|
||
|
||
1. Loads `aaAlarmManagedClient.dll` (x64) on startup.
|
||
2. Calls `AlarmClient.RegisterConsumer` with a `WorkerAlarmConsumer` sink.
|
||
3. Calls `AlarmClient.Subscribe` with a session-level filter (all alarms for
|
||
the session's Galaxy scope).
|
||
4. Translates each `IAlarmConsumer.OnAlarmEvent` callback into a protobuf
|
||
`WorkerEvent` (family `ON_ALARM_TRANSITION`) and writes it to an IPC
|
||
channel readable by the gateway server-side multiplexer.
|
||
5. Handles session lifecycle: re-subscribes after reconnect; unsubscribes on
|
||
session close.
|
||
|
||
IPC from AlarmWorker to gateway: simplest option is a named pipe or an
|
||
in-process queue if the AlarmWorker is hosted in the same gateway process
|
||
space as a separate `IHostedService`.
|
||
|
||
**Option X2 — Accept sub-attribute fallback as production (no A.2 work)**
|
||
|
||
If the architectural decision is to accept the sub-attribute path as permanent:
|
||
|
||
- `MxAccessAlarmEventSink.Attach()` in the worker remains a no-op (as
|
||
currently coded with the architectural comment).
|
||
- The `MX_EVENT_FAMILY_ON_ALARM_TRANSITION` proto family stays defined but
|
||
the gateway never emits events on it.
|
||
- lmxopcua's `GalaxyDriver` does not implement `IAlarmSource` for the
|
||
native path; the value-driven sub-attribute path remains the production
|
||
path.
|
||
- The only regression vs. v1 is operator-comment fidelity on Galaxy alarms.
|
||
- C.1 is still needed if scripted-alarm historian write-back is required.
|
||
|
||
**What blocks it**: the architectural decision above. Once made, A.2 becomes
|
||
a 2–3 day implementation task (sub-process plumbing + proto translation +
|
||
unit tests for the consumer sink cancellation behaviour).
|
||
|
||
**Tests to write (when A.2 proceeds)**:
|
||
|
||
- `WorkerAlarmConsumerTests` — fake `IAlarmConsumer` source emits canned
|
||
transitions; assert each produces the correct `WorkerEvent` body shape.
|
||
- Cancellation/session-close test — closing the session unsubscribes from
|
||
the AlarmClient cleanly (no leaked `IAlarmConsumer` reference if the
|
||
worker is recycled mid-session).
|
||
- Re-subscribe-after-reconnect test — `ReconnectSupervisor` triggers a
|
||
reconnect; assert the alarm consumer re-attaches to the new session.
|
||
|
||
---
|
||
|
||
## Item A.3 / A.4 — Gateway: dispatch and RPC handlers
|
||
|
||
**Repo**: `mxaccessgw` — `src\MxGateway.Server\`
|
||
|
||
**Depends on**: A.2 delivering `WorkerEvent` bodies with family
|
||
`MX_EVENT_FAMILY_ON_ALARM_TRANSITION`.
|
||
|
||
**What it needs**:
|
||
|
||
### A.3 — Dispatch + AcknowledgeAlarm
|
||
|
||
1. The session-level event multiplexer (`Sessions\SessionEventStream.cs` or
|
||
equivalent — verify name in the mxaccessgw repo) must recognise the new
|
||
`WorkerEvent` body and forward it as an `MxEvent` with family
|
||
`MX_EVENT_FAMILY_ON_ALARM_TRANSITION` to every `StreamEvents` subscriber
|
||
for that session.
|
||
|
||
2. New RPC handler `AcknowledgeAlarm` builds an `AlarmAcknowledgeCommand`
|
||
worker command and forwards it to the alarm-helper process (Option X1) or
|
||
the worker's MxAccess session (Option X2 if MxAccess exposes ack). Maps
|
||
the reply status to `AcknowledgeAlarmReply.MxStatusProxy`.
|
||
|
||
3. Authorization: new API scope `invoke:alarm-ack` on the API key. Keys
|
||
without it receive `PERMISSION_DENIED`. Follow the existing scope-check
|
||
pattern used by `invoke:write`.
|
||
|
||
### A.4 — QueryActiveAlarms
|
||
|
||
1. New RPC handler `QueryActiveAlarms` calls `AlarmClient.GetAlarmExtendedRec`
|
||
(or `GetActiveAlarms` — confirm the method name during implementation)
|
||
on the alarm-helper process, batches results into `ActiveAlarmSnapshot`
|
||
proto messages, and streams them back to the caller.
|
||
|
||
2. New API scope `invoke:alarm-query` (separate from ack so read-only clients
|
||
can refresh without ack rights).
|
||
|
||
**What blocks A.3/A.4**: A.2 must deliver `WorkerEvent` bodies on the channel.
|
||
A.3/A.4 are pure dispatch wiring once the events arrive.
|
||
|
||
**Tests to write**:
|
||
|
||
- A.3 dispatch test — fake worker emits an `AlarmTransition` event; assert
|
||
the gateway forwards it on the `StreamEvents` channel of every subscribed
|
||
session (mirrors existing `OnDataChange` dispatch tests).
|
||
- A.3 AcknowledgeAlarm auth test — existing key without `invoke:alarm-ack`
|
||
scope returns `PERMISSION_DENIED`.
|
||
- A.4 pagination test — synthetic active-alarm set of 0 / 1 / 100 entries;
|
||
assert each streams back as separate `ActiveAlarmSnapshot` messages.
|
||
- Integration (parity rig — requires dev box with AVEVA platform):
|
||
trigger a real Galaxy alarm, call `QueryActiveAlarms`, assert the alarm
|
||
appears in the stream; call `AcknowledgeAlarm`, assert the alarm transitions
|
||
to `ActiveAcked` and a `Acknowledge` transition event appears on
|
||
`StreamEvents`.
|
||
|
||
---
|
||
|
||
## Item C.1 — Historian sidecar: AahClientManagedAlarmEventWriter
|
||
|
||
**Repo**: `lmxopcua` — `src\Drivers\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\`
|
||
|
||
**Depends on**: Architectural decision (the sidecar uses `aahClientManaged`
|
||
x64, which is not bitness-constrained like the worker). C.1 is independently
|
||
unblockable from A.2 if the goal is to wire up the scripted-alarm historian
|
||
path.
|
||
|
||
**Current state (DONE — code)**:
|
||
|
||
C.1 shipped. `SdkAlarmHistorianWriteBackend.WriteBatchAsync` writes through the
|
||
real SDK entry point — **`HistorianAccess.AddStreamedValue(HistorianEvent, out
|
||
HistorianAccessError)`** in `aahClientManaged` — pinned 2026-05-18 by
|
||
decompiling the installed SDK. `Program.cs` and `Install-Services.ps1` were
|
||
already wired in the PR C.1 scaffolding. Two corrections to the assumptions
|
||
this doc was written under:
|
||
|
||
- **There is no `ArchestrAAlarmsAndEvents.SDK` writer.** That assembly
|
||
(`ArchestrAAlarmsAndEvents.SDK.Common.dll`, the only one installed) is a WCF
|
||
query-proxy base — no `AlarmHistorianWriter` type. The write path is the
|
||
`aahClientManaged` `HistorianAccess` surface.
|
||
- **The write path needs its own connection.** The query-side
|
||
`HistorianDataSource` opens `ReadOnly` sessions; `AddStreamedValue` on a
|
||
read-only session fails with `WriteToReadOnlyFile`.
|
||
`SdkAlarmHistorianWriteBackend` opens a dedicated `ReadOnly=false` connection
|
||
and shares only `HistorianClusterEndpointPicker` (not the connection object).
|
||
|
||
**What it needed** (all done):
|
||
|
||
1. `SdkAlarmHistorianWriteBackend` builds a `HistorianEvent` per
|
||
`AlarmHistorianEventDto`, calls `AddStreamedValue`, and maps
|
||
`HistorianAccessError.ErrorValue` codes through
|
||
`AahClientManagedAlarmEventWriter.MapOutcome` (Ack / PermanentFail /
|
||
RetryPlease). `HistorianClusterEndpointPicker` drives multi-node failover.
|
||
2. `Program.cs` — `BuildAlarmWriter()` constructs the backend gated behind
|
||
`OTOPCUA_HISTORIAN_ALARM_WRITE_ENABLED`.
|
||
3. `Install-Services.ps1` — env var present in the install-time block.
|
||
|
||
**What remains for C.1**: only the live-rig write smoke — the `Live_*` tests
|
||
in `SdkAlarmHistorianWriteBackendTests` stay `Skip`-gated until D.1 confirms a
|
||
round-trip against a real AVEVA Historian, including the exact mandatory
|
||
`HistorianEvent` field set.
|
||
|
||
**Tests to write**:
|
||
|
||
- Outcome-mapping table: every `MxStatus` on alarm-write → expected
|
||
`HistorianWriteOutcome`.
|
||
- Batch test: 1 / 100 / 1000 events through a fake `aahClientManaged`
|
||
writer; assert per-row outcome list parallel to input order.
|
||
- Cluster failover: primary Historian node returns `BadCommunicationError`;
|
||
picker rotates to secondary; eventual success.
|
||
- `Program.cs` seam: assert handler constructed with alarm writer when env
|
||
var enabled; without it when disabled.
|
||
- Live integration (parity rig): write a synthetic alarm event through the
|
||
IPC; query it back via `ReadEvents`; assert round-trip fidelity.
|
||
|
||
---
|
||
|
||
## Item D.1 — Smoke artifact
|
||
|
||
**Repo**: `lmxopcua` (deployment refresh) + `mxaccessgw` (rig verification)
|
||
|
||
**Depends on**: A.2, A.3, A.4, and C.1 all passing on the dev rig with a live
|
||
Galaxy and live Historian.
|
||
|
||
**Current state**: The deployment script `Refresh-Services.ps1` (task D.1) has
|
||
shipped as PR #417 (merged 2026-04-30). What was NOT captured at that time was
|
||
a smoke artifact — a log snippet or test output confirming that:
|
||
|
||
1. An alarm transition event from a live Galaxy alarm reaches lmxopcua's
|
||
`AlarmConditionService` via the new `IAlarmSource` path (not the fallback).
|
||
2. A scripted-alarm historian write-back reaches AVEVA Historian via the
|
||
sidecar `IAlarmEventWriter`.
|
||
|
||
**What it needs**:
|
||
|
||
Once A.2, A.3, C.1 are wired on the parity rig:
|
||
|
||
1. Deploy the updated mxaccessgw (with A.2 / A.3 / A.4 changes).
|
||
2. Deploy the updated sidecar (with C.1 changes).
|
||
3. Run `Refresh-Services.ps1` to confirm clean service restarts.
|
||
4. Trigger a Galaxy alarm (e.g. set an AnalogLimitAlarm attribute out of
|
||
range in Galaxy IDE).
|
||
5. Observe the lmxopcua OPC UA alarm surface via the Client CLI:
|
||
|
||
```powershell
|
||
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
|
||
alarms -u opc.tcp://localhost:4840 --subscribe
|
||
```
|
||
|
||
Pass: the alarm condition appears on the OPC UA A&E surface within
|
||
2 × publishing interval.
|
||
|
||
6. Trigger a scripted alarm via the lmxopcua `ScriptedAlarmEngine`
|
||
(or an OPC UA method call if one is wired).
|
||
7. Confirm in the AVEVA Historian that the scripted alarm event is stored
|
||
(query via the Historian client or HistorianWatch tool).
|
||
|
||
8. Capture log snippets:
|
||
- mxaccessgw log: `[INF] AlarmTransition dispatched sessionId=<> alarmRef=<>`
|
||
- lmxopcua log: `[INF] AlarmConditionService: IAlarmSource event alarmRef=<> origin=Driver`
|
||
- Sidecar log: `[INF] AahClientManagedAlarmEventWriter: Wrote <n> alarm events`
|
||
|
||
9. Commit the log snippets as `docs/plans/alarms-d1-smoke-artifact.md`
|
||
(a new doc, not this one).
|
||
|
||
**What blocks D.1**: all of A.2, A.3, C.1, plus the operator decision on the
|
||
x64 alarm-helper architecture (or explicit acceptance of the sub-attribute
|
||
fallback as production).
|
||
|
||
---
|
||
|
||
## Summary of blocks
|
||
|
||
| Item | Blocked by | Estimated effort once unblocked |
|
||
|------|-----------|--------------------------------|
|
||
| A.2 | Architectural decision (x64 alarm-helper vs. sub-attribute fallback as production) | 2–3 days implementation; 1 day tests |
|
||
| A.3 | A.2 delivering WorkerEvent bodies | 1–2 days |
|
||
| A.4 | A.2 (active-alarm query needs AlarmClient session) | 1 day |
|
||
| C.1 | aahClientManaged SDK access (available on dev box); NOT blocked by A.2 | 1–2 days |
|
||
| D.1 | A.2 + A.3 + C.1 all passing on parity rig | 0.5 day (smoke + artifact capture) |
|
||
|
||
C.1 can proceed in parallel with A.2 / A.3 since the sidecar's `aahClientManaged`
|
||
is x64 and does not share the worker bitness constraint.
|
||
|
||
---
|
||
|
||
## What this plan does NOT cover
|
||
|
||
- The value-driven sub-attribute fallback path — already shipped and
|
||
functional (not being changed).
|
||
- Track B (lmxopcua EventPump, GalaxyDriver IAlarmSource re-implementation)
|
||
and Track E (client SDK surface refresh) from the alarms-over-gateway plan —
|
||
those are in `lmxopcua` and depend on A.3 being live; they follow naturally
|
||
once A.3 ships.
|
||
- Galaxy-native alarm historian path — System Platform's own `HistorizeToAveva`
|
||
toggle on the Galaxy template; not in scope.
|
||
- Alarm ACL / role-grant surface — already shipped in Phase 6.2.
|