Files
lmxopcua/docs/plans/alarms-worker-wiring-plan.md
Joseph Doherty cd2306db66 feat(historian-sidecar): live aahClientManaged alarm-event write path (C.1)
SdkAlarmHistorianWriteBackend.WriteBatchAsync replaces the RetryPlease
placeholder with the real entry point — HistorianAccess.AddStreamedValue
(HistorianEvent, out HistorianAccessError) in aahClientManaged, pinned by
decompiling the installed SDK.

The write path opens its own ReadOnly=false connection: the query-side
HistorianDataSource opens ReadOnly sessions and AddStreamedValue fails on
those with WriteToReadOnlyFile. IHistorianConnectionFactory gains a readOnly
parameter (default true, query path unchanged); BuildConnectionArgs is
extracted as a pure helper. HistorianClusterEndpointPicker is shared for
node failover; connection-class errors abort the batch as RetryPlease and
reset the connection, malformed-input codes map to PermanentFail.

Tests: connection-unavailable batch deferral, ClassifyOutcome error-code
table, BuildConnectionArgs read-vs-write shaping (80 pass, 2 rig-skipped).
Live_* round-trip tests stay Skip-gated for the D.1 rollout smoke.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 16:08:32 -04:00

345 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Alarms Worker Wiring Plan
> **Context**: The alarms-over-gateway epic shipped 19 PRs across the
> `lmxopcua` and `mxaccessgw` repos (merged 2026-04-30). Contracts are live;
> the sub-attribute fallback path keeps Galaxy alarms functional today. Four
> items remain as inert scaffolds gated on a dev-rig finding. This document is
> the focused implementation plan for those four items only.
>
> **Do not duplicate `docs/plans/alarms-over-gateway.md`** — that document is
> the full historical record of all 19 PRs. This document covers only what is
> still to be done and exactly what blocks each item.
>
> **This work lives in the mxaccessgw sibling repo** at
> `C:\Users\dohertj2\Desktop\mxaccessgw\` — not in this (lmxopcua) repo,
> except where lmxopcua changes are noted explicitly.
---
## Dev-rig finding that blocks everything (2026-04-30)
During PR A.2 work the following was discovered on the dev box:
> The MXAccess COM Toolkit at
> `C:\Program Files (x86)\ArchestrA\Framework\Bin\ArchestrA.MXAccess.dll`
> exposes **no alarm-event family** — only `OnDataChange`, `OnWriteComplete`,
> `OperationComplete`, `OnBufferedDataChange`.
>
> AVEVA's `aaAlarmManagedClient` / `ArchestrAAlarmsAndEvents.SDK` assemblies
> are **x64-only** and incompatible with the worker's x86 net48 bitness.
The architectural decision required before any of A.2, A.3/A.4, C.1 can ship:
> **Either** accept the value-driven sub-attribute path as the production
> architecture (operator-comment fidelity is the only v1 regression), **or**
> add an x64 alarm-helper sub-process alongside the x86 worker.
Resolution drives the implementation shape of every item below. The plan
presented here assumes the x64 alarm-helper sub-process route (the higher
parity option), but notes the sub-attribute-only exit at each step.
---
## Discovered AVEVA API surface
Before implementing, verify the following against the AVEVA SDK actually
installed on the dev box and in the mxaccessgw worker's deployment folder:
| Assembly | Bitness | Likely location | Key types |
|----------|---------|-----------------|-----------|
| `ArchestrA.MXAccess.dll` | x86 | `C:\Program Files (x86)\ArchestrA\Framework\Bin\` | `IMxAlarmEventSink`, `MxAlarmEventArgs`**confirm exists at actual version** |
| `aaAlarmManagedClient.dll` | x64 | `C:\Program Files\ArchestrA\Framework\Bin\` | `AlarmClient`, `IAlarmConsumer`, `AlarmEventArgs` |
| `ArchestrAAlarmsAndEvents.SDK.dll` | x64 | Same or Historian SDK folder | `AlarmHistorianWriter`, `GetAlarmExtendedRec` |
The AVEVA MXAccess Toolkit reference in the mxaccessgw repo (`gateway.md`) is
the canonical API doc for the gateway worker's side. The alarm-client API is
documented separately; verify the following call shapes during PR A.2:
| Operation | Likely API | Notes |
|-----------|-----------|-------|
| Subscribe to alarm events | `AlarmClient.RegisterConsumer(IAlarmConsumer)` + `AlarmClient.Subscribe(filterSpec)` | Confirm exact method signatures against the SDK version on the dev box |
| Receive alarm event | `IAlarmConsumer.OnAlarmEvent(AlarmEventArgs)` callback | Field set: alarm name, source, type, transition kind, severity, timestamps, operator fields |
| Acknowledge alarm | `AlarmClient.AcknowledgeAlarm(alarmRef, comment, userPrincipal)` or equivalent | Confirm whether this is synchronous or returns a status |
| Query active alarms | `AlarmClient.GetAlarmExtendedRec(filter)` or `GetActiveAlarms()` | Returns current active set for ConditionRefresh |
| Get statistics | `AlarmClient.GetStatistics()` | Optional — useful for worker health checks |
Record the exact method signatures against the installed SDK before starting
A.2 — the proto field set in `OnAlarmTransitionEvent` must match the SDK's
actual payload.
---
## Dependency order
```
A.2 (worker: AlarmClient subscription)
└─► A.3 (gateway: dispatch OnAlarmTransition + AcknowledgeAlarm RPC handler)
└─► A.4 (gateway: QueryActiveAlarms RPC handler)
└─► lmxopcua B.2 (GalaxyDriver IAlarmSource live)
└─► C.1 (sidecar: AahClientManagedAlarmEventWriter live)
└─► D.1 (smoke artifact captured)
```
A.2 is the single blocking item. All subsequent items unblock serially once
A.2 delivers alarm events through the channel.
---
## Item A.2 — Worker: subscribe to MxAccess alarm event source
**Repo**: `mxaccessgw``src\MxGateway.Worker\` (net48, x86)
**What it needs**:
The worker must subscribe to AVEVA's alarm events and fan them into the same
bounded channel the data-change pump uses, translating each MxAccess alarm
event into a `WorkerEvent` proto with family `MX_EVENT_FAMILY_ON_ALARM_TRANSITION`
(defined in PR A.1, already merged).
**Architectural choice determines the implementation path**:
**Option X1 — aaAlarmManagedClient in a new x64 alarm-helper process**
Add a second worker-mode sub-process (`MxGateway.AlarmWorker`, net8.0 x64)
alongside the existing x86 worker. The AlarmWorker:
1. Loads `aaAlarmManagedClient.dll` (x64) on startup.
2. Calls `AlarmClient.RegisterConsumer` with a `WorkerAlarmConsumer` sink.
3. Calls `AlarmClient.Subscribe` with a session-level filter (all alarms for
the session's Galaxy scope).
4. Translates each `IAlarmConsumer.OnAlarmEvent` callback into a protobuf
`WorkerEvent` (family `ON_ALARM_TRANSITION`) and writes it to an IPC
channel readable by the gateway server-side multiplexer.
5. Handles session lifecycle: re-subscribes after reconnect; unsubscribes on
session close.
IPC from AlarmWorker to gateway: simplest option is a named pipe or an
in-process queue if the AlarmWorker is hosted in the same gateway process
space as a separate `IHostedService`.
**Option X2 — Accept sub-attribute fallback as production (no A.2 work)**
If the architectural decision is to accept the sub-attribute path as permanent:
- `MxAccessAlarmEventSink.Attach()` in the worker remains a no-op (as
currently coded with the architectural comment).
- The `MX_EVENT_FAMILY_ON_ALARM_TRANSITION` proto family stays defined but
the gateway never emits events on it.
- lmxopcua's `GalaxyDriver` does not implement `IAlarmSource` for the
native path; the value-driven sub-attribute path remains the production
path.
- The only regression vs. v1 is operator-comment fidelity on Galaxy alarms.
- C.1 is still needed if scripted-alarm historian write-back is required.
**What blocks it**: the architectural decision above. Once made, A.2 becomes
a 23 day implementation task (sub-process plumbing + proto translation +
unit tests for the consumer sink cancellation behaviour).
**Tests to write (when A.2 proceeds)**:
- `WorkerAlarmConsumerTests` — fake `IAlarmConsumer` source emits canned
transitions; assert each produces the correct `WorkerEvent` body shape.
- Cancellation/session-close test — closing the session unsubscribes from
the AlarmClient cleanly (no leaked `IAlarmConsumer` reference if the
worker is recycled mid-session).
- Re-subscribe-after-reconnect test — `ReconnectSupervisor` triggers a
reconnect; assert the alarm consumer re-attaches to the new session.
---
## Item A.3 / A.4 — Gateway: dispatch and RPC handlers
**Repo**: `mxaccessgw``src\MxGateway.Server\`
**Depends on**: A.2 delivering `WorkerEvent` bodies with family
`MX_EVENT_FAMILY_ON_ALARM_TRANSITION`.
**What it needs**:
### A.3 — Dispatch + AcknowledgeAlarm
1. The session-level event multiplexer (`Sessions\SessionEventStream.cs` or
equivalent — verify name in the mxaccessgw repo) must recognise the new
`WorkerEvent` body and forward it as an `MxEvent` with family
`MX_EVENT_FAMILY_ON_ALARM_TRANSITION` to every `StreamEvents` subscriber
for that session.
2. New RPC handler `AcknowledgeAlarm` builds an `AlarmAcknowledgeCommand`
worker command and forwards it to the alarm-helper process (Option X1) or
the worker's MxAccess session (Option X2 if MxAccess exposes ack). Maps
the reply status to `AcknowledgeAlarmReply.MxStatusProxy`.
3. Authorization: new API scope `invoke:alarm-ack` on the API key. Keys
without it receive `PERMISSION_DENIED`. Follow the existing scope-check
pattern used by `invoke:write`.
### A.4 — QueryActiveAlarms
1. New RPC handler `QueryActiveAlarms` calls `AlarmClient.GetAlarmExtendedRec`
(or `GetActiveAlarms` — confirm the method name during implementation)
on the alarm-helper process, batches results into `ActiveAlarmSnapshot`
proto messages, and streams them back to the caller.
2. New API scope `invoke:alarm-query` (separate from ack so read-only clients
can refresh without ack rights).
**What blocks A.3/A.4**: A.2 must deliver `WorkerEvent` bodies on the channel.
A.3/A.4 are pure dispatch wiring once the events arrive.
**Tests to write**:
- A.3 dispatch test — fake worker emits an `AlarmTransition` event; assert
the gateway forwards it on the `StreamEvents` channel of every subscribed
session (mirrors existing `OnDataChange` dispatch tests).
- A.3 AcknowledgeAlarm auth test — existing key without `invoke:alarm-ack`
scope returns `PERMISSION_DENIED`.
- A.4 pagination test — synthetic active-alarm set of 0 / 1 / 100 entries;
assert each streams back as separate `ActiveAlarmSnapshot` messages.
- Integration (parity rig — requires dev box with AVEVA platform):
trigger a real Galaxy alarm, call `QueryActiveAlarms`, assert the alarm
appears in the stream; call `AcknowledgeAlarm`, assert the alarm transitions
to `ActiveAcked` and a `Acknowledge` transition event appears on
`StreamEvents`.
---
## Item C.1 — Historian sidecar: AahClientManagedAlarmEventWriter
**Repo**: `lmxopcua``src\Drivers\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\`
**Depends on**: Architectural decision (the sidecar uses `aahClientManaged`
x64, which is not bitness-constrained like the worker). C.1 is independently
unblockable from A.2 if the goal is to wire up the scripted-alarm historian
path.
**Current state (DONE — code)**:
C.1 shipped. `SdkAlarmHistorianWriteBackend.WriteBatchAsync` writes through the
real SDK entry point — **`HistorianAccess.AddStreamedValue(HistorianEvent, out
HistorianAccessError)`** in `aahClientManaged` — pinned 2026-05-18 by
decompiling the installed SDK. `Program.cs` and `Install-Services.ps1` were
already wired in the PR C.1 scaffolding. Two corrections to the assumptions
this doc was written under:
- **There is no `ArchestrAAlarmsAndEvents.SDK` writer.** That assembly
(`ArchestrAAlarmsAndEvents.SDK.Common.dll`, the only one installed) is a WCF
query-proxy base — no `AlarmHistorianWriter` type. The write path is the
`aahClientManaged` `HistorianAccess` surface.
- **The write path needs its own connection.** The query-side
`HistorianDataSource` opens `ReadOnly` sessions; `AddStreamedValue` on a
read-only session fails with `WriteToReadOnlyFile`.
`SdkAlarmHistorianWriteBackend` opens a dedicated `ReadOnly=false` connection
and shares only `HistorianClusterEndpointPicker` (not the connection object).
**What it needed** (all done):
1. `SdkAlarmHistorianWriteBackend` builds a `HistorianEvent` per
`AlarmHistorianEventDto`, calls `AddStreamedValue`, and maps
`HistorianAccessError.ErrorValue` codes through
`AahClientManagedAlarmEventWriter.MapOutcome` (Ack / PermanentFail /
RetryPlease). `HistorianClusterEndpointPicker` drives multi-node failover.
2. `Program.cs``BuildAlarmWriter()` constructs the backend gated behind
`OTOPCUA_HISTORIAN_ALARM_WRITE_ENABLED`.
3. `Install-Services.ps1` — env var present in the install-time block.
**What remains for C.1**: only the live-rig write smoke — the `Live_*` tests
in `SdkAlarmHistorianWriteBackendTests` stay `Skip`-gated until D.1 confirms a
round-trip against a real AVEVA Historian, including the exact mandatory
`HistorianEvent` field set.
**Tests to write**:
- Outcome-mapping table: every `MxStatus` on alarm-write → expected
`HistorianWriteOutcome`.
- Batch test: 1 / 100 / 1000 events through a fake `aahClientManaged`
writer; assert per-row outcome list parallel to input order.
- Cluster failover: primary Historian node returns `BadCommunicationError`;
picker rotates to secondary; eventual success.
- `Program.cs` seam: assert handler constructed with alarm writer when env
var enabled; without it when disabled.
- Live integration (parity rig): write a synthetic alarm event through the
IPC; query it back via `ReadEvents`; assert round-trip fidelity.
---
## Item D.1 — Smoke artifact
**Repo**: `lmxopcua` (deployment refresh) + `mxaccessgw` (rig verification)
**Depends on**: A.2, A.3, A.4, and C.1 all passing on the dev rig with a live
Galaxy and live Historian.
**Current state**: The deployment script `Refresh-Services.ps1` (task D.1) has
shipped as PR #417 (merged 2026-04-30). What was NOT captured at that time was
a smoke artifact — a log snippet or test output confirming that:
1. An alarm transition event from a live Galaxy alarm reaches lmxopcua's
`AlarmConditionService` via the new `IAlarmSource` path (not the fallback).
2. A scripted-alarm historian write-back reaches AVEVA Historian via the
sidecar `IAlarmEventWriter`.
**What it needs**:
Once A.2, A.3, C.1 are wired on the parity rig:
1. Deploy the updated mxaccessgw (with A.2 / A.3 / A.4 changes).
2. Deploy the updated sidecar (with C.1 changes).
3. Run `Refresh-Services.ps1` to confirm clean service restarts.
4. Trigger a Galaxy alarm (e.g. set an AnalogLimitAlarm attribute out of
range in Galaxy IDE).
5. Observe the lmxopcua OPC UA alarm surface via the Client CLI:
```powershell
dotnet run --project src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI -- `
alarms -u opc.tcp://localhost:4840 --subscribe
```
Pass: the alarm condition appears on the OPC UA A&E surface within
2 × publishing interval.
6. Trigger a scripted alarm via the lmxopcua `ScriptedAlarmEngine`
(or an OPC UA method call if one is wired).
7. Confirm in the AVEVA Historian that the scripted alarm event is stored
(query via the Historian client or HistorianWatch tool).
8. Capture log snippets:
- mxaccessgw log: `[INF] AlarmTransition dispatched sessionId=<> alarmRef=<>`
- lmxopcua log: `[INF] AlarmConditionService: IAlarmSource event alarmRef=<> origin=Driver`
- Sidecar log: `[INF] AahClientManagedAlarmEventWriter: Wrote <n> alarm events`
9. Commit the log snippets as `docs/plans/alarms-d1-smoke-artifact.md`
(a new doc, not this one).
**What blocks D.1**: all of A.2, A.3, C.1, plus the operator decision on the
x64 alarm-helper architecture (or explicit acceptance of the sub-attribute
fallback as production).
---
## Summary of blocks
| Item | Blocked by | Estimated effort once unblocked |
|------|-----------|--------------------------------|
| A.2 | Architectural decision (x64 alarm-helper vs. sub-attribute fallback as production) | 23 days implementation; 1 day tests |
| A.3 | A.2 delivering WorkerEvent bodies | 12 days |
| A.4 | A.2 (active-alarm query needs AlarmClient session) | 1 day |
| C.1 | aahClientManaged SDK access (available on dev box); NOT blocked by A.2 | 12 days |
| D.1 | A.2 + A.3 + C.1 all passing on parity rig | 0.5 day (smoke + artifact capture) |
C.1 can proceed in parallel with A.2 / A.3 since the sidecar's `aahClientManaged`
is x64 and does not share the worker bitness constraint.
---
## What this plan does NOT cover
- The value-driven sub-attribute fallback path — already shipped and
functional (not being changed).
- Track B (lmxopcua EventPump, GalaxyDriver IAlarmSource re-implementation)
and Track E (client SDK surface refresh) from the alarms-over-gateway plan —
those are in `lmxopcua` and depend on A.3 being live; they follow naturally
once A.3 ships.
- Galaxy-native alarm historian path — System Platform's own `HistorizeToAveva`
toggle on the Galaxy template; not in scope.
- Alarm ACL / role-grant surface — already shipped in Phase 6.2.