docs: alarm-historian write moves from gateway to historian sidecar

Revise the alarms-over-gateway plan based on review feedback:

The gateway is for MxAccess (live data + Galaxy hierarchy); the
Wonderware historian sidecar is for aahClientManaged (time-series +
alarms historian). Two SDKs, two concerns. Routing alarm-historian
write-back through the gateway would force coupling that doesn't need
to exist — the sidecar already has a dormant WriteAlarmEvents IPC slot
ready to wire.

Drop A.5 (gateway WriteHistorianEvent RPC). Add Track C — two PRs in
the historian sidecar that complete the dormant slot:
  C.1 AahClientManagedAlarmEventWriter implementation
  C.2 Program.cs wires the writer into HistorianFrameHandler

B.4 reverses from "delete the IPC slot" to "consume the IPC slot" via
a new SidecarAlarmHistorianWriter on the lmxopcua side.

Also tightens Why-section #3 + D5 to make explicit that the path is
exclusively for non-Galaxy alarm producers (scripted alarms today, AB
CIP ALMD or others future). Galaxy-native alarms reach AVEVA Historian
via System Platform's own HistorizeToAveva toggle, independent of
anything in our stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-04-30 15:08:58 -04:00
parent 65a5f64931
commit 7367b3e23f

View File

@@ -22,15 +22,17 @@ mxaccessgw's gRPC. In doing so, three v1 capabilities regressed:
API directly from `GalaxyAlarmTracker`. Today, OPC UA acks are written
into the `AckMsgWriteRef` sub-attribute — semantically valid but a
round-trip through the value path that loses operator-comment fidelity.
3. **Alarm-historian write-back path**`GalaxyHistorianWriter`
implemented `IAlarmHistorianWriter` and forwarded scripted-alarm and
Galaxy-native alarm transitions back to AVEVA Historian via
`aahClientManaged`. PR 7.2 deleted it. `Phase7Composer.ResolveHistorianSink`
now finds no writer and falls back to `NullAlarmHistorianSink`, so
**scripted-alarm transitions queue locally and silently discard.**
(Galaxy-native alarms still reach AVEVA Historian via the Galaxy template's
own `HistorizeToAveva` toggle, independent of our sink — that path
wasn't broken.)
3. **Alarm-historian write-back path for non-Galaxy alarm sources.**
v1's `GalaxyHistorianWriter` implemented `IAlarmHistorianWriter` and
forwarded *scripted-alarm* transitions (and any future non-Galaxy
alarm source — AB CIP ALMD, OpcUaClient A&E, etc.) back to AVEVA
Historian via `aahClientManaged`. PR 7.2 deleted it.
`Phase7Composer.ResolveHistorianSink` now finds no writer and falls
back to `NullAlarmHistorianSink`, so **scripted-alarm transitions
queue locally and silently discard.** Galaxy-native alarms (with
`$Alarm*` extensions) reach AVEVA Historian via System Platform's
own `HistorizeToAveva` toggle on the Galaxy template — that path
was never broken and is not in scope for this epic.
`gateway.md` (mxaccessgw, line 8) explicitly commits the gateway to "full
MXAccess parity… preserve MXAccess behavior first… **native MXAccess event
@@ -128,13 +130,22 @@ only way to surface alarms for those templates. Both paths feed
`AlarmConditionService`. Driver-native events take precedence when both
are present (more authoritative, lower latency).
**D5 — Where the historian writer lives.** As a new RPC on the gateway
(`WriteHistorianEvent`). The Wonderware sidecar's existing
`WriteAlarmEvents` IPC slot stays unwired and is deleted as part of this
epic — the gateway is the canonical place for "write to AVEVA Historian"
since the gateway already owns AVEVA-COM access. This also means the
sidecar (long term) only does *reads* and could potentially retire entirely
if the historian-client REST migration (`docs/plans/...`) lands.
**D5 — Where the historian writer lives.** In the **Wonderware historian
sidecar**, not in the gateway. The sidecar already owns `aahClientManaged`,
already has a `WriteAlarmEvents` IPC slot defined in `Ipc/Contracts.cs`, and
already dispatches to an `IAlarmEventWriter` interface — it's just unwired
in `Program.cs:57`. The gateway is for MxAccess (live data + Galaxy
hierarchy); the historian sidecar is for `aahClientManaged` (time-series +
alarms historian). Two different SDKs, two different concerns; keep the
split. Bonus: completing the sidecar's write path also gives it a clearer
long-term role — once the REST-API migration in `histsdk\instructions.md`
takes over reads, write-back keeps the sidecar relevant rather than
retiring it as a read-only relic. **Galaxy-native alarms bypass this
entirely** — System Platform's own `HistorizeToAveva` toggle on the
Galaxy template publishes them directly. The sidecar write path is
exclusively for non-Galaxy producers (today: scripted alarms; future: AB
CIP ALMD or any other lmxopcua-side alarm source the customer wants
unified into AVEVA Historian).
## Track A — mxaccessgw changes
@@ -276,35 +287,11 @@ is worker-internal.
reconnect returns every alarm currently `Active` or `ActiveAcked` in
the Galaxy.
### PR A.5 — gateway: WriteHistorianEvent RPC for sink write-back
**Files** (`src\MxGateway.Server\`, `src\MxGateway.Worker\`,
`src\MxGateway.Contracts\Protos\mxaccess_gateway.proto`).
1. New RPC `WriteHistorianEvent(WriteHistorianEventRequest) →
WriteHistorianEventReply`. Request carries an
`AlarmHistorianRecord` mirroring the existing
`Core.AlarmHistorian.AlarmHistorianEvent` payload (alarm id,
equipment path, alarm name, alarm-type-name, severity, event kind,
message, user, comment, timestamp).
2. Worker maps the record onto `aahClientManaged`'s alarm-event
write API (the same path v1's `GalaxyHistorianWriter` used). Worker
batches up to N records per write to amortize the COM round-trip.
3. AuthN — new scope `invoke:historian-write`. Cross-cutting with
`invoke:write` — keys for OPC UA servers that publish historian
data must hold both.
**Tests:**
- Worker test: fake `aahClientManaged` writer; assert batching
semantics + retry-on-Bad-status-code behaviour matches v1's
`GalaxyHistorianWriter` (per-row outcome reporting).
- Integration: write a record, query it back via existing Historian
read APIs, assert round-trip fidelity.
**Sequencing within Track A:** A.1 → A.2 → A.3 → A.4 → A.5. A.1 is
**Sequencing within Track A:** A.1 → A.2 → A.3 → A.4. A.1 is
mechanical; A.2 + A.3 are the load-bearing changes that unlock lmxopcua
side. A.4 + A.5 can ship after lmxopcua starts consuming A.3 output.
side. A.4 can ship after lmxopcua starts consuming A.3 output. The
historian-write capability moved to **Track C** below — the gateway
intentionally stays out of `aahClientManaged`.
## Track B — lmxopcua changes
@@ -413,37 +400,45 @@ dispatch).
ack-via-RPC; driver implements only `IWritable` → ack-via-write
(existing path).
### PR B.4 — IAlarmHistorianWriter via gateway
### PR B.4 — IAlarmHistorianWriter via the historian sidecar IPC
**Depends on:** A.5 (`WriteHistorianEvent` RPC available).
**Depends on:** C.2 (sidecar wires its `IAlarmEventWriter`). See Track C
for the sidecar-side work; B.4 is the lmxopcua-side consumer.
**Files:**
- New `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\GatewayAlarmHistorianWriter.cs`
implementing `IAlarmHistorianWriter`. Calls the gateway RPC from
Track A.5 with the same batch + per-row outcome semantics v1's
`GalaxyHistorianWriter` exposed.
- `GalaxyDriverFactory` registers it as a singleton tied to the
`DriverInstance`.
- New `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client\SidecarAlarmHistorianWriter.cs`
implementing `IAlarmHistorianWriter`. Sends batches over the existing
named-pipe IPC using the **already-defined**
`WriteAlarmEventsRequest` / `WriteAlarmEventsReply` contracts at
`Ipc\Contracts.cs:153`. No protocol changes — the slot is wired today
on the contract side; only the production behaviour and the consumer
on this side need to land.
- `Server\Phase7\Phase7Composer.ResolveHistorianSink` — already scans
registered drivers for an `IAlarmHistorianWriter`. Once GalaxyDriver
exposes one, `SqliteStoreAndForwardSink` boots with a real writer
attached and the `NullAlarmHistorianSink` fallback no longer applies
on Galaxy installs.
- Delete `WriteAlarmEventsRequest` / `WriteAlarmEventsReply` /
`IAlarmEventWriter` from the Wonderware sidecar
(`src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Ipc\Contracts.cs`,
`Ipc\HistorianFrameHandler.cs`, `Ipc\Framing.cs`). The historian
sidecar becomes read-only — matches the audit done earlier.
for registered `IAlarmHistorianWriter` instances. Register the new
sidecar-backed writer at server bootstrap when the historian sidecar
is enabled (`appsettings.json` `Historian:Wonderware:Enabled = true`).
`SqliteStoreAndForwardSink` then boots with a real writer attached
and the `NullAlarmHistorianSink` fallback no longer applies on
installs that have the sidecar deployed.
**Tests:**
- `GatewayAlarmHistorianWriter` against a fake gRPC server — single
record, batch, per-row failure modes (Ack / RetryPlease /
PermanentFail).
- `Phase7Composer` end-to-end — register a Galaxy driver, assert
`ResolveHistorianSink` picks `SqliteStoreAndForwardSink` with the
new writer attached.
- `SidecarAlarmHistorianWriter` against a fake `PipeServer`
single record, batch, per-row failure modes (Ack / RetryPlease /
PermanentFail) mapped from the sidecar's `PerEventOk[]` reply.
- `Phase7Composer` end-to-end — start the server with the historian
sidecar enabled; assert `ResolveHistorianSink` picks
`SqliteStoreAndForwardSink` with the new sidecar writer attached.
**Note on producer scope:** This path historizes **non-Galaxy alarms
only.** Galaxy-native alarms (with `$Alarm*` extensions) reach AVEVA
Historian directly via System Platform's `HistorizeToAveva` toggle on
the alarm primitive, with no involvement from us. Today the only live
producer feeding `SqliteStoreAndForwardSink` is
`Phase7EngineComposer.RouteToHistorianAsync` for scripted alarms; future
producers (AB CIP ALMD, FOCAS CNC alarms if a customer wants unified
storage) plug into the same path.
### PR B.5 — docs + memory housekeeping
@@ -471,33 +466,99 @@ dispatch).
`✅ Completed YYYY-MM-DD — historical record.` matching the existing
v2-mxgw plan retirement convention.
## Track C — historian sidecar wires the dormant write path
The Wonderware historian sidecar at
`src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\` is a separately
deployable Windows service (NSSM-wrapped) that already loads
`aahClientManaged` x64 and serves a named-pipe IPC for read operations.
The `WriteAlarmEvents` IPC slot is defined but unwired (`Program.cs:57`
constructs `HistorianFrameHandler` without an `alarmWriter`). Track C
completes that slot. Two PRs in the sidecar + one consumer-side PR
(B.4) in lmxopcua finishes the path.
### PR C.1 — sidecar: AahClientManagedAlarmEventWriter
**Files** (`src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Backend\`):
1. New `AahClientManagedAlarmEventWriter.cs` implementing the existing
`IAlarmEventWriter` interface (defined in `Ipc\HistorianFrameHandler.cs:242`).
2. Implementation calls `aahClientManaged`'s alarm-event write API —
the same path v1's `GalaxyHistorianWriter` used. Use the existing
`HistorianClusterEndpointPicker` for multi-node routing so write
failures fail over the same way reads do.
3. Batch size + retry behaviour mirrors v1's `GalaxyHistorianWriter`
per-row outcome reporting (`HistorianWriteOutcome` enum: Ack /
PermanentFail / RetryPlease). Map MxStatus codes onto outcomes.
4. Reuses `HistorianDataSource`'s existing connection-pool / health
gating — no new TCP work needed; the same session that serves
reads can issue writes too.
**Tests** (`tests\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests\`):
- Outcome-mapping table: every documented MxStatus on alarm-write →
expected `HistorianWriteOutcome`.
- Batching: 1 / 100 / 1000 events through a fake `aahClientManaged`
writer; assert per-row outcome list parallel to input order.
- Cluster failover: primary node returns `BadCommunicationError`;
picker rotates to secondary; assert eventual success.
### PR C.2 — sidecar: wire IAlarmEventWriter into Program.cs
**Files** (`src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Program.cs`):
1. Build an `AahClientManagedAlarmEventWriter` next to the existing
`BuildHistorian()` call.
2. Pass it to `HistorianFrameHandler` (currently constructed at line 57
without an `alarmWriter`). The dispatcher already routes
`WriteAlarmEventsRequest` through `_alarmWriter` when non-null
(`HistorianFrameHandler.cs:158-172`); supplying it makes the slot
functional.
3. Gate behind a new env var `OTOPCUA_HISTORIAN_ALARM_WRITE_ENABLED`
(default `true` when `OTOPCUA_HISTORIAN_ENABLED=true`). Lets a
read-only deployment skip the writer registration if needed.
4. Update `Install-Services.ps1` install-time env block in
lmxopcua's `scripts\install\` to include the new toggle.
**Tests:**
- `Program.cs` unit-test seam: assert handler is constructed with
alarm writer when enabled and without when disabled.
- Live integration (parity rig): write a synthetic alarm event
through the IPC; query it back via `ReadEvents`; assert
round-trip fidelity.
### Sequencing within Track C: C.1 → C.2.
C.2's lmxopcua-side consumer is **PR B.4 in Track B**, which depends
on C.2 being deployed.
## Sequencing matrix
```
Track A (mxaccessgw) Track B (lmxopcua)
───────────────────────── ─────────────────────────
A.1 proto (waits)
├──────────────────────────► B.1 EventPump branch
A.2 worker subscription │ uses proto types only
│ │ unit-testable without live gw
A.3 gateway dispatch + ack RPC ──►B.2 GalaxyDriver : IAlarmSource
│ │
│ ──►B.3 DriverNodeManager routing
A.4 ConditionRefresh │ (B.3 closes the loop with A.4
once ConditionRefresh wired)
A.5 WriteHistorianEvent ─────────►B.4 GatewayAlarmHistorianWriter
│ + sidecar write-path deletion
Track A (mxaccessgw) Track B (lmxopcua) Track C (sidecar)
───────────────────────── ───────────────────────── ─────────────────────────
A.1 proto (waits) C.1 AahClientManagedAlarmEventWriter
│ no cross-repo dep
├──────────────────────────► B.1 EventPump branch
A.2 worker subscription │ uses proto types only
│ │ unit-testable
C.2 Program.cs wires writer
A.3 gateway dispatch + ack RPC ──►B.2 GalaxyDriver : IAlarmSource
│ │
│ ──►B.3 DriverNodeManager routing
A.4 ConditionRefresh │
│ │
B.4 SidecarAlarmHistorianWriter
(depends on C.2 deployed)
──►B.5 docs + memory
```
A.1 + B.1 can land in parallel (B.1's tests use proto types without
needing a running gateway). B.1 stays inert until A.3 ships the gateway
dispatch — which is fine; the dispatch branch is a no-op until events
arrive.
A.1 + B.1 + C.1 can all land in parallel — none have cross-repo runtime
dependencies. B.1's tests use proto types without needing a running
gateway. C.1 is purely sidecar-internal. The gateway-side dispatch (A.3)
gates B.2; the sidecar-side wiring (C.2) gates B.4.
## Test gates
@@ -538,7 +599,7 @@ exceptions in the AlarmSurfaceInvoker pipeline.
| Gateway adds a new auth scope (`invoke:alarm-ack`); existing keys lack it | PR A.3 + A.5 ship with a one-time bootstrap migration: keys with `invoke:write` get the new scope auto-granted on the dev rig and parity rig. Production keys are reissued via `apikey rotate-key` (existing CLI). |
| Two simultaneous alarm sources (driver-native + sub-attribute) double-fire transitions | PR B.3 dedup is the load-bearing design. End-to-end test #1 covers it explicitly. |
| Historian write-back batch fails mid-batch — partial success | The existing `SqliteStoreAndForwardSink.HistorianWriteOutcome` per-row enum + dead-letter retention already handles this; PR A.5 just exposes the same outcome shape over gRPC. |
| Sidecar write-path deletion in B.4 leaves orphan IPC frames in old client builds | The frame-kind enum is forward-compatible (`MessageKind.WriteAlarmEventsRequest = 0x20`). Old clients sending the request to a new sidecar receive `Unsupported message kind`; new clients never send it. Acceptable — same-version deploy is the existing rollout convention. |
| Sidecar starts honouring the `WriteAlarmEvents` slot — old lmxopcua-side consumers can now reach a previously inert path | The slot returns `Success=false, Error="not configured"` today; flipping to live writes means a build that *speculatively* sent the frame would suddenly start producing real historian rows. Inventory of any such caller is empty — `WriteAlarmEvents` was never invoked from the lmxopcua side; `Phase7EngineComposer.RouteToHistorianAsync` queues into `SqliteStoreAndForwardSink` and the drain worker is gated on `IAlarmHistorianWriter` registration which only the new B.4 path provides. So enabling C.2 without B.4 is safe. |
## Roll-out
@@ -578,40 +639,45 @@ needed); land B.4 last and only after end-of-epic gate is green.
## File inventory (touched)
**mxaccessgw:**
**mxaccessgw (Track A):**
- `src\MxGateway.Contracts\Protos\mxaccess_gateway.proto` (A.1, A.5)
- `src\MxGateway.Contracts\Protos\mxaccess_worker.proto` (A.2, A.4, A.5)
- `src\MxGateway.Contracts\Protos\mxaccess_gateway.proto` (A.1)
- `src\MxGateway.Contracts\Protos\mxaccess_worker.proto` (A.2, A.4)
- `src\MxGateway.Worker\…\Eventing\` (A.2, A.3, A.4)
- `src\MxGateway.Worker\…\Commands\` (A.3, A.4, A.5)
- `src\MxGateway.Worker\…\Commands\` (A.3, A.4)
- `src\MxGateway.Server\Sessions\SessionEventStream.cs` (A.3)
- `src\MxGateway.Server\Rpc\` (A.3, A.4, A.5)
- `src\MxGateway.Server\Auth\Scopes.cs` (A.3, A.4, A.5)
- `src\MxGateway.Server\Rpc\` (A.3, A.4)
- `src\MxGateway.Server\Auth\Scopes.cs` (A.3, A.4)
- `MxGateway.Tests`, `MxGateway.Worker.Tests`, `MxGateway.IntegrationTests`
**lmxopcua:**
**lmxopcua — Galaxy driver + server (Track B):**
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\EventPump.cs` (B.1)
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\MxAccessSeverityMapper.cs` *(new — B.1)*
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\IGalaxyAlarmAcknowledger.cs` *(new — B.2)*
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\GatewayGalaxyAlarmAcknowledger.cs` *(new — B.2)*
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\GatewayAlarmHistorianWriter.cs` *(new — B.4)*
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriver.cs` (B.2)
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriverFactory.cs` (B.2, B.4)
- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriverFactory.cs` (B.2)
- `src\ZB.MOM.WW.OtOpcUa.Server\OpcUa\DriverNodeManager.cs` (B.3)
- `src\ZB.MOM.WW.OtOpcUa.Server\Alarms\AlarmConditionService.cs` (B.3)
- `src\ZB.MOM.WW.OtOpcUa.Server\Phase7\Phase7Composer.cs` (B.4)
- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Ipc\Contracts.cs` (B.4 — deletions)
- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Ipc\HistorianFrameHandler.cs` (B.4 — deletions)
- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Ipc\Framing.cs` (B.4 — deletions)
- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client\SidecarAlarmHistorianWriter.cs` *(new — B.4)*
- `tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests\Runtime\` (B.1, B.2)
- `tests\ZB.MOM.WW.OtOpcUa.Server.Tests\Alarms\` (B.3)
- `tests\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests\` (B.4 — drop deleted-contract tests)
- `tests\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client.Tests\` (B.4 — new tests)
- `docs\drivers\Galaxy.md` (B.5)
- `docs\AlarmTracking.md` *(new — B.5)*
- `docs\v1\AlarmTracking.md` (B.5 — banner update)
- `docs\plans\alarms-over-gateway.md` (B.5 — completion banner)
Total: ~12 source files added/modified in mxaccessgw; ~17 in lmxopcua;
~10 test files. Should land in 4-6 weeks of focused work given the
parity-rig dependency for end-to-end validation.
**lmxopcua — Wonderware historian sidecar (Track C):**
- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Backend\AahClientManagedAlarmEventWriter.cs` *(new — C.1)*
- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Program.cs` (C.2 — wire writer)
- `scripts\install\Install-Services.ps1` (C.2 — env-var toggle for write-enable)
- `tests\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests\` (C.1 — outcome mapping + batch + cluster failover)
Total: ~10 source files added/modified in mxaccessgw; ~14 in lmxopcua
proper; ~3 in the historian sidecar; ~12 test files across all repos.
Should land in 4-6 weeks of focused work given the parity-rig dependency
for end-to-end validation.