From 7367b3e23f32f2c736e3cfd2b4a77610bf530525 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Thu, 30 Apr 2026 15:08:58 -0400 Subject: [PATCH] docs: alarm-historian write moves from gateway to historian sidecar MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Revise the alarms-over-gateway plan based on review feedback: The gateway is for MxAccess (live data + Galaxy hierarchy); the Wonderware historian sidecar is for aahClientManaged (time-series + alarms historian). Two SDKs, two concerns. Routing alarm-historian write-back through the gateway would force coupling that doesn't need to exist — the sidecar already has a dormant WriteAlarmEvents IPC slot ready to wire. Drop A.5 (gateway WriteHistorianEvent RPC). Add Track C — two PRs in the historian sidecar that complete the dormant slot: C.1 AahClientManagedAlarmEventWriter implementation C.2 Program.cs wires the writer into HistorianFrameHandler B.4 reverses from "delete the IPC slot" to "consume the IPC slot" via a new SidecarAlarmHistorianWriter on the lmxopcua side. Also tightens Why-section #3 + D5 to make explicit that the path is exclusively for non-Galaxy alarm producers (scripted alarms today, AB CIP ALMD or others future). Galaxy-native alarms reach AVEVA Historian via System Platform's own HistorizeToAveva toggle, independent of anything in our stack. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/plans/alarms-over-gateway.md | 276 ++++++++++++++++++------------ 1 file changed, 171 insertions(+), 105 deletions(-) diff --git a/docs/plans/alarms-over-gateway.md b/docs/plans/alarms-over-gateway.md index dfb870e..ce16f2b 100644 --- a/docs/plans/alarms-over-gateway.md +++ b/docs/plans/alarms-over-gateway.md @@ -22,15 +22,17 @@ mxaccessgw's gRPC. In doing so, three v1 capabilities regressed: API directly from `GalaxyAlarmTracker`. Today, OPC UA acks are written into the `AckMsgWriteRef` sub-attribute — semantically valid but a round-trip through the value path that loses operator-comment fidelity. -3. **Alarm-historian write-back path** — `GalaxyHistorianWriter` - implemented `IAlarmHistorianWriter` and forwarded scripted-alarm and - Galaxy-native alarm transitions back to AVEVA Historian via - `aahClientManaged`. PR 7.2 deleted it. `Phase7Composer.ResolveHistorianSink` - now finds no writer and falls back to `NullAlarmHistorianSink`, so - **scripted-alarm transitions queue locally and silently discard.** - (Galaxy-native alarms still reach AVEVA Historian via the Galaxy template's - own `HistorizeToAveva` toggle, independent of our sink — that path - wasn't broken.) +3. **Alarm-historian write-back path for non-Galaxy alarm sources.** + v1's `GalaxyHistorianWriter` implemented `IAlarmHistorianWriter` and + forwarded *scripted-alarm* transitions (and any future non-Galaxy + alarm source — AB CIP ALMD, OpcUaClient A&E, etc.) back to AVEVA + Historian via `aahClientManaged`. PR 7.2 deleted it. + `Phase7Composer.ResolveHistorianSink` now finds no writer and falls + back to `NullAlarmHistorianSink`, so **scripted-alarm transitions + queue locally and silently discard.** Galaxy-native alarms (with + `$Alarm*` extensions) reach AVEVA Historian via System Platform's + own `HistorizeToAveva` toggle on the Galaxy template — that path + was never broken and is not in scope for this epic. `gateway.md` (mxaccessgw, line 8) explicitly commits the gateway to "full MXAccess parity… preserve MXAccess behavior first… **native MXAccess event @@ -128,13 +130,22 @@ only way to surface alarms for those templates. Both paths feed `AlarmConditionService`. Driver-native events take precedence when both are present (more authoritative, lower latency). -**D5 — Where the historian writer lives.** As a new RPC on the gateway -(`WriteHistorianEvent`). The Wonderware sidecar's existing -`WriteAlarmEvents` IPC slot stays unwired and is deleted as part of this -epic — the gateway is the canonical place for "write to AVEVA Historian" -since the gateway already owns AVEVA-COM access. This also means the -sidecar (long term) only does *reads* and could potentially retire entirely -if the historian-client REST migration (`docs/plans/...`) lands. +**D5 — Where the historian writer lives.** In the **Wonderware historian +sidecar**, not in the gateway. The sidecar already owns `aahClientManaged`, +already has a `WriteAlarmEvents` IPC slot defined in `Ipc/Contracts.cs`, and +already dispatches to an `IAlarmEventWriter` interface — it's just unwired +in `Program.cs:57`. The gateway is for MxAccess (live data + Galaxy +hierarchy); the historian sidecar is for `aahClientManaged` (time-series + +alarms historian). Two different SDKs, two different concerns; keep the +split. Bonus: completing the sidecar's write path also gives it a clearer +long-term role — once the REST-API migration in `histsdk\instructions.md` +takes over reads, write-back keeps the sidecar relevant rather than +retiring it as a read-only relic. **Galaxy-native alarms bypass this +entirely** — System Platform's own `HistorizeToAveva` toggle on the +Galaxy template publishes them directly. The sidecar write path is +exclusively for non-Galaxy producers (today: scripted alarms; future: AB +CIP ALMD or any other lmxopcua-side alarm source the customer wants +unified into AVEVA Historian). ## Track A — mxaccessgw changes @@ -276,35 +287,11 @@ is worker-internal. reconnect returns every alarm currently `Active` or `ActiveAcked` in the Galaxy. -### PR A.5 — gateway: WriteHistorianEvent RPC for sink write-back - -**Files** (`src\MxGateway.Server\`, `src\MxGateway.Worker\`, -`src\MxGateway.Contracts\Protos\mxaccess_gateway.proto`). - -1. New RPC `WriteHistorianEvent(WriteHistorianEventRequest) → - WriteHistorianEventReply`. Request carries an - `AlarmHistorianRecord` mirroring the existing - `Core.AlarmHistorian.AlarmHistorianEvent` payload (alarm id, - equipment path, alarm name, alarm-type-name, severity, event kind, - message, user, comment, timestamp). -2. Worker maps the record onto `aahClientManaged`'s alarm-event - write API (the same path v1's `GalaxyHistorianWriter` used). Worker - batches up to N records per write to amortize the COM round-trip. -3. AuthN — new scope `invoke:historian-write`. Cross-cutting with - `invoke:write` — keys for OPC UA servers that publish historian - data must hold both. - -**Tests:** - -- Worker test: fake `aahClientManaged` writer; assert batching - semantics + retry-on-Bad-status-code behaviour matches v1's - `GalaxyHistorianWriter` (per-row outcome reporting). -- Integration: write a record, query it back via existing Historian - read APIs, assert round-trip fidelity. - -**Sequencing within Track A:** A.1 → A.2 → A.3 → A.4 → A.5. A.1 is +**Sequencing within Track A:** A.1 → A.2 → A.3 → A.4. A.1 is mechanical; A.2 + A.3 are the load-bearing changes that unlock lmxopcua -side. A.4 + A.5 can ship after lmxopcua starts consuming A.3 output. +side. A.4 can ship after lmxopcua starts consuming A.3 output. The +historian-write capability moved to **Track C** below — the gateway +intentionally stays out of `aahClientManaged`. ## Track B — lmxopcua changes @@ -413,37 +400,45 @@ dispatch). ack-via-RPC; driver implements only `IWritable` → ack-via-write (existing path). -### PR B.4 — IAlarmHistorianWriter via gateway +### PR B.4 — IAlarmHistorianWriter via the historian sidecar IPC -**Depends on:** A.5 (`WriteHistorianEvent` RPC available). +**Depends on:** C.2 (sidecar wires its `IAlarmEventWriter`). See Track C +for the sidecar-side work; B.4 is the lmxopcua-side consumer. **Files:** -- New `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\GatewayAlarmHistorianWriter.cs` - implementing `IAlarmHistorianWriter`. Calls the gateway RPC from - Track A.5 with the same batch + per-row outcome semantics v1's - `GalaxyHistorianWriter` exposed. -- `GalaxyDriverFactory` registers it as a singleton tied to the - `DriverInstance`. +- New `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client\SidecarAlarmHistorianWriter.cs` + implementing `IAlarmHistorianWriter`. Sends batches over the existing + named-pipe IPC using the **already-defined** + `WriteAlarmEventsRequest` / `WriteAlarmEventsReply` contracts at + `Ipc\Contracts.cs:153`. No protocol changes — the slot is wired today + on the contract side; only the production behaviour and the consumer + on this side need to land. - `Server\Phase7\Phase7Composer.ResolveHistorianSink` — already scans - registered drivers for an `IAlarmHistorianWriter`. Once GalaxyDriver - exposes one, `SqliteStoreAndForwardSink` boots with a real writer - attached and the `NullAlarmHistorianSink` fallback no longer applies - on Galaxy installs. -- Delete `WriteAlarmEventsRequest` / `WriteAlarmEventsReply` / - `IAlarmEventWriter` from the Wonderware sidecar - (`src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Ipc\Contracts.cs`, - `Ipc\HistorianFrameHandler.cs`, `Ipc\Framing.cs`). The historian - sidecar becomes read-only — matches the audit done earlier. + for registered `IAlarmHistorianWriter` instances. Register the new + sidecar-backed writer at server bootstrap when the historian sidecar + is enabled (`appsettings.json` `Historian:Wonderware:Enabled = true`). + `SqliteStoreAndForwardSink` then boots with a real writer attached + and the `NullAlarmHistorianSink` fallback no longer applies on + installs that have the sidecar deployed. **Tests:** -- `GatewayAlarmHistorianWriter` against a fake gRPC server — single - record, batch, per-row failure modes (Ack / RetryPlease / - PermanentFail). -- `Phase7Composer` end-to-end — register a Galaxy driver, assert - `ResolveHistorianSink` picks `SqliteStoreAndForwardSink` with the - new writer attached. +- `SidecarAlarmHistorianWriter` against a fake `PipeServer` — + single record, batch, per-row failure modes (Ack / RetryPlease / + PermanentFail) mapped from the sidecar's `PerEventOk[]` reply. +- `Phase7Composer` end-to-end — start the server with the historian + sidecar enabled; assert `ResolveHistorianSink` picks + `SqliteStoreAndForwardSink` with the new sidecar writer attached. + +**Note on producer scope:** This path historizes **non-Galaxy alarms +only.** Galaxy-native alarms (with `$Alarm*` extensions) reach AVEVA +Historian directly via System Platform's `HistorizeToAveva` toggle on +the alarm primitive, with no involvement from us. Today the only live +producer feeding `SqliteStoreAndForwardSink` is +`Phase7EngineComposer.RouteToHistorianAsync` for scripted alarms; future +producers (AB CIP ALMD, FOCAS CNC alarms if a customer wants unified +storage) plug into the same path. ### PR B.5 — docs + memory housekeeping @@ -471,33 +466,99 @@ dispatch). `✅ Completed YYYY-MM-DD — historical record.` matching the existing v2-mxgw plan retirement convention. +## Track C — historian sidecar wires the dormant write path + +The Wonderware historian sidecar at +`src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\` is a separately +deployable Windows service (NSSM-wrapped) that already loads +`aahClientManaged` x64 and serves a named-pipe IPC for read operations. +The `WriteAlarmEvents` IPC slot is defined but unwired (`Program.cs:57` +constructs `HistorianFrameHandler` without an `alarmWriter`). Track C +completes that slot. Two PRs in the sidecar + one consumer-side PR +(B.4) in lmxopcua finishes the path. + +### PR C.1 — sidecar: AahClientManagedAlarmEventWriter + +**Files** (`src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Backend\`): + +1. New `AahClientManagedAlarmEventWriter.cs` implementing the existing + `IAlarmEventWriter` interface (defined in `Ipc\HistorianFrameHandler.cs:242`). +2. Implementation calls `aahClientManaged`'s alarm-event write API — + the same path v1's `GalaxyHistorianWriter` used. Use the existing + `HistorianClusterEndpointPicker` for multi-node routing so write + failures fail over the same way reads do. +3. Batch size + retry behaviour mirrors v1's `GalaxyHistorianWriter` + per-row outcome reporting (`HistorianWriteOutcome` enum: Ack / + PermanentFail / RetryPlease). Map MxStatus codes onto outcomes. +4. Reuses `HistorianDataSource`'s existing connection-pool / health + gating — no new TCP work needed; the same session that serves + reads can issue writes too. + +**Tests** (`tests\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests\`): + +- Outcome-mapping table: every documented MxStatus on alarm-write → + expected `HistorianWriteOutcome`. +- Batching: 1 / 100 / 1000 events through a fake `aahClientManaged` + writer; assert per-row outcome list parallel to input order. +- Cluster failover: primary node returns `BadCommunicationError`; + picker rotates to secondary; assert eventual success. + +### PR C.2 — sidecar: wire IAlarmEventWriter into Program.cs + +**Files** (`src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Program.cs`): + +1. Build an `AahClientManagedAlarmEventWriter` next to the existing + `BuildHistorian()` call. +2. Pass it to `HistorianFrameHandler` (currently constructed at line 57 + without an `alarmWriter`). The dispatcher already routes + `WriteAlarmEventsRequest` through `_alarmWriter` when non-null + (`HistorianFrameHandler.cs:158-172`); supplying it makes the slot + functional. +3. Gate behind a new env var `OTOPCUA_HISTORIAN_ALARM_WRITE_ENABLED` + (default `true` when `OTOPCUA_HISTORIAN_ENABLED=true`). Lets a + read-only deployment skip the writer registration if needed. +4. Update `Install-Services.ps1` install-time env block in + lmxopcua's `scripts\install\` to include the new toggle. + +**Tests:** + +- `Program.cs` unit-test seam: assert handler is constructed with + alarm writer when enabled and without when disabled. +- Live integration (parity rig): write a synthetic alarm event + through the IPC; query it back via `ReadEvents`; assert + round-trip fidelity. + +### Sequencing within Track C: C.1 → C.2. + +C.2's lmxopcua-side consumer is **PR B.4 in Track B**, which depends +on C.2 being deployed. + ## Sequencing matrix ``` -Track A (mxaccessgw) Track B (lmxopcua) -───────────────────────── ───────────────────────── -A.1 proto (waits) - │ - ├──────────────────────────► B.1 EventPump branch -A.2 worker subscription │ uses proto types only - │ │ unit-testable without live gw - │ -A.3 gateway dispatch + ack RPC ──►B.2 GalaxyDriver : IAlarmSource - │ │ - │ ──►B.3 DriverNodeManager routing - │ -A.4 ConditionRefresh │ (B.3 closes the loop with A.4 - │ once ConditionRefresh wired) - │ -A.5 WriteHistorianEvent ─────────►B.4 GatewayAlarmHistorianWriter - │ + sidecar write-path deletion +Track A (mxaccessgw) Track B (lmxopcua) Track C (sidecar) +───────────────────────── ───────────────────────── ───────────────────────── +A.1 proto (waits) C.1 AahClientManagedAlarmEventWriter + │ │ no cross-repo dep + ├──────────────────────────► B.1 EventPump branch │ +A.2 worker subscription │ uses proto types only │ + │ │ unit-testable │ + │ C.2 Program.cs wires writer +A.3 gateway dispatch + ack RPC ──►B.2 GalaxyDriver : IAlarmSource │ + │ │ │ + │ ──►B.3 DriverNodeManager routing │ + │ │ +A.4 ConditionRefresh │ │ + │ │ + B.4 SidecarAlarmHistorianWriter + (depends on C.2 deployed) ──►B.5 docs + memory ``` -A.1 + B.1 can land in parallel (B.1's tests use proto types without -needing a running gateway). B.1 stays inert until A.3 ships the gateway -dispatch — which is fine; the dispatch branch is a no-op until events -arrive. +A.1 + B.1 + C.1 can all land in parallel — none have cross-repo runtime +dependencies. B.1's tests use proto types without needing a running +gateway. C.1 is purely sidecar-internal. The gateway-side dispatch (A.3) +gates B.2; the sidecar-side wiring (C.2) gates B.4. ## Test gates @@ -538,7 +599,7 @@ exceptions in the AlarmSurfaceInvoker pipeline. | Gateway adds a new auth scope (`invoke:alarm-ack`); existing keys lack it | PR A.3 + A.5 ship with a one-time bootstrap migration: keys with `invoke:write` get the new scope auto-granted on the dev rig and parity rig. Production keys are reissued via `apikey rotate-key` (existing CLI). | | Two simultaneous alarm sources (driver-native + sub-attribute) double-fire transitions | PR B.3 dedup is the load-bearing design. End-to-end test #1 covers it explicitly. | | Historian write-back batch fails mid-batch — partial success | The existing `SqliteStoreAndForwardSink.HistorianWriteOutcome` per-row enum + dead-letter retention already handles this; PR A.5 just exposes the same outcome shape over gRPC. | -| Sidecar write-path deletion in B.4 leaves orphan IPC frames in old client builds | The frame-kind enum is forward-compatible (`MessageKind.WriteAlarmEventsRequest = 0x20`). Old clients sending the request to a new sidecar receive `Unsupported message kind`; new clients never send it. Acceptable — same-version deploy is the existing rollout convention. | +| Sidecar starts honouring the `WriteAlarmEvents` slot — old lmxopcua-side consumers can now reach a previously inert path | The slot returns `Success=false, Error="not configured"` today; flipping to live writes means a build that *speculatively* sent the frame would suddenly start producing real historian rows. Inventory of any such caller is empty — `WriteAlarmEvents` was never invoked from the lmxopcua side; `Phase7EngineComposer.RouteToHistorianAsync` queues into `SqliteStoreAndForwardSink` and the drain worker is gated on `IAlarmHistorianWriter` registration which only the new B.4 path provides. So enabling C.2 without B.4 is safe. | ## Roll-out @@ -578,40 +639,45 @@ needed); land B.4 last and only after end-of-epic gate is green. ## File inventory (touched) -**mxaccessgw:** +**mxaccessgw (Track A):** -- `src\MxGateway.Contracts\Protos\mxaccess_gateway.proto` (A.1, A.5) -- `src\MxGateway.Contracts\Protos\mxaccess_worker.proto` (A.2, A.4, A.5) +- `src\MxGateway.Contracts\Protos\mxaccess_gateway.proto` (A.1) +- `src\MxGateway.Contracts\Protos\mxaccess_worker.proto` (A.2, A.4) - `src\MxGateway.Worker\…\Eventing\` (A.2, A.3, A.4) -- `src\MxGateway.Worker\…\Commands\` (A.3, A.4, A.5) +- `src\MxGateway.Worker\…\Commands\` (A.3, A.4) - `src\MxGateway.Server\Sessions\SessionEventStream.cs` (A.3) -- `src\MxGateway.Server\Rpc\` (A.3, A.4, A.5) -- `src\MxGateway.Server\Auth\Scopes.cs` (A.3, A.4, A.5) +- `src\MxGateway.Server\Rpc\` (A.3, A.4) +- `src\MxGateway.Server\Auth\Scopes.cs` (A.3, A.4) - `MxGateway.Tests`, `MxGateway.Worker.Tests`, `MxGateway.IntegrationTests` -**lmxopcua:** +**lmxopcua — Galaxy driver + server (Track B):** - `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\EventPump.cs` (B.1) - `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\MxAccessSeverityMapper.cs` *(new — B.1)* - `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\IGalaxyAlarmAcknowledger.cs` *(new — B.2)* - `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\GatewayGalaxyAlarmAcknowledger.cs` *(new — B.2)* -- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\Runtime\GatewayAlarmHistorianWriter.cs` *(new — B.4)* - `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriver.cs` (B.2) -- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriverFactory.cs` (B.2, B.4) +- `src\ZB.MOM.WW.OtOpcUa.Driver.Galaxy\GalaxyDriverFactory.cs` (B.2) - `src\ZB.MOM.WW.OtOpcUa.Server\OpcUa\DriverNodeManager.cs` (B.3) - `src\ZB.MOM.WW.OtOpcUa.Server\Alarms\AlarmConditionService.cs` (B.3) - `src\ZB.MOM.WW.OtOpcUa.Server\Phase7\Phase7Composer.cs` (B.4) -- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Ipc\Contracts.cs` (B.4 — deletions) -- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Ipc\HistorianFrameHandler.cs` (B.4 — deletions) -- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Ipc\Framing.cs` (B.4 — deletions) +- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client\SidecarAlarmHistorianWriter.cs` *(new — B.4)* - `tests\ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests\Runtime\` (B.1, B.2) - `tests\ZB.MOM.WW.OtOpcUa.Server.Tests\Alarms\` (B.3) -- `tests\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests\` (B.4 — drop deleted-contract tests) +- `tests\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Client.Tests\` (B.4 — new tests) - `docs\drivers\Galaxy.md` (B.5) - `docs\AlarmTracking.md` *(new — B.5)* - `docs\v1\AlarmTracking.md` (B.5 — banner update) - `docs\plans\alarms-over-gateway.md` (B.5 — completion banner) -Total: ~12 source files added/modified in mxaccessgw; ~17 in lmxopcua; -~10 test files. Should land in 4-6 weeks of focused work given the -parity-rig dependency for end-to-end validation. +**lmxopcua — Wonderware historian sidecar (Track C):** + +- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Backend\AahClientManagedAlarmEventWriter.cs` *(new — C.1)* +- `src\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware\Program.cs` (C.2 — wire writer) +- `scripts\install\Install-Services.ps1` (C.2 — env-var toggle for write-enable) +- `tests\ZB.MOM.WW.OtOpcUa.Driver.Historian.Wonderware.Tests\` (C.1 — outcome mapping + batch + cluster failover) + +Total: ~10 source files added/modified in mxaccessgw; ~14 in lmxopcua +proper; ~3 in the historian sidecar; ~12 test files across all repos. +Should land in 4-6 weeks of focused work given the parity-rig dependency +for end-to-end validation.