docs(alarms): Phase 3 design — OPC UA standards completeness (H4 Enable/Disable + H2 HistoryUpdate bit + H6 native-ack→AVEVA)

This commit is contained in:
Joseph Doherty
2026-06-15 13:59:28 -04:00
parent 4af8e65af1
commit 40b883effe
@@ -0,0 +1,173 @@
# Phase 3 — OPC UA standards completeness (H4 + H2-bit + H6) — design
> **Status:** approved 2026-06-15. Parent roadmap: `docs/plans/2026-06-15-stillpending-backlog-design.md`
> (Phase 3). Backlog items **H4 + H2 + H6** in `stillpending.md` §1.
> **Branch:** `feat/stillpending-phase-3-opcua-standards` off master `4af8e65a`.
> **Classification:** high-risk (touches the OPC UA node-manager method seam, the alarm command path,
> the driver inbound-ack route, and the authorization permission model).
## Problem
Three OPC UA standards gaps remain in the alarm + authorization surface:
- **H4 — alarm `Enable`/`Disable` not callable over OPC UA.** The node manager wires
`OnAcknowledge`/`OnConfirm`/`OnAddComment`/`OnShelve`/`OnTimedUnshelve` but **not** `OnEnableDisable`
(`AlarmCommand.cs` itself flags it as "a future task"). The engine side is **already done**
(`ScriptedAlarmEngine.EnableAsync`/`DisableAsync`, dispatched at `ScriptedAlarmHostActor.cs:408-412`) and
the `AlarmCommand` vocabulary already includes `Enable`/`Disable` — only the node-manager delegate is missing.
- **H2 (permission bit only) — `NodePermissions.HistoryUpdate` missing.** The flags enum has no
`HistoryUpdate` member, so `TriePermissionEvaluator.cs:86` maps `OpcUaOperation.HistoryUpdate → the
HistoryRead bit`. A read-only HistoryRead grant would therefore also authorize a future HistoryUpdate. The
full HistoryUpdate **service** stays **deferred** (infra-gated — no backend insert/replace/delete RPC).
- **H6 — inbound ack of a NATIVE (Galaxy) alarm never reaches the driver / AVEVA.**
`MaterialiseAlarmCondition` is the **single shared** materializer; both native and scripted conditions get
`OnAcknowledge` wired to the **scripted** `AlarmCommandRouter``ScriptedAlarmHostActor` → engine. For a
native alarm the engine doesn't own it, so the ack updates only the local `AlarmConditionState` and never
flows through `IAlarmSource.AcknowledgeAsync` → AVEVA. There is currently **no** inbound path from an OPC UA
ack to a driver. (The authenticated principal is *already* carried on the scripted path —
`AlarmCommand.User = identity.DisplayName`; the generic `"opcua-client"` lives only in the raw
`IAlarmSource` fallback in `ScriptedAlarmSource.cs:75`.)
## Goal
Make an OPC UA client able to Enable/Disable a Part 9 condition (H4); give HistoryUpdate its own permission
bit so a HistoryRead grant can't authorize it (H2-bit); and route a native-condition Acknowledge to the
owning driver (→ AVEVA) carrying the authenticated principal (H6). No EF migration; the only contract
touches are additive (a sink-seam bool, a Core.Abstractions record field).
## Locked decisions (brainstorming 2026-06-15)
1. **H6 native-ack routing = a new `NativeAlarmAckRouter` seam** on the node manager (a second `Action`,
mirroring `AlarmCommandRouter`). Native conditions wire `OnAcknowledge` to it; scripted conditions keep the
existing router. Rejected: a single router disambiguated downstream (couples the scripted host to native
routing).
2. **H4 scoped to scripted alarms only.** Enable/Disable has clear engine semantics there. Native-alarm
Enable/Disable is deferred — the driver has no enable/disable surface; a native condition's
`OnEnableDisable` returns `BadNotSupported` (honest) rather than a silent no-op.
3. **H6 live-provability accepted as code+unit-proven locally; the AVEVA round-trip is operator-gated.** The
route is driver-agnostic via `IAlarmSource`, verified with **Galaxy** (the only real native-alarm source,
with `GatewayGalaxyAlarmAcknowledger`). The gateway ack→AVEVA commit needs the real gateway
(`10.100.0.48`), the same infra-gating class as H5's durable sink / the Galaxy live-gw smokes.
4. **Sequencing:** all three this phase, increasing risk — **H2-bit → H4 → H6.**
## Architecture
### H2-bit (small, independent)
- `NodePermissions`: add `HistoryUpdate = 1 << 12` (next free bit after `MethodCall = 1 << 11`). Optionally
fold into the `Admin` composite (write-side history op); leave `Engineer`/`Operator` without it.
- `TriePermissionEvaluator.MapOperationToPermission`: `OpcUaOperation.HistoryUpdate => NodePermissions.HistoryUpdate`
(was `HistoryRead`). Drop the stale TODO comment.
- Pure mapping change — fully unit-testable; the HistoryUpdate **service** remains unimplemented (a client
calling HistoryUpdate still gets the SDK's default reject — unchanged).
### H4 (small-medium) — wire `OnEnableDisable`
- In `MaterialiseAlarmCondition`, for **scripted** conditions wire
`alarm.OnEnableDisable = (context, condition, enabling) => HandleAlarmCommand(context, condition,
enabling ? "Enable" : "Disable", comment: null, unshelveAt: null);` — reusing the existing AlarmAck-gated
`HandleAlarmCommand` (which builds `AlarmCommand(Enable|Disable)` and routes via `AlarmCommandRouter`).
`ScriptedAlarmHostActor:408-412` already dispatches those to `ScriptedAlarmEngine.Enable/DisableAsync`.
- For **native** conditions, `OnEnableDisable` returns `BadNotSupported` (decision #2).
- Verify the SDK `OnEnableDisable` delegate's exact signature against the decompiled
`AlarmConditionState`/`ConditionState` (the `OnShelve` wiring was verified the same way) and that the
enable-state round-trips to the node (the engine re-projects via `WriteAlarmCondition`, whose delta-gate
suppresses the double-event — same mechanism the ack path documents at `:591`).
- **Authz:** AlarmAck (consistent with the other Part 9 alarm commands).
### H6 (the meaty one) — native ack → driver → AVEVA + principal
Four moving parts:
1. **Mark native conditions at materialization.** Add `bool isNative` to
`IOpcUaAddressSpaceSink.MaterialiseAlarmCondition` (+ `DeferredAddressSpaceSink`, `SdkAddressSpaceSink`,
`NullOpcUaAddressSpaceSink`, `OtOpcUaNodeManager`). `Phase7Applier` already knows which is which: pass
`isNative: true` at the native equipment-tag-alarm site (`:204`) and `false` at the scripted site (`:295`).
The node manager tracks native node ids (e.g. a `_nativeAlarmNodeIds` set) and wires their `OnAcknowledge`
(and `OnEnableDisable`) accordingly.
2. **`NativeAlarmAckRouter` seam.** A new `Action<NativeAlarmAck>?` on the node manager (mirrors
`AlarmCommandRouter`). A native condition's `OnAcknowledge` → resolve the AlarmAck-gated principal the same
way `HandleAlarmCommand` does (fail-closed) → invoke the router with
`NativeAlarmAck(ConditionNodeId, Comment, OperatorUser = identity.DisplayName)` → return `Good`.
3. **Reverse map in `DriverHostActor`.** `DriverHostActor` already builds the forward map
`_alarmNodeIdByDriverRef: (DriverInstanceId, FullName) → condition NodeId(s)` (`:127`, rebuilt at `:902`).
Build the **inverse** in the same clear-and-rebuild pass: `conditionNodeId → (DriverInstanceId, FullName)`.
The host wires the `NativeAlarmAckRouter` to a fire-and-forget `Tell` into `DriverHostActor`; on receipt it
looks up the inverse map → `(driverId, alarmRef)` → routes to the `DriverInstanceActor`
`driver.AcknowledgeAsync([ new AlarmAcknowledgeRequest(SourceNodeId, ConditionId: alarmRef, Comment,
OperatorUser) ], ct)`.
4. **Carry the principal to the driver.** `AlarmAcknowledgeRequest(SourceNodeId, ConditionId, Comment?)` gains
`string? OperatorUser` (the field the base-interface comment already anticipated — "Stream G provides the
authenticated principal"). `GalaxyDriver.AcknowledgeAsync` passes it to
`IGalaxyAlarmAcknowledger.AcknowledgeAsync(alarmFullReference, comment, operatorUser, ct)` → gateway → AVEVA.
Also tidy `ScriptedAlarmSource.cs:75`'s `"opcua-client"` default to honor a supplied principal.
Native-alarm-source drivers that don't implement a real acknowledger (AbCip/FOCAS projections) accept the
request as today (no behavioral regression); only Galaxy commits to its upstream.
```
OPC UA client Acknowledge on a NATIVE condition
→ OtOpcUaNodeManager.OnAcknowledge (native) [AlarmAck gate, fail-closed]
→ NativeAlarmAckRouter(NativeAlarmAck{node, comment, operatorUser})
→ DriverHostActor [inverse map conditionNodeId → (driverId, alarmRef)]
→ DriverInstanceActor → driver.AcknowledgeAsync(AlarmAcknowledgeRequest{…, OperatorUser})
→ GalaxyDriver → GatewayGalaxyAlarmAcknowledger → gateway → AVEVA
```
## Error handling / edge cases
- **Anonymous / missing AlarmAck role** on any of H4/H6 → `BadUserAccessDenied` (fail-closed, same as the
existing alarm methods).
- **Native ack with no driver match** (condition node not in the inverse map — e.g. a scripted alarm wrongly
routed) → log + drop; never throw under the node-manager Lock. The router `Tell` is fire-and-forget (safe
under Lock, like `AlarmCommandRouter`).
- **Driver faulted / reconnecting** → `AcknowledgeAsync` fails or no-ops per the driver; the local condition
state still reflects the SDK-applied ack (the OPC UA client sees Good). Surfacing a *failed* upstream ack
back to the client is out of scope (mirrors the Galaxy write-outcome fire-and-forget limitation).
- **H4 native Enable/Disable** → `BadNotSupported` (decision #2).
- **H2-bit:** no runtime behavior change for existing grants except closing the HistoryRead⇒HistoryUpdate
authorization hole; the service itself stays rejected.
## Testing strategy (xUnit + Shouldly, TDD; NO bUnit)
- **H2-bit:** `TriePermissionEvaluator` test — a node granting only `HistoryRead` does **not** authorize
`OpcUaOperation.HistoryUpdate` (now `NotGranted`); a grant including the new `HistoryUpdate` bit does. Enum
bit-value/round-trip test (no collision with `MethodCall`).
- **H4:** node-manager test — `OnEnableDisable` with an AlarmAck identity routes an `Enable`/`Disable`
`AlarmCommand` through a captured `AlarmCommandRouter`; anonymous → `BadUserAccessDenied`, no route; a native
condition → `BadNotSupported`. Engine path (`ScriptedAlarmHostActor` Enable/Disable → `ScriptedAlarmEngine`)
is already covered.
- **H6:** node-manager test — a **native** condition's `OnAcknowledge` (AlarmAck identity) invokes the
`NativeAlarmAckRouter` with the right `(node, comment, operatorUser=DisplayName)` and **not** the scripted
`AlarmCommandRouter`; a **scripted** condition still uses `AlarmCommandRouter`. `DriverHostActor` test — the
inverse map resolves a condition node → `(driverId, alarmRef)` and a `NativeAlarmAck` routes to a stub
driver's `AcknowledgeAsync` with `OperatorUser` populated. `AlarmAcknowledgeRequest` byte/field test for the
new `OperatorUser`. Galaxy: a unit test that `GalaxyDriver.AcknowledgeAsync` forwards `OperatorUser` to the
acknowledger; the live AVEVA commit is the existing skip-gated `GatewayGalaxy…LiveTests` pattern.
- **Live `/run`:** H4 Enable/Disable proven end-to-end on docker-dev (author a scripted alarm, Disable via
Client.CLI, confirm the condition disables); H6's local half proven (native ack routes to the driver — log
evidence), the **AVEVA round-trip operator-gated** against `10.100.0.48` (decision #3).
## Verification per item
- `dotnet build` clean (TreatWarningsAsErrors on production); full `dotnet test` green.
- High-risk review chain (serial spec → code) per task + a final integration review.
- Live `/run`: H4 driven by the agent on docker-dev (login disabled); H6 local route agent-driven, AVEVA
commit recorded as operator-gated (or driven against the reachable gateway if feasible this session).
## Alternatives considered
- **Single command router, disambiguate native vs scripted in the host** — rejected (decision #1): pushes
native routing knowledge into the scripted host; the dual-router split keeps each path's concern local.
- **Native Enable/Disable now** — rejected (decision #2): no driver enable/disable surface; murky semantics
(the driver keeps producing events).
- **Add the HistoryUpdate *service*** — out of scope / deferred (infra-gated; on the roadmap defer list).
- **Pass the principal out-of-band instead of on `AlarmAcknowledgeRequest`** — rejected: the record is the
natural carrier and the base-interface comment already anticipated it; an additive field is the least
surprising.
## Hard constraints (carried from the roadmap)
NO Configuration entity / EF migration. The only contract touches are additive: a `bool isNative` on the
internal `IOpcUaAddressSpaceSink` sink seam, and `string? OperatorUser` on the `AlarmAcknowledgeRequest`
Core.Abstractions record. Stage by path — never `git add .`; never stage `sql_login.txt`,
`src/Server/.../Host/pki/`, `pending.md`, `current.md`, `docker-dev/docker-compose.yml`, `stillpending.md`.
Never echo or commit secrets (gateway API key). No force-push, no `--no-verify`. Razor/runtime behavior proven
only by live `/run`, never bUnit.