diff --git a/docs/plans/2026-06-15-stillpending-phase-3-opcua-standards-design.md b/docs/plans/2026-06-15-stillpending-phase-3-opcua-standards-design.md new file mode 100644 index 00000000..696a8558 --- /dev/null +++ b/docs/plans/2026-06-15-stillpending-phase-3-opcua-standards-design.md @@ -0,0 +1,173 @@ +# Phase 3 — OPC UA standards completeness (H4 + H2-bit + H6) — design + +> **Status:** approved 2026-06-15. Parent roadmap: `docs/plans/2026-06-15-stillpending-backlog-design.md` +> (Phase 3). Backlog items **H4 + H2 + H6** in `stillpending.md` §1. +> **Branch:** `feat/stillpending-phase-3-opcua-standards` off master `4af8e65a`. +> **Classification:** high-risk (touches the OPC UA node-manager method seam, the alarm command path, +> the driver inbound-ack route, and the authorization permission model). + +## Problem + +Three OPC UA standards gaps remain in the alarm + authorization surface: + +- **H4 — alarm `Enable`/`Disable` not callable over OPC UA.** The node manager wires + `OnAcknowledge`/`OnConfirm`/`OnAddComment`/`OnShelve`/`OnTimedUnshelve` but **not** `OnEnableDisable` + (`AlarmCommand.cs` itself flags it as "a future task"). The engine side is **already done** + (`ScriptedAlarmEngine.EnableAsync`/`DisableAsync`, dispatched at `ScriptedAlarmHostActor.cs:408-412`) and + the `AlarmCommand` vocabulary already includes `Enable`/`Disable` — only the node-manager delegate is missing. +- **H2 (permission bit only) — `NodePermissions.HistoryUpdate` missing.** The flags enum has no + `HistoryUpdate` member, so `TriePermissionEvaluator.cs:86` maps `OpcUaOperation.HistoryUpdate → the + HistoryRead bit`. A read-only HistoryRead grant would therefore also authorize a future HistoryUpdate. The + full HistoryUpdate **service** stays **deferred** (infra-gated — no backend insert/replace/delete RPC). +- **H6 — inbound ack of a NATIVE (Galaxy) alarm never reaches the driver / AVEVA.** + `MaterialiseAlarmCondition` is the **single shared** materializer; both native and scripted conditions get + `OnAcknowledge` wired to the **scripted** `AlarmCommandRouter` → `ScriptedAlarmHostActor` → engine. For a + native alarm the engine doesn't own it, so the ack updates only the local `AlarmConditionState` and never + flows through `IAlarmSource.AcknowledgeAsync` → AVEVA. There is currently **no** inbound path from an OPC UA + ack to a driver. (The authenticated principal is *already* carried on the scripted path — + `AlarmCommand.User = identity.DisplayName`; the generic `"opcua-client"` lives only in the raw + `IAlarmSource` fallback in `ScriptedAlarmSource.cs:75`.) + +## Goal + +Make an OPC UA client able to Enable/Disable a Part 9 condition (H4); give HistoryUpdate its own permission +bit so a HistoryRead grant can't authorize it (H2-bit); and route a native-condition Acknowledge to the +owning driver (→ AVEVA) carrying the authenticated principal (H6). No EF migration; the only contract +touches are additive (a sink-seam bool, a Core.Abstractions record field). + +## Locked decisions (brainstorming 2026-06-15) + +1. **H6 native-ack routing = a new `NativeAlarmAckRouter` seam** on the node manager (a second `Action`, + mirroring `AlarmCommandRouter`). Native conditions wire `OnAcknowledge` to it; scripted conditions keep the + existing router. Rejected: a single router disambiguated downstream (couples the scripted host to native + routing). +2. **H4 scoped to scripted alarms only.** Enable/Disable has clear engine semantics there. Native-alarm + Enable/Disable is deferred — the driver has no enable/disable surface; a native condition's + `OnEnableDisable` returns `BadNotSupported` (honest) rather than a silent no-op. +3. **H6 live-provability accepted as code+unit-proven locally; the AVEVA round-trip is operator-gated.** The + route is driver-agnostic via `IAlarmSource`, verified with **Galaxy** (the only real native-alarm source, + with `GatewayGalaxyAlarmAcknowledger`). The gateway ack→AVEVA commit needs the real gateway + (`10.100.0.48`), the same infra-gating class as H5's durable sink / the Galaxy live-gw smokes. +4. **Sequencing:** all three this phase, increasing risk — **H2-bit → H4 → H6.** + +## Architecture + +### H2-bit (small, independent) +- `NodePermissions`: add `HistoryUpdate = 1 << 12` (next free bit after `MethodCall = 1 << 11`). Optionally + fold into the `Admin` composite (write-side history op); leave `Engineer`/`Operator` without it. +- `TriePermissionEvaluator.MapOperationToPermission`: `OpcUaOperation.HistoryUpdate => NodePermissions.HistoryUpdate` + (was `HistoryRead`). Drop the stale TODO comment. +- Pure mapping change — fully unit-testable; the HistoryUpdate **service** remains unimplemented (a client + calling HistoryUpdate still gets the SDK's default reject — unchanged). + +### H4 (small-medium) — wire `OnEnableDisable` +- In `MaterialiseAlarmCondition`, for **scripted** conditions wire + `alarm.OnEnableDisable = (context, condition, enabling) => HandleAlarmCommand(context, condition, + enabling ? "Enable" : "Disable", comment: null, unshelveAt: null);` — reusing the existing AlarmAck-gated + `HandleAlarmCommand` (which builds `AlarmCommand(Enable|Disable)` and routes via `AlarmCommandRouter`). + `ScriptedAlarmHostActor:408-412` already dispatches those to `ScriptedAlarmEngine.Enable/DisableAsync`. +- For **native** conditions, `OnEnableDisable` returns `BadNotSupported` (decision #2). +- Verify the SDK `OnEnableDisable` delegate's exact signature against the decompiled + `AlarmConditionState`/`ConditionState` (the `OnShelve` wiring was verified the same way) and that the + enable-state round-trips to the node (the engine re-projects via `WriteAlarmCondition`, whose delta-gate + suppresses the double-event — same mechanism the ack path documents at `:591`). +- **Authz:** AlarmAck (consistent with the other Part 9 alarm commands). + +### H6 (the meaty one) — native ack → driver → AVEVA + principal +Four moving parts: + +1. **Mark native conditions at materialization.** Add `bool isNative` to + `IOpcUaAddressSpaceSink.MaterialiseAlarmCondition` (+ `DeferredAddressSpaceSink`, `SdkAddressSpaceSink`, + `NullOpcUaAddressSpaceSink`, `OtOpcUaNodeManager`). `Phase7Applier` already knows which is which: pass + `isNative: true` at the native equipment-tag-alarm site (`:204`) and `false` at the scripted site (`:295`). + The node manager tracks native node ids (e.g. a `_nativeAlarmNodeIds` set) and wires their `OnAcknowledge` + (and `OnEnableDisable`) accordingly. +2. **`NativeAlarmAckRouter` seam.** A new `Action?` on the node manager (mirrors + `AlarmCommandRouter`). A native condition's `OnAcknowledge` → resolve the AlarmAck-gated principal the same + way `HandleAlarmCommand` does (fail-closed) → invoke the router with + `NativeAlarmAck(ConditionNodeId, Comment, OperatorUser = identity.DisplayName)` → return `Good`. +3. **Reverse map in `DriverHostActor`.** `DriverHostActor` already builds the forward map + `_alarmNodeIdByDriverRef: (DriverInstanceId, FullName) → condition NodeId(s)` (`:127`, rebuilt at `:902`). + Build the **inverse** in the same clear-and-rebuild pass: `conditionNodeId → (DriverInstanceId, FullName)`. + The host wires the `NativeAlarmAckRouter` to a fire-and-forget `Tell` into `DriverHostActor`; on receipt it + looks up the inverse map → `(driverId, alarmRef)` → routes to the `DriverInstanceActor` → + `driver.AcknowledgeAsync([ new AlarmAcknowledgeRequest(SourceNodeId, ConditionId: alarmRef, Comment, + OperatorUser) ], ct)`. +4. **Carry the principal to the driver.** `AlarmAcknowledgeRequest(SourceNodeId, ConditionId, Comment?)` gains + `string? OperatorUser` (the field the base-interface comment already anticipated — "Stream G provides the + authenticated principal"). `GalaxyDriver.AcknowledgeAsync` passes it to + `IGalaxyAlarmAcknowledger.AcknowledgeAsync(alarmFullReference, comment, operatorUser, ct)` → gateway → AVEVA. + Also tidy `ScriptedAlarmSource.cs:75`'s `"opcua-client"` default to honor a supplied principal. + +Native-alarm-source drivers that don't implement a real acknowledger (AbCip/FOCAS projections) accept the +request as today (no behavioral regression); only Galaxy commits to its upstream. + +``` +OPC UA client Acknowledge on a NATIVE condition + → OtOpcUaNodeManager.OnAcknowledge (native) [AlarmAck gate, fail-closed] + → NativeAlarmAckRouter(NativeAlarmAck{node, comment, operatorUser}) + → DriverHostActor [inverse map conditionNodeId → (driverId, alarmRef)] + → DriverInstanceActor → driver.AcknowledgeAsync(AlarmAcknowledgeRequest{…, OperatorUser}) + → GalaxyDriver → GatewayGalaxyAlarmAcknowledger → gateway → AVEVA +``` + +## Error handling / edge cases + +- **Anonymous / missing AlarmAck role** on any of H4/H6 → `BadUserAccessDenied` (fail-closed, same as the + existing alarm methods). +- **Native ack with no driver match** (condition node not in the inverse map — e.g. a scripted alarm wrongly + routed) → log + drop; never throw under the node-manager Lock. The router `Tell` is fire-and-forget (safe + under Lock, like `AlarmCommandRouter`). +- **Driver faulted / reconnecting** → `AcknowledgeAsync` fails or no-ops per the driver; the local condition + state still reflects the SDK-applied ack (the OPC UA client sees Good). Surfacing a *failed* upstream ack + back to the client is out of scope (mirrors the Galaxy write-outcome fire-and-forget limitation). +- **H4 native Enable/Disable** → `BadNotSupported` (decision #2). +- **H2-bit:** no runtime behavior change for existing grants except closing the HistoryRead⇒HistoryUpdate + authorization hole; the service itself stays rejected. + +## Testing strategy (xUnit + Shouldly, TDD; NO bUnit) + +- **H2-bit:** `TriePermissionEvaluator` test — a node granting only `HistoryRead` does **not** authorize + `OpcUaOperation.HistoryUpdate` (now `NotGranted`); a grant including the new `HistoryUpdate` bit does. Enum + bit-value/round-trip test (no collision with `MethodCall`). +- **H4:** node-manager test — `OnEnableDisable` with an AlarmAck identity routes an `Enable`/`Disable` + `AlarmCommand` through a captured `AlarmCommandRouter`; anonymous → `BadUserAccessDenied`, no route; a native + condition → `BadNotSupported`. Engine path (`ScriptedAlarmHostActor` Enable/Disable → `ScriptedAlarmEngine`) + is already covered. +- **H6:** node-manager test — a **native** condition's `OnAcknowledge` (AlarmAck identity) invokes the + `NativeAlarmAckRouter` with the right `(node, comment, operatorUser=DisplayName)` and **not** the scripted + `AlarmCommandRouter`; a **scripted** condition still uses `AlarmCommandRouter`. `DriverHostActor` test — the + inverse map resolves a condition node → `(driverId, alarmRef)` and a `NativeAlarmAck` routes to a stub + driver's `AcknowledgeAsync` with `OperatorUser` populated. `AlarmAcknowledgeRequest` byte/field test for the + new `OperatorUser`. Galaxy: a unit test that `GalaxyDriver.AcknowledgeAsync` forwards `OperatorUser` to the + acknowledger; the live AVEVA commit is the existing skip-gated `GatewayGalaxy…LiveTests` pattern. +- **Live `/run`:** H4 Enable/Disable proven end-to-end on docker-dev (author a scripted alarm, Disable via + Client.CLI, confirm the condition disables); H6's local half proven (native ack routes to the driver — log + evidence), the **AVEVA round-trip operator-gated** against `10.100.0.48` (decision #3). + +## Verification per item + +- `dotnet build` clean (TreatWarningsAsErrors on production); full `dotnet test` green. +- High-risk review chain (serial spec → code) per task + a final integration review. +- Live `/run`: H4 driven by the agent on docker-dev (login disabled); H6 local route agent-driven, AVEVA + commit recorded as operator-gated (or driven against the reachable gateway if feasible this session). + +## Alternatives considered + +- **Single command router, disambiguate native vs scripted in the host** — rejected (decision #1): pushes + native routing knowledge into the scripted host; the dual-router split keeps each path's concern local. +- **Native Enable/Disable now** — rejected (decision #2): no driver enable/disable surface; murky semantics + (the driver keeps producing events). +- **Add the HistoryUpdate *service*** — out of scope / deferred (infra-gated; on the roadmap defer list). +- **Pass the principal out-of-band instead of on `AlarmAcknowledgeRequest`** — rejected: the record is the + natural carrier and the base-interface comment already anticipated it; an additive field is the least + surprising. + +## Hard constraints (carried from the roadmap) + +NO Configuration entity / EF migration. The only contract touches are additive: a `bool isNative` on the +internal `IOpcUaAddressSpaceSink` sink seam, and `string? OperatorUser` on the `AlarmAcknowledgeRequest` +Core.Abstractions record. Stage by path — never `git add .`; never stage `sql_login.txt`, +`src/Server/.../Host/pki/`, `pending.md`, `current.md`, `docker-dev/docker-compose.yml`, `stillpending.md`. +Never echo or commit secrets (gateway API key). No force-push, no `--no-verify`. Razor/runtime behavior proven +only by live `/run`, never bUnit.