docs(plan): design + implementation plan + tasklist for non-arch follow-ups batch (A/B/C)
v2-ci / build (push) Failing after 40s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped

This commit is contained in:
Joseph Doherty
2026-06-19 01:19:37 -04:00
parent f57aa8facd
commit ad359c5cd3
3 changed files with 241 additions and 0 deletions
@@ -0,0 +1,128 @@
# Non-architectural follow-ups batch — Design
**Date:** 2026-06-19
**Status:** Approved (planning) — for later subagent-driven execution
**Base:** master `f57aa8fa`
This batches every **non-architectural** open follow-up (the A/B/C survey) into one design +
plan so it can be executed task-by-task later. Items are grouped by how actionable they are:
- **A — actionable code now** (buildable, no infra/contract change). Do these.
- **B — design-deferred** (decided not-worth-it, or out-of-scope by a prior mechanism choice).
Included per request; two carry a *reconsider* flag (we previously rejected them).
- **C — operational / live-verify** (code already shipped; needs a rig/fixtures/devices up).
Captured as verify tasks, not code; the device-gated ones are blocked on hardware.
**Excluded (architectural / infra-gated, by request):** H2 HistoryUpdate service, H5b durable
`IHistoryWriter`, per-Akka-member `/hosts` nesting (needs a Commons field on `DriverHealthChanged`),
driver `IHistoryProvider`→server HistoryRead bridge, modified-value history, array writes /
S7 wide-type arrays, the Wonderware SDK-semantic permanent boundary, full-stack WebSocket+JWT
DriverStatusHub test.
**Standing guardrails (all tasks):** no EF migration, no Commons/proto/wire change, no bUnit;
stage by explicit path; never stage `sql_login.txt`/`Host/pki/`/`docker-dev/docker-compose.yml`/
`pending.md`/`current.md`/`stillpending.md`; no `--no-verify`/force-push; `dangerouslyDisableSandbox`
for build/test/rig. Each code task = its own commit; finish a batch = merge to master + push.
---
## A — Actionable code
### A1. OpcUaClient history session-capture-before-gate race *(strongest — real latent bug)*
`OpcUaClientDriver.cs` captures `var session = RequireSession()` **before** acquiring `_gate` at
lines **1134, 1299, 1413, 1618 (`ExecuteHistoryReadAsync`, the funnel for ReadRaw/Processed/AtTime),
and 1788** — whereas the read/write paths deliberately re-resolve *inside* the gate
(`_ = RequireSession()` guard at 624/714/829, then re-read after `await _gate.WaitAsync`; see the
`:622-628` comment "a reconnect can swap Session while we wait on _gate"). So a reconnect that swaps
the session while a caller waits on `_gate` leaves these methods using a **stale session**.
**Approach:** for every `var session = RequireSession()` that precedes `await _gate.WaitAsync`, move
the authoritative read to *inside* the gate (keep an outside `_ = RequireSession()` fast-fail guard),
matching the existing read/write idiom. Add a regression test that swaps the session across the gate
wait and asserts the post-gate call uses the new session. Driver-internal; no contract change.
**Classification:** standard.
### A2. Client.CLI `enable` / `disable` command (H4 client-driven path)
Phase 3 shipped the node-manager `OnEnableDisable` → engine Enable/DisableAsync, but there's no
client verb to invoke the OPC UA ConditionType Enable/Disable methods, so H4 was never live-driven.
**Approach:** mirror the existing `ack`/`shelve`/`confirm` command + service-method pattern — add
`EnableAsync`/`DisableAsync` to the client-side `IOpcUaClientService` (calls the OPC UA Enable/Disable
condition methods; **client app interface, NOT a Commons/wire change**) + the CLI `enable`/`disable`
commands. Unit-test the VM/service call; live-drive against the rig's scripted condition.
**Classification:** standard. (Unblocks the deferred H4 live `/run`.)
### A3. Cert-audit minor review nits *(from this session's final integration review)*
Two cleanups in the just-shipped cert-audit code:
(a) `Certificates.razor` `ConfirmAction` has two **unreachable** fallthrough arms
(`"cannot delete from {Kind}"`, `"unknown action"`) that build a `CertActionResult.Fail` *inside*
the razor, bypassing `CertificateStoreManager` → those (unreachable) failures aren't audited. Either
route them through the manager or add an explicit "unreachable defensive guard" comment.
(b) `OpcUa:PkiStoreRoot` is read in BOTH `Certificates.razor:130` and the `CertificateStoreManager`
ctor — centralize on the manager (expose a `PkiRoot` property the razor reads).
**Classification:** trivial (combined into one task; same two files).
---
## B — Design-deferred (included per request)
### B1. Write-outcome residuals
Extend the shipped compare-and-revert write path (write-outcome self-correction, master `1d797c1c`):
on a failed inbound device write, additionally (i) emit a **Bad-quality blip** on the node, (ii) raise
an OPC UA **`AuditWriteUpdateEvent`**, and (iii) add **synchronous structural fail-fast** for writes
that can be rejected before dispatch. These were out-of-scope by the chosen revert-only mechanism.
**Approach:** locate the revert continuation (node-manager `OnWriteValue`/the `IOpcUaNodeWriteGateway`
outcome handler), add the three behaviours behind the existing failure branch. Galaxy can't surface a
write failure (fire-and-forget gateway), so this only exercises on protocol drivers.
**Classification:** standard. *(Each of i/ii/iii could split if it balloons.)*
### B2. AdminUI — Galaxy re-pick preserves prior alarm-field edits
On the equipment Tag modal, re-picking a Galaxy address currently resets manually-edited alarm
fields. **Approach:** in the Galaxy-address-picked handler, merge the picked defaults *without*
clobbering already-edited `alarm` fields (same preserve-unknown idiom used elsewhere). No bUnit;
live-verify on docker-dev. **Classification:** small.
### B3. AdminUI — inline-create-script dropdown label drift
After "New script" inline-creates + binds a script from the VirtualTagModal, the script dropdown
label can drift from the selection. **Approach:** refresh the bound-script label from the created
`SC-…` id after creation. **Classification:** trivial/small.
### B4. F10b surgical DataType / IsArray in-place writes *(RECONSIDER — previously rejected)*
Previously **decided against**: dirty (brief value-type mismatch on the live node, no
ModelChangeEvents) for rare edits → kept full rebuild. Included here only so the decision is a task,
not a silent gap. To actually build: extend `ISurgicalAddressSpaceSink.UpdateTagAttributes` to also
swap DataType/ValueRank in place + emit ModelChangeEvents, and widen `TagDeltaIsSurgicalEligible`.
**Recommendation:** keep deferred unless a concrete need appears. **Classification:** standard.
### B5. Alarm-severity `SetSeverity` surgical update *(RECONSIDER — previously rejected)*
Previously **decided against**: operationally invisible — the live alarm engine
(`ScriptedAlarmHostActor``AlarmStateUpdate`) overwrites the authored severity on first eval, so an
in-place node severity change is shadowed immediately. Included for completeness.
**Recommendation:** keep deferred. **Classification:** small.
---
## C — Operational / live-verify (not code; needs a rig)
### C1. Modbus-Int64 full live authoring
Code shipped (Phase 4 T1, `bd8fee61`); never live-authored because docker-dev has no seeded Modbus
driver. **Verify steps:** seed a Modbus driver on docker-dev pointed at sim `10.100.0.35:5020`, author
an Int64 equipment tag, deploy, confirm the OPC UA node advertises `DataTypeIds.Int64` and reads.
**No code** (unless seeding surfaces a gap). **Classification:** verify-only.
### C2. S7 + AbCip Test-Connect probe happy-path live-verify
Probes shipped (Phase 5); green-path skipped because the fixtures aren't on this Mac. **Verify steps:**
bring up `lmxopcua-fix up s7 s7_1500` / `up abcip controllogix` from the Windows VM, run the skip-gated
probe E2E green path. **Classification:** verify-only (needs the Windows-VM fixtures).
### C3. Device-gated proofs *(BLOCKED on hardware — capture only)*
H6 native-ack→AVEVA round-trip, Galaxy Phase C historian T7 live HistoryRead, Phase B native-alarm T9,
and AbLegacy/TwinCAT/FOCAS probe happy-paths all need real devices (Wonderware sidecar+AVEVA on
`10.100.0.48`, a Galaxy native alarm, PLC5/SLC sim, ADS target, CNC+FWLIB). Recorded so they're not
lost; **not executable** without the hardware. **Classification:** blocked.
---
## Execution shape
Independent code tasks (A1/A2/A3, B1/B2/B3) touch disjoint projects → dispatchable concurrently
in a subagent-driven run, each its own commit, per-task review by classification, final integration
review, merge+push. B4/B5 are *reconsider* gates (don't build without a fresh go-ahead). C1/C2 are
operator/rig verify; C3 is blocked. See `2026-06-19-followups-batch.md` for the executable task list.
+92
View File
@@ -0,0 +1,92 @@
# Non-architectural follow-ups batch — Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:subagent-driven-development to execute this plan task-by-task. Each task is self-contained; honor its Classification for the review chain.
**Goal:** Close the actionable non-architectural follow-ups (A/B groups), and capture the
operational/verify and blocked items (C) so nothing is lost.
**Design:** `docs/plans/2026-06-19-followups-batch-design.md`
**Base:** master `f57aa8fa`. **Branch (at execution):** `feat/followups-batch` (off master).
**Standing guardrails:** no EF migration, no Commons/proto/wire change, no bUnit; stage by explicit
path; never stage `sql_login.txt`/`Host/pki/`/`docker-dev/docker-compose.yml`/`pending.md`/
`current.md`/`stillpending.md`; no `--no-verify`/force-push; `dangerouslyDisableSandbox` for
build/test/rig. Finish a batch = ff-merge to master + push.
**Recommended execution order / waves** (disjoint files → concurrent):
- **Wave 1 (code, concurrent):** T1 (OpcUaClient) ∥ T2 (Client.CLI) ∥ T3 (cert-audit AdminUI) ∥ T4 (Galaxy modal) ∥ T5 (vtag modal) — all disjoint projects/files.
- **Wave 2 (code):** T6 (write-outcome, OpcUaServer/Runtime) — its own.
- **Gates (do NOT build without explicit go-ahead):** T7, T8 (reconsider).
- **Operator/rig:** T9, T10 (verify). **Blocked:** T11.
- Each wave: per-task review by classification + a final integration review, then merge+push.
---
### Task 1 (A1): OpcUaClient history session-capture-before-gate race
**Classification:** standard · **Parallelizable with:** T2,T3,T4,T5,T6
**Files:** `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient/OpcUaClientDriver.cs` · tests in `tests/Drivers/.../OpcUaClient*Tests`
**Steps:**
1. Audit every `var session = RequireSession()` that precedes `await _gate.WaitAsync` (known sites: 1134, 1299, 1413, **1618 `ExecuteHistoryReadAsync`**, 1788). Compare to the correct idiom at `:622-628` (outside `_ = RequireSession()` fast-fail guard, then re-read the session *inside* the gate).
2. Write a failing regression test: acquire `_gate`, swap `Session` (simulate `OnReconnectComplete`), release; assert the method under test uses the NEW session, not the captured one. (Use the existing OpcUaClient test harness; if session-swap isn't fakeable, assert via the `Gate` internal + a seam.)
3. Refactor each site to re-resolve inside the gate (keep the outside guard). Run the driver unit suite green; `dotnet build` the driver.
4. Commit `fix(opcuaclient): re-resolve session inside _gate in history/read paths (stale-session race)`.
### Task 2 (A2): Client.CLI `enable`/`disable` command (H4 client path)
**Classification:** standard · **Parallelizable with:** T1,T3,T4,T5,T6
**Files:** `src/Client/ZB.MOM.WW.OtOpcUa.Client.CLI/` (Program.cs + the ack/shelve/confirm command template — explore for the actual command structure) · the client-side `IOpcUaClientService` (+ impl) · CLI/Client.Shared tests
**Steps:**
1. Find the existing `ack`/`shelve`/`confirm` CLI command + the `IOpcUaClientService.{Acknowledge,Shelve,Confirm}AlarmAsync` they call (template). Confirm whether `Enable`/`Disable` already exist on the service (grep) — if not, add `EnableAsync(nodeId)`/`DisableAsync(nodeId)` that call the OPC UA ConditionType Enable/Disable methods (mirror the ack call shape). **Client app interface only — NOT Commons/wire.**
2. Add CLI `enable`/`disable` commands mirroring `ack` (node-id arg, connect, call, print status).
3. Unit-test the service/VM call + the command wiring. Build + driver/client tests green.
4. Live (later): drive `enable`/`disable` against the rig's scripted condition node → AlarmAck-gated → engine Enable/DisableAsync (closes the deferred H4 live `/run`).
5. Commit `feat(cli): add enable/disable condition commands (H4 client path)`.
### Task 3 (A3): Cert-audit minor review nits
**Classification:** trivial · **Parallelizable with:** T1,T2,T4,T5,T6
**Files:** `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Components/Pages/Certificates.razor` · `src/Server/ZB.MOM.WW.OtOpcUa.AdminUI/Certificates/CertificateStoreManager.cs`
**Steps:**
1. (a) The two unreachable `ConfirmAction` fallthrough arms (`"cannot delete from {Kind}"`, `"unknown action"`): add an explicit `// unreachable defensive guard — buttons only render for Trusted/Rejected + 3 literal verbs` comment (simplest), OR route through the manager so they audit. Pick the comment unless trivial to route.
2. (b) Expose a `PkiRoot` property on `CertificateStoreManager`; have `Certificates.razor:130` read it instead of re-reading `OpcUa:PkiStoreRoot` independently.
3. Build AdminUI (0 errors); existing AdminUI.Tests green.
4. Commit `refactor(adminui): tidy cert-audit review nits (fallthrough comment + single PkiStoreRoot read)`.
### Task 4 (B2): AdminUI — Galaxy re-pick preserves prior alarm-field edits
**Classification:** small · **Parallelizable with:** T1,T2,T3,T5,T6
**Files:** the Galaxy-address-picked handler on the equipment Tag modal (explore: `Components/Shared/Uns/TagModal.razor` + the Galaxy picker callback `OnGalaxyAddressPicked`/similar) · a pure merge helper + its unit test
**Steps:**
1. Reproduce: re-picking a Galaxy address resets manually-edited `alarm` fields. Find the picked-handler that overwrites the config.
2. Extract/extend a pure merge that applies picked defaults WITHOUT clobbering already-edited alarm fields (preserve-existing idiom); unit-test the merge.
3. Wire it into the handler. Build; AdminUI.Tests green. Live-verify on docker-dev (re-pick keeps edits).
4. Commit `fix(adminui): preserve edited alarm fields on Galaxy address re-pick`.
### Task 5 (B3): AdminUI — inline-create-script dropdown label drift
**Classification:** small · **Parallelizable with:** T1,T2,T3,T4,T6
**Files:** `VirtualTagModal` + its inline create-script handler (explore) · test if a pure binding helper exists
**Steps:**
1. Reproduce the label drift after "New script" inline-creates + binds (`SC-…`).
2. Refresh the bound-script label/selection from the created id after creation. Build; tests green; live-verify.
3. Commit `fix(adminui): refresh script dropdown label after inline create`.
### Task 6 (B1): Write-outcome residuals (Bad-quality blip + AuditWriteUpdateEvent + sync fail-fast)
**Classification:** standard · **Parallelizable with:** T1T5
**Files:** node-manager write path (`OtOpcUaNodeManager` `OnWriteValue` / the `IOpcUaNodeWriteGateway` outcome continuation — the write-outcome self-correction site, master `1d797c1c`) · Runtime gateway · tests
**Steps:**
1. Locate the failed-write revert continuation. Add behind the existing failure branch: (i) a brief Bad-quality status blip on the node before/with the revert; (ii) raise an OPC UA `AuditWriteUpdateEvent`; (iii) synchronous structural fail-fast for pre-dispatch-rejectable writes.
2. TDD each sub-behaviour (protocol-driver path only — Galaxy is fire-and-forget). Use the modbus exception-injector recipe for live proof (FC06 reject).
3. If any sub-part balloons >~300 LOC, split it out. Build; OpcUaServer + Runtime tests green.
4. Commit `feat(opcua): emit Bad blip + AuditWriteUpdateEvent + sync fail-fast on failed device write`.
### Task 7 (B4): F10b surgical DataType/IsArray in-place writes — **RECONSIDER GATE**
**Classification:** standard · **Do NOT build without an explicit fresh go-ahead** (previously decided against as dirty — brief value-type mismatch, no ModelChangeEvents, rare edits). If approved: extend `ISurgicalAddressSpaceSink.UpdateTagAttributes` to swap DataType/ValueRank in place + emit ModelChangeEvents; widen `Phase7Applier.TagDeltaIsSurgicalEligible`; **live-`/run` the rebuild=False path** (the prod-inertness trap, see the F10b deferred-wrapper lesson). Until approved this stays a deferred record.
### Task 8 (B5): Alarm-severity `SetSeverity` surgical update — **RECONSIDER GATE**
**Classification:** small · **Do NOT build without an explicit fresh go-ahead** (operationally invisible — the alarm engine overwrites authored severity on first eval). Recorded so the decision isn't a silent gap.
### Task 9 (C1): Modbus-Int64 full live authoring — **VERIFY-ONLY (operator/rig)**
**Classification:** verify · Seed a Modbus driver on docker-dev → sim `10.100.0.35:5020`, author an Int64 equipment tag, deploy, confirm the OPC UA node advertises `DataTypeIds.Int64` + reads changing. No code unless a gap surfaces.
### Task 10 (C2): S7 + AbCip Test-Connect probe happy-path — **VERIFY-ONLY (needs Windows-VM fixtures)**
**Classification:** verify · `lmxopcua-fix up s7 s7_1500` / `up abcip controllogix` from the Windows VM, then run the skip-gated probe E2E green path (`DriverProbeHandshakeE2eTests`).
### Task 11 (C3): Device-gated proofs — **BLOCKED (hardware)**
**Classification:** blocked · H6 native-ack→AVEVA, Galaxy Phase C historian T7, Phase B T9, AbLegacy/TwinCAT/FOCAS probe happy-paths — need Wonderware+AVEVA (`10.100.0.48`), a Galaxy native alarm, PLC5/SLC sim, ADS target, CNC+FWLIB. Captured; not executable here.
@@ -0,0 +1,21 @@
{
"planPath": "docs/plans/2026-06-19-followups-batch.md",
"designPath": "docs/plans/2026-06-19-followups-batch-design.md",
"executionState": "PENDING",
"base": "master f57aa8fa",
"branchAtExecution": "feat/followups-batch",
"tasks": [
{"id": 1, "ref": "A1", "subject": "OpcUaClient history session-capture-before-gate race fix", "classification": "standard", "status": "pending", "parallelizableWith": [2,3,4,5,6], "group": "A-actionable"},
{"id": 2, "ref": "A2", "subject": "Client.CLI enable/disable command (H4 client path)", "classification": "standard", "status": "pending", "parallelizableWith": [1,3,4,5,6], "group": "A-actionable"},
{"id": 3, "ref": "A3", "subject": "Cert-audit minor review nits (unreachable fallthrough comment + single PkiStoreRoot read)", "classification": "trivial", "status": "pending", "parallelizableWith": [1,2,4,5,6], "group": "A-actionable"},
{"id": 4, "ref": "B2", "subject": "AdminUI: preserve edited alarm fields on Galaxy address re-pick", "classification": "small", "status": "pending", "parallelizableWith": [1,2,3,5,6], "group": "B-deferred"},
{"id": 5, "ref": "B3", "subject": "AdminUI: refresh script dropdown label after inline create", "classification": "small", "status": "pending", "parallelizableWith": [1,2,3,4,6], "group": "B-deferred"},
{"id": 6, "ref": "B1", "subject": "Write-outcome residuals: Bad-quality blip + AuditWriteUpdateEvent + sync fail-fast", "classification": "standard", "status": "pending", "parallelizableWith": [1,2,3,4,5], "group": "B-deferred"},
{"id": 7, "ref": "B4", "subject": "F10b surgical DataType/IsArray in-place writes", "classification": "standard", "status": "pending", "gate": "RECONSIDER — previously decided against (dirty); do NOT build without explicit go-ahead", "group": "B-reconsider"},
{"id": 8, "ref": "B5", "subject": "Alarm-severity SetSeverity surgical update", "classification": "small", "status": "pending", "gate": "RECONSIDER — operationally invisible (alarm engine overwrites); do NOT build without explicit go-ahead", "group": "B-reconsider"},
{"id": 9, "ref": "C1", "subject": "Modbus-Int64 full live authoring on docker-dev", "classification": "verify", "status": "pending", "note": "operator/rig verify-only; seed a Modbus driver", "group": "C-verify"},
{"id": 10, "ref": "C2", "subject": "S7 + AbCip Test-Connect probe happy-path live-verify", "classification": "verify", "status": "pending", "note": "needs lmxopcua-fix fixtures from the Windows VM", "group": "C-verify"},
{"id": 11, "ref": "C3", "subject": "Device-gated proofs (H6 ack→AVEVA, Phase C T7, Phase B T9, AbLegacy/TwinCAT/FOCAS probe)", "classification": "blocked", "status": "pending", "note": "BLOCKED on real hardware; captured only", "group": "C-blocked"}
],
"lastUpdated": "2026-06-19"
}