Files
lmxopcua/docs/plans/2026-06-19-followups-batch-design.md
T
Joseph Doherty ad359c5cd3
v2-ci / build (push) Failing after 40s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
docs(plan): design + implementation plan + tasklist for non-arch follow-ups batch (A/B/C)
2026-06-19 01:19:37 -04:00

8.3 KiB

Non-architectural follow-ups batch — Design

Date: 2026-06-19 Status: Approved (planning) — for later subagent-driven execution Base: master f57aa8fa

This batches every non-architectural open follow-up (the A/B/C survey) into one design + plan so it can be executed task-by-task later. Items are grouped by how actionable they are:

  • A — actionable code now (buildable, no infra/contract change). Do these.
  • B — design-deferred (decided not-worth-it, or out-of-scope by a prior mechanism choice). Included per request; two carry a reconsider flag (we previously rejected them).
  • C — operational / live-verify (code already shipped; needs a rig/fixtures/devices up). Captured as verify tasks, not code; the device-gated ones are blocked on hardware.

Excluded (architectural / infra-gated, by request): H2 HistoryUpdate service, H5b durable IHistoryWriter, per-Akka-member /hosts nesting (needs a Commons field on DriverHealthChanged), driver IHistoryProvider→server HistoryRead bridge, modified-value history, array writes / S7 wide-type arrays, the Wonderware SDK-semantic permanent boundary, full-stack WebSocket+JWT DriverStatusHub test.

Standing guardrails (all tasks): no EF migration, no Commons/proto/wire change, no bUnit; stage by explicit path; never stage sql_login.txt/Host/pki//docker-dev/docker-compose.yml/ pending.md/current.md/stillpending.md; no --no-verify/force-push; dangerouslyDisableSandbox for build/test/rig. Each code task = its own commit; finish a batch = merge to master + push.


A — Actionable code

A1. OpcUaClient history session-capture-before-gate race (strongest — real latent bug)

OpcUaClientDriver.cs captures var session = RequireSession() before acquiring _gate at lines 1134, 1299, 1413, 1618 (ExecuteHistoryReadAsync, the funnel for ReadRaw/Processed/AtTime), and 1788 — whereas the read/write paths deliberately re-resolve inside the gate (_ = RequireSession() guard at 624/714/829, then re-read after await _gate.WaitAsync; see the :622-628 comment "a reconnect can swap Session while we wait on _gate"). So a reconnect that swaps the session while a caller waits on _gate leaves these methods using a stale session. Approach: for every var session = RequireSession() that precedes await _gate.WaitAsync, move the authoritative read to inside the gate (keep an outside _ = RequireSession() fast-fail guard), matching the existing read/write idiom. Add a regression test that swaps the session across the gate wait and asserts the post-gate call uses the new session. Driver-internal; no contract change. Classification: standard.

A2. Client.CLI enable / disable command (H4 client-driven path)

Phase 3 shipped the node-manager OnEnableDisable → engine Enable/DisableAsync, but there's no client verb to invoke the OPC UA ConditionType Enable/Disable methods, so H4 was never live-driven. Approach: mirror the existing ack/shelve/confirm command + service-method pattern — add EnableAsync/DisableAsync to the client-side IOpcUaClientService (calls the OPC UA Enable/Disable condition methods; client app interface, NOT a Commons/wire change) + the CLI enable/disable commands. Unit-test the VM/service call; live-drive against the rig's scripted condition. Classification: standard. (Unblocks the deferred H4 live /run.)

A3. Cert-audit minor review nits (from this session's final integration review)

Two cleanups in the just-shipped cert-audit code: (a) Certificates.razor ConfirmAction has two unreachable fallthrough arms ("cannot delete from {Kind}", "unknown action") that build a CertActionResult.Fail inside the razor, bypassing CertificateStoreManager → those (unreachable) failures aren't audited. Either route them through the manager or add an explicit "unreachable defensive guard" comment. (b) OpcUa:PkiStoreRoot is read in BOTH Certificates.razor:130 and the CertificateStoreManager ctor — centralize on the manager (expose a PkiRoot property the razor reads). Classification: trivial (combined into one task; same two files).


B — Design-deferred (included per request)

B1. Write-outcome residuals

Extend the shipped compare-and-revert write path (write-outcome self-correction, master 1d797c1c): on a failed inbound device write, additionally (i) emit a Bad-quality blip on the node, (ii) raise an OPC UA AuditWriteUpdateEvent, and (iii) add synchronous structural fail-fast for writes that can be rejected before dispatch. These were out-of-scope by the chosen revert-only mechanism. Approach: locate the revert continuation (node-manager OnWriteValue/the IOpcUaNodeWriteGateway outcome handler), add the three behaviours behind the existing failure branch. Galaxy can't surface a write failure (fire-and-forget gateway), so this only exercises on protocol drivers. Classification: standard. (Each of i/ii/iii could split if it balloons.)

B2. AdminUI — Galaxy re-pick preserves prior alarm-field edits

On the equipment Tag modal, re-picking a Galaxy address currently resets manually-edited alarm fields. Approach: in the Galaxy-address-picked handler, merge the picked defaults without clobbering already-edited alarm fields (same preserve-unknown idiom used elsewhere). No bUnit; live-verify on docker-dev. Classification: small.

B3. AdminUI — inline-create-script dropdown label drift

After "New script" inline-creates + binds a script from the VirtualTagModal, the script dropdown label can drift from the selection. Approach: refresh the bound-script label from the created SC-… id after creation. Classification: trivial/small.

B4. F10b surgical DataType / IsArray in-place writes (RECONSIDER — previously rejected)

Previously decided against: dirty (brief value-type mismatch on the live node, no ModelChangeEvents) for rare edits → kept full rebuild. Included here only so the decision is a task, not a silent gap. To actually build: extend ISurgicalAddressSpaceSink.UpdateTagAttributes to also swap DataType/ValueRank in place + emit ModelChangeEvents, and widen TagDeltaIsSurgicalEligible. Recommendation: keep deferred unless a concrete need appears. Classification: standard.

B5. Alarm-severity SetSeverity surgical update (RECONSIDER — previously rejected)

Previously decided against: operationally invisible — the live alarm engine (ScriptedAlarmHostActorAlarmStateUpdate) overwrites the authored severity on first eval, so an in-place node severity change is shadowed immediately. Included for completeness. Recommendation: keep deferred. Classification: small.


C — Operational / live-verify (not code; needs a rig)

C1. Modbus-Int64 full live authoring

Code shipped (Phase 4 T1, bd8fee61); never live-authored because docker-dev has no seeded Modbus driver. Verify steps: seed a Modbus driver on docker-dev pointed at sim 10.100.0.35:5020, author an Int64 equipment tag, deploy, confirm the OPC UA node advertises DataTypeIds.Int64 and reads. No code (unless seeding surfaces a gap). Classification: verify-only.

C2. S7 + AbCip Test-Connect probe happy-path live-verify

Probes shipped (Phase 5); green-path skipped because the fixtures aren't on this Mac. Verify steps: bring up lmxopcua-fix up s7 s7_1500 / up abcip controllogix from the Windows VM, run the skip-gated probe E2E green path. Classification: verify-only (needs the Windows-VM fixtures).

C3. Device-gated proofs (BLOCKED on hardware — capture only)

H6 native-ack→AVEVA round-trip, Galaxy Phase C historian T7 live HistoryRead, Phase B native-alarm T9, and AbLegacy/TwinCAT/FOCAS probe happy-paths all need real devices (Wonderware sidecar+AVEVA on 10.100.0.48, a Galaxy native alarm, PLC5/SLC sim, ADS target, CNC+FWLIB). Recorded so they're not lost; not executable without the hardware. Classification: blocked.


Execution shape

Independent code tasks (A1/A2/A3, B1/B2/B3) touch disjoint projects → dispatchable concurrently in a subagent-driven run, each its own commit, per-task review by classification, final integration review, merge+push. B4/B5 are reconsider gates (don't build without a fresh go-ahead). C1/C2 are operator/rig verify; C3 is blocked. See 2026-06-19-followups-batch.md for the executable task list.