Files
lmxopcua/docs/plans/2026-06-18-f10b-vtag-skip-rebuild-galaxy-comment-design.md
T

9.9 KiB

F10b vtag-node-irrelevant rebuild skip + Client.UI Galaxy operator-comment limitation — Design

Date: 2026-06-18 Branch: feat/f10b-vtag-skip-rebuild-galaxy-comment (off master be703a30) Backlog items: stillpending.md §A #11 (F10b in-place vs full rebuild) + #12 (Client.UI Galaxy operator-comment fallback)

Two independent, low-risk backlog items. F10b is a bounded address-space-rebuild optimization (the narrow, provably-safe slice — NOT the broad surgical-write rewrite). The Client.UI item is an accepted-limitation close (doc/comment + a pinning test). They touch disjoint projects (OpcUaServer / Client.Shared).


Standing constraints (in force)

  • NO Commons/proto change · NO Core.Abstractions / breaking-interface change · NO EF migration · NO bUnit (no Razor touched this phase).
  • Stage by explicit path (never git add .); never stage sql_login.txt / pki/ / pending.md / current.md / stillpending.md / docker-dev/docker-compose.yml. No force-push / no --no-verify.
  • Finish = merge to master + push. dangerouslyDisableSandbox: true for all build/test commands.

Component A — F10b: skip the address-space rebuild for vtag-node-irrelevant edits (backlog #11)

What "full rebuild" costs, and why H1a made it unconditional

OpcUaPublishActor.HandleRebuild runs: outcome = _applier.Apply(plan) (which calls _sink.RebuildAddressSpace() — a full tear-down of every folder/variable/alarm — iff needsRebuild), then unconditionally re-runs the idempotent Materialise* passes. H1a/H1b (1e95856b/ada01e1a) made needsRebuild fire on any Changed* (not just Added/Removed) to kill a silent stale-node bug — but that means editing a single VirtualTag's expression tears down and rebuilds the entire server address space, dropping every client's live values + monitored-item subscriptions server-wide. That blast radius is the cost F10b targets.

The provably-safe slice (the only zero-node-mutation case)

A VirtualTag's materialised OPC UA node (Phase7Applier.MaterialiseEquipmentVirtualTags, lines 264-272) is derived only from {EquipmentId, FolderPath, Name, DataType}EnsureVariable(parent/Name, Name, DataType, writable:false). Expression, DependencyRefs, and Historize do not touch the node — Historize is explicitly write-side only (the code comment at lines 260-263 says so; vtags are not materialised as SDK Historizing/HistoryRead variables).

The VirtualTag engine picks up Expression/DependencyRefs/Historize changes through a completely independent path: DriverHostActor (:1062) tells VirtualTagHostActor ApplyVirtualTags(...), and VirtualTagHostActor.OnApply (lines 97-109) stop+respawns any child "whose plan changed in place (edited Expression/DependencyRefs, or a toggled flag)". This is not routed through Phase7Applier.

Therefore, when a deploy's only changes are vtag deltas that differ solely in Expression/DependencyRefs/Historize:

  • the materialised node is byte-identical → RebuildAddressSpace would tear it down and rebuild an identical node, dropping subscriptions for nothing;
  • the idempotent Materialise* passes already short-circuit on an existing node (OtOpcUaNodeManager.EnsureVariable:1324 early-returns if (_variables.ContainsKey(...))), so they no-op and preserve the live value + subscriptions;
  • the engine respawns regardless via VirtualTagHostActor.

So skipping RebuildAddressSpace for this case is correct and preserves the H1a invariant (no stale node — because the node never needed to change).

Approach (Phase7Applier-only; NO node-manager / SDK / interface change)

In Phase7Applier.Apply, refine needsRebuild. A vtag delta is node-irrelevant iff overriding the three engine/write-side fields to the new values makes it equal to the new plan (a whitelist-of-may-differ that defaults to rebuild for any other/unknown field, because EquipmentVirtualTagPlan has a custom Equals comparing all 8 logical fields):

static bool VtagDeltaIsNodeIrrelevant(Phase7Plan.EquipmentVirtualTagDelta d) =>
    (d.Previous with
     {
         Expression = d.Current.Expression,
         DependencyRefs = d.Current.DependencyRefs,
         Historize = d.Current.Historize,
     }).Equals(d.Current);

needsRebuild becomes true iff there is any structural change OR any vtag delta that is not node-irrelevant:

var needsRebuild =
    plan.AddedEquipment.Count > 0 || plan.RemovedEquipment.Count > 0 || plan.ChangedEquipment.Count > 0 ||
    plan.AddedAlarms.Count > 0 || plan.RemovedAlarms.Count > 0 || plan.ChangedAlarms.Count > 0 ||
    plan.AddedEquipmentTags.Count > 0 || plan.RemovedEquipmentTags.Count > 0 || plan.ChangedEquipmentTags.Count > 0 ||
    plan.AddedEquipmentVirtualTags.Count > 0 || plan.RemovedEquipmentVirtualTags.Count > 0 ||
    plan.ChangedEquipmentVirtualTags.Any(d => !VtagDeltaIsNodeIrrelevant(d));

i.e. needsRebuild is false only when every effective change is a node-irrelevant vtag edit and nothing else changed. ChangedNodes (the outcome count) stays as-is (it still counts the vtag change as a change — it just doesn't force a rebuild). Update the lines 84-91 comment to document the new skip + the safety reasoning. The downstream Materialise* passes and the VirtualTagHostActor respawn are unchanged.

Out of scope (deferred, documented)

Surgical in-place attribute writes for equipment-tag property changes (DataType, Writable, IsHistorized) and alarm-severity-only changes (SetSeverity) genuinely mutate node attributes and carry the H1a stale-node risk + SDK handler/historian-lifecycle complexity. They are NOT in this slice — F10b here is strictly the zero-node-mutation vtag skip. Record them as follow-ups.

Testing (offline, no SDK)

Phase7ApplierTests already uses a RecordingSink that captures RebuildAddressSpace calls. Add:

  • vtag delta differing only in Expressionoutcome.RebuildCalled == false, RecordingSink saw no RebuildAddressSpace, outcome.ChangedNodes == 1.
  • same for a DependencyRefs-only and a Historize-only delta ⇒ no rebuild.
  • vtag delta where DataType also changedRebuildCalled == true (the safety boundary).
  • vtag delta where Name (or FolderPath) changedRebuildCalled == true.
  • a node-irrelevant vtag delta mixed with any other change (e.g. a ChangedEquipmentTag, or an added equipment) ⇒ RebuildCalled == true (the skip requires the vtag edit to be the sole change).

Live /run (optional, confirmatory)

On docker-dev (rebuild BOTH central nodes per the :9200 round-robin fact): subscribe a client to a vtag, edit only its expression in /uns, Deploy → server log shows rebuild=False for that deploy and the client subscription is NOT dropped (vs rebuild=True when a tag's DataType changes). This is confirmatory; the unit tests pin the logic.


Component B — Client.UI Galaxy operator-comment fallback (backlog #12)

Verdict: accepted limitation, not a code-fixable bug

The operator comment is genuinely unrecoverable in the Galaxy sub-attribute fallback path (OpcUaClientService.cs:769-820):

  • the standard OPC UA event SelectClause (CreateAlarmEventFilter, ~13 fields) does not request an AckComment/Comment field — Part 9 conditions don't carry the operator comment in the event payload here;
  • Galaxy's sub-attribute convention (AlarmRefBuilder.cs:15-20) exposes AckMsg as write-only — there is no readable attribute to recover the last operator comment;
  • the comment is available via the gateway's native alarm feed (GatewayGalaxyAlarmFeedGalaxyAlarmTransition.OperatorComment), which the Galaxy driver already uses — a different path from this OPC UA client fallback.

So a real fix would require a server-side change (out of scope); the backlog already labels this "accepted limitation." AlarmEventArgs.cs:82-87 partly documents it, but the client-side extraction omission is unexplained, and the behavior is untested.

Approach (doc/comment + a pinning test; NO behavioral change)

  • Inline comment at the fallback AlarmEvent?.Invoke(...) site in OpcUaClientService.cs (~:817): explain that OperatorComment is intentionally left null here — Galaxy AckMsg is write-only and the OPC UA event SelectClause carries no comment field, so the comment is only available via the gateway's native alarm feed (the driver path), not this client fallback.
  • Tighten the AlarmEventArgs.cs:82-87 doc to name the two concrete reasons (write-only AckMsg + no event-field), so it reads as an intentional limitation, not a TODO.
  • Pinning test in OpcUaClientServiceTests (mirror the existing OnAlarmEvent_MissingAckedActiveButHasConditionNode_FallbackReadsAndRaisesEvent fixture): drive the fallback path and assert the raised AlarmEventArgs.OperatorComment is null — converting the gap into a tested, intentional limitation.

Testing

Fully offline — the existing OpcUaClientServiceTests fakes (FakeSessionAdapter/FakeSubscriptionAdapter) already drive the fallback path; the new test asserts the documented behavior.


Task slicing (independent → parallelizable)

Task Component Project Class Parallel with
T1 F10b vtag-node-irrelevant rebuild skip + tests OpcUaServer (+ .Tests) small T2
T2 Client.UI Galaxy comment: inline comment + doc + pinning test Client.Shared (+ .Tests) trivial→small T1
T3 Reconcile stillpending #11/#12 + memory + finish (build, tests, merge+push) docs (never-staged) small none

T1/T2 touch disjoint projects → dispatchable concurrently (worktree isolation or serial-on-shared-tree per the git-index-race lesson). T3 runs last.

Done =

Build clean + dotnet test green (OpcUaServer.Tests + Client.Shared.Tests) + backlog/memory reconciled + merged to master + pushed.