Files
lmxopcua/docs/plans/2026-06-26-otopcua-fixedtree-followups-design.md
T

14 KiB
Raw Blame History

OtOpcUa — FixedTree-injection follow-ups (design)

Date: 2026-06-26 Status: Implemented (2026-06-26) — all five follow-ups (AE) built via subagent-driven development (16 commits c2c368dc..0074f37a on feat/focas-fixedtree-equipment-injection; every task spec+code reviewed, high-risk tasks with serial Opus reviews). Offline suites green: Runtime.Tests 331, OpcUaServer.Tests 319, FOCAS 248 + AbLegacy/composer additions; dotnet build 0 errors, production src/ 0 warnings (TreatWarningsAsErrors). No DB migration and no deployment-artifact wire-format change were needed (E is projection-only — the columns + Devices array were already serialized). Live wonder re-validation of the single-device FOCAS path is optional/user-gated (the base feature's live path is unchanged by these follow-ups). Companion to: 2026-06-26-otopcua-fixedtree-equipment-injection-design.md (the base feature — built + live-validated on wonder-app-vd03). This design works through the follow-ups that feature's review chain surfaced. Branch: continue on feat/focas-fixedtree-equipment-injection (stacked on fix/focas-poll-io-serialization, local/unpushed — standing rule is "commit/push only when asked").


Scope (locked with the user 2026-06-26)

The user selected all five items below. The base feature's explicit non-goals — discovered-alarm injection and writable discovered nodes — remain out of scope (locked design decisions, untouched).

# Follow-up Size Notes
A Hardcoded 30 s discovery timeout → injectable trivial behavior-preserving
B Re-discovery opt-in/policy gate per driver moderate back-compat default
C Config-unchanged driver→equipment rebind re-triggers discovery moderate reverses a deliberate won't-fix
D De-dup the double SetDesiredSubscriptions during redeploy small one extra unsub/resub blip today
E Lift the ≥1-authored-tag requirement + multi-device-per-driver largest projection-only, no DB migration

Key discovery that shapes E

The Config-DB already models the equipment→driver(→device) association as first-class data — no schema change is needed for E:

  • Equipment.DriverInstanceId (string?, made nullable by migration 20260608104706_NullableEquipmentDriverInstanceId)
  • Equipment.DeviceId (string?) — FK to a multi-device driver's device
  • Device is a first-class entity (DeviceId, DriverInstanceId, schemaless DeviceConfig JSON with host)
  • The AdminUI equipment editor already exposes an optional driver pick (EquipmentInput.DriverInstanceId)

The only gap is the runtime projection: EquipmentNode is (EquipmentId, DisplayName, UnsLineId) and drops DriverInstanceId/DeviceId, so the injector (DriverHostActor.HandleDiscoveredNodes) can only resolve the equipment by inferring it from authored EquipmentTags — hence the ≥1-tag requirement. E closes that gap in the projection + resolver, not the schema.


A. Discovery timeout → injectable

DriverInstanceActor.HandleRediscoverAsync hardcodes new CancellationTokenSource(TimeSpan.FromSeconds(30)) while the rediscover interval + attempt-cap are already constructor parameters. Add a rediscoverDiscoverTimeout (TimeSpan, default TimeSpan.FromSeconds(30)) to the ctor and the Props factory; use the field instead of the literal. Pure consistency fix; default preserves behavior.

B. Re-discovery opt-in / policy gate

Problem: StartDiscovery() runs the bounded retry loop for every ITagDiscovery driver on every (re)connect. FOCAS needs it (its FixedTreeCache fills ~02 s after connect, so a single early pass would capture an empty/partial tree). A driver that browses its full shape synchronously inside DiscoverAsync (OpcUaClient, TwinCAT, AB) needs at most one pass — the 15×2 s retry is wasted (potentially heavy) network I/O.

Decision: a per-driver policy, declared in code (driver "heaviness" is a property of the driver type, needs no DB/AdminUI plumbing, lowest risk). Add a default-implemented member to ITagDiscovery:

public enum DiscoveryRediscoverPolicy { UntilStable, Once, Never }

public interface ITagDiscovery
{
    DiscoveryRediscoverPolicy RediscoverPolicy => DiscoveryRediscoverPolicy.UntilStable; // default = today's behavior
    Task DiscoverAsync(IAddressSpaceBuilder builder, CancellationToken cancellationToken);
}
  • UntilStable (default, unchanged) — today's loop: retry every _rediscoverInterval up to _rediscoverMaxAttempts or until the captured signature is non-empty and stable.
  • Once — kick exactly one discovery pass on connect, emit one DiscoveredNodesReady, then stop.
  • Never — no post-connect discovery kick at all.

Driver assignments:

  • FOCASUntilStable (explicit; it genuinely needs the retry).
  • OpcUaClient, TwinCAT, AbCip, AbLegacyOnce (they discover synchronously in DiscoverAsync; one pass on connect injects their tree, the retry loop only added cost). Any driver not overriding the default keeps UntilStable, so this is a strict no-regression change.

Mechanism: DriverInstanceActor.StartDiscovery reads ((ITagDiscovery)_driver).RediscoverPolicy. Never → return without scheduling. Once/UntilStable → schedule the first RediscoverTick; HandleRediscoverAsync stops after the first pass when the policy is Once (instead of evaluating stop-on-stable).

Alternative considered (rejected): a per-instance JSON flag parsed by the host + AdminUI. More flexible but adds artifact/AdminUI plumbing for a knob whose correct value is type-uniform.

C. Config-unchanged rebind re-triggers discovery

Problem: when a redeploy rebinds a driver to a new equipment without a DriverConfig change, PushDesiredSubscriptions' re-inject tail correctly drops the stale cached plan (a stale EQ-1-scoped graft under EQ-2 would be worse), but ReconcileDrivers only restarts a child on a DriverConfig change — so a config-unchanged child is never reconnected and the FixedTree stays absent under the new equipment until the driver's next natural reconnect/restart.

The base feature deliberately did not add a re-trigger here, to avoid coupling the subscription pass to driver-lifecycle control. This follow-up reverses that — but cleanly, because the trigger is a discovery action, not lifecycle control (no stop/restart), and it is idempotent.

Decision: add a DriverInstanceActor.TriggerRediscovery message. In the re-inject tail, the two branches that Remove a cached plan because of a rebind/loss also Tell that driver's child TriggerRediscovery. The child kicks a fresh RediscoverTick (current _initGeneration) iff it is in the Connected state; otherwise it no-ops (its eventual reconnect re-discovers anyway). The discovery pass re-emits DiscoveredNodesReady, which resolves against the new composition (_lastComposition) and grafts under the new equipment. The re-trigger honors B's policy (Never drivers do not re-discover; Once/UntilStable run their normal pass(es)).

Update the inline comment at the drop site and the follow-up note in the base design doc to record the new behavior.

D. De-dup the double SetDesiredSubscriptions

Problem: during an in-process redeploy, a cached driver receives two SetDesiredSubscriptions: the bulk authored-only send in PushDesiredSubscriptions, then the authoreddiscovered union from ApplyDiscoveredPlan (the re-inject tail). The first send forces the child to unsubscribe the whole handle (authored tags included) then the second re-subscribes — one extra blip per cached driver per redeploy.

Decision: in the bulk loop, skip the send for any driver that has a _discoveredByDriver entry — the re-inject tail sends their complete union. Critical fallback: the re-inject tail can still drop a cached plan (rebind/loss, see C); when it does, it must send the authored-only set for that driver so its authored subscriptions are not lost. Net invariant: every driver receives exactly one SetDesiredSubscriptions per redeploy.

E. Lift the ≥1-authored-tag requirement + multi-device

No DB migration — projection + resolver only.

E1 — projection (AddressSpaceComposer). Extend EquipmentNode:

public sealed record EquipmentNode(
    string EquipmentId,
    string DisplayName,
    string UnsLineId,
    string? DriverInstanceId = null,   // from Equipment.DriverInstanceId
    string? DeviceId = null,           // from Equipment.DeviceId
    string? DeviceHost = null);        // resolved at projection time: Equipment.DeviceId -> Device.DeviceConfig host

DriverInstanceId/DeviceId are copied straight off the Equipment row. DeviceHost is resolved by joining Equipment.DeviceId → Device and parsing the host out of that Device's schemaless DeviceConfig JSON, so the resolver can match it against a discovered device-host folder without re-reading the DB. All three are nullable; existing single-equipment behavior is unaffected when they're null.

E2 — resolver (DriverHostActor.HandleDiscoveredNodes, and the redeploy re-inject tail). Replace tag-only equipment resolution with:

candidates = { equipments where EquipmentNode.DriverInstanceId == driverId } { equipments inferred from authored EquipmentTags for driverId } (keeps today's path working)

  • 0 candidates → log Info, skip (unchanged).
  • 1 candidate → graft all discovered nodes under it (today's single-device behavior, now also works with zero authored tags because the equipment-level DriverInstanceId resolves it). The device-host folder is still collapsed (single device).
  • >1 candidates (multi-device) → partition discovered nodes by their device-host folder segment (DiscoveredNode.FolderPathSegments[1]) and graft each device's subtree under the equipment whose DeviceHost matches that segment. A device-host with no matching equipment is warn-skipped (its subtree is not grafted) rather than mis-grafted. The mapper's existing device-host collapse already disables itself when ≥2 distinct device-host segments are present, so multi-device paths retain the device-host level and don't collide.

⚠️ Implementation risk (E2 multi-device only): the partition join is a host string match — the driver's emitted device-host folder segment (FOCAS uses device.HostAddress) must equal the equipment's projected DeviceHost (parsed from DeviceConfig). Both ultimately derive from the same device configuration, but the string forms must be normalized to match (e.g. host:port). The warn-skip fallback makes a mismatch safe (no mis-graft, authored tags + single-device paths unaffected); a normalization helper + a unit test pin the formats together. Single-device deployments (the validated FOCAS z-34184 case) take the "1 candidate" path and are not exposed to this risk.


Data flow (unchanged)

E changes only which equipment a discovered node is grafted under and whether an equipment with no authored tags participates. Once the NodeId is assigned, the materialize → subscribe → poll → push value path is exactly the base feature's path; B/C/D change when/how often discovery runs and how many subscription pushes occur, not the value path.

Error handling

  • B Never driver → no discovery, authored tags unaffected.
  • C re-trigger on a non-Connected child → no-op (safe; reconnect re-discovers).
  • D dropped-plan fallback → authored-only send, so a rebind/loss never strands a driver's authored subscriptions.
  • E multi-device unmatched device-host → Warning + skip that device's subtree; other devices + authored tags unaffected. >1 candidate with no DeviceHost data anywhere → falls back to the base feature's warn+skip (no regression).

Testing

  • A: ctor/Props wires the timeout; default is 30 s (assert via a short injected timeout in an existing rediscover test).
  • B: Never → no DiscoveredNodesReady; Once → exactly one even when the captured set would keep growing; UntilStable → today's loop (regression). FOCAS reports UntilStable; the four network drivers report Once.
  • C: rebind drop branch Tells TriggerRediscovery; Connected child re-discovers and re-emits; non-Connected child no-ops; re-trigger respects a Never policy.
  • D: single-send invariant — a cached driver gets exactly one SetDesiredSubscriptions on redeploy (union when applied; authored-only when the plan is dropped).
  • E1: EquipmentNode projection carries DriverInstanceId/DeviceId/DeviceHost; DeviceHost resolves via the Device join + DeviceConfig host parse; nulls when unset.
  • E2: tag-less graft (driver-level link, 0 authored tags); single-candidate unchanged (collapse retained); multi-device partition maps each device-host to the right equipment; unmatched device-host → warn-skip; host-string normalization.
  • Regression: Runtime.Tests, OpcUaServer.Tests, and the FOCAS suite stay green; the validated single-device FOCAS injection path is unchanged.
  • Live wonder re-validation of the single-device FOCAS path is optional and user-gated (the base feature is already live-validated; these follow-ups don't alter that path's runtime behavior).

Scope / non-goals

  • In: AE above.
  • Out (still locked): discovered-alarm injection; writable discovered nodes.

Touched code (anticipated)

  • src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/ITagDiscovery.csDiscoveryRediscoverPolicy enum + default member.
  • Driver classes (FocasDriver, OpcUaClientDriver, TwinCATDriver, AbCipDriver, AbLegacyDriver) — override RediscoverPolicy.
  • src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs — injectable timeout (A); policy-gated StartDiscovery/HandleRediscoverAsync (B); TriggerRediscovery message (C).
  • src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs — resolver union + multi-device partition (E2); re-trigger on rebind drop (C); bulk-send skip + dropped-plan fallback (D).
  • src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/AddressSpaceComposer.csEquipmentNode projection (E1).
  • src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DiscoveredNodeMapper.cs — multi-device partition support (E2).
  • Tests under tests/.../Runtime.Tests and tests/.../OpcUaServer.Tests.

Task tracking

Implementation tasks to be generated by writing-plans from this design.