Files
lmxopcua/docs/plans/2026-06-26-otopcua-fixedtree-followups-design.md
T

235 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# OtOpcUa — FixedTree-injection follow-ups (design)
**Date:** 2026-06-26
**Status:** ✅ Implemented (2026-06-26) — all five follow-ups (AE) built via subagent-driven development
(16 commits `c2c368dc`..`0074f37a` on `feat/focas-fixedtree-equipment-injection`; every task spec+code
reviewed, high-risk tasks with serial Opus reviews). Offline suites green: Runtime.Tests 331, OpcUaServer.Tests 319,
FOCAS 248 + AbLegacy/composer additions; `dotnet build` 0 errors, production `src/` 0 warnings (TreatWarningsAsErrors).
No DB migration and no deployment-artifact wire-format change were needed (E is projection-only — the columns +
`Devices` array were already serialized). Live wonder re-validation of the single-device FOCAS path is optional/user-gated
(the base feature's live path is unchanged by these follow-ups).
**Companion to:** [`2026-06-26-otopcua-fixedtree-equipment-injection-design.md`](2026-06-26-otopcua-fixedtree-equipment-injection-design.md)
(the base feature — ✅ built + live-validated on `wonder-app-vd03`). This design works through the
follow-ups that feature's review chain surfaced.
**Branch:** continue on `feat/focas-fixedtree-equipment-injection` (stacked on `fix/focas-poll-io-serialization`,
local/unpushed — standing rule is "commit/push only when asked").
---
## Scope (locked with the user 2026-06-26)
The user selected **all five** items below. The base feature's explicit non-goals — discovered-**alarm**
injection and **writable** discovered nodes — remain out of scope (locked design decisions, untouched).
| # | Follow-up | Size | Notes |
|---|---|---|---|
| A | Hardcoded 30 s discovery timeout → injectable | trivial | behavior-preserving |
| B | Re-discovery opt-in/policy gate per driver | moderate | back-compat default |
| C | Config-unchanged driver→equipment rebind re-triggers discovery | moderate | reverses a deliberate `won't-fix` |
| D | De-dup the double `SetDesiredSubscriptions` during redeploy | small | one extra unsub/resub blip today |
| E | Lift the ≥1-authored-tag requirement + multi-device-per-driver | largest | **projection-only, no DB migration** |
## Key discovery that shapes E
The Config-DB **already** models the equipment→driver(→device) association as first-class data — no schema
change is needed for E:
- `Equipment.DriverInstanceId` (`string?`, made nullable by migration `20260608104706_NullableEquipmentDriverInstanceId`)
- `Equipment.DeviceId` (`string?`) — FK to a multi-device driver's device
- `Device` is a first-class entity (`DeviceId`, `DriverInstanceId`, schemaless `DeviceConfig` JSON with host)
- The AdminUI equipment editor already exposes an optional driver pick (`EquipmentInput.DriverInstanceId`)
The **only** gap is the runtime projection: `EquipmentNode` is `(EquipmentId, DisplayName, UnsLineId)` and drops
`DriverInstanceId`/`DeviceId`, so the injector (`DriverHostActor.HandleDiscoveredNodes`) can only resolve the
equipment by inferring it from authored `EquipmentTags` — hence the ≥1-tag requirement. E closes that gap in the
projection + resolver, not the schema.
---
## A. Discovery timeout → injectable
`DriverInstanceActor.HandleRediscoverAsync` hardcodes `new CancellationTokenSource(TimeSpan.FromSeconds(30))`
while the rediscover interval + attempt-cap are already constructor parameters. Add a
`rediscoverDiscoverTimeout` (`TimeSpan`, default `TimeSpan.FromSeconds(30)`) to the ctor and the `Props`
factory; use the field instead of the literal. Pure consistency fix; default preserves behavior.
## B. Re-discovery opt-in / policy gate
**Problem:** `StartDiscovery()` runs the bounded retry loop for **every** `ITagDiscovery` driver on every
(re)connect. FOCAS needs it (its `FixedTreeCache` fills ~02 s *after* connect, so a single early pass would
capture an empty/partial tree). A driver that browses its full shape **synchronously** inside `DiscoverAsync`
(OpcUaClient, TwinCAT, AB) needs at most **one** pass — the 15×2 s retry is wasted (potentially heavy) network I/O.
**Decision:** a per-driver **policy**, declared in code (driver "heaviness" is a property of the driver *type*,
needs no DB/AdminUI plumbing, lowest risk). Add a default-implemented member to `ITagDiscovery`:
```csharp
public enum DiscoveryRediscoverPolicy { UntilStable, Once, Never }
public interface ITagDiscovery
{
DiscoveryRediscoverPolicy RediscoverPolicy => DiscoveryRediscoverPolicy.UntilStable; // default = today's behavior
Task DiscoverAsync(IAddressSpaceBuilder builder, CancellationToken cancellationToken);
}
```
- **`UntilStable`** (default, unchanged) — today's loop: retry every `_rediscoverInterval` up to
`_rediscoverMaxAttempts` or until the captured signature is non-empty and stable.
- **`Once`** — kick exactly one discovery pass on connect, emit one `DiscoveredNodesReady`, then stop.
- **`Never`** — no post-connect discovery kick at all.
**Driver assignments:**
- **FOCAS** → `UntilStable` (explicit; it genuinely needs the retry).
- **OpcUaClient, TwinCAT, AbCip, AbLegacy** → `Once` (they discover synchronously in `DiscoverAsync`; one pass
on connect injects their tree, the retry loop only added cost). Any driver not overriding the default keeps
`UntilStable`, so this is a strict no-regression change.
**Mechanism:** `DriverInstanceActor.StartDiscovery` reads `((ITagDiscovery)_driver).RediscoverPolicy`. `Never`
→ return without scheduling. `Once`/`UntilStable` → schedule the first `RediscoverTick`; `HandleRediscoverAsync`
stops after the first pass when the policy is `Once` (instead of evaluating stop-on-stable).
**Alternative considered (rejected):** a per-instance JSON flag parsed by the host + AdminUI. More flexible but
adds artifact/AdminUI plumbing for a knob whose correct value is type-uniform.
## C. Config-unchanged rebind re-triggers discovery
**Problem:** when a redeploy rebinds a driver to a new equipment **without** a `DriverConfig` change,
`PushDesiredSubscriptions`' re-inject tail correctly **drops** the stale cached plan (a stale `EQ-1`-scoped graft
under `EQ-2` would be worse), but `ReconcileDrivers` only restarts a child on a `DriverConfig` change — so a
config-unchanged child is never reconnected and the FixedTree stays absent under the new equipment until the
driver's next natural reconnect/restart.
The base feature deliberately did **not** add a re-trigger here, to avoid coupling the subscription pass to
driver-lifecycle control. This follow-up reverses that — but cleanly, because the trigger is a **discovery**
action, not lifecycle control (no stop/restart), and it is idempotent.
**Decision:** add a `DriverInstanceActor.TriggerRediscovery` message. In the re-inject tail, the two branches
that `Remove` a cached plan because of a rebind/loss also `Tell` that driver's child `TriggerRediscovery`. The
child kicks a fresh `RediscoverTick` (current `_initGeneration`) **iff it is in the `Connected` state**;
otherwise it no-ops (its eventual reconnect re-discovers anyway). The discovery pass re-emits
`DiscoveredNodesReady`, which resolves against the **new** composition (`_lastComposition`) and grafts under the
new equipment. The re-trigger honors B's policy (`Never` drivers do not re-discover; `Once`/`UntilStable` run
their normal pass(es)).
Update the inline comment at the drop site and the follow-up note in the base design doc to record the new
behavior.
## D. De-dup the double `SetDesiredSubscriptions`
**Problem:** during an in-process redeploy, a cached driver receives two `SetDesiredSubscriptions`:
the bulk authored-only send in `PushDesiredSubscriptions`, then the authoreddiscovered union from
`ApplyDiscoveredPlan` (the re-inject tail). The first send forces the child to unsubscribe the whole handle
(authored tags included) then the second re-subscribes — one extra blip per cached driver per redeploy.
**Decision:** in the bulk loop, **skip** the send for any driver that has a `_discoveredByDriver` entry — the
re-inject tail sends their complete union. **Critical fallback:** the re-inject tail can still *drop* a cached
plan (rebind/loss, see C); when it does, it must send the **authored-only** set for that driver so its authored
subscriptions are not lost. Net invariant: every driver receives exactly **one** `SetDesiredSubscriptions` per
redeploy.
## E. Lift the ≥1-authored-tag requirement + multi-device
No DB migration — projection + resolver only.
**E1 — projection (`AddressSpaceComposer`).** Extend `EquipmentNode`:
```csharp
public sealed record EquipmentNode(
string EquipmentId,
string DisplayName,
string UnsLineId,
string? DriverInstanceId = null, // from Equipment.DriverInstanceId
string? DeviceId = null, // from Equipment.DeviceId
string? DeviceHost = null); // resolved at projection time: Equipment.DeviceId -> Device.DeviceConfig host
```
`DriverInstanceId`/`DeviceId` are copied straight off the `Equipment` row. `DeviceHost` is resolved by joining
`Equipment.DeviceId → Device` and parsing the host out of that `Device`'s schemaless `DeviceConfig` JSON, so the
resolver can match it against a discovered device-host folder without re-reading the DB. All three are nullable;
existing single-equipment behavior is unaffected when they're null.
**E2 — resolver (`DriverHostActor.HandleDiscoveredNodes`, and the redeploy re-inject tail).** Replace tag-only
equipment resolution with:
> **candidates** = { equipments where `EquipmentNode.DriverInstanceId == driverId` } ****
> { equipments inferred from authored `EquipmentTags` for `driverId` } (keeps today's path working)
- **0 candidates** → log Info, skip (unchanged).
- **1 candidate** → graft all discovered nodes under it (today's single-device behavior, now also works with
**zero** authored tags because the equipment-level `DriverInstanceId` resolves it). The device-host folder is
still collapsed (single device).
- **>1 candidates (multi-device)** → partition discovered nodes by their device-host folder segment
(`DiscoveredNode.FolderPathSegments[1]`) and graft each device's subtree under the equipment whose
`DeviceHost` matches that segment. A device-host with no matching equipment is **warn-skipped** (its subtree is
not grafted) rather than mis-grafted. The mapper's existing device-host collapse already disables itself when
≥2 distinct device-host segments are present, so multi-device paths retain the device-host level and don't
collide.
**⚠️ Implementation risk (E2 multi-device only):** the partition join is a **host string** match — the driver's
emitted device-host folder segment (FOCAS uses `device.HostAddress`) must equal the equipment's projected
`DeviceHost` (parsed from `DeviceConfig`). Both ultimately derive from the same device configuration, but the
string forms must be normalized to match (e.g. `host:port`). The warn-skip fallback makes a mismatch *safe* (no
mis-graft, authored tags + single-device paths unaffected); a normalization helper + a unit test pin the formats
together. Single-device deployments (the validated FOCAS `z-34184` case) take the "1 candidate" path and are
**not** exposed to this risk.
---
## Data flow (unchanged)
E changes only *which equipment* a discovered node is grafted under and *whether* an equipment with no authored
tags participates. Once the NodeId is assigned, the materialize → subscribe → poll → push value path is exactly
the base feature's path; B/C/D change *when/how often* discovery runs and *how many* subscription pushes occur,
not the value path.
## Error handling
- B `Never` driver → no discovery, authored tags unaffected.
- C re-trigger on a non-`Connected` child → no-op (safe; reconnect re-discovers).
- D dropped-plan fallback → authored-only send, so a rebind/loss never strands a driver's authored subscriptions.
- E multi-device unmatched device-host → Warning + skip that device's subtree; other devices + authored tags
unaffected. >1 candidate with no `DeviceHost` data anywhere → falls back to the base feature's warn+skip
(no regression).
## Testing
- **A:** ctor/`Props` wires the timeout; default is 30 s (assert via a short injected timeout in an existing
rediscover test).
- **B:** `Never` → no `DiscoveredNodesReady`; `Once` → exactly one even when the captured set would keep growing;
`UntilStable` → today's loop (regression). FOCAS reports `UntilStable`; the four network drivers report `Once`.
- **C:** rebind drop branch `Tell`s `TriggerRediscovery`; `Connected` child re-discovers and re-emits;
non-`Connected` child no-ops; re-trigger respects a `Never` policy.
- **D:** single-send invariant — a cached driver gets exactly one `SetDesiredSubscriptions` on redeploy
(union when applied; authored-only when the plan is dropped).
- **E1:** `EquipmentNode` projection carries `DriverInstanceId`/`DeviceId`/`DeviceHost`; `DeviceHost` resolves via
the `Device` join + `DeviceConfig` host parse; nulls when unset.
- **E2:** tag-less graft (driver-level link, 0 authored tags); single-candidate unchanged (collapse retained);
multi-device partition maps each device-host to the right equipment; unmatched device-host → warn-skip;
host-string normalization.
- **Regression:** Runtime.Tests, OpcUaServer.Tests, and the FOCAS suite stay green; the validated single-device
FOCAS injection path is unchanged.
- Live wonder re-validation of the single-device FOCAS path is **optional** and user-gated (the base feature is
already live-validated; these follow-ups don't alter that path's runtime behavior).
## Scope / non-goals
- **In:** AE above.
- **Out (still locked):** discovered-**alarm** injection; **writable** discovered nodes.
## Touched code (anticipated)
- `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/ITagDiscovery.cs``DiscoveryRediscoverPolicy` enum + default member.
- Driver classes (`FocasDriver`, `OpcUaClientDriver`, `TwinCATDriver`, `AbCipDriver`, `AbLegacyDriver`) — override
`RediscoverPolicy`.
- `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs` — injectable timeout (A); policy-gated
`StartDiscovery`/`HandleRediscoverAsync` (B); `TriggerRediscovery` message (C).
- `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs` — resolver union + multi-device partition (E2);
re-trigger on rebind drop (C); bulk-send skip + dropped-plan fallback (D).
- `src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/AddressSpaceComposer.cs``EquipmentNode` projection (E1).
- `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DiscoveredNodeMapper.cs` — multi-device partition support (E2).
- Tests under `tests/.../Runtime.Tests` and `tests/.../OpcUaServer.Tests`.
## Task tracking
Implementation tasks to be generated by writing-plans from this design.