docs(otopcua): record FixedTree follow-ups A-E as implemented (design, plan, RESUME)

This commit is contained in:
Joseph Doherty
2026-06-26 15:19:46 -04:00
parent 0074f37a64
commit 1058542d80
5 changed files with 734 additions and 6 deletions
@@ -0,0 +1,234 @@
# OtOpcUa — FixedTree-injection follow-ups (design)
**Date:** 2026-06-26
**Status:** ✅ Implemented (2026-06-26) — all five follow-ups (AE) built via subagent-driven development
(16 commits `c2c368dc`..`0074f37a` on `feat/focas-fixedtree-equipment-injection`; every task spec+code
reviewed, high-risk tasks with serial Opus reviews). Offline suites green: Runtime.Tests 331, OpcUaServer.Tests 319,
FOCAS 248 + AbLegacy/composer additions; `dotnet build` 0 errors, production `src/` 0 warnings (TreatWarningsAsErrors).
No DB migration and no deployment-artifact wire-format change were needed (E is projection-only — the columns +
`Devices` array were already serialized). Live wonder re-validation of the single-device FOCAS path is optional/user-gated
(the base feature's live path is unchanged by these follow-ups).
**Companion to:** [`2026-06-26-otopcua-fixedtree-equipment-injection-design.md`](2026-06-26-otopcua-fixedtree-equipment-injection-design.md)
(the base feature — ✅ built + live-validated on `wonder-app-vd03`). This design works through the
follow-ups that feature's review chain surfaced.
**Branch:** continue on `feat/focas-fixedtree-equipment-injection` (stacked on `fix/focas-poll-io-serialization`,
local/unpushed — standing rule is "commit/push only when asked").
---
## Scope (locked with the user 2026-06-26)
The user selected **all five** items below. The base feature's explicit non-goals — discovered-**alarm**
injection and **writable** discovered nodes — remain out of scope (locked design decisions, untouched).
| # | Follow-up | Size | Notes |
|---|---|---|---|
| A | Hardcoded 30 s discovery timeout → injectable | trivial | behavior-preserving |
| B | Re-discovery opt-in/policy gate per driver | moderate | back-compat default |
| C | Config-unchanged driver→equipment rebind re-triggers discovery | moderate | reverses a deliberate `won't-fix` |
| D | De-dup the double `SetDesiredSubscriptions` during redeploy | small | one extra unsub/resub blip today |
| E | Lift the ≥1-authored-tag requirement + multi-device-per-driver | largest | **projection-only, no DB migration** |
## Key discovery that shapes E
The Config-DB **already** models the equipment→driver(→device) association as first-class data — no schema
change is needed for E:
- `Equipment.DriverInstanceId` (`string?`, made nullable by migration `20260608104706_NullableEquipmentDriverInstanceId`)
- `Equipment.DeviceId` (`string?`) — FK to a multi-device driver's device
- `Device` is a first-class entity (`DeviceId`, `DriverInstanceId`, schemaless `DeviceConfig` JSON with host)
- The AdminUI equipment editor already exposes an optional driver pick (`EquipmentInput.DriverInstanceId`)
The **only** gap is the runtime projection: `EquipmentNode` is `(EquipmentId, DisplayName, UnsLineId)` and drops
`DriverInstanceId`/`DeviceId`, so the injector (`DriverHostActor.HandleDiscoveredNodes`) can only resolve the
equipment by inferring it from authored `EquipmentTags` — hence the ≥1-tag requirement. E closes that gap in the
projection + resolver, not the schema.
---
## A. Discovery timeout → injectable
`DriverInstanceActor.HandleRediscoverAsync` hardcodes `new CancellationTokenSource(TimeSpan.FromSeconds(30))`
while the rediscover interval + attempt-cap are already constructor parameters. Add a
`rediscoverDiscoverTimeout` (`TimeSpan`, default `TimeSpan.FromSeconds(30)`) to the ctor and the `Props`
factory; use the field instead of the literal. Pure consistency fix; default preserves behavior.
## B. Re-discovery opt-in / policy gate
**Problem:** `StartDiscovery()` runs the bounded retry loop for **every** `ITagDiscovery` driver on every
(re)connect. FOCAS needs it (its `FixedTreeCache` fills ~02 s *after* connect, so a single early pass would
capture an empty/partial tree). A driver that browses its full shape **synchronously** inside `DiscoverAsync`
(OpcUaClient, TwinCAT, AB) needs at most **one** pass — the 15×2 s retry is wasted (potentially heavy) network I/O.
**Decision:** a per-driver **policy**, declared in code (driver "heaviness" is a property of the driver *type*,
needs no DB/AdminUI plumbing, lowest risk). Add a default-implemented member to `ITagDiscovery`:
```csharp
public enum DiscoveryRediscoverPolicy { UntilStable, Once, Never }
public interface ITagDiscovery
{
DiscoveryRediscoverPolicy RediscoverPolicy => DiscoveryRediscoverPolicy.UntilStable; // default = today's behavior
Task DiscoverAsync(IAddressSpaceBuilder builder, CancellationToken cancellationToken);
}
```
- **`UntilStable`** (default, unchanged) — today's loop: retry every `_rediscoverInterval` up to
`_rediscoverMaxAttempts` or until the captured signature is non-empty and stable.
- **`Once`** — kick exactly one discovery pass on connect, emit one `DiscoveredNodesReady`, then stop.
- **`Never`** — no post-connect discovery kick at all.
**Driver assignments:**
- **FOCAS** → `UntilStable` (explicit; it genuinely needs the retry).
- **OpcUaClient, TwinCAT, AbCip, AbLegacy** → `Once` (they discover synchronously in `DiscoverAsync`; one pass
on connect injects their tree, the retry loop only added cost). Any driver not overriding the default keeps
`UntilStable`, so this is a strict no-regression change.
**Mechanism:** `DriverInstanceActor.StartDiscovery` reads `((ITagDiscovery)_driver).RediscoverPolicy`. `Never`
→ return without scheduling. `Once`/`UntilStable` → schedule the first `RediscoverTick`; `HandleRediscoverAsync`
stops after the first pass when the policy is `Once` (instead of evaluating stop-on-stable).
**Alternative considered (rejected):** a per-instance JSON flag parsed by the host + AdminUI. More flexible but
adds artifact/AdminUI plumbing for a knob whose correct value is type-uniform.
## C. Config-unchanged rebind re-triggers discovery
**Problem:** when a redeploy rebinds a driver to a new equipment **without** a `DriverConfig` change,
`PushDesiredSubscriptions`' re-inject tail correctly **drops** the stale cached plan (a stale `EQ-1`-scoped graft
under `EQ-2` would be worse), but `ReconcileDrivers` only restarts a child on a `DriverConfig` change — so a
config-unchanged child is never reconnected and the FixedTree stays absent under the new equipment until the
driver's next natural reconnect/restart.
The base feature deliberately did **not** add a re-trigger here, to avoid coupling the subscription pass to
driver-lifecycle control. This follow-up reverses that — but cleanly, because the trigger is a **discovery**
action, not lifecycle control (no stop/restart), and it is idempotent.
**Decision:** add a `DriverInstanceActor.TriggerRediscovery` message. In the re-inject tail, the two branches
that `Remove` a cached plan because of a rebind/loss also `Tell` that driver's child `TriggerRediscovery`. The
child kicks a fresh `RediscoverTick` (current `_initGeneration`) **iff it is in the `Connected` state**;
otherwise it no-ops (its eventual reconnect re-discovers anyway). The discovery pass re-emits
`DiscoveredNodesReady`, which resolves against the **new** composition (`_lastComposition`) and grafts under the
new equipment. The re-trigger honors B's policy (`Never` drivers do not re-discover; `Once`/`UntilStable` run
their normal pass(es)).
Update the inline comment at the drop site and the follow-up note in the base design doc to record the new
behavior.
## D. De-dup the double `SetDesiredSubscriptions`
**Problem:** during an in-process redeploy, a cached driver receives two `SetDesiredSubscriptions`:
the bulk authored-only send in `PushDesiredSubscriptions`, then the authoreddiscovered union from
`ApplyDiscoveredPlan` (the re-inject tail). The first send forces the child to unsubscribe the whole handle
(authored tags included) then the second re-subscribes — one extra blip per cached driver per redeploy.
**Decision:** in the bulk loop, **skip** the send for any driver that has a `_discoveredByDriver` entry — the
re-inject tail sends their complete union. **Critical fallback:** the re-inject tail can still *drop* a cached
plan (rebind/loss, see C); when it does, it must send the **authored-only** set for that driver so its authored
subscriptions are not lost. Net invariant: every driver receives exactly **one** `SetDesiredSubscriptions` per
redeploy.
## E. Lift the ≥1-authored-tag requirement + multi-device
No DB migration — projection + resolver only.
**E1 — projection (`AddressSpaceComposer`).** Extend `EquipmentNode`:
```csharp
public sealed record EquipmentNode(
string EquipmentId,
string DisplayName,
string UnsLineId,
string? DriverInstanceId = null, // from Equipment.DriverInstanceId
string? DeviceId = null, // from Equipment.DeviceId
string? DeviceHost = null); // resolved at projection time: Equipment.DeviceId -> Device.DeviceConfig host
```
`DriverInstanceId`/`DeviceId` are copied straight off the `Equipment` row. `DeviceHost` is resolved by joining
`Equipment.DeviceId → Device` and parsing the host out of that `Device`'s schemaless `DeviceConfig` JSON, so the
resolver can match it against a discovered device-host folder without re-reading the DB. All three are nullable;
existing single-equipment behavior is unaffected when they're null.
**E2 — resolver (`DriverHostActor.HandleDiscoveredNodes`, and the redeploy re-inject tail).** Replace tag-only
equipment resolution with:
> **candidates** = { equipments where `EquipmentNode.DriverInstanceId == driverId` } ****
> { equipments inferred from authored `EquipmentTags` for `driverId` } (keeps today's path working)
- **0 candidates** → log Info, skip (unchanged).
- **1 candidate** → graft all discovered nodes under it (today's single-device behavior, now also works with
**zero** authored tags because the equipment-level `DriverInstanceId` resolves it). The device-host folder is
still collapsed (single device).
- **>1 candidates (multi-device)** → partition discovered nodes by their device-host folder segment
(`DiscoveredNode.FolderPathSegments[1]`) and graft each device's subtree under the equipment whose
`DeviceHost` matches that segment. A device-host with no matching equipment is **warn-skipped** (its subtree is
not grafted) rather than mis-grafted. The mapper's existing device-host collapse already disables itself when
≥2 distinct device-host segments are present, so multi-device paths retain the device-host level and don't
collide.
**⚠️ Implementation risk (E2 multi-device only):** the partition join is a **host string** match — the driver's
emitted device-host folder segment (FOCAS uses `device.HostAddress`) must equal the equipment's projected
`DeviceHost` (parsed from `DeviceConfig`). Both ultimately derive from the same device configuration, but the
string forms must be normalized to match (e.g. `host:port`). The warn-skip fallback makes a mismatch *safe* (no
mis-graft, authored tags + single-device paths unaffected); a normalization helper + a unit test pin the formats
together. Single-device deployments (the validated FOCAS `z-34184` case) take the "1 candidate" path and are
**not** exposed to this risk.
---
## Data flow (unchanged)
E changes only *which equipment* a discovered node is grafted under and *whether* an equipment with no authored
tags participates. Once the NodeId is assigned, the materialize → subscribe → poll → push value path is exactly
the base feature's path; B/C/D change *when/how often* discovery runs and *how many* subscription pushes occur,
not the value path.
## Error handling
- B `Never` driver → no discovery, authored tags unaffected.
- C re-trigger on a non-`Connected` child → no-op (safe; reconnect re-discovers).
- D dropped-plan fallback → authored-only send, so a rebind/loss never strands a driver's authored subscriptions.
- E multi-device unmatched device-host → Warning + skip that device's subtree; other devices + authored tags
unaffected. >1 candidate with no `DeviceHost` data anywhere → falls back to the base feature's warn+skip
(no regression).
## Testing
- **A:** ctor/`Props` wires the timeout; default is 30 s (assert via a short injected timeout in an existing
rediscover test).
- **B:** `Never` → no `DiscoveredNodesReady`; `Once` → exactly one even when the captured set would keep growing;
`UntilStable` → today's loop (regression). FOCAS reports `UntilStable`; the four network drivers report `Once`.
- **C:** rebind drop branch `Tell`s `TriggerRediscovery`; `Connected` child re-discovers and re-emits;
non-`Connected` child no-ops; re-trigger respects a `Never` policy.
- **D:** single-send invariant — a cached driver gets exactly one `SetDesiredSubscriptions` on redeploy
(union when applied; authored-only when the plan is dropped).
- **E1:** `EquipmentNode` projection carries `DriverInstanceId`/`DeviceId`/`DeviceHost`; `DeviceHost` resolves via
the `Device` join + `DeviceConfig` host parse; nulls when unset.
- **E2:** tag-less graft (driver-level link, 0 authored tags); single-candidate unchanged (collapse retained);
multi-device partition maps each device-host to the right equipment; unmatched device-host → warn-skip;
host-string normalization.
- **Regression:** Runtime.Tests, OpcUaServer.Tests, and the FOCAS suite stay green; the validated single-device
FOCAS injection path is unchanged.
- Live wonder re-validation of the single-device FOCAS path is **optional** and user-gated (the base feature is
already live-validated; these follow-ups don't alter that path's runtime behavior).
## Scope / non-goals
- **In:** AE above.
- **Out (still locked):** discovered-**alarm** injection; **writable** discovered nodes.
## Touched code (anticipated)
- `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/ITagDiscovery.cs``DiscoveryRediscoverPolicy` enum + default member.
- Driver classes (`FocasDriver`, `OpcUaClientDriver`, `TwinCATDriver`, `AbCipDriver`, `AbLegacyDriver`) — override
`RediscoverPolicy`.
- `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs` — injectable timeout (A); policy-gated
`StartDiscovery`/`HandleRediscoverAsync` (B); `TriggerRediscovery` message (C).
- `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverHostActor.cs` — resolver union + multi-device partition (E2);
re-trigger on rebind drop (C); bulk-send skip + dropped-plan fallback (D).
- `src/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer/AddressSpaceComposer.cs``EquipmentNode` projection (E1).
- `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DiscoveredNodeMapper.cs` — multi-device partition support (E2).
- Tests under `tests/.../Runtime.Tests` and `tests/.../OpcUaServer.Tests`.
## Task tracking
Implementation tasks to be generated by writing-plans from this design.