# OtOpcUa follow-ons design (2026-06-07) Three independent hardening items surfaced during the Equipment-namespace live-values milestone (`docs/plans/2026-06-07-equipment-namespace-live-values.md`). All live in OtOpcUa. They are independent and ship as separate commits/tasks. --- ## 1. Fix the `DriverInstanceActor` subscribe race (Runtime — real bug) **File:** `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs` **Root cause.** `HandleSubscribeAsync` (dispatched via `ReceiveAsync`) captures `Sender`/`Self` on lines 362-363 — *after* the conditional `await UnsubscribeAsync().ConfigureAwait(false)` on line 359. `ConfigureAwait(false)` opts out of Akka's `ActorTaskScheduler`, so in the re-subscribe path (`_subscriptionHandle is not null`, which fires on every deploy re-apply and on bootstrap restore) the post-await `Sender` read (`Sender` = `Context.Sender`) runs on a thread-pool thread with no active actor context → `System.NotSupportedException: There is no active ActorContext`. It is provoked further when the actor is stopping mid-await (a Subscribe lands during a deploy/restore reconcile while the driver is being torn down) — observed live as `GalaxyDriver … shutting down` immediately followed by the error. Recovers on a clean node restart, which is why a non-raced bootstrap subscribed cleanly. **Fix (two parts).** 1. **Capture before await.** Move `var replyTo = Sender; var self = Self;` to the very top of `HandleSubscribeAsync`, before the `_subscriptionHandle is not null` check + its await. Use `replyTo`/`self` everywhere after; never read raw `Sender`/`Self`/`Context` past any await. This removes the failing line. 2. **Drop `.ConfigureAwait(false)` inside the `ReceiveAsync` handlers** (`HandleSubscribeAsync`, `UnsubscribeAsync`, `HandleApplyDeltaAsync`, `HandleWriteAsync`) so continuations resume on the actor's `ActorTaskScheduler` (idiomatic Akka — preserves actor context, handles the stopping actor gracefully). Audit each handler for any other post-await `Context`/`Sender`/`Self` access while doing so. (Note: `HandleApplyDeltaAsync` and `HandleWriteAsync` already capture `replyTo = Sender` before their awaits; only `HandleSubscribeAsync` has the after-await capture — confirm during implementation.) **Out of scope.** The `ReceiveAsync` mailbox-blocking semantics (one message at a time during a long SubscribeAsync) are unchanged — not part of this bug. **Testing.** A TestKit test that subscribes, then subscribes again (drives the line-359 unsubscribe-then-resubscribe path) and asserts `SubscriptionEstablished` is replied with no exception (today this throws). Use a fake `ISubscribable` driver. The stopping-mid-await race is hard to make deterministic; the capture-before-await fix removes the failing access regardless. --- ## 2. Validate Tag↔VirtualTag name collisions at deploy (Configuration) **Files:** `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs` + the `DraftSnapshot` type + wherever `DraftSnapshot` is built (locate during planning). **Problem.** `Phase7Applier` materialises both Equipment `Tag` variables and `VirtualTag` variables at the folder-scoped NodeId `{EquipmentId}[/{FolderPath}]/{Name}`. `UX_Tag_EquipmentPath` and `UX_VirtualTag_EquipmentPath` each enforce `(EquipmentId, Name)` uniqueness *within their own table*, but there is **no cross-table constraint** — a `Tag` and a `VirtualTag` with the same `(EquipmentId, Name)` (and same/empty FolderPath) would materialise to the same NodeId, and the sink (keyed on NodeId) would keep only one. Not hit today (the company overlay is VirtualTag-only), but latent for mixed equipment. **Change.** - Extend `DraftSnapshot` to carry equipment-bound signal identities: the `(EquipmentId, FolderPath, Name)` of every `Tag` with a non-null `EquipmentId` and every `VirtualTag` (FolderPath = "" for VirtualTags). Populate them where `DraftSnapshot` is constructed. - New rule `DraftValidator.ValidateNoEquipmentSignalNameCollision`: group those signals by the **folder-scoped key** `(EquipmentId, normalize(FolderPath), Name)` — exactly the materialiser's NodeId key — and emit a `ValidationError("EquipmentSignalNameCollision", …)` for any group with count > 1 (carry the EquipmentId + Name in the error). Wire it into `DraftValidator.Validate`. **Why the folder-scoped key.** A Tag under a sub-folder (`{Eq}/sub/Name`) does not collide with a root VirtualTag (`{Eq}/Name`) of the same Name. Keying on the full NodeId path avoids false positives. Galaxy / SystemPlatform Tags (`EquipmentId` null) live in a different node space and are excluded. **Testing.** Validator unit tests: a Tag + a VirtualTag with the same `(EquipmentId, Name)` (empty FolderPath) → `EquipmentSignalNameCollision`; same Name but different FolderPath → OK; distinct names → OK; a SystemPlatform Tag sharing a name with an equipment VirtualTag → OK (excluded). --- ## 3. docker-dev resource hardening (compose/config only — no app code) **File:** `docker-dev/docker-compose.yml` (+ `docker-dev/README.md`). **Problem.** The full single-mesh stack (central-1/2 + 4 site nodes) OOM-killed central-1 on a loaded host (Docker Desktop VM memory cap; `OOMKilled=true`, host load ~12). Two contributors: (a) verbose EF Core `Development` logging — every Deployment-poll `SELECT` + `Executed DbCommand` logged, flooding the dispatcher and starving the cluster heartbeat thread (observed `heartbeat … delayed 4169ms … thread starvation`); (b) no per-container memory limits, so total footprint is unbounded against the VM cap. **Change.** - **Quiet logging:** add `Logging__LogLevel__Microsoft.EntityFrameworkCore=Warning` (and `Logging__LogLevel__Microsoft.AspNetCore=Warning`) to the host services' `environment`. Keep application logs (`Default`/`ZB.MOM.WW`) at Info. This removes the SQL flood and the starvation cascade. - **Per-service memory bounds:** add `mem_limit` (and `mem_reservation`) to each host service (central-1/2, site-a/b), sized by measuring a steady-state node (`docker stats`) and adding headroom. Document in `docker-dev/README.md` the approximate total Docker Desktop VM memory the full stack needs, so a constrained host knows to raise the VM or run fewer nodes. **Coordinate with active docker-dev work.** `master` has recent in-flight docker-dev commits (single-mesh topology, fresh-volume bootstrap). Rebase/merge carefully so these changes don't clobber that work; keep the edits additive (env keys + `mem_limit` lines) where possible. **Testing.** Operational only: bring the full stack up with the changes, confirm no OOM and all nodes serve (`:9200` healthy, `:4840` listening on central-1, galaxy mirror Good). Not unit-testable. --- ## Sequencing & risk | # | Area | Risk | Ships as | |---|---|---|---| | 1 | Runtime actor (`DriverInstanceActor`) | Medium (actor semantics) — small, well-understood | own commit + TestKit test | | 2 | Configuration (`DraftSnapshot` + `DraftValidator`) | Low–medium (new snapshot field + rule) | own commit + validator tests | | 3 | docker-dev compose/README | Low (config only) | own commit; coordinate with active docker-dev work | All three are independent and can be reviewed/merged separately. No cross-dependencies.