docs(design): OtOpcUa follow-ons — subscribe race, signal-name collision validation, docker-dev resources

2026-06-07 10:06:08 -04:00
parent 4c221ce2b3
commit 4cb488c53e
1 changed files with 112 additions and 0 deletions
@@ -0,0 +1,112 @@
+# OtOpcUa follow-ons design (2026-06-07)
+
+Three independent hardening items surfaced during the Equipment-namespace live-values milestone
+(`docs/plans/2026-06-07-equipment-namespace-live-values.md`). All live in OtOpcUa. They are
+independent and ship as separate commits/tasks.
+
+---
+
+## 1. Fix the `DriverInstanceActor` subscribe race (Runtime — real bug)
+
+**File:** `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs`
+
+**Root cause.** `HandleSubscribeAsync` (dispatched via `ReceiveAsync<Subscribe>`) captures `Sender`/`Self`
+on lines 362-363 — *after* the conditional `await UnsubscribeAsync().ConfigureAwait(false)` on line 359.
+`ConfigureAwait(false)` opts out of Akka's `ActorTaskScheduler`, so in the re-subscribe path
+(`_subscriptionHandle is not null`, which fires on every deploy re-apply and on bootstrap restore) the
+post-await `Sender` read (`Sender` = `Context.Sender`) runs on a thread-pool thread with no active actor
+context → `System.NotSupportedException: There is no active ActorContext`. It is provoked further when the
+actor is stopping mid-await (a Subscribe lands during a deploy/restore reconcile while the driver is being
+torn down) — observed live as `GalaxyDriver … shutting down` immediately followed by the error. Recovers on
+a clean node restart, which is why a non-raced bootstrap subscribed cleanly.
+
+**Fix (two parts).**
+1. **Capture before await.** Move `var replyTo = Sender; var self = Self;` to the very top of
+   `HandleSubscribeAsync`, before the `_subscriptionHandle is not null` check + its await. Use
+   `replyTo`/`self` everywhere after; never read raw `Sender`/`Self`/`Context` past any await. This removes
+   the failing line.
+2. **Drop `.ConfigureAwait(false)` inside the `ReceiveAsync` handlers** (`HandleSubscribeAsync`,
+   `UnsubscribeAsync`, `HandleApplyDeltaAsync`, `HandleWriteAsync`) so continuations resume on the actor's
+   `ActorTaskScheduler` (idiomatic Akka — preserves actor context, handles the stopping actor gracefully).
+   Audit each handler for any other post-await `Context`/`Sender`/`Self` access while doing so. (Note:
+   `HandleApplyDeltaAsync` and `HandleWriteAsync` already capture `replyTo = Sender` before their awaits;
+   only `HandleSubscribeAsync` has the after-await capture — confirm during implementation.)
+
+**Out of scope.** The `ReceiveAsync` mailbox-blocking semantics (one message at a time during a long
+SubscribeAsync) are unchanged — not part of this bug.
+
+**Testing.** A TestKit test that subscribes, then subscribes again (drives the line-359
+unsubscribe-then-resubscribe path) and asserts `SubscriptionEstablished` is replied with no exception
+(today this throws). Use a fake `ISubscribable` driver. The stopping-mid-await race is hard to make
+deterministic; the capture-before-await fix removes the failing access regardless.
+
+---
+
+## 2. Validate Tag↔VirtualTag name collisions at deploy (Configuration)
+
+**Files:** `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs` + the `DraftSnapshot`
+type + wherever `DraftSnapshot` is built (locate during planning).
+
+**Problem.** `Phase7Applier` materialises both Equipment `Tag` variables and `VirtualTag` variables at the
+folder-scoped NodeId `{EquipmentId}[/{FolderPath}]/{Name}`. `UX_Tag_EquipmentPath` and
+`UX_VirtualTag_EquipmentPath` each enforce `(EquipmentId, Name)` uniqueness *within their own table*, but
+there is **no cross-table constraint** — a `Tag` and a `VirtualTag` with the same `(EquipmentId, Name)` (and
+same/empty FolderPath) would materialise to the same NodeId, and the sink (keyed on NodeId) would keep only
+one. Not hit today (the company overlay is VirtualTag-only), but latent for mixed equipment.
+
+**Change.**
+- Extend `DraftSnapshot` to carry equipment-bound signal identities: the `(EquipmentId, FolderPath, Name)`
+  of every `Tag` with a non-null `EquipmentId` and every `VirtualTag` (FolderPath = "" for VirtualTags).
+  Populate them where `DraftSnapshot` is constructed.
+- New rule `DraftValidator.ValidateNoEquipmentSignalNameCollision`: group those signals by the
+  **folder-scoped key** `(EquipmentId, normalize(FolderPath), Name)` — exactly the materialiser's NodeId key
+  — and emit a `ValidationError("EquipmentSignalNameCollision", …)` for any group with count > 1 (carry the
+  EquipmentId + Name in the error). Wire it into `DraftValidator.Validate`.
+
+**Why the folder-scoped key.** A Tag under a sub-folder (`{Eq}/sub/Name`) does not collide with a root
+VirtualTag (`{Eq}/Name`) of the same Name. Keying on the full NodeId path avoids false positives. Galaxy /
+SystemPlatform Tags (`EquipmentId` null) live in a different node space and are excluded.
+
+**Testing.** Validator unit tests: a Tag + a VirtualTag with the same `(EquipmentId, Name)` (empty
+FolderPath) → `EquipmentSignalNameCollision`; same Name but different FolderPath → OK; distinct names → OK;
+a SystemPlatform Tag sharing a name with an equipment VirtualTag → OK (excluded).
+
+---
+
+## 3. docker-dev resource hardening (compose/config only — no app code)
+
+**File:** `docker-dev/docker-compose.yml` (+ `docker-dev/README.md`).
+
+**Problem.** The full single-mesh stack (central-1/2 + 4 site nodes) OOM-killed central-1 on a loaded host
+(Docker Desktop VM memory cap; `OOMKilled=true`, host load ~12). Two contributors: (a) verbose EF Core
+`Development` logging — every Deployment-poll `SELECT` + `Executed DbCommand` logged, flooding the dispatcher
+and starving the cluster heartbeat thread (observed `heartbeat … delayed 4169ms … thread starvation`);
+(b) no per-container memory limits, so total footprint is unbounded against the VM cap.
+
+**Change.**
+- **Quiet logging:** add `Logging__LogLevel__Microsoft.EntityFrameworkCore=Warning` (and
+  `Logging__LogLevel__Microsoft.AspNetCore=Warning`) to the host services' `environment`. Keep application
+  logs (`Default`/`ZB.MOM.WW`) at Info. This removes the SQL flood and the starvation cascade.
+- **Per-service memory bounds:** add `mem_limit` (and `mem_reservation`) to each host service (central-1/2,
+  site-a/b), sized by measuring a steady-state node (`docker stats`) and adding headroom. Document in
+  `docker-dev/README.md` the approximate total Docker Desktop VM memory the full stack needs, so a
+  constrained host knows to raise the VM or run fewer nodes.
+
+**Coordinate with active docker-dev work.** `master` has recent in-flight docker-dev commits
+(single-mesh topology, fresh-volume bootstrap). Rebase/merge carefully so these changes don't clobber that
+work; keep the edits additive (env keys + `mem_limit` lines) where possible.
+
+**Testing.** Operational only: bring the full stack up with the changes, confirm no OOM and all nodes serve
+(`:9200` healthy, `:4840` listening on central-1, galaxy mirror Good). Not unit-testable.
+
+---
+
+## Sequencing & risk
+
+| # | Area | Risk | Ships as |
+|---|---|---|---|
+| 1 | Runtime actor (`DriverInstanceActor`) | Medium (actor semantics) — small, well-understood | own commit + TestKit test |
+| 2 | Configuration (`DraftSnapshot` + `DraftValidator`) | Low–medium (new snapshot field + rule) | own commit + validator tests |
+| 3 | docker-dev compose/README | Low (config only) | own commit; coordinate with active docker-dev work |
+
+All three are independent and can be reviewed/merged separately. No cross-dependencies.