From 4cb488c53e953b763b4c59a231e213e2d7497fea Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Sun, 7 Jun 2026 10:06:08 -0400 Subject: [PATCH] =?UTF-8?q?docs(design):=20OtOpcUa=20follow-ons=20?= =?UTF-8?q?=E2=80=94=20subscribe=20race,=20signal-name=20collision=20valid?= =?UTF-8?q?ation,=20docker-dev=20resources?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../2026-06-07-otopcua-followons-design.md | 112 ++++++++++++++++++ 1 file changed, 112 insertions(+) create mode 100644 docs/plans/2026-06-07-otopcua-followons-design.md diff --git a/docs/plans/2026-06-07-otopcua-followons-design.md b/docs/plans/2026-06-07-otopcua-followons-design.md new file mode 100644 index 00000000..d8e5a3eb --- /dev/null +++ b/docs/plans/2026-06-07-otopcua-followons-design.md @@ -0,0 +1,112 @@ +# OtOpcUa follow-ons design (2026-06-07) + +Three independent hardening items surfaced during the Equipment-namespace live-values milestone +(`docs/plans/2026-06-07-equipment-namespace-live-values.md`). All live in OtOpcUa. They are +independent and ship as separate commits/tasks. + +--- + +## 1. Fix the `DriverInstanceActor` subscribe race (Runtime — real bug) + +**File:** `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs` + +**Root cause.** `HandleSubscribeAsync` (dispatched via `ReceiveAsync`) captures `Sender`/`Self` +on lines 362-363 — *after* the conditional `await UnsubscribeAsync().ConfigureAwait(false)` on line 359. +`ConfigureAwait(false)` opts out of Akka's `ActorTaskScheduler`, so in the re-subscribe path +(`_subscriptionHandle is not null`, which fires on every deploy re-apply and on bootstrap restore) the +post-await `Sender` read (`Sender` = `Context.Sender`) runs on a thread-pool thread with no active actor +context → `System.NotSupportedException: There is no active ActorContext`. It is provoked further when the +actor is stopping mid-await (a Subscribe lands during a deploy/restore reconcile while the driver is being +torn down) — observed live as `GalaxyDriver … shutting down` immediately followed by the error. Recovers on +a clean node restart, which is why a non-raced bootstrap subscribed cleanly. + +**Fix (two parts).** +1. **Capture before await.** Move `var replyTo = Sender; var self = Self;` to the very top of + `HandleSubscribeAsync`, before the `_subscriptionHandle is not null` check + its await. Use + `replyTo`/`self` everywhere after; never read raw `Sender`/`Self`/`Context` past any await. This removes + the failing line. +2. **Drop `.ConfigureAwait(false)` inside the `ReceiveAsync` handlers** (`HandleSubscribeAsync`, + `UnsubscribeAsync`, `HandleApplyDeltaAsync`, `HandleWriteAsync`) so continuations resume on the actor's + `ActorTaskScheduler` (idiomatic Akka — preserves actor context, handles the stopping actor gracefully). + Audit each handler for any other post-await `Context`/`Sender`/`Self` access while doing so. (Note: + `HandleApplyDeltaAsync` and `HandleWriteAsync` already capture `replyTo = Sender` before their awaits; + only `HandleSubscribeAsync` has the after-await capture — confirm during implementation.) + +**Out of scope.** The `ReceiveAsync` mailbox-blocking semantics (one message at a time during a long +SubscribeAsync) are unchanged — not part of this bug. + +**Testing.** A TestKit test that subscribes, then subscribes again (drives the line-359 +unsubscribe-then-resubscribe path) and asserts `SubscriptionEstablished` is replied with no exception +(today this throws). Use a fake `ISubscribable` driver. The stopping-mid-await race is hard to make +deterministic; the capture-before-await fix removes the failing access regardless. + +--- + +## 2. Validate Tag↔VirtualTag name collisions at deploy (Configuration) + +**Files:** `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs` + the `DraftSnapshot` +type + wherever `DraftSnapshot` is built (locate during planning). + +**Problem.** `Phase7Applier` materialises both Equipment `Tag` variables and `VirtualTag` variables at the +folder-scoped NodeId `{EquipmentId}[/{FolderPath}]/{Name}`. `UX_Tag_EquipmentPath` and +`UX_VirtualTag_EquipmentPath` each enforce `(EquipmentId, Name)` uniqueness *within their own table*, but +there is **no cross-table constraint** — a `Tag` and a `VirtualTag` with the same `(EquipmentId, Name)` (and +same/empty FolderPath) would materialise to the same NodeId, and the sink (keyed on NodeId) would keep only +one. Not hit today (the company overlay is VirtualTag-only), but latent for mixed equipment. + +**Change.** +- Extend `DraftSnapshot` to carry equipment-bound signal identities: the `(EquipmentId, FolderPath, Name)` + of every `Tag` with a non-null `EquipmentId` and every `VirtualTag` (FolderPath = "" for VirtualTags). + Populate them where `DraftSnapshot` is constructed. +- New rule `DraftValidator.ValidateNoEquipmentSignalNameCollision`: group those signals by the + **folder-scoped key** `(EquipmentId, normalize(FolderPath), Name)` — exactly the materialiser's NodeId key + — and emit a `ValidationError("EquipmentSignalNameCollision", …)` for any group with count > 1 (carry the + EquipmentId + Name in the error). Wire it into `DraftValidator.Validate`. + +**Why the folder-scoped key.** A Tag under a sub-folder (`{Eq}/sub/Name`) does not collide with a root +VirtualTag (`{Eq}/Name`) of the same Name. Keying on the full NodeId path avoids false positives. Galaxy / +SystemPlatform Tags (`EquipmentId` null) live in a different node space and are excluded. + +**Testing.** Validator unit tests: a Tag + a VirtualTag with the same `(EquipmentId, Name)` (empty +FolderPath) → `EquipmentSignalNameCollision`; same Name but different FolderPath → OK; distinct names → OK; +a SystemPlatform Tag sharing a name with an equipment VirtualTag → OK (excluded). + +--- + +## 3. docker-dev resource hardening (compose/config only — no app code) + +**File:** `docker-dev/docker-compose.yml` (+ `docker-dev/README.md`). + +**Problem.** The full single-mesh stack (central-1/2 + 4 site nodes) OOM-killed central-1 on a loaded host +(Docker Desktop VM memory cap; `OOMKilled=true`, host load ~12). Two contributors: (a) verbose EF Core +`Development` logging — every Deployment-poll `SELECT` + `Executed DbCommand` logged, flooding the dispatcher +and starving the cluster heartbeat thread (observed `heartbeat … delayed 4169ms … thread starvation`); +(b) no per-container memory limits, so total footprint is unbounded against the VM cap. + +**Change.** +- **Quiet logging:** add `Logging__LogLevel__Microsoft.EntityFrameworkCore=Warning` (and + `Logging__LogLevel__Microsoft.AspNetCore=Warning`) to the host services' `environment`. Keep application + logs (`Default`/`ZB.MOM.WW`) at Info. This removes the SQL flood and the starvation cascade. +- **Per-service memory bounds:** add `mem_limit` (and `mem_reservation`) to each host service (central-1/2, + site-a/b), sized by measuring a steady-state node (`docker stats`) and adding headroom. Document in + `docker-dev/README.md` the approximate total Docker Desktop VM memory the full stack needs, so a + constrained host knows to raise the VM or run fewer nodes. + +**Coordinate with active docker-dev work.** `master` has recent in-flight docker-dev commits +(single-mesh topology, fresh-volume bootstrap). Rebase/merge carefully so these changes don't clobber that +work; keep the edits additive (env keys + `mem_limit` lines) where possible. + +**Testing.** Operational only: bring the full stack up with the changes, confirm no OOM and all nodes serve +(`:9200` healthy, `:4840` listening on central-1, galaxy mirror Good). Not unit-testable. + +--- + +## Sequencing & risk + +| # | Area | Risk | Ships as | +|---|---|---|---| +| 1 | Runtime actor (`DriverInstanceActor`) | Medium (actor semantics) — small, well-understood | own commit + TestKit test | +| 2 | Configuration (`DraftSnapshot` + `DraftValidator`) | Low–medium (new snapshot field + rule) | own commit + validator tests | +| 3 | docker-dev compose/README | Low (config only) | own commit; coordinate with active docker-dev work | + +All three are independent and can be reviewed/merged separately. No cross-dependencies.