docs(design): OtOpcUa follow-ons — subscribe race, signal-name collision validation, docker-dev resources

This commit is contained in:
Joseph Doherty
2026-06-07 10:06:08 -04:00
parent 4c221ce2b3
commit 4cb488c53e
@@ -0,0 +1,112 @@
# OtOpcUa follow-ons design (2026-06-07)
Three independent hardening items surfaced during the Equipment-namespace live-values milestone
(`docs/plans/2026-06-07-equipment-namespace-live-values.md`). All live in OtOpcUa. They are
independent and ship as separate commits/tasks.
---
## 1. Fix the `DriverInstanceActor` subscribe race (Runtime — real bug)
**File:** `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs`
**Root cause.** `HandleSubscribeAsync` (dispatched via `ReceiveAsync<Subscribe>`) captures `Sender`/`Self`
on lines 362-363 — *after* the conditional `await UnsubscribeAsync().ConfigureAwait(false)` on line 359.
`ConfigureAwait(false)` opts out of Akka's `ActorTaskScheduler`, so in the re-subscribe path
(`_subscriptionHandle is not null`, which fires on every deploy re-apply and on bootstrap restore) the
post-await `Sender` read (`Sender` = `Context.Sender`) runs on a thread-pool thread with no active actor
context → `System.NotSupportedException: There is no active ActorContext`. It is provoked further when the
actor is stopping mid-await (a Subscribe lands during a deploy/restore reconcile while the driver is being
torn down) — observed live as `GalaxyDriver … shutting down` immediately followed by the error. Recovers on
a clean node restart, which is why a non-raced bootstrap subscribed cleanly.
**Fix (two parts).**
1. **Capture before await.** Move `var replyTo = Sender; var self = Self;` to the very top of
`HandleSubscribeAsync`, before the `_subscriptionHandle is not null` check + its await. Use
`replyTo`/`self` everywhere after; never read raw `Sender`/`Self`/`Context` past any await. This removes
the failing line.
2. **Drop `.ConfigureAwait(false)` inside the `ReceiveAsync` handlers** (`HandleSubscribeAsync`,
`UnsubscribeAsync`, `HandleApplyDeltaAsync`, `HandleWriteAsync`) so continuations resume on the actor's
`ActorTaskScheduler` (idiomatic Akka — preserves actor context, handles the stopping actor gracefully).
Audit each handler for any other post-await `Context`/`Sender`/`Self` access while doing so. (Note:
`HandleApplyDeltaAsync` and `HandleWriteAsync` already capture `replyTo = Sender` before their awaits;
only `HandleSubscribeAsync` has the after-await capture — confirm during implementation.)
**Out of scope.** The `ReceiveAsync` mailbox-blocking semantics (one message at a time during a long
SubscribeAsync) are unchanged — not part of this bug.
**Testing.** A TestKit test that subscribes, then subscribes again (drives the line-359
unsubscribe-then-resubscribe path) and asserts `SubscriptionEstablished` is replied with no exception
(today this throws). Use a fake `ISubscribable` driver. The stopping-mid-await race is hard to make
deterministic; the capture-before-await fix removes the failing access regardless.
---
## 2. Validate Tag↔VirtualTag name collisions at deploy (Configuration)
**Files:** `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs` + the `DraftSnapshot`
type + wherever `DraftSnapshot` is built (locate during planning).
**Problem.** `Phase7Applier` materialises both Equipment `Tag` variables and `VirtualTag` variables at the
folder-scoped NodeId `{EquipmentId}[/{FolderPath}]/{Name}`. `UX_Tag_EquipmentPath` and
`UX_VirtualTag_EquipmentPath` each enforce `(EquipmentId, Name)` uniqueness *within their own table*, but
there is **no cross-table constraint** — a `Tag` and a `VirtualTag` with the same `(EquipmentId, Name)` (and
same/empty FolderPath) would materialise to the same NodeId, and the sink (keyed on NodeId) would keep only
one. Not hit today (the company overlay is VirtualTag-only), but latent for mixed equipment.
**Change.**
- Extend `DraftSnapshot` to carry equipment-bound signal identities: the `(EquipmentId, FolderPath, Name)`
of every `Tag` with a non-null `EquipmentId` and every `VirtualTag` (FolderPath = "" for VirtualTags).
Populate them where `DraftSnapshot` is constructed.
- New rule `DraftValidator.ValidateNoEquipmentSignalNameCollision`: group those signals by the
**folder-scoped key** `(EquipmentId, normalize(FolderPath), Name)` — exactly the materialiser's NodeId key
— and emit a `ValidationError("EquipmentSignalNameCollision", …)` for any group with count > 1 (carry the
EquipmentId + Name in the error). Wire it into `DraftValidator.Validate`.
**Why the folder-scoped key.** A Tag under a sub-folder (`{Eq}/sub/Name`) does not collide with a root
VirtualTag (`{Eq}/Name`) of the same Name. Keying on the full NodeId path avoids false positives. Galaxy /
SystemPlatform Tags (`EquipmentId` null) live in a different node space and are excluded.
**Testing.** Validator unit tests: a Tag + a VirtualTag with the same `(EquipmentId, Name)` (empty
FolderPath) → `EquipmentSignalNameCollision`; same Name but different FolderPath → OK; distinct names → OK;
a SystemPlatform Tag sharing a name with an equipment VirtualTag → OK (excluded).
---
## 3. docker-dev resource hardening (compose/config only — no app code)
**File:** `docker-dev/docker-compose.yml` (+ `docker-dev/README.md`).
**Problem.** The full single-mesh stack (central-1/2 + 4 site nodes) OOM-killed central-1 on a loaded host
(Docker Desktop VM memory cap; `OOMKilled=true`, host load ~12). Two contributors: (a) verbose EF Core
`Development` logging — every Deployment-poll `SELECT` + `Executed DbCommand` logged, flooding the dispatcher
and starving the cluster heartbeat thread (observed `heartbeat … delayed 4169ms … thread starvation`);
(b) no per-container memory limits, so total footprint is unbounded against the VM cap.
**Change.**
- **Quiet logging:** add `Logging__LogLevel__Microsoft.EntityFrameworkCore=Warning` (and
`Logging__LogLevel__Microsoft.AspNetCore=Warning`) to the host services' `environment`. Keep application
logs (`Default`/`ZB.MOM.WW`) at Info. This removes the SQL flood and the starvation cascade.
- **Per-service memory bounds:** add `mem_limit` (and `mem_reservation`) to each host service (central-1/2,
site-a/b), sized by measuring a steady-state node (`docker stats`) and adding headroom. Document in
`docker-dev/README.md` the approximate total Docker Desktop VM memory the full stack needs, so a
constrained host knows to raise the VM or run fewer nodes.
**Coordinate with active docker-dev work.** `master` has recent in-flight docker-dev commits
(single-mesh topology, fresh-volume bootstrap). Rebase/merge carefully so these changes don't clobber that
work; keep the edits additive (env keys + `mem_limit` lines) where possible.
**Testing.** Operational only: bring the full stack up with the changes, confirm no OOM and all nodes serve
(`:9200` healthy, `:4840` listening on central-1, galaxy mirror Good). Not unit-testable.
---
## Sequencing & risk
| # | Area | Risk | Ships as |
|---|---|---|---|
| 1 | Runtime actor (`DriverInstanceActor`) | Medium (actor semantics) — small, well-understood | own commit + TestKit test |
| 2 | Configuration (`DraftSnapshot` + `DraftValidator`) | Lowmedium (new snapshot field + rule) | own commit + validator tests |
| 3 | docker-dev compose/README | Low (config only) | own commit; coordinate with active docker-dev work |
All three are independent and can be reviewed/merged separately. No cross-dependencies.