docs(design): OtOpcUa follow-ons — subscribe race, signal-name collision validation, docker-dev resources
This commit is contained in:
@@ -0,0 +1,112 @@
|
||||
# OtOpcUa follow-ons design (2026-06-07)
|
||||
|
||||
Three independent hardening items surfaced during the Equipment-namespace live-values milestone
|
||||
(`docs/plans/2026-06-07-equipment-namespace-live-values.md`). All live in OtOpcUa. They are
|
||||
independent and ship as separate commits/tasks.
|
||||
|
||||
---
|
||||
|
||||
## 1. Fix the `DriverInstanceActor` subscribe race (Runtime — real bug)
|
||||
|
||||
**File:** `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs`
|
||||
|
||||
**Root cause.** `HandleSubscribeAsync` (dispatched via `ReceiveAsync<Subscribe>`) captures `Sender`/`Self`
|
||||
on lines 362-363 — *after* the conditional `await UnsubscribeAsync().ConfigureAwait(false)` on line 359.
|
||||
`ConfigureAwait(false)` opts out of Akka's `ActorTaskScheduler`, so in the re-subscribe path
|
||||
(`_subscriptionHandle is not null`, which fires on every deploy re-apply and on bootstrap restore) the
|
||||
post-await `Sender` read (`Sender` = `Context.Sender`) runs on a thread-pool thread with no active actor
|
||||
context → `System.NotSupportedException: There is no active ActorContext`. It is provoked further when the
|
||||
actor is stopping mid-await (a Subscribe lands during a deploy/restore reconcile while the driver is being
|
||||
torn down) — observed live as `GalaxyDriver … shutting down` immediately followed by the error. Recovers on
|
||||
a clean node restart, which is why a non-raced bootstrap subscribed cleanly.
|
||||
|
||||
**Fix (two parts).**
|
||||
1. **Capture before await.** Move `var replyTo = Sender; var self = Self;` to the very top of
|
||||
`HandleSubscribeAsync`, before the `_subscriptionHandle is not null` check + its await. Use
|
||||
`replyTo`/`self` everywhere after; never read raw `Sender`/`Self`/`Context` past any await. This removes
|
||||
the failing line.
|
||||
2. **Drop `.ConfigureAwait(false)` inside the `ReceiveAsync` handlers** (`HandleSubscribeAsync`,
|
||||
`UnsubscribeAsync`, `HandleApplyDeltaAsync`, `HandleWriteAsync`) so continuations resume on the actor's
|
||||
`ActorTaskScheduler` (idiomatic Akka — preserves actor context, handles the stopping actor gracefully).
|
||||
Audit each handler for any other post-await `Context`/`Sender`/`Self` access while doing so. (Note:
|
||||
`HandleApplyDeltaAsync` and `HandleWriteAsync` already capture `replyTo = Sender` before their awaits;
|
||||
only `HandleSubscribeAsync` has the after-await capture — confirm during implementation.)
|
||||
|
||||
**Out of scope.** The `ReceiveAsync` mailbox-blocking semantics (one message at a time during a long
|
||||
SubscribeAsync) are unchanged — not part of this bug.
|
||||
|
||||
**Testing.** A TestKit test that subscribes, then subscribes again (drives the line-359
|
||||
unsubscribe-then-resubscribe path) and asserts `SubscriptionEstablished` is replied with no exception
|
||||
(today this throws). Use a fake `ISubscribable` driver. The stopping-mid-await race is hard to make
|
||||
deterministic; the capture-before-await fix removes the failing access regardless.
|
||||
|
||||
---
|
||||
|
||||
## 2. Validate Tag↔VirtualTag name collisions at deploy (Configuration)
|
||||
|
||||
**Files:** `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs` + the `DraftSnapshot`
|
||||
type + wherever `DraftSnapshot` is built (locate during planning).
|
||||
|
||||
**Problem.** `Phase7Applier` materialises both Equipment `Tag` variables and `VirtualTag` variables at the
|
||||
folder-scoped NodeId `{EquipmentId}[/{FolderPath}]/{Name}`. `UX_Tag_EquipmentPath` and
|
||||
`UX_VirtualTag_EquipmentPath` each enforce `(EquipmentId, Name)` uniqueness *within their own table*, but
|
||||
there is **no cross-table constraint** — a `Tag` and a `VirtualTag` with the same `(EquipmentId, Name)` (and
|
||||
same/empty FolderPath) would materialise to the same NodeId, and the sink (keyed on NodeId) would keep only
|
||||
one. Not hit today (the company overlay is VirtualTag-only), but latent for mixed equipment.
|
||||
|
||||
**Change.**
|
||||
- Extend `DraftSnapshot` to carry equipment-bound signal identities: the `(EquipmentId, FolderPath, Name)`
|
||||
of every `Tag` with a non-null `EquipmentId` and every `VirtualTag` (FolderPath = "" for VirtualTags).
|
||||
Populate them where `DraftSnapshot` is constructed.
|
||||
- New rule `DraftValidator.ValidateNoEquipmentSignalNameCollision`: group those signals by the
|
||||
**folder-scoped key** `(EquipmentId, normalize(FolderPath), Name)` — exactly the materialiser's NodeId key
|
||||
— and emit a `ValidationError("EquipmentSignalNameCollision", …)` for any group with count > 1 (carry the
|
||||
EquipmentId + Name in the error). Wire it into `DraftValidator.Validate`.
|
||||
|
||||
**Why the folder-scoped key.** A Tag under a sub-folder (`{Eq}/sub/Name`) does not collide with a root
|
||||
VirtualTag (`{Eq}/Name`) of the same Name. Keying on the full NodeId path avoids false positives. Galaxy /
|
||||
SystemPlatform Tags (`EquipmentId` null) live in a different node space and are excluded.
|
||||
|
||||
**Testing.** Validator unit tests: a Tag + a VirtualTag with the same `(EquipmentId, Name)` (empty
|
||||
FolderPath) → `EquipmentSignalNameCollision`; same Name but different FolderPath → OK; distinct names → OK;
|
||||
a SystemPlatform Tag sharing a name with an equipment VirtualTag → OK (excluded).
|
||||
|
||||
---
|
||||
|
||||
## 3. docker-dev resource hardening (compose/config only — no app code)
|
||||
|
||||
**File:** `docker-dev/docker-compose.yml` (+ `docker-dev/README.md`).
|
||||
|
||||
**Problem.** The full single-mesh stack (central-1/2 + 4 site nodes) OOM-killed central-1 on a loaded host
|
||||
(Docker Desktop VM memory cap; `OOMKilled=true`, host load ~12). Two contributors: (a) verbose EF Core
|
||||
`Development` logging — every Deployment-poll `SELECT` + `Executed DbCommand` logged, flooding the dispatcher
|
||||
and starving the cluster heartbeat thread (observed `heartbeat … delayed 4169ms … thread starvation`);
|
||||
(b) no per-container memory limits, so total footprint is unbounded against the VM cap.
|
||||
|
||||
**Change.**
|
||||
- **Quiet logging:** add `Logging__LogLevel__Microsoft.EntityFrameworkCore=Warning` (and
|
||||
`Logging__LogLevel__Microsoft.AspNetCore=Warning`) to the host services' `environment`. Keep application
|
||||
logs (`Default`/`ZB.MOM.WW`) at Info. This removes the SQL flood and the starvation cascade.
|
||||
- **Per-service memory bounds:** add `mem_limit` (and `mem_reservation`) to each host service (central-1/2,
|
||||
site-a/b), sized by measuring a steady-state node (`docker stats`) and adding headroom. Document in
|
||||
`docker-dev/README.md` the approximate total Docker Desktop VM memory the full stack needs, so a
|
||||
constrained host knows to raise the VM or run fewer nodes.
|
||||
|
||||
**Coordinate with active docker-dev work.** `master` has recent in-flight docker-dev commits
|
||||
(single-mesh topology, fresh-volume bootstrap). Rebase/merge carefully so these changes don't clobber that
|
||||
work; keep the edits additive (env keys + `mem_limit` lines) where possible.
|
||||
|
||||
**Testing.** Operational only: bring the full stack up with the changes, confirm no OOM and all nodes serve
|
||||
(`:9200` healthy, `:4840` listening on central-1, galaxy mirror Good). Not unit-testable.
|
||||
|
||||
---
|
||||
|
||||
## Sequencing & risk
|
||||
|
||||
| # | Area | Risk | Ships as |
|
||||
|---|---|---|---|
|
||||
| 1 | Runtime actor (`DriverInstanceActor`) | Medium (actor semantics) — small, well-understood | own commit + TestKit test |
|
||||
| 2 | Configuration (`DraftSnapshot` + `DraftValidator`) | Low–medium (new snapshot field + rule) | own commit + validator tests |
|
||||
| 3 | docker-dev compose/README | Low (config only) | own commit; coordinate with active docker-dev work |
|
||||
|
||||
All three are independent and can be reviewed/merged separately. No cross-dependencies.
|
||||
Reference in New Issue
Block a user