Files
lmxopcua/docs/plans/2026-06-07-otopcua-followons-design.md
T

113 lines
7.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# OtOpcUa follow-ons design (2026-06-07)
Three independent hardening items surfaced during the Equipment-namespace live-values milestone
(`docs/plans/2026-06-07-equipment-namespace-live-values.md`). All live in OtOpcUa. They are
independent and ship as separate commits/tasks.
---
## 1. Fix the `DriverInstanceActor` subscribe race (Runtime — real bug)
**File:** `src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs`
**Root cause.** `HandleSubscribeAsync` (dispatched via `ReceiveAsync<Subscribe>`) captures `Sender`/`Self`
on lines 362-363 — *after* the conditional `await UnsubscribeAsync().ConfigureAwait(false)` on line 359.
`ConfigureAwait(false)` opts out of Akka's `ActorTaskScheduler`, so in the re-subscribe path
(`_subscriptionHandle is not null`, which fires on every deploy re-apply and on bootstrap restore) the
post-await `Sender` read (`Sender` = `Context.Sender`) runs on a thread-pool thread with no active actor
context → `System.NotSupportedException: There is no active ActorContext`. It is provoked further when the
actor is stopping mid-await (a Subscribe lands during a deploy/restore reconcile while the driver is being
torn down) — observed live as `GalaxyDriver … shutting down` immediately followed by the error. Recovers on
a clean node restart, which is why a non-raced bootstrap subscribed cleanly.
**Fix (two parts).**
1. **Capture before await.** Move `var replyTo = Sender; var self = Self;` to the very top of
`HandleSubscribeAsync`, before the `_subscriptionHandle is not null` check + its await. Use
`replyTo`/`self` everywhere after; never read raw `Sender`/`Self`/`Context` past any await. This removes
the failing line.
2. **Drop `.ConfigureAwait(false)` inside the `ReceiveAsync` handlers** (`HandleSubscribeAsync`,
`UnsubscribeAsync`, `HandleApplyDeltaAsync`, `HandleWriteAsync`) so continuations resume on the actor's
`ActorTaskScheduler` (idiomatic Akka — preserves actor context, handles the stopping actor gracefully).
Audit each handler for any other post-await `Context`/`Sender`/`Self` access while doing so. (Note:
`HandleApplyDeltaAsync` and `HandleWriteAsync` already capture `replyTo = Sender` before their awaits;
only `HandleSubscribeAsync` has the after-await capture — confirm during implementation.)
**Out of scope.** The `ReceiveAsync` mailbox-blocking semantics (one message at a time during a long
SubscribeAsync) are unchanged — not part of this bug.
**Testing.** A TestKit test that subscribes, then subscribes again (drives the line-359
unsubscribe-then-resubscribe path) and asserts `SubscriptionEstablished` is replied with no exception
(today this throws). Use a fake `ISubscribable` driver. The stopping-mid-await race is hard to make
deterministic; the capture-before-await fix removes the failing access regardless.
---
## 2. Validate Tag↔VirtualTag name collisions at deploy (Configuration)
**Files:** `src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs` + the `DraftSnapshot`
type + wherever `DraftSnapshot` is built (locate during planning).
**Problem.** `Phase7Applier` materialises both Equipment `Tag` variables and `VirtualTag` variables at the
folder-scoped NodeId `{EquipmentId}[/{FolderPath}]/{Name}`. `UX_Tag_EquipmentPath` and
`UX_VirtualTag_EquipmentPath` each enforce `(EquipmentId, Name)` uniqueness *within their own table*, but
there is **no cross-table constraint** — a `Tag` and a `VirtualTag` with the same `(EquipmentId, Name)` (and
same/empty FolderPath) would materialise to the same NodeId, and the sink (keyed on NodeId) would keep only
one. Not hit today (the company overlay is VirtualTag-only), but latent for mixed equipment.
**Change.**
- Extend `DraftSnapshot` to carry equipment-bound signal identities: the `(EquipmentId, FolderPath, Name)`
of every `Tag` with a non-null `EquipmentId` and every `VirtualTag` (FolderPath = "" for VirtualTags).
Populate them where `DraftSnapshot` is constructed.
- New rule `DraftValidator.ValidateNoEquipmentSignalNameCollision`: group those signals by the
**folder-scoped key** `(EquipmentId, normalize(FolderPath), Name)` — exactly the materialiser's NodeId key
— and emit a `ValidationError("EquipmentSignalNameCollision", …)` for any group with count > 1 (carry the
EquipmentId + Name in the error). Wire it into `DraftValidator.Validate`.
**Why the folder-scoped key.** A Tag under a sub-folder (`{Eq}/sub/Name`) does not collide with a root
VirtualTag (`{Eq}/Name`) of the same Name. Keying on the full NodeId path avoids false positives. Galaxy /
SystemPlatform Tags (`EquipmentId` null) live in a different node space and are excluded.
**Testing.** Validator unit tests: a Tag + a VirtualTag with the same `(EquipmentId, Name)` (empty
FolderPath) → `EquipmentSignalNameCollision`; same Name but different FolderPath → OK; distinct names → OK;
a SystemPlatform Tag sharing a name with an equipment VirtualTag → OK (excluded).
---
## 3. docker-dev resource hardening (compose/config only — no app code)
**File:** `docker-dev/docker-compose.yml` (+ `docker-dev/README.md`).
**Problem.** The full single-mesh stack (central-1/2 + 4 site nodes) OOM-killed central-1 on a loaded host
(Docker Desktop VM memory cap; `OOMKilled=true`, host load ~12). Two contributors: (a) verbose EF Core
`Development` logging — every Deployment-poll `SELECT` + `Executed DbCommand` logged, flooding the dispatcher
and starving the cluster heartbeat thread (observed `heartbeat … delayed 4169ms … thread starvation`);
(b) no per-container memory limits, so total footprint is unbounded against the VM cap.
**Change.**
- **Quiet logging:** add `Logging__LogLevel__Microsoft.EntityFrameworkCore=Warning` (and
`Logging__LogLevel__Microsoft.AspNetCore=Warning`) to the host services' `environment`. Keep application
logs (`Default`/`ZB.MOM.WW`) at Info. This removes the SQL flood and the starvation cascade.
- **Per-service memory bounds:** add `mem_limit` (and `mem_reservation`) to each host service (central-1/2,
site-a/b), sized by measuring a steady-state node (`docker stats`) and adding headroom. Document in
`docker-dev/README.md` the approximate total Docker Desktop VM memory the full stack needs, so a
constrained host knows to raise the VM or run fewer nodes.
**Coordinate with active docker-dev work.** `master` has recent in-flight docker-dev commits
(single-mesh topology, fresh-volume bootstrap). Rebase/merge carefully so these changes don't clobber that
work; keep the edits additive (env keys + `mem_limit` lines) where possible.
**Testing.** Operational only: bring the full stack up with the changes, confirm no OOM and all nodes serve
(`:9200` healthy, `:4840` listening on central-1, galaxy mirror Good). Not unit-testable.
---
## Sequencing & risk
| # | Area | Risk | Ships as |
|---|---|---|---|
| 1 | Runtime actor (`DriverInstanceActor`) | Medium (actor semantics) — small, well-understood | own commit + TestKit test |
| 2 | Configuration (`DraftSnapshot` + `DraftValidator`) | Lowmedium (new snapshot field + rule) | own commit + validator tests |
| 3 | docker-dev compose/README | Low (config only) | own commit; coordinate with active docker-dev work |
All three are independent and can be reviewed/merged separately. No cross-dependencies.