7.2 KiB
OtOpcUa follow-ons design (2026-06-07)
Three independent hardening items surfaced during the Equipment-namespace live-values milestone
(docs/plans/2026-06-07-equipment-namespace-live-values.md). All live in OtOpcUa. They are
independent and ship as separate commits/tasks.
1. Fix the DriverInstanceActor subscribe race (Runtime — real bug)
File: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs
Root cause. HandleSubscribeAsync (dispatched via ReceiveAsync<Subscribe>) captures Sender/Self
on lines 362-363 — after the conditional await UnsubscribeAsync().ConfigureAwait(false) on line 359.
ConfigureAwait(false) opts out of Akka's ActorTaskScheduler, so in the re-subscribe path
(_subscriptionHandle is not null, which fires on every deploy re-apply and on bootstrap restore) the
post-await Sender read (Sender = Context.Sender) runs on a thread-pool thread with no active actor
context → System.NotSupportedException: There is no active ActorContext. It is provoked further when the
actor is stopping mid-await (a Subscribe lands during a deploy/restore reconcile while the driver is being
torn down) — observed live as GalaxyDriver … shutting down immediately followed by the error. Recovers on
a clean node restart, which is why a non-raced bootstrap subscribed cleanly.
Fix (two parts).
- Capture before await. Move
var replyTo = Sender; var self = Self;to the very top ofHandleSubscribeAsync, before the_subscriptionHandle is not nullcheck + its await. UsereplyTo/selfeverywhere after; never read rawSender/Self/Contextpast any await. This removes the failing line. - Drop
.ConfigureAwait(false)inside theReceiveAsynchandlers (HandleSubscribeAsync,UnsubscribeAsync,HandleApplyDeltaAsync,HandleWriteAsync) so continuations resume on the actor'sActorTaskScheduler(idiomatic Akka — preserves actor context, handles the stopping actor gracefully). Audit each handler for any other post-awaitContext/Sender/Selfaccess while doing so. (Note:HandleApplyDeltaAsyncandHandleWriteAsyncalready capturereplyTo = Senderbefore their awaits; onlyHandleSubscribeAsynchas the after-await capture — confirm during implementation.)
Out of scope. The ReceiveAsync mailbox-blocking semantics (one message at a time during a long
SubscribeAsync) are unchanged — not part of this bug.
Testing. A TestKit test that subscribes, then subscribes again (drives the line-359
unsubscribe-then-resubscribe path) and asserts SubscriptionEstablished is replied with no exception
(today this throws). Use a fake ISubscribable driver. The stopping-mid-await race is hard to make
deterministic; the capture-before-await fix removes the failing access regardless.
2. Validate Tag↔VirtualTag name collisions at deploy (Configuration)
Files: src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs + the DraftSnapshot
type + wherever DraftSnapshot is built (locate during planning).
Problem. Phase7Applier materialises both Equipment Tag variables and VirtualTag variables at the
folder-scoped NodeId {EquipmentId}[/{FolderPath}]/{Name}. UX_Tag_EquipmentPath and
UX_VirtualTag_EquipmentPath each enforce (EquipmentId, Name) uniqueness within their own table, but
there is no cross-table constraint — a Tag and a VirtualTag with the same (EquipmentId, Name) (and
same/empty FolderPath) would materialise to the same NodeId, and the sink (keyed on NodeId) would keep only
one. Not hit today (the company overlay is VirtualTag-only), but latent for mixed equipment.
Change.
- Extend
DraftSnapshotto carry equipment-bound signal identities: the(EquipmentId, FolderPath, Name)of everyTagwith a non-nullEquipmentIdand everyVirtualTag(FolderPath = "" for VirtualTags). Populate them whereDraftSnapshotis constructed. - New rule
DraftValidator.ValidateNoEquipmentSignalNameCollision: group those signals by the folder-scoped key(EquipmentId, normalize(FolderPath), Name)— exactly the materialiser's NodeId key — and emit aValidationError("EquipmentSignalNameCollision", …)for any group with count > 1 (carry the EquipmentId + Name in the error). Wire it intoDraftValidator.Validate.
Why the folder-scoped key. A Tag under a sub-folder ({Eq}/sub/Name) does not collide with a root
VirtualTag ({Eq}/Name) of the same Name. Keying on the full NodeId path avoids false positives. Galaxy /
SystemPlatform Tags (EquipmentId null) live in a different node space and are excluded.
Testing. Validator unit tests: a Tag + a VirtualTag with the same (EquipmentId, Name) (empty
FolderPath) → EquipmentSignalNameCollision; same Name but different FolderPath → OK; distinct names → OK;
a SystemPlatform Tag sharing a name with an equipment VirtualTag → OK (excluded).
3. docker-dev resource hardening (compose/config only — no app code)
File: docker-dev/docker-compose.yml (+ docker-dev/README.md).
Problem. The full single-mesh stack (central-1/2 + 4 site nodes) OOM-killed central-1 on a loaded host
(Docker Desktop VM memory cap; OOMKilled=true, host load ~12). Two contributors: (a) verbose EF Core
Development logging — every Deployment-poll SELECT + Executed DbCommand logged, flooding the dispatcher
and starving the cluster heartbeat thread (observed heartbeat … delayed 4169ms … thread starvation);
(b) no per-container memory limits, so total footprint is unbounded against the VM cap.
Change.
- Quiet logging: add
Logging__LogLevel__Microsoft.EntityFrameworkCore=Warning(andLogging__LogLevel__Microsoft.AspNetCore=Warning) to the host services'environment. Keep application logs (Default/ZB.MOM.WW) at Info. This removes the SQL flood and the starvation cascade. - Per-service memory bounds: add
mem_limit(andmem_reservation) to each host service (central-1/2, site-a/b), sized by measuring a steady-state node (docker stats) and adding headroom. Document indocker-dev/README.mdthe approximate total Docker Desktop VM memory the full stack needs, so a constrained host knows to raise the VM or run fewer nodes.
Coordinate with active docker-dev work. master has recent in-flight docker-dev commits
(single-mesh topology, fresh-volume bootstrap). Rebase/merge carefully so these changes don't clobber that
work; keep the edits additive (env keys + mem_limit lines) where possible.
Testing. Operational only: bring the full stack up with the changes, confirm no OOM and all nodes serve
(:9200 healthy, :4840 listening on central-1, galaxy mirror Good). Not unit-testable.
Sequencing & risk
| # | Area | Risk | Ships as |
|---|---|---|---|
| 1 | Runtime actor (DriverInstanceActor) |
Medium (actor semantics) — small, well-understood | own commit + TestKit test |
| 2 | Configuration (DraftSnapshot + DraftValidator) |
Low–medium (new snapshot field + rule) | own commit + validator tests |
| 3 | docker-dev compose/README | Low (config only) | own commit; coordinate with active docker-dev work |
All three are independent and can be reviewed/merged separately. No cross-dependencies.