Files
lmxopcua/docs/plans/2026-06-07-otopcua-followons-design.md
T

7.2 KiB
Raw Blame History

OtOpcUa follow-ons design (2026-06-07)

Three independent hardening items surfaced during the Equipment-namespace live-values milestone (docs/plans/2026-06-07-equipment-namespace-live-values.md). All live in OtOpcUa. They are independent and ship as separate commits/tasks.


1. Fix the DriverInstanceActor subscribe race (Runtime — real bug)

File: src/Server/ZB.MOM.WW.OtOpcUa.Runtime/Drivers/DriverInstanceActor.cs

Root cause. HandleSubscribeAsync (dispatched via ReceiveAsync<Subscribe>) captures Sender/Self on lines 362-363 — after the conditional await UnsubscribeAsync().ConfigureAwait(false) on line 359. ConfigureAwait(false) opts out of Akka's ActorTaskScheduler, so in the re-subscribe path (_subscriptionHandle is not null, which fires on every deploy re-apply and on bootstrap restore) the post-await Sender read (Sender = Context.Sender) runs on a thread-pool thread with no active actor context → System.NotSupportedException: There is no active ActorContext. It is provoked further when the actor is stopping mid-await (a Subscribe lands during a deploy/restore reconcile while the driver is being torn down) — observed live as GalaxyDriver … shutting down immediately followed by the error. Recovers on a clean node restart, which is why a non-raced bootstrap subscribed cleanly.

Fix (two parts).

  1. Capture before await. Move var replyTo = Sender; var self = Self; to the very top of HandleSubscribeAsync, before the _subscriptionHandle is not null check + its await. Use replyTo/self everywhere after; never read raw Sender/Self/Context past any await. This removes the failing line.
  2. Drop .ConfigureAwait(false) inside the ReceiveAsync handlers (HandleSubscribeAsync, UnsubscribeAsync, HandleApplyDeltaAsync, HandleWriteAsync) so continuations resume on the actor's ActorTaskScheduler (idiomatic Akka — preserves actor context, handles the stopping actor gracefully). Audit each handler for any other post-await Context/Sender/Self access while doing so. (Note: HandleApplyDeltaAsync and HandleWriteAsync already capture replyTo = Sender before their awaits; only HandleSubscribeAsync has the after-await capture — confirm during implementation.)

Out of scope. The ReceiveAsync mailbox-blocking semantics (one message at a time during a long SubscribeAsync) are unchanged — not part of this bug.

Testing. A TestKit test that subscribes, then subscribes again (drives the line-359 unsubscribe-then-resubscribe path) and asserts SubscriptionEstablished is replied with no exception (today this throws). Use a fake ISubscribable driver. The stopping-mid-await race is hard to make deterministic; the capture-before-await fix removes the failing access regardless.


2. Validate Tag↔VirtualTag name collisions at deploy (Configuration)

Files: src/Core/ZB.MOM.WW.OtOpcUa.Configuration/Validation/DraftValidator.cs + the DraftSnapshot type + wherever DraftSnapshot is built (locate during planning).

Problem. Phase7Applier materialises both Equipment Tag variables and VirtualTag variables at the folder-scoped NodeId {EquipmentId}[/{FolderPath}]/{Name}. UX_Tag_EquipmentPath and UX_VirtualTag_EquipmentPath each enforce (EquipmentId, Name) uniqueness within their own table, but there is no cross-table constraint — a Tag and a VirtualTag with the same (EquipmentId, Name) (and same/empty FolderPath) would materialise to the same NodeId, and the sink (keyed on NodeId) would keep only one. Not hit today (the company overlay is VirtualTag-only), but latent for mixed equipment.

Change.

  • Extend DraftSnapshot to carry equipment-bound signal identities: the (EquipmentId, FolderPath, Name) of every Tag with a non-null EquipmentId and every VirtualTag (FolderPath = "" for VirtualTags). Populate them where DraftSnapshot is constructed.
  • New rule DraftValidator.ValidateNoEquipmentSignalNameCollision: group those signals by the folder-scoped key (EquipmentId, normalize(FolderPath), Name) — exactly the materialiser's NodeId key — and emit a ValidationError("EquipmentSignalNameCollision", …) for any group with count > 1 (carry the EquipmentId + Name in the error). Wire it into DraftValidator.Validate.

Why the folder-scoped key. A Tag under a sub-folder ({Eq}/sub/Name) does not collide with a root VirtualTag ({Eq}/Name) of the same Name. Keying on the full NodeId path avoids false positives. Galaxy / SystemPlatform Tags (EquipmentId null) live in a different node space and are excluded.

Testing. Validator unit tests: a Tag + a VirtualTag with the same (EquipmentId, Name) (empty FolderPath) → EquipmentSignalNameCollision; same Name but different FolderPath → OK; distinct names → OK; a SystemPlatform Tag sharing a name with an equipment VirtualTag → OK (excluded).


3. docker-dev resource hardening (compose/config only — no app code)

File: docker-dev/docker-compose.yml (+ docker-dev/README.md).

Problem. The full single-mesh stack (central-1/2 + 4 site nodes) OOM-killed central-1 on a loaded host (Docker Desktop VM memory cap; OOMKilled=true, host load ~12). Two contributors: (a) verbose EF Core Development logging — every Deployment-poll SELECT + Executed DbCommand logged, flooding the dispatcher and starving the cluster heartbeat thread (observed heartbeat … delayed 4169ms … thread starvation); (b) no per-container memory limits, so total footprint is unbounded against the VM cap.

Change.

  • Quiet logging: add Logging__LogLevel__Microsoft.EntityFrameworkCore=Warning (and Logging__LogLevel__Microsoft.AspNetCore=Warning) to the host services' environment. Keep application logs (Default/ZB.MOM.WW) at Info. This removes the SQL flood and the starvation cascade.
  • Per-service memory bounds: add mem_limit (and mem_reservation) to each host service (central-1/2, site-a/b), sized by measuring a steady-state node (docker stats) and adding headroom. Document in docker-dev/README.md the approximate total Docker Desktop VM memory the full stack needs, so a constrained host knows to raise the VM or run fewer nodes.

Coordinate with active docker-dev work. master has recent in-flight docker-dev commits (single-mesh topology, fresh-volume bootstrap). Rebase/merge carefully so these changes don't clobber that work; keep the edits additive (env keys + mem_limit lines) where possible.

Testing. Operational only: bring the full stack up with the changes, confirm no OOM and all nodes serve (:9200 healthy, :4840 listening on central-1, galaxy mirror Good). Not unit-testable.


Sequencing & risk

# Area Risk Ships as
1 Runtime actor (DriverInstanceActor) Medium (actor semantics) — small, well-understood own commit + TestKit test
2 Configuration (DraftSnapshot + DraftValidator) Lowmedium (new snapshot field + rule) own commit + validator tests
3 docker-dev compose/README Low (config only) own commit; coordinate with active docker-dev work

All three are independent and can be reviewed/merged separately. No cross-dependencies.