Commit Graph

1721 Commits

Author SHA1 Message Date
Joseph Doherty 87dd65b97a test(alarms): native ack wrong-role deny + tidy NativeAlarmAck doc (code-review) 2026-06-15 14:39:26 -04:00
Joseph Doherty a6d9de091b feat(alarms): native condition Acknowledge routes to NativeAlarmAckRouter with principal [H6c] 2026-06-15 14:33:58 -04:00
Joseph Doherty be6858baa1 fix(alarms): OnEnableDisable native-check via lock-guarded IsNativeAlarmNode + unstale AlarmCommand doc (code-review) 2026-06-15 14:30:17 -04:00
Joseph Doherty 328bd1b9ee feat(alarms): wire OnEnableDisable over OPC UA (AlarmAck-gated; native→BadNotSupported) [H4] 2026-06-15 14:24:19 -04:00
Joseph Doherty 226587d817 test(alarms): cover isNative rebuild/kind-flip lifecycle + Phase7Applier call-site (code-review) 2026-06-15 14:20:20 -04:00
Joseph Doherty 2423edf232 test(alarms): assert Galaxy ack null-OperatorUser falls back to empty (code-review) 2026-06-15 14:18:57 -04:00
Joseph Doherty 418663b359 feat(alarms): thread isNative through MaterialiseAlarmCondition; node manager tracks native conditions [H6a] 2026-06-15 14:13:30 -04:00
Joseph Doherty ed941c51da feat(alarms): AlarmAcknowledgeRequest carries OperatorUser; Galaxy/ScriptedAlarmSource honor it [H6b] 2026-06-15 14:11:40 -04:00
Joseph Doherty c236263e8d fix(authz): give HistoryUpdate its own NodePermissions bit (was aliased to HistoryRead) [H2] 2026-06-15 14:09:35 -04:00
Joseph Doherty 6ab3d8630b docs(alarms): Phase 3 implementation plan + tasks (H4 + H2-bit + H6) 2026-06-15 14:05:00 -04:00
Joseph Doherty 40b883effe docs(alarms): Phase 3 design — OPC UA standards completeness (H4 Enable/Disable + H2 HistoryUpdate bit + H6 native-ack→AVEVA) 2026-06-15 13:59:28 -04:00
Joseph Doherty 4af8e65af1 fix(redundancy): PeerProbeSupervisor explicitly ignores co-mingled OpcUaProbeResult (integration review)
v2-ci / build (push) Failing after 34s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
2026-06-15 13:40:16 -04:00
Joseph Doherty 4c78dcd358 feat(redundancy): wire dbHealth into OpcUaPublishActor + spawn PeerProbeSupervisor per node 2026-06-15 13:33:34 -04:00
Joseph Doherty 393b746d9b docs(redundancy): sync component table with the wired calculator + PeerProbeSupervisor 2026-06-15 13:30:55 -04:00
Joseph Doherty 5a064e086d test(redundancy): lock in stale-Terminated guard + clarify OnTerminated (code-review) 2026-06-15 13:29:58 -04:00
Joseph Doherty 70e6d3d2c0 docs(redundancy): ServiceLevelCalculator is wired into the live publish path 2026-06-15 13:26:34 -04:00
Joseph Doherty f41e957e07 feat(redundancy): PeerProbeSupervisor maintains one peer OPC UA probe per driver peer 2026-06-15 13:22:38 -04:00
Joseph Doherty 37b32a5623 feat(redundancy): periodic HealthTick refreshes DB reachability via Ask/PipeTo 2026-06-15 13:15:26 -04:00
Joseph Doherty 5382eea9b5 test(redundancy): cover stale-probe-not-demoted branch + make _probeFreshnessWindow readonly (code-review) 2026-06-15 13:11:01 -04:00
Joseph Doherty cf278035d2 feat(redundancy): OpcUaProbeOk from peer-probes-me with freshness debounce 2026-06-15 13:04:41 -04:00
Joseph Doherty a9ff1a64b2 fix(redundancy): always publish first ServiceLevel (even 0) + log SafeSelfStatus failures (code-review) 2026-06-15 13:00:25 -04:00
Joseph Doherty 3e609a2b19 feat(redundancy): OpcUaPublishActor computes ServiceLevel via calculator (DB+stale+leader; legacy seam) 2026-06-15 12:51:32 -04:00
Joseph Doherty ff0f62db38 refactor(redundancy): move ServiceLevelCalculator to Core.Cluster (shared, Runtime-reachable) 2026-06-15 12:45:17 -04:00
Joseph Doherty 7605f4d8fd docs(redundancy): Phase 2 implementation plan + tasks (H3 ServiceLevel wiring) 2026-06-15 12:41:51 -04:00
Joseph Doherty 0528353315 docs(redundancy): Phase 2 design — health-aware ServiceLevel (H3) 2026-06-15 12:33:09 -04:00
Joseph Doherty 4bd7180e7f fix(docker-dev): stop seeding retired SystemPlatform namespace
v2-ci / build (push) Failing after 36s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
The seed re-inserted a Namespace with Kind='SystemPlatform' (+ a
GalaxyMxGateway driver + 3 mirror tags), but that NamespaceKind member was
removed when Galaxy became Equipment-kind (migration CleanupSystemPlatformNamespaces).
cluster-seed runs after the migrator, so a fresh down -v/up re-introduced a Kind
the current code can't EF-materialize — 500ing /deployments and failing every
publish (ConfigComposer reads db.Namespaces). Remove the obsolete inserts;
author an Equipment-kind Galaxy driver via the UI if a fixture is needed.
2026-06-15 12:17:02 -04:00
Joseph Doherty 907005d2d2 docs(claude): note local docker-dev rig has login disabled — run live /run verification directly, don't wait for sign-in
v2-ci / build (push) Failing after 48s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
2026-06-15 11:50:55 -04:00
Joseph Doherty c6a543d1b6 docs(vtags): note rename-respawn transient + write-side-only historize (integration review)
v2-ci / build (push) Failing after 44s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
2026-06-15 10:50:08 -04:00
Joseph Doherty aaa5d8b851 docs(vtags): document runtime Historize honoring + infra-gated durable sink (Phase 1 H5) 2026-06-15 10:43:29 -04:00
Joseph Doherty 4501f12669 feat(vtags): wire IHistoryWriter through DriverHostActor (Null default; durable sink infra-gated) (H5d, stillpending §1) 2026-06-15 10:38:49 -04:00
Joseph Doherty 2f30c54dc1 test(vtags): thread-safe CapturingHistoryWriter + drop redundant wait (H5c review follow-up) 2026-06-15 10:33:14 -04:00
Joseph Doherty 0c6d4c5491 feat(vtags): forward historized vtag results to IHistoryWriter (H5c, stillpending §1) 2026-06-15 10:26:25 -04:00
Joseph Doherty 83d3b9f7be test(vtags): planner detects Historize-only toggle as a change + doc nit (H5a review follow-up) 2026-06-15 10:21:31 -04:00
Joseph Doherty 9c5a091395 feat(vtags): decode VirtualTag Historize from artifact, byte-parity with composer (H5b, stillpending §1) 2026-06-15 10:17:08 -04:00
Joseph Doherty fc8121cbf3 feat(vtags): carry VirtualTag.Historize onto EquipmentVirtualTagPlan (H5a, stillpending §1) 2026-06-15 10:17:05 -04:00
Joseph Doherty ebf2f1dd7a fix(vtags): prune _planByVtag on child termination + crash-then-change test (H1b review follow-up) 2026-06-15 10:12:11 -04:00
Joseph Doherty ada01e1af8 fix(vtags): respawn equipment virtualtag child on in-place plan change (H1b, stillpending §1) 2026-06-15 10:05:29 -04:00
Joseph Doherty 1dc713693a fix(deploy): count removed equipment tags/vtags in RemovedNodes (H1a review follow-up) 2026-06-15 10:01:37 -04:00
Joseph Doherty 1e95856b00 fix(deploy): rebuild address space on changed-only deploys (H1a, stillpending §1) 2026-06-15 09:57:40 -04:00
Joseph Doherty 50a2fdf32d chore(plans): mark confirmed-shipped .tasks.json completed so audits don't re-flag (stillpending §7) 2026-06-15 09:52:51 -04:00
Joseph Doherty a9d267c91a docs(security,core): correct stale write-outcome doc + note benign DraftSnapshot/LeaderChanged residue (stillpending §9/§3) 2026-06-15 09:48:14 -04:00
Joseph Doherty b4af9e7f37 docs(comments): correct 7 stale 'later task/milestone' comments (stillpending §9) 2026-06-15 09:47:08 -04:00
Joseph Doherty 68a0f759f0 docs(plans): Phase 0+1 implementation plan for the still-pending backlog
12 tasks (0 branch; 1-3 Phase 0 hygiene; 4-5 H1 changed-only-deploy fix;
6-9 H5 vtag Historize threading + IHistoryWriter seam; 10 docs; 11 verify).
Conservative rebuild-on-change; no EF migration (Historize column + artifact
already carry it); durable AVEVA sink flagged infra-gated.
2026-06-15 09:40:03 -04:00
Joseph Doherty f64be52796 docs(plans): phased completion design for the still-pending backlog
Roadmap for closing stillpending.md §1-§5 + §7/§9 cleanup in 9 phases
(0 hygiene -> 1 silent-deploy bugs H1/H5 -> 2 ServiceLevel H3 ->
3 OPC UA standards H4/H2-bit/H6 -> 4 driver coverage -> 5 probes ->
6 AdminUI -> 7 Client.UI -> 8 per-cluster scoping). Conservative
rebuild-on-change for H1; plan-and-execute phase-by-phase; no EF
migration; defer-list flagged (Denied/Simulated/Language/InlayHints/
HistoryUpdate-service/Galaxy-gateway-write).
2026-06-15 09:27:06 -04:00
Joseph Doherty 151b7165af docs(abcip,focas): document RetireAsync one-tick overlap residual + guard Dispose
v2-ci / build (push) Failing after 2m47s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
Code-review follow-ups on the poll-loop collapse: (1) RetireAsync is fire-and-
forget and does NOT guarantee zero overlap — the retired loop runs until its
in-flight read+tick finish and it observes cancellation, so a device transition
landing in that one-tick window can fire once on both loops (at most ONE
duplicate raise/clear per reconnect, transient + self-correcting; upstream Part
9 conditions dedupe on ConditionId). Documented in both RetireAsync XML docs so
it isn't mistaken for a zero-overlap guarantee. (2) wrap Cts.Dispose() so the
fire-and-forget task has no theoretical unobserved-exception path.
2026-06-15 06:14:44 -04:00
Joseph Doherty 6ba59f9d4d fix(abcip,focas): collapse alarm projection to a single poll loop (no reconnect leak)
The owning DriverInstanceActor re-subscribes alarms on every Connected
entry (DetachAlarmSource nulls its cached handle on Connected->Reconnecting
without calling UnsubscribeAlarmsAsync), and the driver object + its alarm
projection are reused across every in-place reconnect. Each SubscribeAsync
started a fresh, never-cancelled Task.Run poll loop and added it to _subs,
so N reconnects leaked N concurrent loops all polling the device and all
firing the same raise/clear transitions => duplicate alarm events + CPU/mem
growth.

Mirrors the Galaxy #399 fix (Clear-before-Add) but for live poll loops the
collapse must also CANCEL the superseded loops, not just drop references.
SubscribeAsync now snapshots existing subs under _subsLock, clears _subs,
adds the new sub, starts its loop, then retires each stale sub out-of-band
(RetireAsync: Cancel + await loop + Dispose CTS, fire-and-forget so the new
subscription's return isn't blocked on a poll interval). Snapshot+clear under
the same lock DisposeAsync uses guarantees no double-own / double-dispose.

There is exactly one consumer per driver instance (factory-per-actor), so
retiring all prior subscriptions before starting the new one is faithful.

Regression tests (TDD, fail->pass): subscribe twice then drive one device
raise; assert OnAlarmEvent fires exactly once (was twice with two leaked
loops).
2026-06-15 06:09:38 -04:00
Joseph Doherty 43b3769a1d docs(plans): add write-outcome self-correction implementation plan
v2-ci / build (push) Failing after 32s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
The plan + task list for the write-outcome self-correction work (B1, already
shipped via master 1d797c1c). Its design-doc counterpart is already committed;
this adds the matching plan artifacts, consistent with the other docs/plans/.
2026-06-15 05:57:15 -04:00
Joseph Doherty 5a70cd7910 chore(deps): bump vendored MxGateway.Client/Contracts 0.1.0 -> 0.1.1
User-published 0.1.1 of the MxGateway client + contracts packages into the
local-mxgw vendored source (nuget-packages/). Bumps Directory.Packages.props to
match and adds the 0.1.1 .nupkg artifacts alongside the existing 0.1.0 ones.
Full solution builds clean against 0.1.1.
2026-06-15 05:57:09 -04:00
Joseph Doherty 013882262a fix(galaxy): bound alarm-subscription handles to one (no reconnect leak)
v2-ci / build (push) Failing after 44s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
GalaxyDriver's StreamAlarms feed is session-less and survives an in-place
reconnect, so DriverInstanceActor re-subscribed on every Connected re-entry
(after dropping its own cached handle without an Unsubscribe — sync teardown).
The re-subscribe was additive: _alarmSubscriptions.Add grew the list by one
untracked handle per reconnect cycle — a slow unbounded leak. Functionally
harmless (the gate is Count>0 and OnAlarmFeedTransition only reads [0], firing
once regardless), but it accumulated forever.

Fix: SubscribeAlarmsAsync clears the set before adding, collapsing to a single
live handle (under the existing _alarmHandlersLock, atomic w.r.t. the fan-out
reader). There is exactly one consumer per driver instance (factory-per-actor
lifecycle), so replacing the set with the latest handle is faithful. Chosen
over making the actor's sync DetachAlarmSource call UnsubscribeAlarmsAsync
async/fire-and-forget — disproportionate for a minor leak.

Regression test Re_subscribe_collapses_to_a_single_handle_no_accumulation
(TDD-verified: FAILS without the Clear — releasing the latest handle leaves
the feed open because stale handles remain; PASSES with the fix). Galaxy tests
263 pass / 3 skip; Runtime native-alarm 24 pass. Code-reviewed (approved).
2026-06-15 05:49:07 -04:00
Joseph Doherty c9643f68ba fix(runtime): restart driver no longer throws 'actor name is not unique'
v2-ci / build (push) Failing after 42s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
HandleRestartDriver stopped + respawned the child within one synchronous
message handler, reusing the base actor name drv-<id>. Context.Stop is async
(the child processes its own stop on its own mailbox), so the old child was
ALWAYS still registered when the respawn ran — Context.ActorOf threw
InvalidActorNameException deterministically on every AdminUI Restart press,
crashing + restarting the host.

Fix: a monotonic _childSpawnGeneration counter (single-threaded actor) feeds a
-g<gen> suffix on every spawned child name, so a respawn can never collide with
the still-terminating predecessor. Children are tracked by the _children dict
(by IActorRef), never by actor path, so the suffix is invisible to callers.
This also closes the same-shaped latent race in the reconcile path (a removed-
then-readded instance, and a driver-type-change ToStop+ToSpawn in one plan).

Regression test RestartDriver_respawns_the_child_without_an_actor_name_collision
(verified: FAILS on the old code with the exact InvalidActorNameException,
PASSES with the fix). Runtime.Tests 238/238 green. Code-reviewed (approved).
2026-06-15 05:41:18 -04:00