Commit Graph

828 Commits

Author SHA1 Message Date
Joseph Doherty 526eebb3bb feat(adminui): UNS-tree delete for Cluster + Enterprise (refuse-if-children, no rowversion) 2026-06-16 16:35:07 -04:00
Joseph Doherty 6a8020e7e7 feat(adminui): native-alarm HistorizeToAveva opt-out 2026-06-16 16:27:31 -04:00
Joseph Doherty 72d414ada7 feat(adminui): typed TagConfig editors for OpcUaClient + Historian 2026-06-16 16:25:19 -04:00
Joseph Doherty 1164d423b6 fix(probe): Galaxy gRPC ping — drop invalid Retry, treat MxGatewayAuth exceptions as reachable (live /run)
v2-ci / build (push) Failing after 44s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
Two bugs caught by live verification against the mxaccessgw at 10.100.0.48:5120:
- MaxAttempts=1 produced an invalid Polly RetryStrategyOptions -> the probe failed
  on every real gateway. Removed the Retry override (matches GalaxyDriver); fail-fast
  is already guaranteed by the TCP preflight + the per-call deadline.
- A rejected key surfaces as a typed MxGatewayAuthenticationException, not a raw
  RpcException, so 'auth-rejection = reachable' was bypassed. Catch the typed auth/
  authorization exceptions -> Ok=true.
Adds DriverProbeHandshakeE2eTests: direct-probe, skip-gated cross-protocol green/red
discrimination (Modbus, OpcUaClient, Galaxy + a local real OPC UA server).
2026-06-16 07:32:59 -04:00
Joseph Doherty 2d688c2a6d feat(probe): Galaxy Test-Connect does a gRPC ping (auth-rejection counts as reachable) 2026-06-16 06:48:40 -04:00
Joseph Doherty b663ae6eff feat(probe): TwinCAT Test-Connect does an ADS ReadState (degrade-guarded) 2026-06-16 06:48:22 -04:00
Joseph Doherty 5ed0276ffb feat(probe): FOCAS Test-Connect attempts a cnc_allclibhndl3 handshake (degrade-guarded) 2026-06-16 06:45:19 -04:00
Joseph Doherty 21f3e8feab feat(probe): AbLegacy Test-Connect opens a real PCCC session (libplctag init)
Replaces the bare-TCP AbLegacyDriverProbe with a two-phase probe:
Phase 1 is the existing TCP preflight; Phase 2 initialises a
LibplctagLegacyTagRuntime (Protocol.ab_eip + per-family PlcType) to
open a real PCCC-over-EIP session, using AbLegacyProbeOptions.ProbeAddress
("S:0") as the probe tag. Status-code discrimination mirrors the AbCip
probe: ErrorNotFound/ErrorNoMatch/ErrorBadDevice → Ok=true "controller
reachable"; transport errors → Ok=false "handshake failed".
Adds AbLegacyDriverProbeTests (5 unit tests, all green, 168 total).
2026-06-16 06:44:15 -04:00
Joseph Doherty 0c08b152c2 feat(probe): AbCip Test-Connect opens a real CIP session (libplctag init)
Replace the bare-TCP-only AbCipDriverProbe with a two-phase check:
Phase 1 keeps the existing TCP preflight; Phase 2 initialises a
LibplctagTagRuntime against the first device to open a real EIP session
and CIP Forward Open, so a live-but-rejecting CIP endpoint reads red
instead of a false-positive green.

Status mapping: ErrorNotFound / ErrorNoMatch / ErrorBadDevice → reachable
(controller answered CIP, probe tag absent); ErrorTimeout / ErrorBadConnection
/ ErrorBadGateway / ErrorWinsock / ErrorOpen / ErrorClose / ErrorRead /
ErrorWrite / ErrorBadReply / ErrorRemoteErr / ErrorPartial / ErrorAbort →
handshake failed. LibPlcTagException message text is used as a secondary
signal for the reachable-exception path. All other statuses default to
handshake-failed (conservative).

Add AbCipDriverProbeTests: invalid JSON, no devices, malformed host address,
closed-port TCP rejection, and black-hole timeout — all offline-determinable.
Happy path + CIP-error path covered live against the CIP sim.
2026-06-16 06:39:46 -04:00
Joseph Doherty 957a63cfdb feat(probe): OpcUaClient Test-Connect does a GetEndpoints discovery handshake
Replace the bare TCP-connect return in OpcUaClientDriverProbe with a real
OPC UA GetEndpoints discovery handshake (mirroring SelectMatchingEndpointAsync
in the driver). TCP preflight still fast-fails closed ports; the handshake
confirms the remote is actually an OPC UA server, so a live-but-rejecting
non-OPC-UA process now reads RED instead of a false-healthy green.
2026-06-16 06:39:27 -04:00
Joseph Doherty 9a8336ff6e feat(probe): S7 Test-Connect does a real ISO-on-TCP + S7 setup handshake
Replace bare TCP-connect with a two-phase probe: Phase 1 keeps the
existing SocketException / timeout / generic preflight paths unchanged;
Phase 2 runs Plc.OpenAsync (COTP CR/CC + S7 setup-communication) so a
device that accepts TCP but is not an S7 PLC reads red instead of green.
A linked CTS distinguishes caller cancellation ("timed out") from the
S7netplus internal read-timeout OCE ("handshake failed: timed out").
2026-06-16 06:38:51 -04:00
Joseph Doherty 9b909002be feat(probe): Modbus Test-Connect does a real FC03 handshake
Replace the bare TCP-connect probe in ModbusDriverProbe with a two-phase
check: TCP connect via ModbusTcpTransport (keeps the same SocketException /
timeout / generic error paths and messages), then a one-shot FC03 Read
Holding Registers (qty 1 @ addr 0). A normal response → Ok=true "Modbus
FC03 OK"; a Modbus exception PDU → Ok=true "Modbus FC03 OK (device
returned exception PDU)"; any other failure after TCP succeeds → Ok=false
"Reachable at host:port but Modbus FC03 handshake failed: …".

Add ModbusDriverProbeTests (6 tests) covering invalid JSON, missing
host/port, closed port, TCP-accept-then-close, canned MBAP happy path,
and Modbus exception PDU path. All 277 Modbus tests green.
2026-06-16 06:36:48 -04:00
Joseph Doherty 4973075291 feat(focas): scale axis positions by 10^PositionDecimalPlaces (config-supplied) 2026-06-16 05:32:36 -04:00
Joseph Doherty fcb3801415 fix(historian): dead-letter poison events after maxAttempts (finding 002) 2026-06-16 05:25:43 -04:00
Joseph Doherty 5e27b5f708 feat(historian): support Total aggregate (client-side Average x interval-seconds) 2026-06-16 05:24:56 -04:00
Joseph Doherty 5c5aaef609 fix(focas): fail-fast at init on unimplemented backend (operator footgun)
Add IFocasClientFactory.EnsureUsable() — a config-time probe called by
FocasDriver.InitializeAsync before any background loops start. The
UnimplementedFocasClientFactory throws NotSupportedException immediately
(faulting the driver at init), eliminating the footgun where a driver on
the 'unimplemented' backend appeared Healthy then failed every read/write/
subscribe silently. WireFocasClientFactory and FakeFocasClientFactory are
no-ops. Backstop Create() throw remains in place.
2026-06-16 05:24:41 -04:00
Joseph Doherty bd8fee610b fix(modbus): surface Int64/UInt64 node DataType (Driver.Modbus-007)
Make MapDataType internal, split the combined Int64/UInt64 arm to return
DriverDataType.Int64 and DriverDataType.UInt64 respectively, and remove
the now-stale Driver.Modbus-007 caveat doc block and inline comment.
Add a Theory covering both cases; full suite 271/271 green.
2026-06-16 05:23:47 -04:00
Joseph Doherty 8899d6e091 test(galaxy): assert current Write/Subscribe guard text (PR 4.4/4.W refs removed in Phase 0 b4af9e7f) 2026-06-15 15:03:38 -04:00
Joseph Doherty 93d9160dae feat(alarms): DriverHostActor routes native-condition acks to the owning driver [H6d] 2026-06-15 14:46:00 -04:00
Joseph Doherty 87dd65b97a test(alarms): native ack wrong-role deny + tidy NativeAlarmAck doc (code-review) 2026-06-15 14:39:26 -04:00
Joseph Doherty a6d9de091b feat(alarms): native condition Acknowledge routes to NativeAlarmAckRouter with principal [H6c] 2026-06-15 14:33:58 -04:00
Joseph Doherty 328bd1b9ee feat(alarms): wire OnEnableDisable over OPC UA (AlarmAck-gated; native→BadNotSupported) [H4] 2026-06-15 14:24:19 -04:00
Joseph Doherty 226587d817 test(alarms): cover isNative rebuild/kind-flip lifecycle + Phase7Applier call-site (code-review) 2026-06-15 14:20:20 -04:00
Joseph Doherty 2423edf232 test(alarms): assert Galaxy ack null-OperatorUser falls back to empty (code-review) 2026-06-15 14:18:57 -04:00
Joseph Doherty 418663b359 feat(alarms): thread isNative through MaterialiseAlarmCondition; node manager tracks native conditions [H6a] 2026-06-15 14:13:30 -04:00
Joseph Doherty ed941c51da feat(alarms): AlarmAcknowledgeRequest carries OperatorUser; Galaxy/ScriptedAlarmSource honor it [H6b] 2026-06-15 14:11:40 -04:00
Joseph Doherty c236263e8d fix(authz): give HistoryUpdate its own NodePermissions bit (was aliased to HistoryRead) [H2] 2026-06-15 14:09:35 -04:00
Joseph Doherty 4c78dcd358 feat(redundancy): wire dbHealth into OpcUaPublishActor + spawn PeerProbeSupervisor per node 2026-06-15 13:33:34 -04:00
Joseph Doherty 5a064e086d test(redundancy): lock in stale-Terminated guard + clarify OnTerminated (code-review) 2026-06-15 13:29:58 -04:00
Joseph Doherty f41e957e07 feat(redundancy): PeerProbeSupervisor maintains one peer OPC UA probe per driver peer 2026-06-15 13:22:38 -04:00
Joseph Doherty 37b32a5623 feat(redundancy): periodic HealthTick refreshes DB reachability via Ask/PipeTo 2026-06-15 13:15:26 -04:00
Joseph Doherty 5382eea9b5 test(redundancy): cover stale-probe-not-demoted branch + make _probeFreshnessWindow readonly (code-review) 2026-06-15 13:11:01 -04:00
Joseph Doherty cf278035d2 feat(redundancy): OpcUaProbeOk from peer-probes-me with freshness debounce 2026-06-15 13:04:41 -04:00
Joseph Doherty a9ff1a64b2 fix(redundancy): always publish first ServiceLevel (even 0) + log SafeSelfStatus failures (code-review) 2026-06-15 13:00:25 -04:00
Joseph Doherty 3e609a2b19 feat(redundancy): OpcUaPublishActor computes ServiceLevel via calculator (DB+stale+leader; legacy seam) 2026-06-15 12:51:32 -04:00
Joseph Doherty ff0f62db38 refactor(redundancy): move ServiceLevelCalculator to Core.Cluster (shared, Runtime-reachable) 2026-06-15 12:45:17 -04:00
Joseph Doherty 4501f12669 feat(vtags): wire IHistoryWriter through DriverHostActor (Null default; durable sink infra-gated) (H5d, stillpending §1) 2026-06-15 10:38:49 -04:00
Joseph Doherty 2f30c54dc1 test(vtags): thread-safe CapturingHistoryWriter + drop redundant wait (H5c review follow-up) 2026-06-15 10:33:14 -04:00
Joseph Doherty 0c6d4c5491 feat(vtags): forward historized vtag results to IHistoryWriter (H5c, stillpending §1) 2026-06-15 10:26:25 -04:00
Joseph Doherty 83d3b9f7be test(vtags): planner detects Historize-only toggle as a change + doc nit (H5a review follow-up) 2026-06-15 10:21:31 -04:00
Joseph Doherty 9c5a091395 feat(vtags): decode VirtualTag Historize from artifact, byte-parity with composer (H5b, stillpending §1) 2026-06-15 10:17:08 -04:00
Joseph Doherty fc8121cbf3 feat(vtags): carry VirtualTag.Historize onto EquipmentVirtualTagPlan (H5a, stillpending §1) 2026-06-15 10:17:05 -04:00
Joseph Doherty ebf2f1dd7a fix(vtags): prune _planByVtag on child termination + crash-then-change test (H1b review follow-up) 2026-06-15 10:12:11 -04:00
Joseph Doherty ada01e1af8 fix(vtags): respawn equipment virtualtag child on in-place plan change (H1b, stillpending §1) 2026-06-15 10:05:29 -04:00
Joseph Doherty 1dc713693a fix(deploy): count removed equipment tags/vtags in RemovedNodes (H1a review follow-up) 2026-06-15 10:01:37 -04:00
Joseph Doherty 1e95856b00 fix(deploy): rebuild address space on changed-only deploys (H1a, stillpending §1) 2026-06-15 09:57:40 -04:00
Joseph Doherty 6ba59f9d4d fix(abcip,focas): collapse alarm projection to a single poll loop (no reconnect leak)
The owning DriverInstanceActor re-subscribes alarms on every Connected
entry (DetachAlarmSource nulls its cached handle on Connected->Reconnecting
without calling UnsubscribeAlarmsAsync), and the driver object + its alarm
projection are reused across every in-place reconnect. Each SubscribeAsync
started a fresh, never-cancelled Task.Run poll loop and added it to _subs,
so N reconnects leaked N concurrent loops all polling the device and all
firing the same raise/clear transitions => duplicate alarm events + CPU/mem
growth.

Mirrors the Galaxy #399 fix (Clear-before-Add) but for live poll loops the
collapse must also CANCEL the superseded loops, not just drop references.
SubscribeAsync now snapshots existing subs under _subsLock, clears _subs,
adds the new sub, starts its loop, then retires each stale sub out-of-band
(RetireAsync: Cancel + await loop + Dispose CTS, fire-and-forget so the new
subscription's return isn't blocked on a poll interval). Snapshot+clear under
the same lock DisposeAsync uses guarantees no double-own / double-dispose.

There is exactly one consumer per driver instance (factory-per-actor), so
retiring all prior subscriptions before starting the new one is faithful.

Regression tests (TDD, fail->pass): subscribe twice then drive one device
raise; assert OnAlarmEvent fires exactly once (was twice with two leaked
loops).
2026-06-15 06:09:38 -04:00
Joseph Doherty 013882262a fix(galaxy): bound alarm-subscription handles to one (no reconnect leak)
v2-ci / build (push) Failing after 44s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
GalaxyDriver's StreamAlarms feed is session-less and survives an in-place
reconnect, so DriverInstanceActor re-subscribed on every Connected re-entry
(after dropping its own cached handle without an Unsubscribe — sync teardown).
The re-subscribe was additive: _alarmSubscriptions.Add grew the list by one
untracked handle per reconnect cycle — a slow unbounded leak. Functionally
harmless (the gate is Count>0 and OnAlarmFeedTransition only reads [0], firing
once regardless), but it accumulated forever.

Fix: SubscribeAlarmsAsync clears the set before adding, collapsing to a single
live handle (under the existing _alarmHandlersLock, atomic w.r.t. the fan-out
reader). There is exactly one consumer per driver instance (factory-per-actor
lifecycle), so replacing the set with the latest handle is faithful. Chosen
over making the actor's sync DetachAlarmSource call UnsubscribeAlarmsAsync
async/fire-and-forget — disproportionate for a minor leak.

Regression test Re_subscribe_collapses_to_a_single_handle_no_accumulation
(TDD-verified: FAILS without the Clear — releasing the latest handle leaves
the feed open because stale handles remain; PASSES with the fix). Galaxy tests
263 pass / 3 skip; Runtime native-alarm 24 pass. Code-reviewed (approved).
2026-06-15 05:49:07 -04:00
Joseph Doherty c9643f68ba fix(runtime): restart driver no longer throws 'actor name is not unique'
v2-ci / build (push) Failing after 42s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
HandleRestartDriver stopped + respawned the child within one synchronous
message handler, reusing the base actor name drv-<id>. Context.Stop is async
(the child processes its own stop on its own mailbox), so the old child was
ALWAYS still registered when the respawn ran — Context.ActorOf threw
InvalidActorNameException deterministically on every AdminUI Restart press,
crashing + restarting the host.

Fix: a monotonic _childSpawnGeneration counter (single-threaded actor) feeds a
-g<gen> suffix on every spawned child name, so a respawn can never collide with
the still-terminating predecessor. Children are tracked by the _children dict
(by IActorRef), never by actor path, so the suffix is invisible to callers.
This also closes the same-shaped latent race in the reconcile path (a removed-
then-readded instance, and a driver-type-change ToStop+ToSpawn in one plan).

Regression test RestartDriver_respawns_the_child_without_an_actor_name_collision
(verified: FAILS on the old code with the exact InvalidActorNameException,
PASSES with the fix). Runtime.Tests 238/238 green. Code-reviewed (approved).
2026-06-15 05:41:18 -04:00
Joseph Doherty bea0b482d4 fix(historian): address code review on Raw HistoryRead paging
C1 (critical): a boundary tie cluster larger than NumValuesPerNode could
silently truncate a resumed read to GoodNoData, permanently dropping the
un-emitted ties — the (timestamp, skip) cursor cannot advance past a single
timestamp the fixed-(start,end,cap) backend keeps re-returning. Now detected
and failed LOUDLY per node with BadHistoryOperationUnsupported + a log naming
the tag/timestamp/cap; documented in Historian.md with the larger-cap remedy.
Regression test Raw_tie_cluster_larger_than_page_fails_loudly_not_silently.

I3: build HistoryData before Save() so a projection failure can never orphan a
stored continuation cursor.

N1 (YAGNI): drop the never-produced HistoryReadKind enum + Processed-only
Aggregate/IntervalTicks fields from HistoryContinuationState — only Raw pages.

N3: ComputeResumeCursor guards its documented non-empty precondition.

I1: document InMemoryHistoryContinuationStore's eventual-consistency (test double).

Build clean, 182/182 OpcUaServer tests pass.
2026-06-15 05:15:07 -04:00