Two bugs caught by live verification against the mxaccessgw at 10.100.0.48:5120:
- MaxAttempts=1 produced an invalid Polly RetryStrategyOptions -> the probe failed
on every real gateway. Removed the Retry override (matches GalaxyDriver); fail-fast
is already guaranteed by the TCP preflight + the per-call deadline.
- A rejected key surfaces as a typed MxGatewayAuthenticationException, not a raw
RpcException, so 'auth-rejection = reachable' was bypassed. Catch the typed auth/
authorization exceptions -> Ok=true.
Adds DriverProbeHandshakeE2eTests: direct-probe, skip-gated cross-protocol green/red
discrimination (Modbus, OpcUaClient, Galaxy + a local real OPC UA server).
Replaces the bare-TCP AbLegacyDriverProbe with a two-phase probe:
Phase 1 is the existing TCP preflight; Phase 2 initialises a
LibplctagLegacyTagRuntime (Protocol.ab_eip + per-family PlcType) to
open a real PCCC-over-EIP session, using AbLegacyProbeOptions.ProbeAddress
("S:0") as the probe tag. Status-code discrimination mirrors the AbCip
probe: ErrorNotFound/ErrorNoMatch/ErrorBadDevice → Ok=true "controller
reachable"; transport errors → Ok=false "handshake failed".
Adds AbLegacyDriverProbeTests (5 unit tests, all green, 168 total).
Replace the bare-TCP-only AbCipDriverProbe with a two-phase check:
Phase 1 keeps the existing TCP preflight; Phase 2 initialises a
LibplctagTagRuntime against the first device to open a real EIP session
and CIP Forward Open, so a live-but-rejecting CIP endpoint reads red
instead of a false-positive green.
Status mapping: ErrorNotFound / ErrorNoMatch / ErrorBadDevice → reachable
(controller answered CIP, probe tag absent); ErrorTimeout / ErrorBadConnection
/ ErrorBadGateway / ErrorWinsock / ErrorOpen / ErrorClose / ErrorRead /
ErrorWrite / ErrorBadReply / ErrorRemoteErr / ErrorPartial / ErrorAbort →
handshake failed. LibPlcTagException message text is used as a secondary
signal for the reachable-exception path. All other statuses default to
handshake-failed (conservative).
Add AbCipDriverProbeTests: invalid JSON, no devices, malformed host address,
closed-port TCP rejection, and black-hole timeout — all offline-determinable.
Happy path + CIP-error path covered live against the CIP sim.
Replace the bare TCP-connect return in OpcUaClientDriverProbe with a real
OPC UA GetEndpoints discovery handshake (mirroring SelectMatchingEndpointAsync
in the driver). TCP preflight still fast-fails closed ports; the handshake
confirms the remote is actually an OPC UA server, so a live-but-rejecting
non-OPC-UA process now reads RED instead of a false-healthy green.
Replace bare TCP-connect with a two-phase probe: Phase 1 keeps the
existing SocketException / timeout / generic preflight paths unchanged;
Phase 2 runs Plc.OpenAsync (COTP CR/CC + S7 setup-communication) so a
device that accepts TCP but is not an S7 PLC reads red instead of green.
A linked CTS distinguishes caller cancellation ("timed out") from the
S7netplus internal read-timeout OCE ("handshake failed: timed out").
Replace the bare TCP-connect probe in ModbusDriverProbe with a two-phase
check: TCP connect via ModbusTcpTransport (keeps the same SocketException /
timeout / generic error paths and messages), then a one-shot FC03 Read
Holding Registers (qty 1 @ addr 0). A normal response → Ok=true "Modbus
FC03 OK"; a Modbus exception PDU → Ok=true "Modbus FC03 OK (device
returned exception PDU)"; any other failure after TCP succeeds → Ok=false
"Reachable at host:port but Modbus FC03 handshake failed: …".
Add ModbusDriverProbeTests (6 tests) covering invalid JSON, missing
host/port, closed port, TCP-accept-then-close, canned MBAP happy path,
and Modbus exception PDU path. All 277 Modbus tests green.
Add IFocasClientFactory.EnsureUsable() — a config-time probe called by
FocasDriver.InitializeAsync before any background loops start. The
UnimplementedFocasClientFactory throws NotSupportedException immediately
(faulting the driver at init), eliminating the footgun where a driver on
the 'unimplemented' backend appeared Healthy then failed every read/write/
subscribe silently. WireFocasClientFactory and FakeFocasClientFactory are
no-ops. Backstop Create() throw remains in place.
Make MapDataType internal, split the combined Int64/UInt64 arm to return
DriverDataType.Int64 and DriverDataType.UInt64 respectively, and remove
the now-stale Driver.Modbus-007 caveat doc block and inline comment.
Add a Theory covering both cases; full suite 271/271 green.
Code-review follow-ups on the poll-loop collapse: (1) RetireAsync is fire-and-
forget and does NOT guarantee zero overlap — the retired loop runs until its
in-flight read+tick finish and it observes cancellation, so a device transition
landing in that one-tick window can fire once on both loops (at most ONE
duplicate raise/clear per reconnect, transient + self-correcting; upstream Part
9 conditions dedupe on ConditionId). Documented in both RetireAsync XML docs so
it isn't mistaken for a zero-overlap guarantee. (2) wrap Cts.Dispose() so the
fire-and-forget task has no theoretical unobserved-exception path.
The owning DriverInstanceActor re-subscribes alarms on every Connected
entry (DetachAlarmSource nulls its cached handle on Connected->Reconnecting
without calling UnsubscribeAlarmsAsync), and the driver object + its alarm
projection are reused across every in-place reconnect. Each SubscribeAsync
started a fresh, never-cancelled Task.Run poll loop and added it to _subs,
so N reconnects leaked N concurrent loops all polling the device and all
firing the same raise/clear transitions => duplicate alarm events + CPU/mem
growth.
Mirrors the Galaxy #399 fix (Clear-before-Add) but for live poll loops the
collapse must also CANCEL the superseded loops, not just drop references.
SubscribeAsync now snapshots existing subs under _subsLock, clears _subs,
adds the new sub, starts its loop, then retires each stale sub out-of-band
(RetireAsync: Cancel + await loop + Dispose CTS, fire-and-forget so the new
subscription's return isn't blocked on a poll interval). Snapshot+clear under
the same lock DisposeAsync uses guarantees no double-own / double-dispose.
There is exactly one consumer per driver instance (factory-per-actor), so
retiring all prior subscriptions before starting the new one is faithful.
Regression tests (TDD, fail->pass): subscribe twice then drive one device
raise; assert OnAlarmEvent fires exactly once (was twice with two leaked
loops).
GalaxyDriver's StreamAlarms feed is session-less and survives an in-place
reconnect, so DriverInstanceActor re-subscribed on every Connected re-entry
(after dropping its own cached handle without an Unsubscribe — sync teardown).
The re-subscribe was additive: _alarmSubscriptions.Add grew the list by one
untracked handle per reconnect cycle — a slow unbounded leak. Functionally
harmless (the gate is Count>0 and OnAlarmFeedTransition only reads [0], firing
once regardless), but it accumulated forever.
Fix: SubscribeAlarmsAsync clears the set before adding, collapsing to a single
live handle (under the existing _alarmHandlersLock, atomic w.r.t. the fan-out
reader). There is exactly one consumer per driver instance (factory-per-actor
lifecycle), so replacing the set with the latest handle is faithful. Chosen
over making the actor's sync DetachAlarmSource call UnsubscribeAlarmsAsync
async/fire-and-forget — disproportionate for a minor leak.
Regression test Re_subscribe_collapses_to_a_single_handle_no_accumulation
(TDD-verified: FAILS without the Clear — releasing the latest handle leaves
the feed open because stale handles remain; PASSES with the fix). Galaxy tests
263 pass / 3 skip; Runtime native-alarm 24 pass. Code-reviewed (approved).
Native-alarm delivery through OnAlarmFeedTransition was a black box — there was no way
to answer 'is the gateway feed delivering / is a subscription un-gating it', which is
partly why the missing-SubscribeAlarmsAsync wiring shipped undetected. Add a single
per-transition Debug line (kind, ref, live subscription count, fanout flag). Debug so a
flapping galaxy doesn't flood prod, but available on demand.
Add IGalaxyDataWriter.InvalidateHandleCaches() and call it in
GalaxyDriver.ReopenAsync after RecreateAsync succeeds. Prior to this
fix, GatewayGalaxyDataWriter's _itemHandles and _supervisedHandles
dictionaries survived across reconnects, causing the next write to
skip AddItem and AdviseSupervisory against already-dead handles.
Equipment tags resolved at runtime via FocasEquipmentTagParser were not
seeded in _parsedAddressesByTagName so both ReadAsync and WriteAsync
re-parsed the raw TagConfig JSON address string on every hot-path call.
Promoted the field to ConcurrentDictionary (read + write thread safety)
and introduced ResolveParsedAddress(GetOrAdd) so the first call stores
the parse result and all subsequent calls are a cache hit. Authored tags
seeded at InitializeAsync compile and work unchanged.
A plain MXAccess Write runs with no user login (WriteUserId is typically 0),
and MXAccess only COMMITS such a write when the item is advised in supervisory
mode. Without it the gateway's Write call doesn't throw (the reply looks OK) but
the value never reaches the galaxy. GatewayGalaxyDataWriter now issues
AdviseSupervisory (once per item handle) before each raw Write; SecuredWrite/
VerifiedWrite tags keep their own user-identity path. Live-verified end-to-end:
an authorized write to a Galaxy equipment tag commits and PERSISTS across a
fresh re-subscribe; an anonymous write is denied.
(The sister ScadaBridge driver commits writes the other way — a configured
non-zero WriteUserId + regular Advise; we have no galaxy login, so we use the
supervisory context.)
The net48 sidecar's TcpFrameServer.RunOneConnectionAsync registered the
cancellation token to Stop() only the listener (to unblock a parked
AcceptTcpClientAsync), but never closed the active client. On net48
NetworkStream.ReadAsync ignores the CancellationToken, so while the frame
loop is parked reading an idle connected client, cancelling the token cannot
unblock it — only closing the socket can. RunAsync therefore never returned
on Ctrl-C/service-stop while a connection was open (Program.Main's
RunAsync().GetAwaiter().GetResult() would hang until NSSM force-killed).
Register the cancel to Close() the active client, and convert the resulting
cancel-time read/handshake exception to OperationCanceledException so RunAsync
unwinds cleanly without logging it as a connection failure or counting it
toward MaxConsecutiveFailures.
Caught by the first-ever net48 execution of TcpRoundTripTests on the Windows
VM (these only compile on macOS): SingleActive_SecondClientHelloCompletesOnly
AfterFirstCloses deadlocked in teardown. Full net48 historian suite now green
(122 passed, 0 failed, 2 skipped); all 6 TcpRoundTrip tests pass.
Live verification on a Windows VM surfaced a crash loop: TcpFrameServer.EnsureListening
assigned _listener = new TcpListener(...) BEFORE calling Start(). When Start() throws —
e.g. the port is in a Windows excluded/reserved range (WSAEACCES) or already in use — the
field was left non-null-but-unstarted, so the `if (_listener is not null) return` guard
permanently skipped re-Start() and every subsequent AcceptTcpClientAsync() threw the
misleading InvalidOperationException "Not listening" → 20 failures → exit 2 → NSSM restart
→ loop. Now _listener is assigned only after Start() succeeds, so a transient bind failure
is retried and a permanent one surfaces the real bind error each iteration. Adds a
regression test that forces a bind conflict and asserts the SocketException persists.