Two bugs caught by live verification against the mxaccessgw at 10.100.0.48:5120: - MaxAttempts=1 produced an invalid Polly RetryStrategyOptions -> the probe failed on every real gateway. Removed the Retry override (matches GalaxyDriver); fail-fast is already guaranteed by the TCP preflight + the per-call deadline. - A rejected key surfaces as a typed MxGatewayAuthenticationException, not a raw RpcException, so 'auth-rejection = reachable' was bypassed. Catch the typed auth/ authorization exceptions -> Ok=true. Adds DriverProbeHandshakeE2eTests: direct-probe, skip-gated cross-protocol green/red discrimination (Modbus, OpcUaClient, Galaxy + a local real OPC UA server).
8.3 KiB
Test-Connect Probes — Protocol Handshakes
Each driver's Test-Connect button in the AdminUI runs a probe against the
form's current config (never the persisted row, never the live driver actor).
Before Phase 5 (shipped 2026-06-16) every probe was a bare TCP ConnectAsync
— a live-but-rejecting device showed a healthy green tick, and the operator
only discovered the truth when the driver faulted at deploy. Phase 5 replaced
each TCP-only probe with a real protocol handshake so a reachable-but-wrong
or actively-rejecting endpoint now reads RED.
The IDriverProbe / DriverProbeResult contract and DI registration are
unchanged. Probes run in a transient actor with a timeout clamp of 1–60 s
and must not mutate any state.
For the AdminUI probe flow (button → AdminOperationsActor → transient probe
actor), see
docs/plans/2026-05-28-adminui-driver-pages-design.md
§4.
Result contract
All probes return a consistent DriverProbeResult(bool Ok, string? Message, TimeSpan? Latency).
The message templates below are uniform across all 8 drivers:
| Outcome | Ok |
Message template |
|---|---|---|
| TCP connect fails | false |
"Connect failed: {SocketErrorCode}" |
| TCP ok + handshake ok | true |
driver-specific descriptive string (see table below) |
| TCP ok but handshake rejected | false |
"Reachable at {host}:{port} but {proto} handshake failed: {detail}" |
| Timeout | false |
"Probe timed out after {n}s." |
The third row is the key new behavior: a reachable device that answers on the
port but rejects the protocol-level handshake now surfaces a false result
with a human-readable explanation rather than a false-green TCP-open tick.
Per-driver handshake
| Driver | Handshake | Ok message | Dev-rig target |
|---|---|---|---|
| Modbus | FC03 (Read Holding Registers, qty 1 @ addr 0) via ModbusTcpTransport. A Modbus exception PDU still proves a real Modbus device → Ok. A non-MBAP reply → handshake fail. |
"Modbus FC03 OK" |
10.100.0.35:5020 (Modbus sim) |
| OpcUaClient | DiscoveryClient.GetEndpointsAsync — no session, no app-cert, no auth. ≥ 1 endpoint → Ok. A non-OPC-UA TCP server throws or times out → handshake fail. |
"OPC UA: N endpoint(s)" |
opc.tcp://10.100.0.35:50000 (opc-plc) |
| S7 | Plc.OpenAsync (COTP CR/CC + S7 setup-communication), check IsConnected, then Close. Wrong rack/slot or a non-S7 server causes OpenAsync to throw → handshake fail. |
"S7 connected (CPU …)" |
10.100.0.35:1102 (python-snap7 sim) |
| AbCip | libplctag Tag InitializeAsync (EIP session + CIP Forward Open). A CIP-level error such as tag-not-found still proves the controller answered CIP → Ok. A session/ForwardOpen/connect error → handshake fail. |
"CIP session OK" |
10.100.0.35:44818 (CIP sim) |
| AbLegacy | Same libplctag InitializeAsync handshake as AbCip, PCCC protocol family. |
"CIP session OK" (PCCC family) |
Deferred — no PLC5/SLC sim |
| TwinCAT | AdsClient.Connect + ReadStateAsync. See degrade semantics below. |
"ADS state: {state}" |
Deferred — no ADS target |
| FOCAS | cnc_allclibhndl3 via a direct DllImport("fwlib32") in the probe. See degrade semantics below. |
"FOCAS handle OK" |
Deferred — no CNC + FWLIB |
| Galaxy | gRPC unary call to GalaxyRepository.TestConnection on the configured mxaccessgw endpoint. See auth-rejection rule below. |
"gateway gRPC OK" |
http://10.100.0.48:5120 (mxaccessgw) |
Historian.Wonderware already performed a real handshake (Hello → HelloAck)
before Phase 5 and was not changed by this work. See
Historian.Wonderware.md for details.
Degrade semantics
Three drivers have environmental constraints that can prevent the handshake from running on certain hosts. The degradation principle is: the probe must never produce a result worse than today's TCP-only probe. A genuine protocol rejection from a reachable device is a correct RED; an inability to run the handshake at all (no FWLIB, no managed router) degrades to the existing TCP-reachability message — still a green tick but annotated.
TwinCAT degrade
Where the handshake is available:
AdsClient.Connect(netId, port)+ReadStateAsync→Ok=true,"ADS state: {state}"(Run / Config / Stop).- An ADS route-table rejection from a reachable ADS router is a true RED:
"Reachable at {host}:{port} but ADS handshake failed: {detail} — check the target's ADS route table authorizes this host". This is the correct result: the driver would also be unable to function without an authorized route.
Where the handshake is unavailable (headless server, no TwinCAT runtime, the managed AMS router cannot start):
- Probe degrades to TCP-reachability:
Ok=true,"(ADS handshake unavailable on this host — TCP reachability only)".
FOCAS degrade
On a Windows host with the FANUC FWLIB shared library present:
cnc_allclibhndl3is called via a directDllImport("fwlib32")declared in the probe (the productionWire.WireFocasClientis a pure-managed FOCAS/2 TCP client, not an FWLIB P/Invoke, so the probe carries its own native binding). A successful handle allocation →Ok=true,"FOCAS handle OK".- A CNC-level rejection → handshake fail.
On dev, Linux, or macOS (no native FWLIB — UnimplementedFocasClientFactory
gates the driver):
DllNotFoundException/NotSupportedExceptionis caught and the probe degrades to TCP-reachability:Ok=true,"(FOCAS handshake unavailable on this host — FWLIB absent, TCP reachability only)".
Galaxy auth-rejection rule
The probe builds the gRPC channel from the form's config and issues one
lightweight unary call. It does not resolve secretref: secrets — the
key string in the transient config (possibly empty or unresolved) is used as-is.
Unavailable/ transport failure →Ok=false(gateway is down or unreachable).Unauthenticated/PermissionDenied→Ok=true,"gateway reachable & speaking gRPC (auth not checked)"— an auth rejection proves a live mxaccessgw gRPC server. This is the correct result: the driver's own session-layer will handle auth; the probe is testing reachability only.
The mxaccessgw client surfaces a rejected key as a typed
MxGatewayAuthenticationException / MxGatewayAuthorizationException, not a
raw RpcException — the probe catches both and maps them to the reachable result
above. (Live verification on 10.100.0.48:5120 with no key returns
MxGatewayAuthenticationException("Missing or invalid API key.") → Ok=true.)
Config note:
UseTlsmust match the endpoint scheme —UseTls:falsefor anhttp://(h2c) gateway,UseTls:trueforhttps://. A mismatch fails the client's own validation (the same constraint the Galaxy driver enforces).
Live-verify scope
| Driver | Live-verify status | Notes |
|---|---|---|
| Modbus | Verified | Dev-rig sim 10.100.0.35:5020; green vs sim, RED vs wrong port / non-Modbus server, timeout vs black-hole IP |
| OpcUaClient | Verified | opc-plc 10.100.0.35:50000; same three-scenario matrix |
| S7 | Verified | python-snap7 10.100.0.35:1102 |
| AbCip | Verified | CIP sim 10.100.0.35:44818 |
| Galaxy | Verified | mxaccessgw 10.100.0.48:5120; Unauthenticated reply counts as Ok |
| AbLegacy | Deferred | No PLC5/SLC sim; unit-proven + code path identical to AbCip |
| TwinCAT | Deferred | No ADS target; unit-proven + degrade guard tested |
| FOCAS | Deferred | No CNC + FWLIB on dev host; degrade guard is the CI-observable path |
Implementation references
- Phase 5 design:
docs/plans/2026-06-16-stillpending-phase-5-probes-design.md - Parent roadmap:
docs/plans/2026-06-15-stillpending-backlog-design.md§Phase 5 - AdminUI probe flow:
docs/plans/2026-05-28-adminui-driver-pages-design.md§4 - Per-driver probe implementations:
src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.<Type>/<Type>DriverProbe.cs IDriverProbecontract:src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IDriverProbe.cs- Probe dispatch + timeout clamp:
src/Server/ZB.MOM.WW.OtOpcUa.Host/Actors/AdminOperationsActor.cs(around line 284)