Files
lmxopcua/docs/drivers/TestConnectProbes.md
T
Joseph Doherty 961b2b558d docs(phase5): real Test-Connect handshakes per driver + degrade semantics
Create docs/drivers/TestConnectProbes.md: full reference for the Phase 5
protocol-handshake probes — result contract, per-driver handshake table,
TwinCAT/FOCAS/Galaxy degrade semantics, live-verify scope, and the
Historian.Wonderware already-done note. Annotate the Phase 7 step in
docs/plans/2026-05-28-adminui-driver-pages-design.md with a shipped note
pointing at the phase-5 design doc and TestConnectProbes.md.
2026-06-16 07:06:47 -04:00

7.5 KiB
Raw Blame History

Test-Connect Probes — Protocol Handshakes

Each driver's Test-Connect button in the AdminUI runs a probe against the form's current config (never the persisted row, never the live driver actor). Before Phase 5 (shipped 2026-06-16) every probe was a bare TCP ConnectAsync — a live-but-rejecting device showed a healthy green tick, and the operator only discovered the truth when the driver faulted at deploy. Phase 5 replaced each TCP-only probe with a real protocol handshake so a reachable-but-wrong or actively-rejecting endpoint now reads RED.

The IDriverProbe / DriverProbeResult contract and DI registration are unchanged. Probes run in a transient actor with a timeout clamp of 160 s and must not mutate any state.

For the AdminUI probe flow (button → AdminOperationsActor → transient probe actor), see docs/plans/2026-05-28-adminui-driver-pages-design.md §4.


Result contract

All probes return a consistent DriverProbeResult(bool Ok, string? Message, TimeSpan? Latency). The message templates below are uniform across all 8 drivers:

Outcome Ok Message template
TCP connect fails false "Connect failed: {SocketErrorCode}"
TCP ok + handshake ok true driver-specific descriptive string (see table below)
TCP ok but handshake rejected false "Reachable at {host}:{port} but {proto} handshake failed: {detail}"
Timeout false "Probe timed out after {n}s."

The third row is the key new behavior: a reachable device that answers on the port but rejects the protocol-level handshake now surfaces a false result with a human-readable explanation rather than a false-green TCP-open tick.


Per-driver handshake

Driver Handshake Ok message Dev-rig target
Modbus FC03 (Read Holding Registers, qty 1 @ addr 0) via ModbusTcpTransport. A Modbus exception PDU still proves a real Modbus device → Ok. A non-MBAP reply → handshake fail. "Modbus FC03 OK" 10.100.0.35:5020 (Modbus sim)
OpcUaClient DiscoveryClient.GetEndpointsAsync — no session, no app-cert, no auth. ≥ 1 endpoint → Ok. A non-OPC-UA TCP server throws or times out → handshake fail. "OPC UA: N endpoint(s)" opc.tcp://10.100.0.35:50000 (opc-plc)
S7 Plc.OpenAsync (COTP CR/CC + S7 setup-communication), check IsConnected, then Close. Wrong rack/slot or a non-S7 server causes OpenAsync to throw → handshake fail. "S7 connected (CPU …)" 10.100.0.35:1102 (python-snap7 sim)
AbCip libplctag Tag InitializeAsync (EIP session + CIP Forward Open). A CIP-level error such as tag-not-found still proves the controller answered CIP → Ok. A session/ForwardOpen/connect error → handshake fail. "CIP session OK" 10.100.0.35:44818 (CIP sim)
AbLegacy Same libplctag InitializeAsync handshake as AbCip, PCCC protocol family. "CIP session OK" (PCCC family) Deferred — no PLC5/SLC sim
TwinCAT AdsClient.Connect + ReadStateAsync. See degrade semantics below. "ADS state: {state}" Deferred — no ADS target
FOCAS cnc_allclibhndl3 via FWLIB P/Invoke (Wire.WireFocasClient). See degrade semantics below. "FOCAS handle OK" Deferred — no CNC + FWLIB
Galaxy gRPC unary call to GalaxyRepository.TestConnection on the configured mxaccessgw endpoint. See auth-rejection rule below. "gateway gRPC OK" http://10.100.0.48:5120 (mxaccessgw)

Historian.Wonderware already performed a real handshake (HelloHelloAck) before Phase 5 and was not changed by this work. See Historian.Wonderware.md for details.


Degrade semantics

Three drivers have environmental constraints that can prevent the handshake from running on certain hosts. The degradation principle is: the probe must never produce a result worse than today's TCP-only probe. A genuine protocol rejection from a reachable device is a correct RED; an inability to run the handshake at all (no FWLIB, no managed router) degrades to the existing TCP-reachability message — still a green tick but annotated.

TwinCAT degrade

Where the handshake is available:

  • AdsClient.Connect(netId, port) + ReadStateAsyncOk=true, "ADS state: {state}" (Run / Config / Stop).
  • An ADS route-table rejection from a reachable ADS router is a true RED: "Reachable at {host}:{port} but ADS handshake failed: {detail} — check the target's ADS route table authorizes this host". This is the correct result: the driver would also be unable to function without an authorized route.

Where the handshake is unavailable (headless server, no TwinCAT runtime, the managed AMS router cannot start):

  • Probe degrades to TCP-reachability: Ok=true, "(ADS handshake unavailable on this host — TCP reachability only)".

FOCAS degrade

On a Windows host with the FANUC FWLIB shared library present:

  • cnc_allclibhndl3 is called via the existing Wire.WireFocasClient P/Invoke. A successful handle allocation → Ok=true, "FOCAS handle OK".
  • A CNC-level rejection → handshake fail.

On dev, Linux, or macOS (no native FWLIB — UnimplementedFocasClientFactory gates the driver):

  • DllNotFoundException / NotSupportedException is caught and the probe degrades to TCP-reachability: Ok=true, "(FOCAS handshake unavailable on this host — FWLIB absent, TCP reachability only)".

Galaxy auth-rejection rule

The probe builds the gRPC channel from the form's config and issues one lightweight unary call. It does not resolve secretref: secrets — the key string in the transient config (possibly empty or unresolved) is used as-is.

  • Unavailable / transport failure → Ok=false (gateway is down or unreachable).
  • Unauthenticated / PermissionDeniedOk=true, "gateway reachable & speaking gRPC; auth not checked" — an auth rejection proves a live mxaccessgw gRPC server. This is the correct result: the driver's own session-layer will handle auth; the probe is testing reachability only.

Live-verify scope

Driver Live-verify status Notes
Modbus Verified Dev-rig sim 10.100.0.35:5020; green vs sim, RED vs wrong port / non-Modbus server, timeout vs black-hole IP
OpcUaClient Verified opc-plc 10.100.0.35:50000; same three-scenario matrix
S7 Verified python-snap7 10.100.0.35:1102
AbCip Verified CIP sim 10.100.0.35:44818
Galaxy Verified mxaccessgw 10.100.0.48:5120; Unauthenticated reply counts as Ok
AbLegacy Deferred No PLC5/SLC sim; unit-proven + code path identical to AbCip
TwinCAT Deferred No ADS target; unit-proven + degrade guard tested
FOCAS Deferred No CNC + FWLIB on dev host; degrade guard is the CI-observable path

Implementation references

  • Phase 5 design: docs/plans/2026-06-16-stillpending-phase-5-probes-design.md
  • Parent roadmap: docs/plans/2026-06-15-stillpending-backlog-design.md §Phase 5
  • AdminUI probe flow: docs/plans/2026-05-28-adminui-driver-pages-design.md §4
  • Per-driver probe implementations: src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.<Type>/<Type>DriverProbe.cs
  • IDriverProbe contract: src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IDriverProbe.cs
  • Probe dispatch + timeout clamp: src/Server/ZB.MOM.WW.OtOpcUa.Host/Actors/AdminOperationsActor.cs (around line 284)