Files
lmxopcua/docs/drivers/TestConnectProbes.md
T
Joseph Doherty 1164d423b6
v2-ci / build (push) Failing after 44s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
fix(probe): Galaxy gRPC ping — drop invalid Retry, treat MxGatewayAuth exceptions as reachable (live /run)
Two bugs caught by live verification against the mxaccessgw at 10.100.0.48:5120:
- MaxAttempts=1 produced an invalid Polly RetryStrategyOptions -> the probe failed
  on every real gateway. Removed the Retry override (matches GalaxyDriver); fail-fast
  is already guaranteed by the TCP preflight + the per-call deadline.
- A rejected key surfaces as a typed MxGatewayAuthenticationException, not a raw
  RpcException, so 'auth-rejection = reachable' was bypassed. Catch the typed auth/
  authorization exceptions -> Ok=true.
Adds DriverProbeHandshakeE2eTests: direct-probe, skip-gated cross-protocol green/red
discrimination (Modbus, OpcUaClient, Galaxy + a local real OPC UA server).
2026-06-16 07:32:59 -04:00

8.3 KiB
Raw Blame History

Test-Connect Probes — Protocol Handshakes

Each driver's Test-Connect button in the AdminUI runs a probe against the form's current config (never the persisted row, never the live driver actor). Before Phase 5 (shipped 2026-06-16) every probe was a bare TCP ConnectAsync — a live-but-rejecting device showed a healthy green tick, and the operator only discovered the truth when the driver faulted at deploy. Phase 5 replaced each TCP-only probe with a real protocol handshake so a reachable-but-wrong or actively-rejecting endpoint now reads RED.

The IDriverProbe / DriverProbeResult contract and DI registration are unchanged. Probes run in a transient actor with a timeout clamp of 160 s and must not mutate any state.

For the AdminUI probe flow (button → AdminOperationsActor → transient probe actor), see docs/plans/2026-05-28-adminui-driver-pages-design.md §4.


Result contract

All probes return a consistent DriverProbeResult(bool Ok, string? Message, TimeSpan? Latency). The message templates below are uniform across all 8 drivers:

Outcome Ok Message template
TCP connect fails false "Connect failed: {SocketErrorCode}"
TCP ok + handshake ok true driver-specific descriptive string (see table below)
TCP ok but handshake rejected false "Reachable at {host}:{port} but {proto} handshake failed: {detail}"
Timeout false "Probe timed out after {n}s."

The third row is the key new behavior: a reachable device that answers on the port but rejects the protocol-level handshake now surfaces a false result with a human-readable explanation rather than a false-green TCP-open tick.


Per-driver handshake

Driver Handshake Ok message Dev-rig target
Modbus FC03 (Read Holding Registers, qty 1 @ addr 0) via ModbusTcpTransport. A Modbus exception PDU still proves a real Modbus device → Ok. A non-MBAP reply → handshake fail. "Modbus FC03 OK" 10.100.0.35:5020 (Modbus sim)
OpcUaClient DiscoveryClient.GetEndpointsAsync — no session, no app-cert, no auth. ≥ 1 endpoint → Ok. A non-OPC-UA TCP server throws or times out → handshake fail. "OPC UA: N endpoint(s)" opc.tcp://10.100.0.35:50000 (opc-plc)
S7 Plc.OpenAsync (COTP CR/CC + S7 setup-communication), check IsConnected, then Close. Wrong rack/slot or a non-S7 server causes OpenAsync to throw → handshake fail. "S7 connected (CPU …)" 10.100.0.35:1102 (python-snap7 sim)
AbCip libplctag Tag InitializeAsync (EIP session + CIP Forward Open). A CIP-level error such as tag-not-found still proves the controller answered CIP → Ok. A session/ForwardOpen/connect error → handshake fail. "CIP session OK" 10.100.0.35:44818 (CIP sim)
AbLegacy Same libplctag InitializeAsync handshake as AbCip, PCCC protocol family. "CIP session OK" (PCCC family) Deferred — no PLC5/SLC sim
TwinCAT AdsClient.Connect + ReadStateAsync. See degrade semantics below. "ADS state: {state}" Deferred — no ADS target
FOCAS cnc_allclibhndl3 via a direct DllImport("fwlib32") in the probe. See degrade semantics below. "FOCAS handle OK" Deferred — no CNC + FWLIB
Galaxy gRPC unary call to GalaxyRepository.TestConnection on the configured mxaccessgw endpoint. See auth-rejection rule below. "gateway gRPC OK" http://10.100.0.48:5120 (mxaccessgw)

Historian.Wonderware already performed a real handshake (HelloHelloAck) before Phase 5 and was not changed by this work. See Historian.Wonderware.md for details.


Degrade semantics

Three drivers have environmental constraints that can prevent the handshake from running on certain hosts. The degradation principle is: the probe must never produce a result worse than today's TCP-only probe. A genuine protocol rejection from a reachable device is a correct RED; an inability to run the handshake at all (no FWLIB, no managed router) degrades to the existing TCP-reachability message — still a green tick but annotated.

TwinCAT degrade

Where the handshake is available:

  • AdsClient.Connect(netId, port) + ReadStateAsyncOk=true, "ADS state: {state}" (Run / Config / Stop).
  • An ADS route-table rejection from a reachable ADS router is a true RED: "Reachable at {host}:{port} but ADS handshake failed: {detail} — check the target's ADS route table authorizes this host". This is the correct result: the driver would also be unable to function without an authorized route.

Where the handshake is unavailable (headless server, no TwinCAT runtime, the managed AMS router cannot start):

  • Probe degrades to TCP-reachability: Ok=true, "(ADS handshake unavailable on this host — TCP reachability only)".

FOCAS degrade

On a Windows host with the FANUC FWLIB shared library present:

  • cnc_allclibhndl3 is called via a direct DllImport("fwlib32") declared in the probe (the production Wire.WireFocasClient is a pure-managed FOCAS/2 TCP client, not an FWLIB P/Invoke, so the probe carries its own native binding). A successful handle allocation → Ok=true, "FOCAS handle OK".
  • A CNC-level rejection → handshake fail.

On dev, Linux, or macOS (no native FWLIB — UnimplementedFocasClientFactory gates the driver):

  • DllNotFoundException / NotSupportedException is caught and the probe degrades to TCP-reachability: Ok=true, "(FOCAS handshake unavailable on this host — FWLIB absent, TCP reachability only)".

Galaxy auth-rejection rule

The probe builds the gRPC channel from the form's config and issues one lightweight unary call. It does not resolve secretref: secrets — the key string in the transient config (possibly empty or unresolved) is used as-is.

  • Unavailable / transport failure → Ok=false (gateway is down or unreachable).
  • Unauthenticated / PermissionDeniedOk=true, "gateway reachable & speaking gRPC (auth not checked)" — an auth rejection proves a live mxaccessgw gRPC server. This is the correct result: the driver's own session-layer will handle auth; the probe is testing reachability only.

The mxaccessgw client surfaces a rejected key as a typed MxGatewayAuthenticationException / MxGatewayAuthorizationException, not a raw RpcException — the probe catches both and maps them to the reachable result above. (Live verification on 10.100.0.48:5120 with no key returns MxGatewayAuthenticationException("Missing or invalid API key.")Ok=true.)

Config note: UseTls must match the endpoint scheme — UseTls:false for an http:// (h2c) gateway, UseTls:true for https://. A mismatch fails the client's own validation (the same constraint the Galaxy driver enforces).


Live-verify scope

Driver Live-verify status Notes
Modbus Verified Dev-rig sim 10.100.0.35:5020; green vs sim, RED vs wrong port / non-Modbus server, timeout vs black-hole IP
OpcUaClient Verified opc-plc 10.100.0.35:50000; same three-scenario matrix
S7 Verified python-snap7 10.100.0.35:1102
AbCip Verified CIP sim 10.100.0.35:44818
Galaxy Verified mxaccessgw 10.100.0.48:5120; Unauthenticated reply counts as Ok
AbLegacy Deferred No PLC5/SLC sim; unit-proven + code path identical to AbCip
TwinCAT Deferred No ADS target; unit-proven + degrade guard tested
FOCAS Deferred No CNC + FWLIB on dev host; degrade guard is the CI-observable path

Implementation references

  • Phase 5 design: docs/plans/2026-06-16-stillpending-phase-5-probes-design.md
  • Parent roadmap: docs/plans/2026-06-15-stillpending-backlog-design.md §Phase 5
  • AdminUI probe flow: docs/plans/2026-05-28-adminui-driver-pages-design.md §4
  • Per-driver probe implementations: src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.<Type>/<Type>DriverProbe.cs
  • IDriverProbe contract: src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IDriverProbe.cs
  • Probe dispatch + timeout clamp: src/Server/ZB.MOM.WW.OtOpcUa.Host/Actors/AdminOperationsActor.cs (around line 284)