Files
lmxopcua/docs/plans/2026-06-16-stillpending-phase-5-probes-design.md
T

9.8 KiB
Raw Blame History

Still-Pending Phase 5 — Test-Connect protocol handshakes — design

Status: approved 2026-06-16. Parent roadmap: docs/plans/2026-06-15-stillpending-backlog-design.md (Phase 5). Source backlog: stillpending.md §2 ("Test-Connect probes are TCP-only") + plan 2026-05-28-adminui-driver-pages Phase 7 + 2026-06-12-historian-tcp-transport task 9. Branch feat/stillpending-phase-5-probes off master 050f5c4b. Phases 04 already shipped.

Goal

Replace the bare-TCP socket.ConnectAsync Test-Connect probes with real protocol handshakes so a live-but-rejecting device reads RED, not green. Today a firewalled port, a non-Modbus TCP server, a PLC at the wrong rack/slot, or a down-but-port-forwarded OPC UA endpoint all surface a healthy tick — the operator gets a false "connection OK" and only discovers the truth when the driver faults at deploy.

Grounding (verified this session)

  • All 8 probes are byte-identical TCP-only boilerplate (deserialize → ExtractTarget → Socket.ConnectAsync → close): Modbus, S7, AbCip, AbLegacy, TwinCAT, OpcUaClient, Galaxy, FOCAS.
  • Historian.Wonderware is ALREADY a real handshake — it sends a Hello and confirms HelloAck (WonderwareHistorianDriverProbe.cs:54-71). So §2's "task 9 / historian probe" is already done; this phase does not touch it (only documents it).
  • Every driver already owns the handshake primitive in its client code — no new package references:
    • Modbus: ModbusTcpTransport.ConnectAsync + IModbusTransport.SendAsync(unitId, pdu, ct).
    • S7: new Plc(cpu, host, port, rack, slot).OpenAsync(ct) (COTP CR/CC + S7 setup-communication).
    • AbCip / AbLegacy: libplctag Tag InitializeAsync (opens the EIP session + first CIP op).
    • TwinCAT: AdsClient.Connect(netId, port) + ReadStateAsync (AdsTwinCATClient.cs:90,194).
    • OpcUaClient: DiscoveryClient.GetEndpointsAsync (no session / cert / auth — OpcUaClientDriver.cs:422).
    • Galaxy: the MxGateway.Client gRPC channel + one lightweight unary call.
    • FOCAS: cnc_allclibhndl3 via the existing wire P/Invoke (Wire.WireFocasClient).
  • Probe dispatch clamps the timeout to 160 s and passes a cancelled-on-timeout ct (AdminOperationsActor.cs:284-291). Probes MUST honour it and MUST NOT mutate state (read-only handshakes).
  • A proven skip-gated E2E harness exists (DriverTestConnectE2eTests) targeting the live Modbus sim with happy / wrong-port / black-hole scenarios, auto-skipping when the fixture is unreachable. The dev-rig sims (Modbus :5020, AbCip :44818, S7 :1102, opc-plc :50000) and the mxaccessgw (10.100.0.48:5120) are reachable from the dev Mac, so 5 of the 8 are live-verifiable agent-side.

Architecture

Per-probe, no shared scaffold. Each probe stays self-contained in its own driver project (matches the existing intentional copy-paste style, keeps all 8 projects disjoint → parallelizable like Phase 4, and avoids adding socket/handshake logic to the Core.Abstractions contracts project). Each in-scope probe keeps its TCP preflight and adds one handshake step, reusing the driver's own client primitive.

New three-way result contract (the operator value — message templates kept consistent across all 8):

Outcome Result
TCP connect fails Ok=false · "Connect failed: {SocketError}" (unchanged)
TCP ok + handshake ok Ok=true · latency · descriptive msg (e.g. "Modbus FC03 OK", "OPC UA: N endpoint(s)", "S7 connected (CPU …)", "CIP session OK", "ADS state: Run", "gateway gRPC OK")
TCP ok but handshake rejected Ok=false · "Reachable at {host}:{port} but {proto} handshake failed: {detail}"the new behavior
timeout Ok=false · "Probe timed out after {n}s." (unchanged)

IDriverProbe / DriverProbeResult are unchanged — no Commons / Core contract touch, no DI change (the 8 probes are already registered in DriverFactoryBootstrap.AddOtOpcUaDriverProbes).

Per-driver handshake + degradation

Tier A — real handshake, live-verifiable on the rig (agent-driven)

  1. ModbusConnectAsync then SendAsync one FC03 (Read Holding Registers, qty 1 @ addr 0, unit from config/default 1). Any well-formed MBAP reply that echoes the TxId with protocol-id 0 — including a Modbus exception PDU (0x83…) — proves a real Modbus device ⇒ Ok. A malformed/non-MBAP reply or silence ⇒ handshake-fail. Sim :5020.
  2. OpcUaClientDiscoveryClient.GetEndpointsAsync (no session, no app-cert, no auth). ≥1 endpoint ⇒ Ok ("OPC UA: N endpoint(s)"); a non-OPC-UA TCP server throws/times out ⇒ handshake-fail. opc-plc :50000.
  3. S7new Plc(...).OpenAsync(ct) with ReadTimeout set first (mirror S7Driver.cs:164), check IsConnected, Close. Wrong rack/slot or non-S7 server ⇒ OpenAsync throws ⇒ handshake-fail. python-snap7 :1102.
  4. AbCip — create a libplctag Tag for the first configured tag path (else a benign probe name) and InitializeAsync. Session opens ⇒ Ok; a CIP-level error (tag-not-found / bad-path) still counts as reachable (the controller answered CIP); a session/ForwardOpen/connect error ⇒ handshake-fail. CIP sim :44818.
  5. Galaxy — build the MxGateway.Client gRPC channel (honour the config's cleartext/TLS) and issue one lightweight unary call. The probe does NOT resolve secretref: secrets — it sends whatever key string is in the transient config (possibly empty/unresolved). An OK reply ⇒ Ok; an Unauthenticated / PermissionDenied reply also ⇒ Ok ("gateway reachable & speaking gRPC; auth not checked") because it proves a live mxgw server; Unavailable / transport error ⇒ handshake-fail. Gateway 10.100.0.48:5120.

Tier B — real handshake, unit-proven only (no rig target → live-verify deferred, honestly recorded)

  1. AbLegacy — same libplctag Init handshake as AbCip but the PCCC protocol family (verified-by-proxy via AbCip's identical code path). No PLC5/SLC sim on the rig.
  2. TwinCATAdsClient.Connect + ReadStateAsyncOk with the ADS state (Run/Config/Stop). An ADS route-table rejection is a true RED (the driver also cannot function without an authorized route — the message says so: "check the target's ADS route table authorizes this host"). Degrade guard: if AdsClient cannot construct/connect headless (managed AMS router unavailable), catch and fall back to the TCP-preflight result with a "ADS handshake unavailable on this host — TCP reachability only" note — never worse than today. No TwinCAT target on the rig.
  3. FOCAS — attempt cnc_allclibhndl3 via the existing wire P/Invoke. Degrade guard: the FWLIB native lib is absent on the dev box / Linux containers (UnimplementedFocasClientFactory gates the driver), so the call throws DllNotFoundException / NotSupportedException ⇒ catch and fall back to the TCP-preflight result with a "FOCAS handshake unavailable on this host (FWLIB absent) — TCP reachability only" note. A real handshake runs only on a Windows host with FWLIB + a reachable CNC. No CNC on the rig.

Degradation principle: the three Tier-B handshakes must NEVER produce a result worse than today's TCP-only probe. A genuine protocol rejection from a reachable device is a correct RED; an environmental inability to run the handshake at all (no FWLIB, no managed router) degrades to the existing TCP-reachability message.

Testing & verification

  • Unit (per probe, TDD red→green, xUnit + Shouldly): in-process TcpListener drives — (a) invalid/empty config, (b) unreachable → Connect failed, (c) TCP-accepts-then-closes / garbage → handshake-fail (the key new path — cleanly assertable for Modbus via a canned-MBAP server; loopback-accept for the rest), plus Modbus's canned-MBAP happy path and FOCAS's degrade path (DllNotFound on the dev box is the actual CI behavior, so it is directly testable). Galaxy's Unauthenticated⇒Ok is testable against a tiny in-process gRPC server or via the live gateway.
  • Live /run (agent-driven, extends DriverTestConnectE2eTests, skip-gated): Modbus, OpcUaClient, S7, AbCip against the rig sims; Galaxy against 10.100.0.48:5120. Each: green vs the live sim, RED vs wrong port / non-protocol server, timeout vs a black-hole IP. AbLegacy / TwinCAT / FOCAS live-verify is honestly deferred (no PLC5/SLC sim, no ADS target, no CNC+FWLIB) — unit-proven + degrade-guarded.
  • dotnet build clean (production projects are TreatWarningsAsErrors) + full dotnet test green before merge.
  • Final integration review (the three degrade guards + the consistent message contract + no-regression-vs-TCP).
  • No bUnit — no Razor change (the DriverTestConnectButton already renders whatever the probe returns).

Out of scope / not touched

  • Historian.Wonderware probe (already a real Hello/HelloAck handshake).
  • IDriverProbe / DriverProbeResult contract, DI registration, the AdminUI button/Razor, the persisted DriverInstance row (probes run against transient form config only).
  • Plan 2026-05-28-adminui-driver-pages Phase 9 typed address pickers and Phase 10 driver-page E2E — those are Phase 6 (AdminUI), not this phase.

Hard constraints (carried from the parent roadmap)

  • NO Configuration entity / EF migration. No contract change (IDriverProbe/DriverProbeResult frozen).
  • Stage by path — never git add .. Never stage sql_login.txt, src/Server/.../Host/pki/, pending.md, current.md, docker-dev/docker-compose.yml, stillpending.md. Never echo or commit the gateway API key (the Galaxy live-verify sources it without echoing, per the established recipe). No force-push, no --no-verify.
  • Finish = merge to master + push.