9.8 KiB
Still-Pending Phase 5 — Test-Connect protocol handshakes — design
Status: approved 2026-06-16. Parent roadmap:
docs/plans/2026-06-15-stillpending-backlog-design.md(Phase 5). Source backlog:stillpending.md§2 ("Test-Connect probes are TCP-only") + plan2026-05-28-adminui-driver-pagesPhase 7 +2026-06-12-historian-tcp-transporttask 9. Branchfeat/stillpending-phase-5-probesoff master050f5c4b. Phases 0–4 already shipped.
Goal
Replace the bare-TCP socket.ConnectAsync Test-Connect probes with real protocol handshakes so a
live-but-rejecting device reads RED, not green. Today a firewalled port, a non-Modbus TCP server, a
PLC at the wrong rack/slot, or a down-but-port-forwarded OPC UA endpoint all surface a healthy tick — the
operator gets a false "connection OK" and only discovers the truth when the driver faults at deploy.
Grounding (verified this session)
- All 8 probes are byte-identical TCP-only boilerplate (
deserialize → ExtractTarget → Socket.ConnectAsync → close): Modbus, S7, AbCip, AbLegacy, TwinCAT, OpcUaClient, Galaxy, FOCAS. - Historian.Wonderware is ALREADY a real handshake — it sends a
Helloand confirmsHelloAck(WonderwareHistorianDriverProbe.cs:54-71). So §2's "task 9 / historian probe" is already done; this phase does not touch it (only documents it). - Every driver already owns the handshake primitive in its client code — no new package references:
- Modbus:
ModbusTcpTransport.ConnectAsync+IModbusTransport.SendAsync(unitId, pdu, ct). - S7:
new Plc(cpu, host, port, rack, slot).OpenAsync(ct)(COTP CR/CC + S7 setup-communication). - AbCip / AbLegacy:
libplctagTagInitializeAsync(opens the EIP session + first CIP op). - TwinCAT:
AdsClient.Connect(netId, port)+ReadStateAsync(AdsTwinCATClient.cs:90,194). - OpcUaClient:
DiscoveryClient.GetEndpointsAsync(no session / cert / auth —OpcUaClientDriver.cs:422). - Galaxy: the
MxGateway.ClientgRPC channel + one lightweight unary call. - FOCAS:
cnc_allclibhndl3via the existing wire P/Invoke (Wire.WireFocasClient).
- Modbus:
- Probe dispatch clamps the timeout to 1–60 s and passes a cancelled-on-timeout
ct(AdminOperationsActor.cs:284-291). Probes MUST honour it and MUST NOT mutate state (read-only handshakes). - A proven skip-gated E2E harness exists (
DriverTestConnectE2eTests) targeting the live Modbus sim with happy / wrong-port / black-hole scenarios, auto-skipping when the fixture is unreachable. The dev-rig sims (Modbus:5020, AbCip:44818, S7:1102, opc-plc:50000) and the mxaccessgw (10.100.0.48:5120) are reachable from the dev Mac, so 5 of the 8 are live-verifiable agent-side.
Architecture
Per-probe, no shared scaffold. Each probe stays self-contained in its own driver project (matches the
existing intentional copy-paste style, keeps all 8 projects disjoint → parallelizable like Phase 4, and avoids
adding socket/handshake logic to the Core.Abstractions contracts project). Each in-scope probe keeps its TCP
preflight and adds one handshake step, reusing the driver's own client primitive.
New three-way result contract (the operator value — message templates kept consistent across all 8):
| Outcome | Result |
|---|---|
| TCP connect fails | Ok=false · "Connect failed: {SocketError}" (unchanged) |
| TCP ok + handshake ok | Ok=true · latency · descriptive msg (e.g. "Modbus FC03 OK", "OPC UA: N endpoint(s)", "S7 connected (CPU …)", "CIP session OK", "ADS state: Run", "gateway gRPC OK") |
| TCP ok but handshake rejected | Ok=false · "Reachable at {host}:{port} but {proto} handshake failed: {detail}" ← the new behavior |
| timeout | Ok=false · "Probe timed out after {n}s." (unchanged) |
IDriverProbe / DriverProbeResult are unchanged — no Commons / Core contract touch, no DI change
(the 8 probes are already registered in DriverFactoryBootstrap.AddOtOpcUaDriverProbes).
Per-driver handshake + degradation
Tier A — real handshake, live-verifiable on the rig (agent-driven)
- Modbus —
ConnectAsyncthenSendAsyncone FC03 (Read Holding Registers, qty 1 @ addr 0, unit from config/default 1). Any well-formed MBAP reply that echoes the TxId with protocol-id 0 — including a Modbus exception PDU (0x83…) — proves a real Modbus device ⇒Ok. A malformed/non-MBAP reply or silence ⇒ handshake-fail. Sim:5020. - OpcUaClient —
DiscoveryClient.GetEndpointsAsync(no session, no app-cert, no auth). ≥1 endpoint ⇒Ok("OPC UA: N endpoint(s)"); a non-OPC-UA TCP server throws/times out ⇒ handshake-fail. opc-plc:50000. - S7 —
new Plc(...).OpenAsync(ct)withReadTimeoutset first (mirrorS7Driver.cs:164), checkIsConnected,Close. Wrong rack/slot or non-S7 server ⇒OpenAsyncthrows ⇒ handshake-fail. python-snap7:1102. - AbCip — create a
libplctagTag for the first configured tag path (else a benign probe name) andInitializeAsync. Session opens ⇒Ok; a CIP-level error (tag-not-found / bad-path) still counts as reachable (the controller answered CIP); a session/ForwardOpen/connect error ⇒ handshake-fail. CIP sim:44818. - Galaxy — build the
MxGateway.ClientgRPC channel (honour the config's cleartext/TLS) and issue one lightweight unary call. The probe does NOT resolvesecretref:secrets — it sends whatever key string is in the transient config (possibly empty/unresolved). AnOKreply ⇒Ok; anUnauthenticated/PermissionDeniedreply also ⇒Ok("gateway reachable & speaking gRPC; auth not checked") because it proves a live mxgw server;Unavailable/ transport error ⇒ handshake-fail. Gateway10.100.0.48:5120.
Tier B — real handshake, unit-proven only (no rig target → live-verify deferred, honestly recorded)
- AbLegacy — same
libplctagInit handshake as AbCip but the PCCC protocol family (verified-by-proxy via AbCip's identical code path). No PLC5/SLC sim on the rig. - TwinCAT —
AdsClient.Connect+ReadStateAsync⇒Okwith the ADS state (Run/Config/Stop). An ADS route-table rejection is a true RED (the driver also cannot function without an authorized route — the message says so: "check the target's ADS route table authorizes this host"). Degrade guard: ifAdsClientcannot construct/connect headless (managed AMS router unavailable), catch and fall back to the TCP-preflight result with a "ADS handshake unavailable on this host — TCP reachability only" note — never worse than today. No TwinCAT target on the rig. - FOCAS — attempt
cnc_allclibhndl3via the existing wire P/Invoke. Degrade guard: the FWLIB native lib is absent on the dev box / Linux containers (UnimplementedFocasClientFactorygates the driver), so the call throwsDllNotFoundException/NotSupportedException⇒ catch and fall back to the TCP-preflight result with a "FOCAS handshake unavailable on this host (FWLIB absent) — TCP reachability only" note. A real handshake runs only on a Windows host with FWLIB + a reachable CNC. No CNC on the rig.
Degradation principle: the three Tier-B handshakes must NEVER produce a result worse than today's TCP-only probe. A genuine protocol rejection from a reachable device is a correct RED; an environmental inability to run the handshake at all (no FWLIB, no managed router) degrades to the existing TCP-reachability message.
Testing & verification
- Unit (per probe, TDD red→green, xUnit + Shouldly): in-process
TcpListenerdrives — (a) invalid/empty config, (b) unreachable →Connect failed, (c) TCP-accepts-then-closes / garbage → handshake-fail (the key new path — cleanly assertable for Modbus via a canned-MBAP server; loopback-accept for the rest), plus Modbus's canned-MBAP happy path and FOCAS's degrade path (DllNotFound on the dev box is the actual CI behavior, so it is directly testable). Galaxy'sUnauthenticated⇒Okis testable against a tiny in-process gRPC server or via the live gateway. - Live
/run(agent-driven, extendsDriverTestConnectE2eTests, skip-gated): Modbus, OpcUaClient, S7, AbCip against the rig sims; Galaxy against10.100.0.48:5120. Each: green vs the live sim, RED vs wrong port / non-protocol server, timeout vs a black-hole IP. AbLegacy / TwinCAT / FOCAS live-verify is honestly deferred (no PLC5/SLC sim, no ADS target, no CNC+FWLIB) — unit-proven + degrade-guarded. dotnet buildclean (production projects areTreatWarningsAsErrors) + fulldotnet testgreen before merge.- Final integration review (the three degrade guards + the consistent message contract + no-regression-vs-TCP).
- No bUnit — no Razor change (the
DriverTestConnectButtonalready renders whatever the probe returns).
Out of scope / not touched
- Historian.Wonderware probe (already a real Hello/HelloAck handshake).
IDriverProbe/DriverProbeResultcontract, DI registration, the AdminUI button/Razor, the persistedDriverInstancerow (probes run against transient form config only).- Plan
2026-05-28-adminui-driver-pagesPhase 9 typed address pickers and Phase 10 driver-page E2E — those are Phase 6 (AdminUI), not this phase.
Hard constraints (carried from the parent roadmap)
- NO Configuration entity / EF migration. No contract change (
IDriverProbe/DriverProbeResultfrozen). - Stage by path — never
git add .. Never stagesql_login.txt,src/Server/.../Host/pki/,pending.md,current.md,docker-dev/docker-compose.yml,stillpending.md. Never echo or commit the gateway API key (the Galaxy live-verify sources it without echoing, per the established recipe). No force-push, no--no-verify. - Finish = merge to master + push.