From 1f2d32ac1e4fe8fbb1a8d3fb15a101ad9435f3de Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Tue, 16 Jun 2026 06:28:21 -0400 Subject: [PATCH] =?UTF-8?q?docs(phase5):=20design=20=E2=80=94=20Test-Conne?= =?UTF-8?q?ct=20protocol=20handshakes=20(all=208=20probes,=20best-effort)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...6-16-stillpending-phase-5-probes-design.md | 126 ++++++++++++++++++ 1 file changed, 126 insertions(+) create mode 100644 docs/plans/2026-06-16-stillpending-phase-5-probes-design.md diff --git a/docs/plans/2026-06-16-stillpending-phase-5-probes-design.md b/docs/plans/2026-06-16-stillpending-phase-5-probes-design.md new file mode 100644 index 00000000..42940277 --- /dev/null +++ b/docs/plans/2026-06-16-stillpending-phase-5-probes-design.md @@ -0,0 +1,126 @@ +# Still-Pending Phase 5 — Test-Connect protocol handshakes — design + +> **Status:** approved 2026-06-16. Parent roadmap: `docs/plans/2026-06-15-stillpending-backlog-design.md` (Phase 5). +> Source backlog: `stillpending.md` §2 ("Test-Connect probes are TCP-only") + plan +> `2026-05-28-adminui-driver-pages` Phase 7 + `2026-06-12-historian-tcp-transport` task 9. +> Branch `feat/stillpending-phase-5-probes` off master `050f5c4b`. Phases 0–4 already shipped. + +## Goal + +Replace the bare-TCP `socket.ConnectAsync` Test-Connect probes with **real protocol handshakes** so a +*live-but-rejecting* device reads **RED**, not green. Today a firewalled port, a non-Modbus TCP server, a +PLC at the wrong rack/slot, or a down-but-port-forwarded OPC UA endpoint all surface a healthy tick — the +operator gets a false "connection OK" and only discovers the truth when the driver faults at deploy. + +## Grounding (verified this session) + +- **All 8 probes are byte-identical TCP-only** boilerplate (`deserialize → ExtractTarget → Socket.ConnectAsync + → close`): Modbus, S7, AbCip, AbLegacy, TwinCAT, OpcUaClient, Galaxy, FOCAS. +- **Historian.Wonderware is ALREADY a real handshake** — it sends a `Hello` and confirms `HelloAck` + (`WonderwareHistorianDriverProbe.cs:54-71`). So §2's "task 9 / historian probe" is **already done**; this + phase does not touch it (only documents it). +- **Every driver already owns the handshake primitive** in its client code — no new package references: + - Modbus: `ModbusTcpTransport.ConnectAsync` + `IModbusTransport.SendAsync(unitId, pdu, ct)`. + - S7: `new Plc(cpu, host, port, rack, slot).OpenAsync(ct)` (COTP CR/CC + S7 setup-communication). + - AbCip / AbLegacy: `libplctag` Tag `InitializeAsync` (opens the EIP session + first CIP op). + - TwinCAT: `AdsClient.Connect(netId, port)` + `ReadStateAsync` (`AdsTwinCATClient.cs:90,194`). + - OpcUaClient: `DiscoveryClient.GetEndpointsAsync` (no session / cert / auth — `OpcUaClientDriver.cs:422`). + - Galaxy: the `MxGateway.Client` gRPC channel + one lightweight unary call. + - FOCAS: `cnc_allclibhndl3` via the existing wire P/Invoke (`Wire.WireFocasClient`). +- **Probe dispatch** clamps the timeout to 1–60 s and passes a cancelled-on-timeout `ct` + (`AdminOperationsActor.cs:284-291`). Probes MUST honour it and MUST NOT mutate state (read-only handshakes). +- **A proven skip-gated E2E harness exists** (`DriverTestConnectE2eTests`) targeting the live Modbus sim with + happy / wrong-port / black-hole scenarios, auto-skipping when the fixture is unreachable. The dev-rig sims + (Modbus `:5020`, AbCip `:44818`, S7 `:1102`, opc-plc `:50000`) and the mxaccessgw (`10.100.0.48:5120`) are + reachable from the dev Mac, so 5 of the 8 are live-verifiable agent-side. + +## Architecture + +**Per-probe, no shared scaffold.** Each probe stays self-contained in its own driver project (matches the +existing intentional copy-paste style, keeps all 8 projects disjoint → parallelizable like Phase 4, and avoids +adding socket/handshake logic to the `Core.Abstractions` contracts project). Each in-scope probe keeps its TCP +preflight and **adds one handshake step**, reusing the driver's own client primitive. + +**New three-way result contract** (the operator value — message templates kept consistent across all 8): + +| Outcome | Result | +|---|---| +| TCP connect fails | `Ok=false` · `"Connect failed: {SocketError}"` *(unchanged)* | +| TCP ok **+ handshake ok** | `Ok=true` · latency · descriptive msg (e.g. `"Modbus FC03 OK"`, `"OPC UA: N endpoint(s)"`, `"S7 connected (CPU …)"`, `"CIP session OK"`, `"ADS state: Run"`, `"gateway gRPC OK"`) | +| TCP ok **but handshake rejected** | `Ok=false` · `"Reachable at {host}:{port} but {proto} handshake failed: {detail}"` ← **the new behavior** | +| timeout | `Ok=false` · `"Probe timed out after {n}s."` *(unchanged)* | + +`IDriverProbe` / `DriverProbeResult` are **unchanged** — no Commons / Core contract touch, no DI change +(the 8 probes are already registered in `DriverFactoryBootstrap.AddOtOpcUaDriverProbes`). + +## Per-driver handshake + degradation + +### Tier A — real handshake, live-verifiable on the rig (agent-driven) +1. **Modbus** — `ConnectAsync` then `SendAsync` one **FC03** (Read Holding Registers, qty 1 @ addr 0, unit + from config/default 1). Any well-formed MBAP reply that echoes the TxId with protocol-id 0 — **including a + Modbus exception PDU (0x83…)** — proves a real Modbus device ⇒ `Ok`. A malformed/non-MBAP reply or silence + ⇒ handshake-fail. Sim `:5020`. +2. **OpcUaClient** — `DiscoveryClient.GetEndpointsAsync` (no session, no app-cert, no auth). ≥1 endpoint ⇒ + `Ok` (`"OPC UA: N endpoint(s)"`); a non-OPC-UA TCP server throws/times out ⇒ handshake-fail. opc-plc `:50000`. +3. **S7** — `new Plc(...).OpenAsync(ct)` with `ReadTimeout` set first (mirror `S7Driver.cs:164`), check + `IsConnected`, `Close`. Wrong rack/slot or non-S7 server ⇒ `OpenAsync` throws ⇒ handshake-fail. python-snap7 + `:1102`. +4. **AbCip** — create a `libplctag` Tag for the first configured tag path (else a benign probe name) and + `InitializeAsync`. Session opens ⇒ `Ok`; a **CIP-level** error (tag-not-found / bad-path) **still counts as + reachable** (the controller answered CIP); a session/ForwardOpen/connect error ⇒ handshake-fail. CIP sim `:44818`. +5. **Galaxy** — build the `MxGateway.Client` gRPC channel (honour the config's cleartext/TLS) and issue one + lightweight unary call. **The probe does NOT resolve `secretref:` secrets** — it sends whatever key string is + in the transient config (possibly empty/unresolved). An `OK` reply ⇒ `Ok`; an **`Unauthenticated` / + `PermissionDenied`** reply **also ⇒ `Ok`** ("gateway reachable & speaking gRPC; auth not checked") because it + proves a live mxgw server; `Unavailable` / transport error ⇒ handshake-fail. Gateway `10.100.0.48:5120`. + +### Tier B — real handshake, unit-proven only (no rig target → live-verify deferred, honestly recorded) +6. **AbLegacy** — same `libplctag` Init handshake as AbCip but the PCCC protocol family (verified-by-proxy via + AbCip's identical code path). No PLC5/SLC sim on the rig. +7. **TwinCAT** — `AdsClient.Connect` + `ReadStateAsync` ⇒ `Ok` with the ADS state (Run/Config/Stop). An ADS + **route-table** rejection is a *true* RED (the driver also cannot function without an authorized route — the + message says so: *"check the target's ADS route table authorizes this host"*). **Degrade guard:** if + `AdsClient` cannot construct/connect headless (managed AMS router unavailable), catch and fall back to the + TCP-preflight result with a *"ADS handshake unavailable on this host — TCP reachability only"* note — never + worse than today. No TwinCAT target on the rig. +8. **FOCAS** — attempt `cnc_allclibhndl3` via the existing wire P/Invoke. **Degrade guard:** the FWLIB native + lib is absent on the dev box / Linux containers (`UnimplementedFocasClientFactory` gates the driver), so the + call throws `DllNotFoundException` / `NotSupportedException` ⇒ catch and fall back to the TCP-preflight result + with a *"FOCAS handshake unavailable on this host (FWLIB absent) — TCP reachability only"* note. A real + handshake runs only on a Windows host with FWLIB + a reachable CNC. No CNC on the rig. + +**Degradation principle:** the three Tier-B handshakes must NEVER produce a result worse than today's TCP-only +probe. A genuine protocol rejection from a reachable device is a correct RED; an *environmental inability to run +the handshake at all* (no FWLIB, no managed router) degrades to the existing TCP-reachability message. + +## Testing & verification + +- **Unit (per probe, TDD red→green, xUnit + Shouldly):** in-process `TcpListener` drives — (a) invalid/empty + config, (b) unreachable → `Connect failed`, (c) **TCP-accepts-then-closes / garbage → handshake-fail** (the + key new path — cleanly assertable for Modbus via a canned-MBAP server; loopback-accept for the rest), plus + Modbus's canned-MBAP **happy** path and FOCAS's **degrade** path (DllNotFound on the dev box is the actual CI + behavior, so it is directly testable). Galaxy's `Unauthenticated⇒Ok` is testable against a tiny in-process + gRPC server or via the live gateway. +- **Live `/run` (agent-driven, extends `DriverTestConnectE2eTests`, skip-gated):** Modbus, OpcUaClient, S7, + AbCip against the rig sims; Galaxy against `10.100.0.48:5120`. Each: green vs the live sim, RED vs wrong port + / non-protocol server, timeout vs a black-hole IP. AbLegacy / TwinCAT / FOCAS live-verify is **honestly + deferred** (no PLC5/SLC sim, no ADS target, no CNC+FWLIB) — unit-proven + degrade-guarded. +- `dotnet build` clean (production projects are `TreatWarningsAsErrors`) + full `dotnet test` green before merge. +- Final integration review (the three degrade guards + the consistent message contract + no-regression-vs-TCP). +- **No bUnit** — no Razor change (the `DriverTestConnectButton` already renders whatever the probe returns). + +## Out of scope / not touched + +- Historian.Wonderware probe (already a real Hello/HelloAck handshake). +- `IDriverProbe` / `DriverProbeResult` contract, DI registration, the AdminUI button/Razor, the persisted + `DriverInstance` row (probes run against transient form config only). +- Plan `2026-05-28-adminui-driver-pages` Phase 9 typed **address pickers** and Phase 10 driver-page E2E — those + are Phase 6 (AdminUI), not this phase. + +## Hard constraints (carried from the parent roadmap) + +- **NO Configuration entity / EF migration.** No contract change (`IDriverProbe`/`DriverProbeResult` frozen). +- Stage by path — never `git add .`. Never stage `sql_login.txt`, `src/Server/.../Host/pki/`, `pending.md`, + `current.md`, `docker-dev/docker-compose.yml`, `stillpending.md`. **Never echo or commit the gateway API key** + (the Galaxy live-verify sources it without echoing, per the established recipe). No force-push, no `--no-verify`. +- Finish = merge to master + push.