docs(phase5): design — Test-Connect protocol handshakes (all 8 probes, best-effort)

This commit is contained in:
Joseph Doherty
2026-06-16 06:28:21 -04:00
parent 050f5c4b60
commit 1f2d32ac1e
@@ -0,0 +1,126 @@
# Still-Pending Phase 5 — Test-Connect protocol handshakes — design
> **Status:** approved 2026-06-16. Parent roadmap: `docs/plans/2026-06-15-stillpending-backlog-design.md` (Phase 5).
> Source backlog: `stillpending.md` §2 ("Test-Connect probes are TCP-only") + plan
> `2026-05-28-adminui-driver-pages` Phase 7 + `2026-06-12-historian-tcp-transport` task 9.
> Branch `feat/stillpending-phase-5-probes` off master `050f5c4b`. Phases 04 already shipped.
## Goal
Replace the bare-TCP `socket.ConnectAsync` Test-Connect probes with **real protocol handshakes** so a
*live-but-rejecting* device reads **RED**, not green. Today a firewalled port, a non-Modbus TCP server, a
PLC at the wrong rack/slot, or a down-but-port-forwarded OPC UA endpoint all surface a healthy tick — the
operator gets a false "connection OK" and only discovers the truth when the driver faults at deploy.
## Grounding (verified this session)
- **All 8 probes are byte-identical TCP-only** boilerplate (`deserialize → ExtractTarget → Socket.ConnectAsync
→ close`): Modbus, S7, AbCip, AbLegacy, TwinCAT, OpcUaClient, Galaxy, FOCAS.
- **Historian.Wonderware is ALREADY a real handshake** — it sends a `Hello` and confirms `HelloAck`
(`WonderwareHistorianDriverProbe.cs:54-71`). So §2's "task 9 / historian probe" is **already done**; this
phase does not touch it (only documents it).
- **Every driver already owns the handshake primitive** in its client code — no new package references:
- Modbus: `ModbusTcpTransport.ConnectAsync` + `IModbusTransport.SendAsync(unitId, pdu, ct)`.
- S7: `new Plc(cpu, host, port, rack, slot).OpenAsync(ct)` (COTP CR/CC + S7 setup-communication).
- AbCip / AbLegacy: `libplctag` Tag `InitializeAsync` (opens the EIP session + first CIP op).
- TwinCAT: `AdsClient.Connect(netId, port)` + `ReadStateAsync` (`AdsTwinCATClient.cs:90,194`).
- OpcUaClient: `DiscoveryClient.GetEndpointsAsync` (no session / cert / auth — `OpcUaClientDriver.cs:422`).
- Galaxy: the `MxGateway.Client` gRPC channel + one lightweight unary call.
- FOCAS: `cnc_allclibhndl3` via the existing wire P/Invoke (`Wire.WireFocasClient`).
- **Probe dispatch** clamps the timeout to 160 s and passes a cancelled-on-timeout `ct`
(`AdminOperationsActor.cs:284-291`). Probes MUST honour it and MUST NOT mutate state (read-only handshakes).
- **A proven skip-gated E2E harness exists** (`DriverTestConnectE2eTests`) targeting the live Modbus sim with
happy / wrong-port / black-hole scenarios, auto-skipping when the fixture is unreachable. The dev-rig sims
(Modbus `:5020`, AbCip `:44818`, S7 `:1102`, opc-plc `:50000`) and the mxaccessgw (`10.100.0.48:5120`) are
reachable from the dev Mac, so 5 of the 8 are live-verifiable agent-side.
## Architecture
**Per-probe, no shared scaffold.** Each probe stays self-contained in its own driver project (matches the
existing intentional copy-paste style, keeps all 8 projects disjoint → parallelizable like Phase 4, and avoids
adding socket/handshake logic to the `Core.Abstractions` contracts project). Each in-scope probe keeps its TCP
preflight and **adds one handshake step**, reusing the driver's own client primitive.
**New three-way result contract** (the operator value — message templates kept consistent across all 8):
| Outcome | Result |
|---|---|
| TCP connect fails | `Ok=false` · `"Connect failed: {SocketError}"` *(unchanged)* |
| TCP ok **+ handshake ok** | `Ok=true` · latency · descriptive msg (e.g. `"Modbus FC03 OK"`, `"OPC UA: N endpoint(s)"`, `"S7 connected (CPU …)"`, `"CIP session OK"`, `"ADS state: Run"`, `"gateway gRPC OK"`) |
| TCP ok **but handshake rejected** | `Ok=false` · `"Reachable at {host}:{port} but {proto} handshake failed: {detail}"` ← **the new behavior** |
| timeout | `Ok=false` · `"Probe timed out after {n}s."` *(unchanged)* |
`IDriverProbe` / `DriverProbeResult` are **unchanged** — no Commons / Core contract touch, no DI change
(the 8 probes are already registered in `DriverFactoryBootstrap.AddOtOpcUaDriverProbes`).
## Per-driver handshake + degradation
### Tier A — real handshake, live-verifiable on the rig (agent-driven)
1. **Modbus** — `ConnectAsync` then `SendAsync` one **FC03** (Read Holding Registers, qty 1 @ addr 0, unit
from config/default 1). Any well-formed MBAP reply that echoes the TxId with protocol-id 0 — **including a
Modbus exception PDU (0x83…)** — proves a real Modbus device ⇒ `Ok`. A malformed/non-MBAP reply or silence
⇒ handshake-fail. Sim `:5020`.
2. **OpcUaClient** — `DiscoveryClient.GetEndpointsAsync` (no session, no app-cert, no auth). ≥1 endpoint ⇒
`Ok` (`"OPC UA: N endpoint(s)"`); a non-OPC-UA TCP server throws/times out ⇒ handshake-fail. opc-plc `:50000`.
3. **S7** — `new Plc(...).OpenAsync(ct)` with `ReadTimeout` set first (mirror `S7Driver.cs:164`), check
`IsConnected`, `Close`. Wrong rack/slot or non-S7 server ⇒ `OpenAsync` throws ⇒ handshake-fail. python-snap7
`:1102`.
4. **AbCip** — create a `libplctag` Tag for the first configured tag path (else a benign probe name) and
`InitializeAsync`. Session opens ⇒ `Ok`; a **CIP-level** error (tag-not-found / bad-path) **still counts as
reachable** (the controller answered CIP); a session/ForwardOpen/connect error ⇒ handshake-fail. CIP sim `:44818`.
5. **Galaxy** — build the `MxGateway.Client` gRPC channel (honour the config's cleartext/TLS) and issue one
lightweight unary call. **The probe does NOT resolve `secretref:` secrets** — it sends whatever key string is
in the transient config (possibly empty/unresolved). An `OK` reply ⇒ `Ok`; an **`Unauthenticated` /
`PermissionDenied`** reply **also ⇒ `Ok`** ("gateway reachable & speaking gRPC; auth not checked") because it
proves a live mxgw server; `Unavailable` / transport error ⇒ handshake-fail. Gateway `10.100.0.48:5120`.
### Tier B — real handshake, unit-proven only (no rig target → live-verify deferred, honestly recorded)
6. **AbLegacy** — same `libplctag` Init handshake as AbCip but the PCCC protocol family (verified-by-proxy via
AbCip's identical code path). No PLC5/SLC sim on the rig.
7. **TwinCAT** — `AdsClient.Connect` + `ReadStateAsync` ⇒ `Ok` with the ADS state (Run/Config/Stop). An ADS
**route-table** rejection is a *true* RED (the driver also cannot function without an authorized route — the
message says so: *"check the target's ADS route table authorizes this host"*). **Degrade guard:** if
`AdsClient` cannot construct/connect headless (managed AMS router unavailable), catch and fall back to the
TCP-preflight result with a *"ADS handshake unavailable on this host — TCP reachability only"* note — never
worse than today. No TwinCAT target on the rig.
8. **FOCAS** — attempt `cnc_allclibhndl3` via the existing wire P/Invoke. **Degrade guard:** the FWLIB native
lib is absent on the dev box / Linux containers (`UnimplementedFocasClientFactory` gates the driver), so the
call throws `DllNotFoundException` / `NotSupportedException` ⇒ catch and fall back to the TCP-preflight result
with a *"FOCAS handshake unavailable on this host (FWLIB absent) — TCP reachability only"* note. A real
handshake runs only on a Windows host with FWLIB + a reachable CNC. No CNC on the rig.
**Degradation principle:** the three Tier-B handshakes must NEVER produce a result worse than today's TCP-only
probe. A genuine protocol rejection from a reachable device is a correct RED; an *environmental inability to run
the handshake at all* (no FWLIB, no managed router) degrades to the existing TCP-reachability message.
## Testing & verification
- **Unit (per probe, TDD red→green, xUnit + Shouldly):** in-process `TcpListener` drives — (a) invalid/empty
config, (b) unreachable → `Connect failed`, (c) **TCP-accepts-then-closes / garbage → handshake-fail** (the
key new path — cleanly assertable for Modbus via a canned-MBAP server; loopback-accept for the rest), plus
Modbus's canned-MBAP **happy** path and FOCAS's **degrade** path (DllNotFound on the dev box is the actual CI
behavior, so it is directly testable). Galaxy's `Unauthenticated⇒Ok` is testable against a tiny in-process
gRPC server or via the live gateway.
- **Live `/run` (agent-driven, extends `DriverTestConnectE2eTests`, skip-gated):** Modbus, OpcUaClient, S7,
AbCip against the rig sims; Galaxy against `10.100.0.48:5120`. Each: green vs the live sim, RED vs wrong port
/ non-protocol server, timeout vs a black-hole IP. AbLegacy / TwinCAT / FOCAS live-verify is **honestly
deferred** (no PLC5/SLC sim, no ADS target, no CNC+FWLIB) — unit-proven + degrade-guarded.
- `dotnet build` clean (production projects are `TreatWarningsAsErrors`) + full `dotnet test` green before merge.
- Final integration review (the three degrade guards + the consistent message contract + no-regression-vs-TCP).
- **No bUnit** — no Razor change (the `DriverTestConnectButton` already renders whatever the probe returns).
## Out of scope / not touched
- Historian.Wonderware probe (already a real Hello/HelloAck handshake).
- `IDriverProbe` / `DriverProbeResult` contract, DI registration, the AdminUI button/Razor, the persisted
`DriverInstance` row (probes run against transient form config only).
- Plan `2026-05-28-adminui-driver-pages` Phase 9 typed **address pickers** and Phase 10 driver-page E2E — those
are Phase 6 (AdminUI), not this phase.
## Hard constraints (carried from the parent roadmap)
- **NO Configuration entity / EF migration.** No contract change (`IDriverProbe`/`DriverProbeResult` frozen).
- Stage by path — never `git add .`. Never stage `sql_login.txt`, `src/Server/.../Host/pki/`, `pending.md`,
`current.md`, `docker-dev/docker-compose.yml`, `stillpending.md`. **Never echo or commit the gateway API key**
(the Galaxy live-verify sources it without echoing, per the established recipe). No force-push, no `--no-verify`.
- Finish = merge to master + push.