docs(phase5): design — Test-Connect protocol handshakes (all 8 probes, best-effort)
This commit is contained in:
@@ -0,0 +1,126 @@
|
||||
# Still-Pending Phase 5 — Test-Connect protocol handshakes — design
|
||||
|
||||
> **Status:** approved 2026-06-16. Parent roadmap: `docs/plans/2026-06-15-stillpending-backlog-design.md` (Phase 5).
|
||||
> Source backlog: `stillpending.md` §2 ("Test-Connect probes are TCP-only") + plan
|
||||
> `2026-05-28-adminui-driver-pages` Phase 7 + `2026-06-12-historian-tcp-transport` task 9.
|
||||
> Branch `feat/stillpending-phase-5-probes` off master `050f5c4b`. Phases 0–4 already shipped.
|
||||
|
||||
## Goal
|
||||
|
||||
Replace the bare-TCP `socket.ConnectAsync` Test-Connect probes with **real protocol handshakes** so a
|
||||
*live-but-rejecting* device reads **RED**, not green. Today a firewalled port, a non-Modbus TCP server, a
|
||||
PLC at the wrong rack/slot, or a down-but-port-forwarded OPC UA endpoint all surface a healthy tick — the
|
||||
operator gets a false "connection OK" and only discovers the truth when the driver faults at deploy.
|
||||
|
||||
## Grounding (verified this session)
|
||||
|
||||
- **All 8 probes are byte-identical TCP-only** boilerplate (`deserialize → ExtractTarget → Socket.ConnectAsync
|
||||
→ close`): Modbus, S7, AbCip, AbLegacy, TwinCAT, OpcUaClient, Galaxy, FOCAS.
|
||||
- **Historian.Wonderware is ALREADY a real handshake** — it sends a `Hello` and confirms `HelloAck`
|
||||
(`WonderwareHistorianDriverProbe.cs:54-71`). So §2's "task 9 / historian probe" is **already done**; this
|
||||
phase does not touch it (only documents it).
|
||||
- **Every driver already owns the handshake primitive** in its client code — no new package references:
|
||||
- Modbus: `ModbusTcpTransport.ConnectAsync` + `IModbusTransport.SendAsync(unitId, pdu, ct)`.
|
||||
- S7: `new Plc(cpu, host, port, rack, slot).OpenAsync(ct)` (COTP CR/CC + S7 setup-communication).
|
||||
- AbCip / AbLegacy: `libplctag` Tag `InitializeAsync` (opens the EIP session + first CIP op).
|
||||
- TwinCAT: `AdsClient.Connect(netId, port)` + `ReadStateAsync` (`AdsTwinCATClient.cs:90,194`).
|
||||
- OpcUaClient: `DiscoveryClient.GetEndpointsAsync` (no session / cert / auth — `OpcUaClientDriver.cs:422`).
|
||||
- Galaxy: the `MxGateway.Client` gRPC channel + one lightweight unary call.
|
||||
- FOCAS: `cnc_allclibhndl3` via the existing wire P/Invoke (`Wire.WireFocasClient`).
|
||||
- **Probe dispatch** clamps the timeout to 1–60 s and passes a cancelled-on-timeout `ct`
|
||||
(`AdminOperationsActor.cs:284-291`). Probes MUST honour it and MUST NOT mutate state (read-only handshakes).
|
||||
- **A proven skip-gated E2E harness exists** (`DriverTestConnectE2eTests`) targeting the live Modbus sim with
|
||||
happy / wrong-port / black-hole scenarios, auto-skipping when the fixture is unreachable. The dev-rig sims
|
||||
(Modbus `:5020`, AbCip `:44818`, S7 `:1102`, opc-plc `:50000`) and the mxaccessgw (`10.100.0.48:5120`) are
|
||||
reachable from the dev Mac, so 5 of the 8 are live-verifiable agent-side.
|
||||
|
||||
## Architecture
|
||||
|
||||
**Per-probe, no shared scaffold.** Each probe stays self-contained in its own driver project (matches the
|
||||
existing intentional copy-paste style, keeps all 8 projects disjoint → parallelizable like Phase 4, and avoids
|
||||
adding socket/handshake logic to the `Core.Abstractions` contracts project). Each in-scope probe keeps its TCP
|
||||
preflight and **adds one handshake step**, reusing the driver's own client primitive.
|
||||
|
||||
**New three-way result contract** (the operator value — message templates kept consistent across all 8):
|
||||
|
||||
| Outcome | Result |
|
||||
|---|---|
|
||||
| TCP connect fails | `Ok=false` · `"Connect failed: {SocketError}"` *(unchanged)* |
|
||||
| TCP ok **+ handshake ok** | `Ok=true` · latency · descriptive msg (e.g. `"Modbus FC03 OK"`, `"OPC UA: N endpoint(s)"`, `"S7 connected (CPU …)"`, `"CIP session OK"`, `"ADS state: Run"`, `"gateway gRPC OK"`) |
|
||||
| TCP ok **but handshake rejected** | `Ok=false` · `"Reachable at {host}:{port} but {proto} handshake failed: {detail}"` ← **the new behavior** |
|
||||
| timeout | `Ok=false` · `"Probe timed out after {n}s."` *(unchanged)* |
|
||||
|
||||
`IDriverProbe` / `DriverProbeResult` are **unchanged** — no Commons / Core contract touch, no DI change
|
||||
(the 8 probes are already registered in `DriverFactoryBootstrap.AddOtOpcUaDriverProbes`).
|
||||
|
||||
## Per-driver handshake + degradation
|
||||
|
||||
### Tier A — real handshake, live-verifiable on the rig (agent-driven)
|
||||
1. **Modbus** — `ConnectAsync` then `SendAsync` one **FC03** (Read Holding Registers, qty 1 @ addr 0, unit
|
||||
from config/default 1). Any well-formed MBAP reply that echoes the TxId with protocol-id 0 — **including a
|
||||
Modbus exception PDU (0x83…)** — proves a real Modbus device ⇒ `Ok`. A malformed/non-MBAP reply or silence
|
||||
⇒ handshake-fail. Sim `:5020`.
|
||||
2. **OpcUaClient** — `DiscoveryClient.GetEndpointsAsync` (no session, no app-cert, no auth). ≥1 endpoint ⇒
|
||||
`Ok` (`"OPC UA: N endpoint(s)"`); a non-OPC-UA TCP server throws/times out ⇒ handshake-fail. opc-plc `:50000`.
|
||||
3. **S7** — `new Plc(...).OpenAsync(ct)` with `ReadTimeout` set first (mirror `S7Driver.cs:164`), check
|
||||
`IsConnected`, `Close`. Wrong rack/slot or non-S7 server ⇒ `OpenAsync` throws ⇒ handshake-fail. python-snap7
|
||||
`:1102`.
|
||||
4. **AbCip** — create a `libplctag` Tag for the first configured tag path (else a benign probe name) and
|
||||
`InitializeAsync`. Session opens ⇒ `Ok`; a **CIP-level** error (tag-not-found / bad-path) **still counts as
|
||||
reachable** (the controller answered CIP); a session/ForwardOpen/connect error ⇒ handshake-fail. CIP sim `:44818`.
|
||||
5. **Galaxy** — build the `MxGateway.Client` gRPC channel (honour the config's cleartext/TLS) and issue one
|
||||
lightweight unary call. **The probe does NOT resolve `secretref:` secrets** — it sends whatever key string is
|
||||
in the transient config (possibly empty/unresolved). An `OK` reply ⇒ `Ok`; an **`Unauthenticated` /
|
||||
`PermissionDenied`** reply **also ⇒ `Ok`** ("gateway reachable & speaking gRPC; auth not checked") because it
|
||||
proves a live mxgw server; `Unavailable` / transport error ⇒ handshake-fail. Gateway `10.100.0.48:5120`.
|
||||
|
||||
### Tier B — real handshake, unit-proven only (no rig target → live-verify deferred, honestly recorded)
|
||||
6. **AbLegacy** — same `libplctag` Init handshake as AbCip but the PCCC protocol family (verified-by-proxy via
|
||||
AbCip's identical code path). No PLC5/SLC sim on the rig.
|
||||
7. **TwinCAT** — `AdsClient.Connect` + `ReadStateAsync` ⇒ `Ok` with the ADS state (Run/Config/Stop). An ADS
|
||||
**route-table** rejection is a *true* RED (the driver also cannot function without an authorized route — the
|
||||
message says so: *"check the target's ADS route table authorizes this host"*). **Degrade guard:** if
|
||||
`AdsClient` cannot construct/connect headless (managed AMS router unavailable), catch and fall back to the
|
||||
TCP-preflight result with a *"ADS handshake unavailable on this host — TCP reachability only"* note — never
|
||||
worse than today. No TwinCAT target on the rig.
|
||||
8. **FOCAS** — attempt `cnc_allclibhndl3` via the existing wire P/Invoke. **Degrade guard:** the FWLIB native
|
||||
lib is absent on the dev box / Linux containers (`UnimplementedFocasClientFactory` gates the driver), so the
|
||||
call throws `DllNotFoundException` / `NotSupportedException` ⇒ catch and fall back to the TCP-preflight result
|
||||
with a *"FOCAS handshake unavailable on this host (FWLIB absent) — TCP reachability only"* note. A real
|
||||
handshake runs only on a Windows host with FWLIB + a reachable CNC. No CNC on the rig.
|
||||
|
||||
**Degradation principle:** the three Tier-B handshakes must NEVER produce a result worse than today's TCP-only
|
||||
probe. A genuine protocol rejection from a reachable device is a correct RED; an *environmental inability to run
|
||||
the handshake at all* (no FWLIB, no managed router) degrades to the existing TCP-reachability message.
|
||||
|
||||
## Testing & verification
|
||||
|
||||
- **Unit (per probe, TDD red→green, xUnit + Shouldly):** in-process `TcpListener` drives — (a) invalid/empty
|
||||
config, (b) unreachable → `Connect failed`, (c) **TCP-accepts-then-closes / garbage → handshake-fail** (the
|
||||
key new path — cleanly assertable for Modbus via a canned-MBAP server; loopback-accept for the rest), plus
|
||||
Modbus's canned-MBAP **happy** path and FOCAS's **degrade** path (DllNotFound on the dev box is the actual CI
|
||||
behavior, so it is directly testable). Galaxy's `Unauthenticated⇒Ok` is testable against a tiny in-process
|
||||
gRPC server or via the live gateway.
|
||||
- **Live `/run` (agent-driven, extends `DriverTestConnectE2eTests`, skip-gated):** Modbus, OpcUaClient, S7,
|
||||
AbCip against the rig sims; Galaxy against `10.100.0.48:5120`. Each: green vs the live sim, RED vs wrong port
|
||||
/ non-protocol server, timeout vs a black-hole IP. AbLegacy / TwinCAT / FOCAS live-verify is **honestly
|
||||
deferred** (no PLC5/SLC sim, no ADS target, no CNC+FWLIB) — unit-proven + degrade-guarded.
|
||||
- `dotnet build` clean (production projects are `TreatWarningsAsErrors`) + full `dotnet test` green before merge.
|
||||
- Final integration review (the three degrade guards + the consistent message contract + no-regression-vs-TCP).
|
||||
- **No bUnit** — no Razor change (the `DriverTestConnectButton` already renders whatever the probe returns).
|
||||
|
||||
## Out of scope / not touched
|
||||
|
||||
- Historian.Wonderware probe (already a real Hello/HelloAck handshake).
|
||||
- `IDriverProbe` / `DriverProbeResult` contract, DI registration, the AdminUI button/Razor, the persisted
|
||||
`DriverInstance` row (probes run against transient form config only).
|
||||
- Plan `2026-05-28-adminui-driver-pages` Phase 9 typed **address pickers** and Phase 10 driver-page E2E — those
|
||||
are Phase 6 (AdminUI), not this phase.
|
||||
|
||||
## Hard constraints (carried from the parent roadmap)
|
||||
|
||||
- **NO Configuration entity / EF migration.** No contract change (`IDriverProbe`/`DriverProbeResult` frozen).
|
||||
- Stage by path — never `git add .`. Never stage `sql_login.txt`, `src/Server/.../Host/pki/`, `pending.md`,
|
||||
`current.md`, `docker-dev/docker-compose.yml`, `stillpending.md`. **Never echo or commit the gateway API key**
|
||||
(the Galaxy live-verify sources it without echoing, per the established recipe). No force-push, no `--no-verify`.
|
||||
- Finish = merge to master + push.
|
||||
Reference in New Issue
Block a user