docs(phase5): real Test-Connect handshakes per driver + degrade semantics

Create docs/drivers/TestConnectProbes.md: full reference for the Phase 5
protocol-handshake probes — result contract, per-driver handshake table,
TwinCAT/FOCAS/Galaxy degrade semantics, live-verify scope, and the
Historian.Wonderware already-done note. Annotate the Phase 7 step in
docs/plans/2026-05-28-adminui-driver-pages-design.md with a shipped note
pointing at the phase-5 design doc and TestConnectProbes.md.
This commit is contained in:
Joseph Doherty
2026-06-16 07:06:47 -04:00
parent 5df3c73204
commit 961b2b558d
2 changed files with 141 additions and 0 deletions
+136
View File
@@ -0,0 +1,136 @@
# Test-Connect Probes — Protocol Handshakes
Each driver's **Test-Connect** button in the AdminUI runs a probe against the
form's current config (never the persisted row, never the live driver actor).
Before Phase 5 (shipped 2026-06-16) every probe was a bare TCP `ConnectAsync`
— a live-but-rejecting device showed a healthy green tick, and the operator
only discovered the truth when the driver faulted at deploy. Phase 5 replaced
each TCP-only probe with a **real protocol handshake** so a reachable-but-wrong
or actively-rejecting endpoint now reads RED.
The `IDriverProbe` / `DriverProbeResult` contract and DI registration are
unchanged. Probes run in a transient actor with a timeout clamp of 160 s
and must not mutate any state.
For the AdminUI probe flow (button → `AdminOperationsActor` → transient probe
actor), see
[`docs/plans/2026-05-28-adminui-driver-pages-design.md`](../plans/2026-05-28-adminui-driver-pages-design.md)
§4.
---
## Result contract
All probes return a consistent `DriverProbeResult(bool Ok, string? Message, TimeSpan? Latency)`.
The message templates below are uniform across all 8 drivers:
| Outcome | `Ok` | Message template |
|---------|------|-----------------|
| TCP connect fails | `false` | `"Connect failed: {SocketErrorCode}"` |
| TCP ok + handshake ok | `true` | driver-specific descriptive string (see table below) |
| TCP ok but handshake rejected | `false` | `"Reachable at {host}:{port} but {proto} handshake failed: {detail}"` |
| Timeout | `false` | `"Probe timed out after {n}s."` |
The third row is the key new behavior: a reachable device that answers on the
port but rejects the protocol-level handshake now surfaces a `false` result
with a human-readable explanation rather than a false-green TCP-open tick.
---
## Per-driver handshake
| Driver | Handshake | Ok message | Dev-rig target |
|--------|-----------|------------|----------------|
| **Modbus** | FC03 (Read Holding Registers, qty 1 @ addr 0) via `ModbusTcpTransport`. A Modbus exception PDU still proves a real Modbus device → `Ok`. A non-MBAP reply → handshake fail. | `"Modbus FC03 OK"` | `10.100.0.35:5020` (Modbus sim) |
| **OpcUaClient** | `DiscoveryClient.GetEndpointsAsync` — no session, no app-cert, no auth. ≥ 1 endpoint → `Ok`. A non-OPC-UA TCP server throws or times out → handshake fail. | `"OPC UA: N endpoint(s)"` | `opc.tcp://10.100.0.35:50000` (opc-plc) |
| **S7** | `Plc.OpenAsync` (COTP CR/CC + S7 setup-communication), check `IsConnected`, then `Close`. Wrong rack/slot or a non-S7 server causes `OpenAsync` to throw → handshake fail. | `"S7 connected (CPU …)"` | `10.100.0.35:1102` (python-snap7 sim) |
| **AbCip** | `libplctag` Tag `InitializeAsync` (EIP session + CIP Forward Open). A CIP-level error such as tag-not-found still proves the controller answered CIP → `Ok`. A session/ForwardOpen/connect error → handshake fail. | `"CIP session OK"` | `10.100.0.35:44818` (CIP sim) |
| **AbLegacy** | Same `libplctag` `InitializeAsync` handshake as AbCip, PCCC protocol family. | `"CIP session OK"` (PCCC family) | Deferred — no PLC5/SLC sim |
| **TwinCAT** | `AdsClient.Connect` + `ReadStateAsync`. See [degrade semantics](#twincat-degrade) below. | `"ADS state: {state}"` | Deferred — no ADS target |
| **FOCAS** | `cnc_allclibhndl3` via FWLIB P/Invoke (`Wire.WireFocasClient`). See [degrade semantics](#focas-degrade) below. | `"FOCAS handle OK"` | Deferred — no CNC + FWLIB |
| **Galaxy** | gRPC unary call to `GalaxyRepository.TestConnection` on the configured mxaccessgw endpoint. See [auth-rejection rule](#galaxy-auth-rejection) below. | `"gateway gRPC OK"` | `http://10.100.0.48:5120` (mxaccessgw) |
**Historian.Wonderware** already performed a real handshake (`Hello``HelloAck`)
before Phase 5 and was not changed by this work. See
[`Historian.Wonderware.md`](Historian.Wonderware.md) for details.
---
## Degrade semantics
Three drivers have environmental constraints that can prevent the handshake
from running on certain hosts. The **degradation principle** is: the probe
must never produce a result *worse* than today's TCP-only probe. A genuine
protocol rejection from a reachable device is a correct RED; an inability to
*run* the handshake at all (no FWLIB, no managed router) degrades to the
existing TCP-reachability message — still a green tick but annotated.
### TwinCAT degrade
Where the handshake is available:
- `AdsClient.Connect(netId, port)` + `ReadStateAsync``Ok=true`,
`"ADS state: {state}"` (Run / Config / Stop).
- An ADS **route-table rejection** from a reachable ADS router is a **true RED**:
`"Reachable at {host}:{port} but ADS handshake failed: {detail} — check the
target's ADS route table authorizes this host"`. This is the correct result:
the driver would also be unable to function without an authorized route.
Where the handshake is unavailable (headless server, no TwinCAT runtime, the
managed AMS router cannot start):
- Probe degrades to TCP-reachability: `Ok=true`,
`"(ADS handshake unavailable on this host — TCP reachability only)"`.
### FOCAS degrade
On a Windows host with the FANUC FWLIB shared library present:
- `cnc_allclibhndl3` is called via the existing `Wire.WireFocasClient` P/Invoke.
A successful handle allocation → `Ok=true`, `"FOCAS handle OK"`.
- A CNC-level rejection → handshake fail.
On dev, Linux, or macOS (no native FWLIB — `UnimplementedFocasClientFactory`
gates the driver):
- `DllNotFoundException` / `NotSupportedException` is caught and the probe
degrades to TCP-reachability: `Ok=true`,
`"(FOCAS handshake unavailable on this host — FWLIB absent, TCP reachability only)"`.
### Galaxy auth-rejection rule
The probe builds the gRPC channel from the form's config and issues one
lightweight unary call. It does **not** resolve `secretref:` secrets — the
key string in the transient config (possibly empty or unresolved) is used as-is.
- `Unavailable` / transport failure → `Ok=false` (gateway is down or unreachable).
- `Unauthenticated` / `PermissionDenied`**`Ok=true`**,
`"gateway reachable & speaking gRPC; auth not checked"` — an auth rejection
proves a live mxaccessgw gRPC server. This is the correct result: the driver's
own session-layer will handle auth; the probe is testing reachability only.
---
## Live-verify scope
| Driver | Live-verify status | Notes |
|--------|-------------------|-------|
| Modbus | Verified | Dev-rig sim `10.100.0.35:5020`; green vs sim, RED vs wrong port / non-Modbus server, timeout vs black-hole IP |
| OpcUaClient | Verified | opc-plc `10.100.0.35:50000`; same three-scenario matrix |
| S7 | Verified | python-snap7 `10.100.0.35:1102` |
| AbCip | Verified | CIP sim `10.100.0.35:44818` |
| Galaxy | Verified | mxaccessgw `10.100.0.48:5120`; `Unauthenticated` reply counts as Ok |
| AbLegacy | Deferred | No PLC5/SLC sim; unit-proven + code path identical to AbCip |
| TwinCAT | Deferred | No ADS target; unit-proven + degrade guard tested |
| FOCAS | Deferred | No CNC + FWLIB on dev host; degrade guard is the CI-observable path |
---
## Implementation references
- Phase 5 design: `docs/plans/2026-06-16-stillpending-phase-5-probes-design.md`
- Parent roadmap: `docs/plans/2026-06-15-stillpending-backlog-design.md` §Phase 5
- AdminUI probe flow: `docs/plans/2026-05-28-adminui-driver-pages-design.md` §4
- Per-driver probe implementations: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.<Type>/<Type>DriverProbe.cs`
- `IDriverProbe` contract: `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IDriverProbe.cs`
- Probe dispatch + timeout clamp: `src/Server/ZB.MOM.WW.OtOpcUa.Host/Actors/AdminOperationsActor.cs` (around line 284)