Files
lmxopcua/docs/drivers/TestConnectProbes.md
T
Joseph Doherty 961b2b558d docs(phase5): real Test-Connect handshakes per driver + degrade semantics
Create docs/drivers/TestConnectProbes.md: full reference for the Phase 5
protocol-handshake probes — result contract, per-driver handshake table,
TwinCAT/FOCAS/Galaxy degrade semantics, live-verify scope, and the
Historian.Wonderware already-done note. Annotate the Phase 7 step in
docs/plans/2026-05-28-adminui-driver-pages-design.md with a shipped note
pointing at the phase-5 design doc and TestConnectProbes.md.
2026-06-16 07:06:47 -04:00

137 lines
7.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Test-Connect Probes — Protocol Handshakes
Each driver's **Test-Connect** button in the AdminUI runs a probe against the
form's current config (never the persisted row, never the live driver actor).
Before Phase 5 (shipped 2026-06-16) every probe was a bare TCP `ConnectAsync`
— a live-but-rejecting device showed a healthy green tick, and the operator
only discovered the truth when the driver faulted at deploy. Phase 5 replaced
each TCP-only probe with a **real protocol handshake** so a reachable-but-wrong
or actively-rejecting endpoint now reads RED.
The `IDriverProbe` / `DriverProbeResult` contract and DI registration are
unchanged. Probes run in a transient actor with a timeout clamp of 160 s
and must not mutate any state.
For the AdminUI probe flow (button → `AdminOperationsActor` → transient probe
actor), see
[`docs/plans/2026-05-28-adminui-driver-pages-design.md`](../plans/2026-05-28-adminui-driver-pages-design.md)
§4.
---
## Result contract
All probes return a consistent `DriverProbeResult(bool Ok, string? Message, TimeSpan? Latency)`.
The message templates below are uniform across all 8 drivers:
| Outcome | `Ok` | Message template |
|---------|------|-----------------|
| TCP connect fails | `false` | `"Connect failed: {SocketErrorCode}"` |
| TCP ok + handshake ok | `true` | driver-specific descriptive string (see table below) |
| TCP ok but handshake rejected | `false` | `"Reachable at {host}:{port} but {proto} handshake failed: {detail}"` |
| Timeout | `false` | `"Probe timed out after {n}s."` |
The third row is the key new behavior: a reachable device that answers on the
port but rejects the protocol-level handshake now surfaces a `false` result
with a human-readable explanation rather than a false-green TCP-open tick.
---
## Per-driver handshake
| Driver | Handshake | Ok message | Dev-rig target |
|--------|-----------|------------|----------------|
| **Modbus** | FC03 (Read Holding Registers, qty 1 @ addr 0) via `ModbusTcpTransport`. A Modbus exception PDU still proves a real Modbus device → `Ok`. A non-MBAP reply → handshake fail. | `"Modbus FC03 OK"` | `10.100.0.35:5020` (Modbus sim) |
| **OpcUaClient** | `DiscoveryClient.GetEndpointsAsync` — no session, no app-cert, no auth. ≥ 1 endpoint → `Ok`. A non-OPC-UA TCP server throws or times out → handshake fail. | `"OPC UA: N endpoint(s)"` | `opc.tcp://10.100.0.35:50000` (opc-plc) |
| **S7** | `Plc.OpenAsync` (COTP CR/CC + S7 setup-communication), check `IsConnected`, then `Close`. Wrong rack/slot or a non-S7 server causes `OpenAsync` to throw → handshake fail. | `"S7 connected (CPU …)"` | `10.100.0.35:1102` (python-snap7 sim) |
| **AbCip** | `libplctag` Tag `InitializeAsync` (EIP session + CIP Forward Open). A CIP-level error such as tag-not-found still proves the controller answered CIP → `Ok`. A session/ForwardOpen/connect error → handshake fail. | `"CIP session OK"` | `10.100.0.35:44818` (CIP sim) |
| **AbLegacy** | Same `libplctag` `InitializeAsync` handshake as AbCip, PCCC protocol family. | `"CIP session OK"` (PCCC family) | Deferred — no PLC5/SLC sim |
| **TwinCAT** | `AdsClient.Connect` + `ReadStateAsync`. See [degrade semantics](#twincat-degrade) below. | `"ADS state: {state}"` | Deferred — no ADS target |
| **FOCAS** | `cnc_allclibhndl3` via FWLIB P/Invoke (`Wire.WireFocasClient`). See [degrade semantics](#focas-degrade) below. | `"FOCAS handle OK"` | Deferred — no CNC + FWLIB |
| **Galaxy** | gRPC unary call to `GalaxyRepository.TestConnection` on the configured mxaccessgw endpoint. See [auth-rejection rule](#galaxy-auth-rejection) below. | `"gateway gRPC OK"` | `http://10.100.0.48:5120` (mxaccessgw) |
**Historian.Wonderware** already performed a real handshake (`Hello``HelloAck`)
before Phase 5 and was not changed by this work. See
[`Historian.Wonderware.md`](Historian.Wonderware.md) for details.
---
## Degrade semantics
Three drivers have environmental constraints that can prevent the handshake
from running on certain hosts. The **degradation principle** is: the probe
must never produce a result *worse* than today's TCP-only probe. A genuine
protocol rejection from a reachable device is a correct RED; an inability to
*run* the handshake at all (no FWLIB, no managed router) degrades to the
existing TCP-reachability message — still a green tick but annotated.
### TwinCAT degrade
Where the handshake is available:
- `AdsClient.Connect(netId, port)` + `ReadStateAsync``Ok=true`,
`"ADS state: {state}"` (Run / Config / Stop).
- An ADS **route-table rejection** from a reachable ADS router is a **true RED**:
`"Reachable at {host}:{port} but ADS handshake failed: {detail} — check the
target's ADS route table authorizes this host"`. This is the correct result:
the driver would also be unable to function without an authorized route.
Where the handshake is unavailable (headless server, no TwinCAT runtime, the
managed AMS router cannot start):
- Probe degrades to TCP-reachability: `Ok=true`,
`"(ADS handshake unavailable on this host — TCP reachability only)"`.
### FOCAS degrade
On a Windows host with the FANUC FWLIB shared library present:
- `cnc_allclibhndl3` is called via the existing `Wire.WireFocasClient` P/Invoke.
A successful handle allocation → `Ok=true`, `"FOCAS handle OK"`.
- A CNC-level rejection → handshake fail.
On dev, Linux, or macOS (no native FWLIB — `UnimplementedFocasClientFactory`
gates the driver):
- `DllNotFoundException` / `NotSupportedException` is caught and the probe
degrades to TCP-reachability: `Ok=true`,
`"(FOCAS handshake unavailable on this host — FWLIB absent, TCP reachability only)"`.
### Galaxy auth-rejection rule
The probe builds the gRPC channel from the form's config and issues one
lightweight unary call. It does **not** resolve `secretref:` secrets — the
key string in the transient config (possibly empty or unresolved) is used as-is.
- `Unavailable` / transport failure → `Ok=false` (gateway is down or unreachable).
- `Unauthenticated` / `PermissionDenied`**`Ok=true`**,
`"gateway reachable & speaking gRPC; auth not checked"` — an auth rejection
proves a live mxaccessgw gRPC server. This is the correct result: the driver's
own session-layer will handle auth; the probe is testing reachability only.
---
## Live-verify scope
| Driver | Live-verify status | Notes |
|--------|-------------------|-------|
| Modbus | Verified | Dev-rig sim `10.100.0.35:5020`; green vs sim, RED vs wrong port / non-Modbus server, timeout vs black-hole IP |
| OpcUaClient | Verified | opc-plc `10.100.0.35:50000`; same three-scenario matrix |
| S7 | Verified | python-snap7 `10.100.0.35:1102` |
| AbCip | Verified | CIP sim `10.100.0.35:44818` |
| Galaxy | Verified | mxaccessgw `10.100.0.48:5120`; `Unauthenticated` reply counts as Ok |
| AbLegacy | Deferred | No PLC5/SLC sim; unit-proven + code path identical to AbCip |
| TwinCAT | Deferred | No ADS target; unit-proven + degrade guard tested |
| FOCAS | Deferred | No CNC + FWLIB on dev host; degrade guard is the CI-observable path |
---
## Implementation references
- Phase 5 design: `docs/plans/2026-06-16-stillpending-phase-5-probes-design.md`
- Parent roadmap: `docs/plans/2026-06-15-stillpending-backlog-design.md` §Phase 5
- AdminUI probe flow: `docs/plans/2026-05-28-adminui-driver-pages-design.md` §4
- Per-driver probe implementations: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.<Type>/<Type>DriverProbe.cs`
- `IDriverProbe` contract: `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IDriverProbe.cs`
- Probe dispatch + timeout clamp: `src/Server/ZB.MOM.WW.OtOpcUa.Host/Actors/AdminOperationsActor.cs` (around line 284)