From 961b2b558d64302c9eeb6048833aaec6eb0f42f6 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Tue, 16 Jun 2026 07:06:47 -0400 Subject: [PATCH] docs(phase5): real Test-Connect handshakes per driver + degrade semantics MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Create docs/drivers/TestConnectProbes.md: full reference for the Phase 5 protocol-handshake probes — result contract, per-driver handshake table, TwinCAT/FOCAS/Galaxy degrade semantics, live-verify scope, and the Historian.Wonderware already-done note. Annotate the Phase 7 step in docs/plans/2026-05-28-adminui-driver-pages-design.md with a shipped note pointing at the phase-5 design doc and TestConnectProbes.md. --- docs/drivers/TestConnectProbes.md | 136 ++++++++++++++++++ .../2026-05-28-adminui-driver-pages-design.md | 5 + 2 files changed, 141 insertions(+) create mode 100644 docs/drivers/TestConnectProbes.md diff --git a/docs/drivers/TestConnectProbes.md b/docs/drivers/TestConnectProbes.md new file mode 100644 index 00000000..9d224a4b --- /dev/null +++ b/docs/drivers/TestConnectProbes.md @@ -0,0 +1,136 @@ +# Test-Connect Probes — Protocol Handshakes + +Each driver's **Test-Connect** button in the AdminUI runs a probe against the +form's current config (never the persisted row, never the live driver actor). +Before Phase 5 (shipped 2026-06-16) every probe was a bare TCP `ConnectAsync` +— a live-but-rejecting device showed a healthy green tick, and the operator +only discovered the truth when the driver faulted at deploy. Phase 5 replaced +each TCP-only probe with a **real protocol handshake** so a reachable-but-wrong +or actively-rejecting endpoint now reads RED. + +The `IDriverProbe` / `DriverProbeResult` contract and DI registration are +unchanged. Probes run in a transient actor with a timeout clamp of 1–60 s +and must not mutate any state. + +For the AdminUI probe flow (button → `AdminOperationsActor` → transient probe +actor), see +[`docs/plans/2026-05-28-adminui-driver-pages-design.md`](../plans/2026-05-28-adminui-driver-pages-design.md) +§4. + +--- + +## Result contract + +All probes return a consistent `DriverProbeResult(bool Ok, string? Message, TimeSpan? Latency)`. +The message templates below are uniform across all 8 drivers: + +| Outcome | `Ok` | Message template | +|---------|------|-----------------| +| TCP connect fails | `false` | `"Connect failed: {SocketErrorCode}"` | +| TCP ok + handshake ok | `true` | driver-specific descriptive string (see table below) | +| TCP ok but handshake rejected | `false` | `"Reachable at {host}:{port} but {proto} handshake failed: {detail}"` | +| Timeout | `false` | `"Probe timed out after {n}s."` | + +The third row is the key new behavior: a reachable device that answers on the +port but rejects the protocol-level handshake now surfaces a `false` result +with a human-readable explanation rather than a false-green TCP-open tick. + +--- + +## Per-driver handshake + +| Driver | Handshake | Ok message | Dev-rig target | +|--------|-----------|------------|----------------| +| **Modbus** | FC03 (Read Holding Registers, qty 1 @ addr 0) via `ModbusTcpTransport`. A Modbus exception PDU still proves a real Modbus device → `Ok`. A non-MBAP reply → handshake fail. | `"Modbus FC03 OK"` | `10.100.0.35:5020` (Modbus sim) | +| **OpcUaClient** | `DiscoveryClient.GetEndpointsAsync` — no session, no app-cert, no auth. ≥ 1 endpoint → `Ok`. A non-OPC-UA TCP server throws or times out → handshake fail. | `"OPC UA: N endpoint(s)"` | `opc.tcp://10.100.0.35:50000` (opc-plc) | +| **S7** | `Plc.OpenAsync` (COTP CR/CC + S7 setup-communication), check `IsConnected`, then `Close`. Wrong rack/slot or a non-S7 server causes `OpenAsync` to throw → handshake fail. | `"S7 connected (CPU …)"` | `10.100.0.35:1102` (python-snap7 sim) | +| **AbCip** | `libplctag` Tag `InitializeAsync` (EIP session + CIP Forward Open). A CIP-level error such as tag-not-found still proves the controller answered CIP → `Ok`. A session/ForwardOpen/connect error → handshake fail. | `"CIP session OK"` | `10.100.0.35:44818` (CIP sim) | +| **AbLegacy** | Same `libplctag` `InitializeAsync` handshake as AbCip, PCCC protocol family. | `"CIP session OK"` (PCCC family) | Deferred — no PLC5/SLC sim | +| **TwinCAT** | `AdsClient.Connect` + `ReadStateAsync`. See [degrade semantics](#twincat-degrade) below. | `"ADS state: {state}"` | Deferred — no ADS target | +| **FOCAS** | `cnc_allclibhndl3` via FWLIB P/Invoke (`Wire.WireFocasClient`). See [degrade semantics](#focas-degrade) below. | `"FOCAS handle OK"` | Deferred — no CNC + FWLIB | +| **Galaxy** | gRPC unary call to `GalaxyRepository.TestConnection` on the configured mxaccessgw endpoint. See [auth-rejection rule](#galaxy-auth-rejection) below. | `"gateway gRPC OK"` | `http://10.100.0.48:5120` (mxaccessgw) | + +**Historian.Wonderware** already performed a real handshake (`Hello` → `HelloAck`) +before Phase 5 and was not changed by this work. See +[`Historian.Wonderware.md`](Historian.Wonderware.md) for details. + +--- + +## Degrade semantics + +Three drivers have environmental constraints that can prevent the handshake +from running on certain hosts. The **degradation principle** is: the probe +must never produce a result *worse* than today's TCP-only probe. A genuine +protocol rejection from a reachable device is a correct RED; an inability to +*run* the handshake at all (no FWLIB, no managed router) degrades to the +existing TCP-reachability message — still a green tick but annotated. + +### TwinCAT degrade + +Where the handshake is available: + +- `AdsClient.Connect(netId, port)` + `ReadStateAsync` → `Ok=true`, + `"ADS state: {state}"` (Run / Config / Stop). +- An ADS **route-table rejection** from a reachable ADS router is a **true RED**: + `"Reachable at {host}:{port} but ADS handshake failed: {detail} — check the + target's ADS route table authorizes this host"`. This is the correct result: + the driver would also be unable to function without an authorized route. + +Where the handshake is unavailable (headless server, no TwinCAT runtime, the +managed AMS router cannot start): + +- Probe degrades to TCP-reachability: `Ok=true`, + `"(ADS handshake unavailable on this host — TCP reachability only)"`. + +### FOCAS degrade + +On a Windows host with the FANUC FWLIB shared library present: + +- `cnc_allclibhndl3` is called via the existing `Wire.WireFocasClient` P/Invoke. + A successful handle allocation → `Ok=true`, `"FOCAS handle OK"`. +- A CNC-level rejection → handshake fail. + +On dev, Linux, or macOS (no native FWLIB — `UnimplementedFocasClientFactory` +gates the driver): + +- `DllNotFoundException` / `NotSupportedException` is caught and the probe + degrades to TCP-reachability: `Ok=true`, + `"(FOCAS handshake unavailable on this host — FWLIB absent, TCP reachability only)"`. + +### Galaxy auth-rejection rule + +The probe builds the gRPC channel from the form's config and issues one +lightweight unary call. It does **not** resolve `secretref:` secrets — the +key string in the transient config (possibly empty or unresolved) is used as-is. + +- `Unavailable` / transport failure → `Ok=false` (gateway is down or unreachable). +- `Unauthenticated` / `PermissionDenied` → **`Ok=true`**, + `"gateway reachable & speaking gRPC; auth not checked"` — an auth rejection + proves a live mxaccessgw gRPC server. This is the correct result: the driver's + own session-layer will handle auth; the probe is testing reachability only. + +--- + +## Live-verify scope + +| Driver | Live-verify status | Notes | +|--------|-------------------|-------| +| Modbus | Verified | Dev-rig sim `10.100.0.35:5020`; green vs sim, RED vs wrong port / non-Modbus server, timeout vs black-hole IP | +| OpcUaClient | Verified | opc-plc `10.100.0.35:50000`; same three-scenario matrix | +| S7 | Verified | python-snap7 `10.100.0.35:1102` | +| AbCip | Verified | CIP sim `10.100.0.35:44818` | +| Galaxy | Verified | mxaccessgw `10.100.0.48:5120`; `Unauthenticated` reply counts as Ok | +| AbLegacy | Deferred | No PLC5/SLC sim; unit-proven + code path identical to AbCip | +| TwinCAT | Deferred | No ADS target; unit-proven + degrade guard tested | +| FOCAS | Deferred | No CNC + FWLIB on dev host; degrade guard is the CI-observable path | + +--- + +## Implementation references + +- Phase 5 design: `docs/plans/2026-06-16-stillpending-phase-5-probes-design.md` +- Parent roadmap: `docs/plans/2026-06-15-stillpending-backlog-design.md` §Phase 5 +- AdminUI probe flow: `docs/plans/2026-05-28-adminui-driver-pages-design.md` §4 +- Per-driver probe implementations: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver./DriverProbe.cs` +- `IDriverProbe` contract: `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IDriverProbe.cs` +- Probe dispatch + timeout clamp: `src/Server/ZB.MOM.WW.OtOpcUa.Host/Actors/AdminOperationsActor.cs` (around line 284) diff --git a/docs/plans/2026-05-28-adminui-driver-pages-design.md b/docs/plans/2026-05-28-adminui-driver-pages-design.md index e7302938..455f474e 100644 --- a/docs/plans/2026-05-28-adminui-driver-pages-design.md +++ b/docs/plans/2026-05-28-adminui-driver-pages-design.md @@ -296,6 +296,11 @@ Incremental — driver-by-driver swap-over. Each step compile-clean and shippabl 5. Delete the generic `DriverEdit.razor` + its route once all 9 typed pages exist. 6. Land `DriverStatusHub` + bridge + `` (read-only first). 7. Land `` + `IDriverProbe` impls + AdminOperationsActor handler. + > **Shipped (TCP-only, 2026-05-28).** Real protocol handshakes (Modbus FC03, OPC UA GetEndpoints, + > S7 COTP+setup-communication, CIP ForwardOpen, Galaxy gRPC ping) shipped 2026-06-16 via Phase 5 + > of the still-pending backlog (`docs/plans/2026-06-16-stillpending-phase-5-probes-design.md`). + > TwinCAT ADS and FOCAS FWLIB handshakes degrade to TCP-reachability-only on hosts where the native + > runtime is absent. See `docs/drivers/TestConnectProbes.md` for the full contract + degrade semantics. 8. Land Reconnect/Restart on the status panel with `DriverOperator` policy. 9. Land 9 static address builders inside ``.