1164d423b6
v2-ci / build (push) Failing after 44s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (push) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (push) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (push) Has been skipped
Two bugs caught by live verification against the mxaccessgw at 10.100.0.48:5120: - MaxAttempts=1 produced an invalid Polly RetryStrategyOptions -> the probe failed on every real gateway. Removed the Retry override (matches GalaxyDriver); fail-fast is already guaranteed by the TCP preflight + the per-call deadline. - A rejected key surfaces as a typed MxGatewayAuthenticationException, not a raw RpcException, so 'auth-rejection = reachable' was bypassed. Catch the typed auth/ authorization exceptions -> Ok=true. Adds DriverProbeHandshakeE2eTests: direct-probe, skip-gated cross-protocol green/red discrimination (Modbus, OpcUaClient, Galaxy + a local real OPC UA server).
149 lines
8.3 KiB
Markdown
149 lines
8.3 KiB
Markdown
# Test-Connect Probes — Protocol Handshakes
|
||
|
||
Each driver's **Test-Connect** button in the AdminUI runs a probe against the
|
||
form's current config (never the persisted row, never the live driver actor).
|
||
Before Phase 5 (shipped 2026-06-16) every probe was a bare TCP `ConnectAsync`
|
||
— a live-but-rejecting device showed a healthy green tick, and the operator
|
||
only discovered the truth when the driver faulted at deploy. Phase 5 replaced
|
||
each TCP-only probe with a **real protocol handshake** so a reachable-but-wrong
|
||
or actively-rejecting endpoint now reads RED.
|
||
|
||
The `IDriverProbe` / `DriverProbeResult` contract and DI registration are
|
||
unchanged. Probes run in a transient actor with a timeout clamp of 1–60 s
|
||
and must not mutate any state.
|
||
|
||
For the AdminUI probe flow (button → `AdminOperationsActor` → transient probe
|
||
actor), see
|
||
[`docs/plans/2026-05-28-adminui-driver-pages-design.md`](../plans/2026-05-28-adminui-driver-pages-design.md)
|
||
§4.
|
||
|
||
---
|
||
|
||
## Result contract
|
||
|
||
All probes return a consistent `DriverProbeResult(bool Ok, string? Message, TimeSpan? Latency)`.
|
||
The message templates below are uniform across all 8 drivers:
|
||
|
||
| Outcome | `Ok` | Message template |
|
||
|---------|------|-----------------|
|
||
| TCP connect fails | `false` | `"Connect failed: {SocketErrorCode}"` |
|
||
| TCP ok + handshake ok | `true` | driver-specific descriptive string (see table below) |
|
||
| TCP ok but handshake rejected | `false` | `"Reachable at {host}:{port} but {proto} handshake failed: {detail}"` |
|
||
| Timeout | `false` | `"Probe timed out after {n}s."` |
|
||
|
||
The third row is the key new behavior: a reachable device that answers on the
|
||
port but rejects the protocol-level handshake now surfaces a `false` result
|
||
with a human-readable explanation rather than a false-green TCP-open tick.
|
||
|
||
---
|
||
|
||
## Per-driver handshake
|
||
|
||
| Driver | Handshake | Ok message | Dev-rig target |
|
||
|--------|-----------|------------|----------------|
|
||
| **Modbus** | FC03 (Read Holding Registers, qty 1 @ addr 0) via `ModbusTcpTransport`. A Modbus exception PDU still proves a real Modbus device → `Ok`. A non-MBAP reply → handshake fail. | `"Modbus FC03 OK"` | `10.100.0.35:5020` (Modbus sim) |
|
||
| **OpcUaClient** | `DiscoveryClient.GetEndpointsAsync` — no session, no app-cert, no auth. ≥ 1 endpoint → `Ok`. A non-OPC-UA TCP server throws or times out → handshake fail. | `"OPC UA: N endpoint(s)"` | `opc.tcp://10.100.0.35:50000` (opc-plc) |
|
||
| **S7** | `Plc.OpenAsync` (COTP CR/CC + S7 setup-communication), check `IsConnected`, then `Close`. Wrong rack/slot or a non-S7 server causes `OpenAsync` to throw → handshake fail. | `"S7 connected (CPU …)"` | `10.100.0.35:1102` (python-snap7 sim) |
|
||
| **AbCip** | `libplctag` Tag `InitializeAsync` (EIP session + CIP Forward Open). A CIP-level error such as tag-not-found still proves the controller answered CIP → `Ok`. A session/ForwardOpen/connect error → handshake fail. | `"CIP session OK"` | `10.100.0.35:44818` (CIP sim) |
|
||
| **AbLegacy** | Same `libplctag` `InitializeAsync` handshake as AbCip, PCCC protocol family. | `"CIP session OK"` (PCCC family) | Deferred — no PLC5/SLC sim |
|
||
| **TwinCAT** | `AdsClient.Connect` + `ReadStateAsync`. See [degrade semantics](#twincat-degrade) below. | `"ADS state: {state}"` | Deferred — no ADS target |
|
||
| **FOCAS** | `cnc_allclibhndl3` via a direct `DllImport("fwlib32")` in the probe. See [degrade semantics](#focas-degrade) below. | `"FOCAS handle OK"` | Deferred — no CNC + FWLIB |
|
||
| **Galaxy** | gRPC unary call to `GalaxyRepository.TestConnection` on the configured mxaccessgw endpoint. See [auth-rejection rule](#galaxy-auth-rejection) below. | `"gateway gRPC OK"` | `http://10.100.0.48:5120` (mxaccessgw) |
|
||
|
||
**Historian.Wonderware** already performed a real handshake (`Hello` → `HelloAck`)
|
||
before Phase 5 and was not changed by this work. See
|
||
[`Historian.Wonderware.md`](Historian.Wonderware.md) for details.
|
||
|
||
---
|
||
|
||
## Degrade semantics
|
||
|
||
Three drivers have environmental constraints that can prevent the handshake
|
||
from running on certain hosts. The **degradation principle** is: the probe
|
||
must never produce a result *worse* than today's TCP-only probe. A genuine
|
||
protocol rejection from a reachable device is a correct RED; an inability to
|
||
*run* the handshake at all (no FWLIB, no managed router) degrades to the
|
||
existing TCP-reachability message — still a green tick but annotated.
|
||
|
||
### TwinCAT degrade
|
||
|
||
Where the handshake is available:
|
||
|
||
- `AdsClient.Connect(netId, port)` + `ReadStateAsync` → `Ok=true`,
|
||
`"ADS state: {state}"` (Run / Config / Stop).
|
||
- An ADS **route-table rejection** from a reachable ADS router is a **true RED**:
|
||
`"Reachable at {host}:{port} but ADS handshake failed: {detail} — check the
|
||
target's ADS route table authorizes this host"`. This is the correct result:
|
||
the driver would also be unable to function without an authorized route.
|
||
|
||
Where the handshake is unavailable (headless server, no TwinCAT runtime, the
|
||
managed AMS router cannot start):
|
||
|
||
- Probe degrades to TCP-reachability: `Ok=true`,
|
||
`"(ADS handshake unavailable on this host — TCP reachability only)"`.
|
||
|
||
### FOCAS degrade
|
||
|
||
On a Windows host with the FANUC FWLIB shared library present:
|
||
|
||
- `cnc_allclibhndl3` is called via a direct `DllImport("fwlib32")` declared in
|
||
the probe (the production `Wire.WireFocasClient` is a pure-managed FOCAS/2 TCP
|
||
client, not an FWLIB P/Invoke, so the probe carries its own native binding).
|
||
A successful handle allocation → `Ok=true`, `"FOCAS handle OK"`.
|
||
- A CNC-level rejection → handshake fail.
|
||
|
||
On dev, Linux, or macOS (no native FWLIB — `UnimplementedFocasClientFactory`
|
||
gates the driver):
|
||
|
||
- `DllNotFoundException` / `NotSupportedException` is caught and the probe
|
||
degrades to TCP-reachability: `Ok=true`,
|
||
`"(FOCAS handshake unavailable on this host — FWLIB absent, TCP reachability only)"`.
|
||
|
||
### Galaxy auth-rejection rule
|
||
|
||
The probe builds the gRPC channel from the form's config and issues one
|
||
lightweight unary call. It does **not** resolve `secretref:` secrets — the
|
||
key string in the transient config (possibly empty or unresolved) is used as-is.
|
||
|
||
- `Unavailable` / transport failure → `Ok=false` (gateway is down or unreachable).
|
||
- `Unauthenticated` / `PermissionDenied` → **`Ok=true`**,
|
||
`"gateway reachable & speaking gRPC (auth not checked)"` — an auth rejection
|
||
proves a live mxaccessgw gRPC server. This is the correct result: the driver's
|
||
own session-layer will handle auth; the probe is testing reachability only.
|
||
|
||
The mxaccessgw client surfaces a rejected key as a typed
|
||
`MxGatewayAuthenticationException` / `MxGatewayAuthorizationException`, **not** a
|
||
raw `RpcException` — the probe catches both and maps them to the reachable result
|
||
above. (Live verification on `10.100.0.48:5120` with no key returns
|
||
`MxGatewayAuthenticationException("Missing or invalid API key.")` → `Ok=true`.)
|
||
|
||
> **Config note:** `UseTls` must match the endpoint scheme — `UseTls:false` for an
|
||
> `http://` (h2c) gateway, `UseTls:true` for `https://`. A mismatch fails the
|
||
> client's own validation (the same constraint the Galaxy driver enforces).
|
||
|
||
---
|
||
|
||
## Live-verify scope
|
||
|
||
| Driver | Live-verify status | Notes |
|
||
|--------|-------------------|-------|
|
||
| Modbus | Verified | Dev-rig sim `10.100.0.35:5020`; green vs sim, RED vs wrong port / non-Modbus server, timeout vs black-hole IP |
|
||
| OpcUaClient | Verified | opc-plc `10.100.0.35:50000`; same three-scenario matrix |
|
||
| S7 | Verified | python-snap7 `10.100.0.35:1102` |
|
||
| AbCip | Verified | CIP sim `10.100.0.35:44818` |
|
||
| Galaxy | Verified | mxaccessgw `10.100.0.48:5120`; `Unauthenticated` reply counts as Ok |
|
||
| AbLegacy | Deferred | No PLC5/SLC sim; unit-proven + code path identical to AbCip |
|
||
| TwinCAT | Deferred | No ADS target; unit-proven + degrade guard tested |
|
||
| FOCAS | Deferred | No CNC + FWLIB on dev host; degrade guard is the CI-observable path |
|
||
|
||
---
|
||
|
||
## Implementation references
|
||
|
||
- Phase 5 design: `docs/plans/2026-06-16-stillpending-phase-5-probes-design.md`
|
||
- Parent roadmap: `docs/plans/2026-06-15-stillpending-backlog-design.md` §Phase 5
|
||
- AdminUI probe flow: `docs/plans/2026-05-28-adminui-driver-pages-design.md` §4
|
||
- Per-driver probe implementations: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.<Type>/<Type>DriverProbe.cs`
|
||
- `IDriverProbe` contract: `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IDriverProbe.cs`
|
||
- Probe dispatch + timeout clamp: `src/Server/ZB.MOM.WW.OtOpcUa.Host/Actors/AdminOperationsActor.cs` (around line 284)
|