Files
lmxopcua/docs/plans/2026-06-16-stillpending-phase-5-probes.md
T

296 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 5 — Test-Connect Protocol Handshakes — Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers-extended-cc:subagent-driven-development to implement this plan task-by-task.
**Goal:** Replace the 8 bare-TCP Test-Connect probes with real protocol handshakes so a live-but-rejecting device reads RED, not green — reusing each driver's own client primitive, with graceful degradation for the three (TwinCAT/FOCAS/Galaxy) that can't run a real handshake on the dev rig.
**Architecture:** Per-probe, no shared scaffold (matches the existing self-contained probe style; keeps all 8 driver projects disjoint → parallelizable). Each probe keeps its TCP preflight and adds one handshake step. New three-way result contract (TCP-fail / handshake-ok / TCP-ok-but-handshake-rejected / timeout). `IDriverProbe`/`DriverProbeResult` and DI are UNCHANGED. Design: `docs/plans/2026-06-16-stillpending-phase-5-probes-design.md`.
**Tech Stack:** C# / .NET 10, xUnit + Shouldly, in-process `TcpListener` for unit tests, skip-gated `DriverTestConnectE2eTests` for live verification. Per-driver client libs already referenced (S7netplus, libplctag, Beckhoff.TwinCAT.Ads, OPCFoundation.Opc.Ua.Client, MxGateway.Client gRPC, FOCAS wire P/Invoke).
**Consistent result-message templates (apply in EVERY probe):**
- TCP connect fails → `Ok=false`, `"Connect failed: {SocketErrorCode}"` *(keep as-is)*
- Handshake OK → `Ok=true`, `Latency`, e.g. `"Modbus FC03 OK"`, `"OPC UA: {n} endpoint(s)"`, `"S7 connected (CPU {cpu})"`, `"CIP session OK"`, `"ADS state: {state}"`, `"gateway gRPC OK"`
- TCP OK but handshake rejected → `Ok=false`, `"Reachable at {host}:{port} but {proto} handshake failed: {detail}"`
- Timeout (`OperationCanceledException`) → `Ok=false`, `"Probe timed out after {timeout.TotalSeconds:F0}s."` *(keep as-is)*
- Degrade (TwinCAT/FOCAS only, env can't run handshake) → `Ok=true`, `"Reachable at {host}:{port} ({proto} handshake unavailable on this host — TCP reachability only)"`
**Global rules (every task):** TDD red→green. Probes MUST honour `ct` and MUST NOT mutate state. Stage by path — never `git add .`; never stage `sql_login.txt`, `src/Server/.../Host/pki/`, `pending.md`, `current.md`, `docker-dev/docker-compose.yml`, `stillpending.md`. Never echo/commit the gateway API key. No `--no-verify`, no force-push. No `IDriverProbe`/`DriverProbeResult`/DI change. No bUnit.
---
### Task 0: Feature branch *(done)*
Branch `feat/stillpending-phase-5-probes` off master `050f5c4b` already created; design committed `1f2d32ac`. No action.
---
### Task 1: Modbus handshake — FC03 read
**Classification:** small
**Estimated implement time:** ~4 min
**Parallelizable with:** Tasks 2, 3, 4, 5, 6, 7, 8
**Files:**
- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Modbus/ModbusDriverProbe.cs`
- Test: `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Tests/ModbusDriverProbeTests.cs` (create)
**Approach:** After the existing TCP preflight succeeds, do the real handshake with the in-project transport:
```csharp
// proto label for messages
const string Proto = "Modbus";
// ... existing deserialize + ExtractTarget + TCP preflight unchanged ...
// On TCP success, run a one-shot FC03 (Read Holding Registers, qty 1 @ addr 0):
await using var transport = new ModbusTcpTransport(host, port, /* keep-alive */ default, /* timeouts from opts/defaults */);
await transport.ConnectAsync(ct);
var pdu = new byte[] { 0x03, 0x00, 0x00, 0x00, 0x01 }; // FC03, addr 0, qty 1
try
{
_ = await transport.SendAsync(opts.UnitId /* or default 1 */, pdu, ct);
return new(true, "Modbus FC03 OK", sw.Elapsed);
}
catch (ModbusException) // exception PDU (e.g. illegal data address) STILL proves a real Modbus device
{
return new(true, "Modbus FC03 OK (device returned exception PDU)", sw.Elapsed);
}
```
- Inspect `ModbusTcpTransport`'s real ctor signature (`ModbusTcpTransport.cs:27-66`) and `ModbusDriverOptions` for the unit-id field; mirror how `ModbusDriver` constructs the transport. Keep the `SocketException`/`OperationCanceledException`/`Exception` catches; a non-`ModbusException` failure after TCP success → `Ok=false`, `"Reachable at {host}:{port} but Modbus FC03 handshake failed: {ex.Message}"`.
- Update the class XML-doc: it now performs a real FC03 handshake (drop the "Does NOT exchange any protocol bytes" sentence).
**Steps:** (1) Write failing tests. (2) Run → fail. (3) Implement handshake. (4) Run → pass. (5) `dotnet build` the Modbus project clean. (6) Commit.
**Tests (`ModbusDriverProbeTests`, in-process `TcpListener`):**
- `ProbeAsync_invalid_json → Ok=false` ("invalid").
- `ProbeAsync_no_host → Ok=false` ("no host/port").
- `ProbeAsync_unreachable_port → Ok=false` (Connect failed) — target a closed loopback port.
- `ProbeAsync_tcp_accepts_then_closes → Ok=false` with "handshake failed" — a `TcpListener` that accepts and immediately closes (no MBAP reply).
- `ProbeAsync_canned_MBAP_response → Ok=true` "Modbus FC03 OK" — a `TcpListener` that reads the request frame and writes a valid MBAP FC03 response echoing the TxId.
- (optional) `ProbeAsync_exception_PDU → Ok=true` — listener replies 0x83 + exception code.
**Commit:** `feat(probe): Modbus Test-Connect does a real FC03 handshake`
---
### Task 2: OpcUaClient handshake — GetEndpoints
**Classification:** small
**Estimated implement time:** ~4 min
**Parallelizable with:** Tasks 1, 3, 4, 5, 6, 7, 8
**Files:**
- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient/OpcUaClientDriverProbe.cs`
- Test: `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests/OpcUaClientDriverProbeTests.cs` (create)
**Approach:** After TCP preflight, do an unsecured discovery handshake (no session, no app-cert, no auth) — mirror `OpcUaClientDriver.cs:417-424`:
```csharp
using var client = await DiscoveryClient.CreateAsync(new Uri(endpointUrl) /* + the SDK's default config as the driver does */);
var endpoints = await client.GetEndpointsAsync(null, ct);
return endpoints is { Count: > 0 }
? new(true, $"OPC UA: {endpoints.Count} endpoint(s)", sw.Elapsed)
: new(false, $"Reachable at {host}:{port} but OPC UA handshake failed: server published 0 endpoints", null);
```
- Reuse the EXACT `DiscoveryClient.CreateAsync(...)` overload the driver uses (Read `OpcUaClientDriver.cs:405-424` for the arg shape — it may pass an `ApplicationConfiguration`/`EndpointConfiguration`). Honour `ct`. A non-OPC-UA TCP server makes `GetEndpointsAsync` throw/timeout → catch → `Ok=false` "handshake failed: {ex.Message}". Keep the timeout/Connect-failed catches.
- Update the class XML-doc (drop "Does NOT open an OPC UA session" → now does a GetEndpoints discovery handshake).
**Tests:** invalid-json / no-endpoint / unreachable / tcp-accepts-then-closes→handshake-fail. The happy path (real endpoints) is covered live in Task 11 (a faithful in-process OPC UA server is heavy; the accept-then-close negative path is the unit-testable new behavior).
**Commit:** `feat(probe): OpcUaClient Test-Connect does a GetEndpoints discovery handshake`
---
### Task 3: S7 handshake — Plc.OpenAsync
**Classification:** small
**Estimated implement time:** ~4 min
**Parallelizable with:** Tasks 1, 2, 4, 5, 6, 7, 8
**Files:**
- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.S7/S7DriverProbe.cs`
- Test: `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.S7.Tests/S7DriverProbeTests.cs` (create)
**Approach:** After TCP preflight, do the COTP+S7 setup handshake — mirror `S7Driver.cs:162-171`:
```csharp
var plc = new Plc(S7CpuTypeMap.ToS7Net(opts.CpuType), host, port, opts.Rack, opts.Slot);
plc.ReadTimeout = (int)timeout.TotalMilliseconds; // set BEFORE OpenAsync (handshake honours it)
try
{
await plc.OpenAsync(ct);
if (plc.IsConnected) return new(true, $"S7 connected (CPU {opts.CpuType})", sw.Elapsed);
return new(false, $"Reachable at {host}:{port} but S7 handshake failed: not connected", null);
}
finally { plc.Close(); }
```
- Reuse `S7CpuTypeMap.ToS7Net` (`S7CpuTypeMap.cs`). Read `S7DriverOptions` for Rack/Slot/CpuType field names. Wrong rack/slot or non-S7 server → `OpenAsync` throws → catch → `Ok=false` "handshake failed: {ex.Message}". Keep Connect-failed / timeout catches.
- Update the class XML-doc.
**Tests:** invalid-json / no-host / unreachable / tcp-accepts-then-closes→handshake-fail (a listener that accepts then closes makes `OpenAsync` throw). Happy path is live (Task 11, python-snap7 sim).
**Commit:** `feat(probe): S7 Test-Connect does a real ISO-on-TCP + S7 setup handshake`
---
### Task 4: AbCip handshake — libplctag init
**Classification:** small
**Estimated implement time:** ~5 min
**Parallelizable with:** Tasks 1, 2, 3, 5, 6, 7, 8
**Files:**
- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.AbCip/AbCipDriverProbe.cs`
- Test: `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/AbCipDriverProbeTests.cs` (create)
**Approach:** After TCP preflight, open a CIP session via `libplctag` by initializing one Tag against the first device:
- Build an `AbCipTagCreateParams` from the first device's options (Gateway + CIP path + PlcType/libplctag-attr + a tag name) and `new LibplctagTagRuntime(p).InitializeAsync(ct)`. Read `LibplctagTagRuntime.cs` + `AbCipTagCreateParams` + how `AbCipDriver` builds these (`AbCipDriver.cs` device-init path, ~`:824`/`:856`) for the exact param shape and where the family/PlcType comes from.
- For the tag name: prefer the first configured tag path if `opts` carries tags; else a benign placeholder. Interpret the outcome:
- `InitializeAsync` succeeds → `Ok` `"CIP session OK"`.
- A **CIP-level** error (tag-not-found / bad-path — inspect `GetStatus()` / the libplctag `Status` enum) → STILL `Ok` `"CIP session OK (controller reachable; probe tag not found)"` — the controller answered CIP.
- A session/ForwardOpen/connect/timeout error → `Ok=false` `"Reachable at {host}:{port} but CIP handshake failed: {detail}"`.
- Dispose the runtime/tag. Update the class XML-doc.
**Tests:** invalid-json / no-host / unreachable. The CIP-status interpretation happy/CIP-error paths are covered live (Task 11, CIP sim). Keep unit tests to the offline-determinable paths; do NOT spin a fake CIP server.
**Commit:** `feat(probe): AbCip Test-Connect opens a real CIP session (libplctag init)`
---
### Task 5: AbLegacy handshake — libplctag init (PCCC)
**Classification:** small
**Estimated implement time:** ~4 min
**Parallelizable with:** Tasks 1, 2, 3, 4, 6, 7, 8
**Files:**
- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy/AbLegacyDriverProbe.cs`
- Test: `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.Tests/AbLegacyDriverProbeTests.cs` (create)
**Approach:** Same libplctag-init pattern as Task 4 but the AbLegacy project's own runtime/params types (PCCC protocol family). Read the AbLegacy driver's device-init/tag-runtime code for the exact param shape (it mirrors AbCip). Same outcome interpretation (session-open or CIP/PCCC-level error → `Ok`; connect/timeout → handshake-fail). Message: `"PCCC session OK"` / `"Reachable … but PCCC handshake failed: {detail}"`. Update the class XML-doc.
**Tests:** invalid-json / no-host / unreachable. Happy path is **deferred** (no PLC5/SLC sim on the rig) — note this in the test file header; the handshake code path is the same library as AbCip (verified-by-proxy).
**Commit:** `feat(probe): AbLegacy Test-Connect opens a real PCCC session (libplctag init)`
---
### Task 6: TwinCAT handshake — ADS ReadState (degrade-guarded)
**Classification:** standard
**Estimated implement time:** ~5 min
**Parallelizable with:** Tasks 1, 2, 3, 4, 5, 7, 8
**Files:**
- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT/TwinCATDriverProbe.cs`
- Test: `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.Tests/TwinCATDriverProbeTests.cs` (create)
**Approach:** After TCP preflight, attempt an ADS state read — mirror `AdsTwinCATClient.cs:87-90,194`:
```csharp
using var client = new AdsClient();
client.Connect(parsed.NetId, parsed.Port); // AmsNetId + ADS port from the parsed address
var state = await client.ReadStateAsync(ct); // AdsState
return new(true, $"ADS state: {state.AdsState}", sw.Elapsed);
```
- **Degrade guard:** wrap construction/connect in try/catch. Distinguish:
- ADS connected + `ReadState` OK → `Ok` `"ADS state: {state}"`.
- ADS **route/auth rejection** from a reachable router (the AdsErrorCode indicates target-port/route) → `Ok=false` `"Reachable at {host}:{port} but ADS handshake failed: {AdsErrorCode} — check the target's ADS route table authorizes this host"` (a true RED — the driver also needs the route).
- The managed AMS router can't construct/run headless (any other exception that means the handshake could not be ATTEMPTED, not that the device rejected it) → **degrade**: `Ok=true` `"Reachable at {host}:{port} (ADS handshake unavailable on this host — TCP reachability only)"`.
- Use the existing `TwinCATAmsAddress.TryParse` for NetId+port (already in `ExtractTarget`). Honour `ct`/timeout. Read the `Beckhoff.TwinCAT.Ads` `AdsClient` API (`Connect`, `ReadStateAsync`, `AdsErrorException`/`AdsErrorCode`) to classify route-rejection vs construction-failure. Update the class XML-doc.
**Tests:** invalid-json / no-host / unreachable (black-hole → timeout or degrade). Assert the degrade path returns `Ok=true` with the "TCP reachability only" note when `AdsClient` cannot attempt the handshake. Happy/route-reject paths are **deferred** (no ADS target on the rig) — note in the test header.
**Commit:** `feat(probe): TwinCAT Test-Connect does an ADS ReadState (degrade-guarded)`
---
### Task 7: FOCAS handshake — cnc_allclibhndl3 (degrade-guarded)
**Classification:** standard
**Estimated implement time:** ~5 min
**Parallelizable with:** Tasks 1, 2, 3, 4, 5, 6, 8
**Files:**
- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/FocasDriverProbe.cs`
- Test: `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests/FocasDriverProbeTests.cs` (create)
**Approach:** After TCP preflight, attempt the FOCAS library handshake via the existing wire P/Invoke (`Wire.WireFocasClient` / `FocasWireClient` — read those for the `cnc_allclibhndl3`/`cnc_freelibhndl` entry points). Build the wire client directly (do NOT route through `UnimplementedFocasClientFactory`, which throws by design). Allocate a handle to the first device's host/port and free it.
- **Degrade guard:** the FWLIB native lib is absent on the dev box / Linux containers → the P/Invoke throws `DllNotFoundException` / `NotSupportedException` / `TypeInitializationException`. Catch those specifically and **degrade**: `Ok=true` `"Reachable at {host}:{port} (FOCAS handshake unavailable on this host — FWLIB absent, TCP reachability only)"`.
- Handle allocated OK → `Ok` `"FOCAS handle OK"`.
- FWLIB present but `cnc_allclibhndl3` returns an error code (e.g. EW_SOCKET) from a reachable-but-non-CNC host → `Ok=false` `"Reachable at {host}:{port} but FOCAS handshake failed: {focasRc}"`.
- Honour `ct`/timeout (FWLIB connect can block — run it on a worker/`Task.Run` bounded by the linked timeout CTS so the probe still returns within budget). Update the class XML-doc.
**Tests:** invalid-json / no-host / unreachable. **Assert the degrade path** — on the CI/dev box (no FWLIB) the probe against a reachable TCP listener returns `Ok=true` with the "FWLIB absent" note (this IS the dev-box behavior, so it's directly testable). Happy/CNC-error paths are **deferred** (no CNC + no FWLIB) — note in the test header.
**Commit:** `feat(probe): FOCAS Test-Connect attempts a cnc_allclibhndl3 handshake (degrade-guarded)`
---
### Task 8: Galaxy handshake — gRPC ping (auth-rejection = reachable)
**Classification:** standard
**Estimated implement time:** ~5 min
**Parallelizable with:** Tasks 1, 2, 3, 4, 5, 6, 7
**Files:**
- Modify: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/GalaxyDriverProbe.cs`
- Test: `tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests/GalaxyDriverProbeTests.cs` (create)
**Approach:** After TCP preflight, build a `Grpc.Net.Client` channel to `Gateway.Endpoint` honouring the cleartext/TLS setting (read how `GalaxyDriver` / `GatewayGalaxy*` construct the channel — there's an http2-cleartext path for the dev gw), and issue ONE lightweight unary call from `MxGateway.Client`/`MxGateway.Contracts` (pick the cheapest — e.g. a status/echo/health, else the smallest query). **Do NOT resolve `secretref:` secrets** — send whatever key string is in the transient config.
- Interpret the gRPC `StatusCode`:
- `OK``Ok` `"gateway gRPC OK"`.
- `Unauthenticated` / `PermissionDenied`**also `Ok`** `"gateway reachable & speaking gRPC (auth not checked)"` — proves a live mxgw server.
- `Unavailable` / transport error / deadline → `Ok=false` `"Reachable at {host}:{port} but gateway gRPC handshake failed: {StatusCode}"`.
- Honour `ct`/timeout (set the gRPC deadline from `timeout`). Dispose the channel. Update the class XML-doc.
**Tests:** invalid-json / no-endpoint / unreachable (black-hole → `Unavailable`/deadline → `Ok=false`). The `Unauthenticated⇒Ok` rule: if a tiny in-process gRPC server is disproportionate, cover it live (Task 11, gateway `10.100.0.48:5120`) and unit-test the StatusCode→result classification by factoring the mapping into a small pure helper (e.g. `static (bool ok, string msg) ClassifyRpc(StatusCode, host, port)`) and testing that directly.
**Commit:** `feat(probe): Galaxy Test-Connect does a gRPC ping (auth-rejection counts as reachable)`
---
### Task 9: Docs + bookkeeping
**Classification:** small
**Estimated implement time:** ~4 min
**Parallelizable with:** none (blocked by 18)
**Files:**
- Modify: `docs/plans/2026-05-28-adminui-driver-pages-design.md` (mark Phase 7 real-probes done) — OR add a "Test-Connect probes" section to the most appropriate driver doc.
- Modify: `docs/Historian.md` or a probes note — record that the Historian probe was already a real handshake.
- Create/Modify: a short `docs/drivers/TestConnectProbes.md` (or a section in an existing driver overview) documenting the per-driver handshake + the three degrade behaviors + the consistent message contract.
**Steps:** Document the 8 handshakes, the degrade semantics (TwinCAT route table; FOCAS FWLIB-absent), and the auth-rejection=reachable Galaxy rule. Note AbLegacy/TwinCAT/FOCAS live-verify deferred (no sim/target/FWLIB). Commit: `docs(phase5): real Test-Connect handshakes per driver + degrade semantics`.
---
### Task 10: Full build + test + final integration review
**Classification:** high-risk (final integration gate — degrade guards + no-regression-vs-TCP across 8 disjoint probes)
**Estimated implement time:** ~6 min
**Parallelizable with:** none (blocked by 19)
**Steps:**
1. `dotnet build ZB.MOM.WW.OtOpcUa.slnx` → 0 errors (production projects are `TreatWarningsAsErrors`).
2. `dotnet test` for the 8 driver `.Tests` projects → all green.
3. Final integration review focus: (a) every probe still returns the *unchanged* `"Connect failed"` / `"timed out"` messages on those paths (no regression for offline devices); (b) the TwinCAT + FOCAS **degrade guards** truly catch "cannot-attempt" vs "device-rejected" and never emit a worse-than-TCP result; (c) Galaxy's `Unauthenticated⇒Ok`; (d) no probe mutates state; (e) no `IDriverProbe`/`DriverProbeResult`/DI change leaked.
4. Commit any review fixes.
---
### Task 11: Live `/run` — extend E2E + run the 5 verifiable probes
**Classification:** high-risk (acceptance gate, agent-driven)
**Estimated implement time:** ~8 min
**Parallelizable with:** none (blocked by 10)
**Files:**
- Modify: `tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests/DriverTestConnectE2eTests.cs` (add OpcUaClient/S7/AbCip/Galaxy happy + wrong-port cases, skip-gated like the Modbus ones).
**Steps:**
1. Extend `DriverTestConnectE2eTests` with skip-gated happy-path + wrong-port cases for OpcUaClient (`opc.tcp://10.100.0.35:50000`), S7 (`10.100.0.35:1102`), AbCip (`10.100.0.35:44818`), Galaxy (`10.100.0.48:5120`), mirroring the existing Modbus pattern (`DockerFixtureAvailability.IsReachable` skip).
2. Run the integration suite from the dev Mac (sims reachable): assert each verifiable probe is GREEN vs its live sim and RED vs a wrong port. For Galaxy, source the key WITHOUT echoing if a call needs it (per the established recipe) — but the probe should report reachable even without it.
3. Record results. AbLegacy / TwinCAT / FOCAS happy-path live-verify is **honestly deferred** (no PLC5/SLC sim, no ADS target, no CNC+FWLIB) — unit-proven + degrade-guarded.
4. Commit the E2E additions: `test(phase5): live Test-Connect E2E for OpcUaClient/S7/AbCip/Galaxy (skip-gated)`.
---
## Done =
Build clean + all driver `.Tests` green + final integration review SHIP + the 5 verifiable probes live-proven GREEN/RED on the rig + docs updated. Then `finishing-a-development-branch` → merge to master + push.