Files
lmxopcua/docs/plans/2026-06-16-stillpending-phase-5-probes.md
T

20 KiB
Raw Blame History

Phase 5 — Test-Connect Protocol Handshakes — Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:subagent-driven-development to implement this plan task-by-task.

Goal: Replace the 8 bare-TCP Test-Connect probes with real protocol handshakes so a live-but-rejecting device reads RED, not green — reusing each driver's own client primitive, with graceful degradation for the three (TwinCAT/FOCAS/Galaxy) that can't run a real handshake on the dev rig.

Architecture: Per-probe, no shared scaffold (matches the existing self-contained probe style; keeps all 8 driver projects disjoint → parallelizable). Each probe keeps its TCP preflight and adds one handshake step. New three-way result contract (TCP-fail / handshake-ok / TCP-ok-but-handshake-rejected / timeout). IDriverProbe/DriverProbeResult and DI are UNCHANGED. Design: docs/plans/2026-06-16-stillpending-phase-5-probes-design.md.

Tech Stack: C# / .NET 10, xUnit + Shouldly, in-process TcpListener for unit tests, skip-gated DriverTestConnectE2eTests for live verification. Per-driver client libs already referenced (S7netplus, libplctag, Beckhoff.TwinCAT.Ads, OPCFoundation.Opc.Ua.Client, MxGateway.Client gRPC, FOCAS wire P/Invoke).

Consistent result-message templates (apply in EVERY probe):

  • TCP connect fails → Ok=false, "Connect failed: {SocketErrorCode}" (keep as-is)
  • Handshake OK → Ok=true, Latency, e.g. "Modbus FC03 OK", "OPC UA: {n} endpoint(s)", "S7 connected (CPU {cpu})", "CIP session OK", "ADS state: {state}", "gateway gRPC OK"
  • TCP OK but handshake rejected → Ok=false, "Reachable at {host}:{port} but {proto} handshake failed: {detail}"
  • Timeout (OperationCanceledException) → Ok=false, "Probe timed out after {timeout.TotalSeconds:F0}s." (keep as-is)
  • Degrade (TwinCAT/FOCAS only, env can't run handshake) → Ok=true, "Reachable at {host}:{port} ({proto} handshake unavailable on this host — TCP reachability only)"

Global rules (every task): TDD red→green. Probes MUST honour ct and MUST NOT mutate state. Stage by path — never git add .; never stage sql_login.txt, src/Server/.../Host/pki/, pending.md, current.md, docker-dev/docker-compose.yml, stillpending.md. Never echo/commit the gateway API key. No --no-verify, no force-push. No IDriverProbe/DriverProbeResult/DI change. No bUnit.


Task 0: Feature branch (done)

Branch feat/stillpending-phase-5-probes off master 050f5c4b already created; design committed 1f2d32ac. No action.


Task 1: Modbus handshake — FC03 read

Classification: small Estimated implement time: ~4 min Parallelizable with: Tasks 2, 3, 4, 5, 6, 7, 8

Files:

  • Modify: src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Modbus/ModbusDriverProbe.cs
  • Test: tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Modbus.Tests/ModbusDriverProbeTests.cs (create)

Approach: After the existing TCP preflight succeeds, do the real handshake with the in-project transport:

// proto label for messages
const string Proto = "Modbus";
// ... existing deserialize + ExtractTarget + TCP preflight unchanged ...
// On TCP success, run a one-shot FC03 (Read Holding Registers, qty 1 @ addr 0):
await using var transport = new ModbusTcpTransport(host, port, /* keep-alive */ default, /* timeouts from opts/defaults */);
await transport.ConnectAsync(ct);
var pdu = new byte[] { 0x03, 0x00, 0x00, 0x00, 0x01 }; // FC03, addr 0, qty 1
try
{
    _ = await transport.SendAsync(opts.UnitId /* or default 1 */, pdu, ct);
    return new(true, "Modbus FC03 OK", sw.Elapsed);
}
catch (ModbusException) // exception PDU (e.g. illegal data address) STILL proves a real Modbus device
{
    return new(true, "Modbus FC03 OK (device returned exception PDU)", sw.Elapsed);
}
  • Inspect ModbusTcpTransport's real ctor signature (ModbusTcpTransport.cs:27-66) and ModbusDriverOptions for the unit-id field; mirror how ModbusDriver constructs the transport. Keep the SocketException/OperationCanceledException/Exception catches; a non-ModbusException failure after TCP success → Ok=false, "Reachable at {host}:{port} but Modbus FC03 handshake failed: {ex.Message}".
  • Update the class XML-doc: it now performs a real FC03 handshake (drop the "Does NOT exchange any protocol bytes" sentence).

Steps: (1) Write failing tests. (2) Run → fail. (3) Implement handshake. (4) Run → pass. (5) dotnet build the Modbus project clean. (6) Commit.

Tests (ModbusDriverProbeTests, in-process TcpListener):

  • ProbeAsync_invalid_json → Ok=false ("invalid").
  • ProbeAsync_no_host → Ok=false ("no host/port").
  • ProbeAsync_unreachable_port → Ok=false (Connect failed) — target a closed loopback port.
  • ProbeAsync_tcp_accepts_then_closes → Ok=false with "handshake failed" — a TcpListener that accepts and immediately closes (no MBAP reply).
  • ProbeAsync_canned_MBAP_response → Ok=true "Modbus FC03 OK" — a TcpListener that reads the request frame and writes a valid MBAP FC03 response echoing the TxId.
  • (optional) ProbeAsync_exception_PDU → Ok=true — listener replies 0x83 + exception code.

Commit: feat(probe): Modbus Test-Connect does a real FC03 handshake


Task 2: OpcUaClient handshake — GetEndpoints

Classification: small Estimated implement time: ~4 min Parallelizable with: Tasks 1, 3, 4, 5, 6, 7, 8

Files:

  • Modify: src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient/OpcUaClientDriverProbe.cs
  • Test: tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.OpcUaClient.Tests/OpcUaClientDriverProbeTests.cs (create)

Approach: After TCP preflight, do an unsecured discovery handshake (no session, no app-cert, no auth) — mirror OpcUaClientDriver.cs:417-424:

using var client = await DiscoveryClient.CreateAsync(new Uri(endpointUrl) /* + the SDK's default config as the driver does */);
var endpoints = await client.GetEndpointsAsync(null, ct);
return endpoints is { Count: > 0 }
    ? new(true, $"OPC UA: {endpoints.Count} endpoint(s)", sw.Elapsed)
    : new(false, $"Reachable at {host}:{port} but OPC UA handshake failed: server published 0 endpoints", null);
  • Reuse the EXACT DiscoveryClient.CreateAsync(...) overload the driver uses (Read OpcUaClientDriver.cs:405-424 for the arg shape — it may pass an ApplicationConfiguration/EndpointConfiguration). Honour ct. A non-OPC-UA TCP server makes GetEndpointsAsync throw/timeout → catch → Ok=false "handshake failed: {ex.Message}". Keep the timeout/Connect-failed catches.
  • Update the class XML-doc (drop "Does NOT open an OPC UA session" → now does a GetEndpoints discovery handshake).

Tests: invalid-json / no-endpoint / unreachable / tcp-accepts-then-closes→handshake-fail. The happy path (real endpoints) is covered live in Task 11 (a faithful in-process OPC UA server is heavy; the accept-then-close negative path is the unit-testable new behavior).

Commit: feat(probe): OpcUaClient Test-Connect does a GetEndpoints discovery handshake


Task 3: S7 handshake — Plc.OpenAsync

Classification: small Estimated implement time: ~4 min Parallelizable with: Tasks 1, 2, 4, 5, 6, 7, 8

Files:

  • Modify: src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.S7/S7DriverProbe.cs
  • Test: tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.S7.Tests/S7DriverProbeTests.cs (create)

Approach: After TCP preflight, do the COTP+S7 setup handshake — mirror S7Driver.cs:162-171:

var plc = new Plc(S7CpuTypeMap.ToS7Net(opts.CpuType), host, port, opts.Rack, opts.Slot);
plc.ReadTimeout = (int)timeout.TotalMilliseconds; // set BEFORE OpenAsync (handshake honours it)
try
{
    await plc.OpenAsync(ct);
    if (plc.IsConnected) return new(true, $"S7 connected (CPU {opts.CpuType})", sw.Elapsed);
    return new(false, $"Reachable at {host}:{port} but S7 handshake failed: not connected", null);
}
finally { plc.Close(); }
  • Reuse S7CpuTypeMap.ToS7Net (S7CpuTypeMap.cs). Read S7DriverOptions for Rack/Slot/CpuType field names. Wrong rack/slot or non-S7 server → OpenAsync throws → catch → Ok=false "handshake failed: {ex.Message}". Keep Connect-failed / timeout catches.
  • Update the class XML-doc.

Tests: invalid-json / no-host / unreachable / tcp-accepts-then-closes→handshake-fail (a listener that accepts then closes makes OpenAsync throw). Happy path is live (Task 11, python-snap7 sim).

Commit: feat(probe): S7 Test-Connect does a real ISO-on-TCP + S7 setup handshake


Task 4: AbCip handshake — libplctag init

Classification: small Estimated implement time: ~5 min Parallelizable with: Tasks 1, 2, 3, 5, 6, 7, 8

Files:

  • Modify: src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.AbCip/AbCipDriverProbe.cs
  • Test: tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.AbCip.Tests/AbCipDriverProbeTests.cs (create)

Approach: After TCP preflight, open a CIP session via libplctag by initializing one Tag against the first device:

  • Build an AbCipTagCreateParams from the first device's options (Gateway + CIP path + PlcType/libplctag-attr + a tag name) and new LibplctagTagRuntime(p).InitializeAsync(ct). Read LibplctagTagRuntime.cs + AbCipTagCreateParams + how AbCipDriver builds these (AbCipDriver.cs device-init path, ~:824/:856) for the exact param shape and where the family/PlcType comes from.
  • For the tag name: prefer the first configured tag path if opts carries tags; else a benign placeholder. Interpret the outcome:
    • InitializeAsync succeeds → Ok "CIP session OK".
    • A CIP-level error (tag-not-found / bad-path — inspect GetStatus() / the libplctag Status enum) → STILL Ok "CIP session OK (controller reachable; probe tag not found)" — the controller answered CIP.
    • A session/ForwardOpen/connect/timeout error → Ok=false "Reachable at {host}:{port} but CIP handshake failed: {detail}".
  • Dispose the runtime/tag. Update the class XML-doc.

Tests: invalid-json / no-host / unreachable. The CIP-status interpretation happy/CIP-error paths are covered live (Task 11, CIP sim). Keep unit tests to the offline-determinable paths; do NOT spin a fake CIP server.

Commit: feat(probe): AbCip Test-Connect opens a real CIP session (libplctag init)


Task 5: AbLegacy handshake — libplctag init (PCCC)

Classification: small Estimated implement time: ~4 min Parallelizable with: Tasks 1, 2, 3, 4, 6, 7, 8

Files:

  • Modify: src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy/AbLegacyDriverProbe.cs
  • Test: tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.AbLegacy.Tests/AbLegacyDriverProbeTests.cs (create)

Approach: Same libplctag-init pattern as Task 4 but the AbLegacy project's own runtime/params types (PCCC protocol family). Read the AbLegacy driver's device-init/tag-runtime code for the exact param shape (it mirrors AbCip). Same outcome interpretation (session-open or CIP/PCCC-level error → Ok; connect/timeout → handshake-fail). Message: "PCCC session OK" / "Reachable … but PCCC handshake failed: {detail}". Update the class XML-doc.

Tests: invalid-json / no-host / unreachable. Happy path is deferred (no PLC5/SLC sim on the rig) — note this in the test file header; the handshake code path is the same library as AbCip (verified-by-proxy).

Commit: feat(probe): AbLegacy Test-Connect opens a real PCCC session (libplctag init)


Task 6: TwinCAT handshake — ADS ReadState (degrade-guarded)

Classification: standard Estimated implement time: ~5 min Parallelizable with: Tasks 1, 2, 3, 4, 5, 7, 8

Files:

  • Modify: src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT/TwinCATDriverProbe.cs
  • Test: tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.TwinCAT.Tests/TwinCATDriverProbeTests.cs (create)

Approach: After TCP preflight, attempt an ADS state read — mirror AdsTwinCATClient.cs:87-90,194:

using var client = new AdsClient();
client.Connect(parsed.NetId, parsed.Port);            // AmsNetId + ADS port from the parsed address
var state = await client.ReadStateAsync(ct);          // AdsState
return new(true, $"ADS state: {state.AdsState}", sw.Elapsed);
  • Degrade guard: wrap construction/connect in try/catch. Distinguish:
    • ADS connected + ReadState OK → Ok "ADS state: {state}".
    • ADS route/auth rejection from a reachable router (the AdsErrorCode indicates target-port/route) → Ok=false "Reachable at {host}:{port} but ADS handshake failed: {AdsErrorCode} — check the target's ADS route table authorizes this host" (a true RED — the driver also needs the route).
    • The managed AMS router can't construct/run headless (any other exception that means the handshake could not be ATTEMPTED, not that the device rejected it) → degrade: Ok=true "Reachable at {host}:{port} (ADS handshake unavailable on this host — TCP reachability only)".
  • Use the existing TwinCATAmsAddress.TryParse for NetId+port (already in ExtractTarget). Honour ct/timeout. Read the Beckhoff.TwinCAT.Ads AdsClient API (Connect, ReadStateAsync, AdsErrorException/AdsErrorCode) to classify route-rejection vs construction-failure. Update the class XML-doc.

Tests: invalid-json / no-host / unreachable (black-hole → timeout or degrade). Assert the degrade path returns Ok=true with the "TCP reachability only" note when AdsClient cannot attempt the handshake. Happy/route-reject paths are deferred (no ADS target on the rig) — note in the test header.

Commit: feat(probe): TwinCAT Test-Connect does an ADS ReadState (degrade-guarded)


Task 7: FOCAS handshake — cnc_allclibhndl3 (degrade-guarded)

Classification: standard Estimated implement time: ~5 min Parallelizable with: Tasks 1, 2, 3, 4, 5, 6, 8

Files:

  • Modify: src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS/FocasDriverProbe.cs
  • Test: tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.FOCAS.Tests/FocasDriverProbeTests.cs (create)

Approach: After TCP preflight, attempt the FOCAS library handshake via the existing wire P/Invoke (Wire.WireFocasClient / FocasWireClient — read those for the cnc_allclibhndl3/cnc_freelibhndl entry points). Build the wire client directly (do NOT route through UnimplementedFocasClientFactory, which throws by design). Allocate a handle to the first device's host/port and free it.

  • Degrade guard: the FWLIB native lib is absent on the dev box / Linux containers → the P/Invoke throws DllNotFoundException / NotSupportedException / TypeInitializationException. Catch those specifically and degrade: Ok=true "Reachable at {host}:{port} (FOCAS handshake unavailable on this host — FWLIB absent, TCP reachability only)".
    • Handle allocated OK → Ok "FOCAS handle OK".
    • FWLIB present but cnc_allclibhndl3 returns an error code (e.g. EW_SOCKET) from a reachable-but-non-CNC host → Ok=false "Reachable at {host}:{port} but FOCAS handshake failed: {focasRc}".
  • Honour ct/timeout (FWLIB connect can block — run it on a worker/Task.Run bounded by the linked timeout CTS so the probe still returns within budget). Update the class XML-doc.

Tests: invalid-json / no-host / unreachable. Assert the degrade path — on the CI/dev box (no FWLIB) the probe against a reachable TCP listener returns Ok=true with the "FWLIB absent" note (this IS the dev-box behavior, so it's directly testable). Happy/CNC-error paths are deferred (no CNC + no FWLIB) — note in the test header.

Commit: feat(probe): FOCAS Test-Connect attempts a cnc_allclibhndl3 handshake (degrade-guarded)


Task 8: Galaxy handshake — gRPC ping (auth-rejection = reachable)

Classification: standard Estimated implement time: ~5 min Parallelizable with: Tasks 1, 2, 3, 4, 5, 6, 7

Files:

  • Modify: src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy/GalaxyDriverProbe.cs
  • Test: tests/Drivers/ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests/GalaxyDriverProbeTests.cs (create)

Approach: After TCP preflight, build a Grpc.Net.Client channel to Gateway.Endpoint honouring the cleartext/TLS setting (read how GalaxyDriver / GatewayGalaxy* construct the channel — there's an http2-cleartext path for the dev gw), and issue ONE lightweight unary call from MxGateway.Client/MxGateway.Contracts (pick the cheapest — e.g. a status/echo/health, else the smallest query). Do NOT resolve secretref: secrets — send whatever key string is in the transient config.

  • Interpret the gRPC StatusCode:
    • OKOk "gateway gRPC OK".
    • Unauthenticated / PermissionDeniedalso Ok "gateway reachable & speaking gRPC (auth not checked)" — proves a live mxgw server.
    • Unavailable / transport error / deadline → Ok=false "Reachable at {host}:{port} but gateway gRPC handshake failed: {StatusCode}".
  • Honour ct/timeout (set the gRPC deadline from timeout). Dispose the channel. Update the class XML-doc.

Tests: invalid-json / no-endpoint / unreachable (black-hole → Unavailable/deadline → Ok=false). The Unauthenticated⇒Ok rule: if a tiny in-process gRPC server is disproportionate, cover it live (Task 11, gateway 10.100.0.48:5120) and unit-test the StatusCode→result classification by factoring the mapping into a small pure helper (e.g. static (bool ok, string msg) ClassifyRpc(StatusCode, host, port)) and testing that directly.

Commit: feat(probe): Galaxy Test-Connect does a gRPC ping (auth-rejection counts as reachable)


Task 9: Docs + bookkeeping

Classification: small Estimated implement time: ~4 min Parallelizable with: none (blocked by 18)

Files:

  • Modify: docs/plans/2026-05-28-adminui-driver-pages-design.md (mark Phase 7 real-probes done) — OR add a "Test-Connect probes" section to the most appropriate driver doc.
  • Modify: docs/Historian.md or a probes note — record that the Historian probe was already a real handshake.
  • Create/Modify: a short docs/drivers/TestConnectProbes.md (or a section in an existing driver overview) documenting the per-driver handshake + the three degrade behaviors + the consistent message contract.

Steps: Document the 8 handshakes, the degrade semantics (TwinCAT route table; FOCAS FWLIB-absent), and the auth-rejection=reachable Galaxy rule. Note AbLegacy/TwinCAT/FOCAS live-verify deferred (no sim/target/FWLIB). Commit: docs(phase5): real Test-Connect handshakes per driver + degrade semantics.


Task 10: Full build + test + final integration review

Classification: high-risk (final integration gate — degrade guards + no-regression-vs-TCP across 8 disjoint probes) Estimated implement time: ~6 min Parallelizable with: none (blocked by 19)

Steps:

  1. dotnet build ZB.MOM.WW.OtOpcUa.slnx → 0 errors (production projects are TreatWarningsAsErrors).
  2. dotnet test for the 8 driver .Tests projects → all green.
  3. Final integration review focus: (a) every probe still returns the unchanged "Connect failed" / "timed out" messages on those paths (no regression for offline devices); (b) the TwinCAT + FOCAS degrade guards truly catch "cannot-attempt" vs "device-rejected" and never emit a worse-than-TCP result; (c) Galaxy's Unauthenticated⇒Ok; (d) no probe mutates state; (e) no IDriverProbe/DriverProbeResult/DI change leaked.
  4. Commit any review fixes.

Task 11: Live /run — extend E2E + run the 5 verifiable probes

Classification: high-risk (acceptance gate, agent-driven) Estimated implement time: ~8 min Parallelizable with: none (blocked by 10)

Files:

  • Modify: tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests/DriverTestConnectE2eTests.cs (add OpcUaClient/S7/AbCip/Galaxy happy + wrong-port cases, skip-gated like the Modbus ones).

Steps:

  1. Extend DriverTestConnectE2eTests with skip-gated happy-path + wrong-port cases for OpcUaClient (opc.tcp://10.100.0.35:50000), S7 (10.100.0.35:1102), AbCip (10.100.0.35:44818), Galaxy (10.100.0.48:5120), mirroring the existing Modbus pattern (DockerFixtureAvailability.IsReachable skip).
  2. Run the integration suite from the dev Mac (sims reachable): assert each verifiable probe is GREEN vs its live sim and RED vs a wrong port. For Galaxy, source the key WITHOUT echoing if a call needs it (per the established recipe) — but the probe should report reachable even without it.
  3. Record results. AbLegacy / TwinCAT / FOCAS happy-path live-verify is honestly deferred (no PLC5/SLC sim, no ADS target, no CNC+FWLIB) — unit-proven + degrade-guarded.
  4. Commit the E2E additions: test(phase5): live Test-Connect E2E for OpcUaClient/S7/AbCip/Galaxy (skip-gated).

Done =

Build clean + all driver .Tests green + final integration review SHIP + the 5 verifiable probes live-proven GREEN/RED on the rig + docs updated. Then finishing-a-development-branch → merge to master + push.