Files
lmxopcua/docs/plans/2026-06-16-stillpending-phase-5-probes-design.md
T

127 lines
9.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Still-Pending Phase 5 — Test-Connect protocol handshakes — design
> **Status:** approved 2026-06-16. Parent roadmap: `docs/plans/2026-06-15-stillpending-backlog-design.md` (Phase 5).
> Source backlog: `stillpending.md` §2 ("Test-Connect probes are TCP-only") + plan
> `2026-05-28-adminui-driver-pages` Phase 7 + `2026-06-12-historian-tcp-transport` task 9.
> Branch `feat/stillpending-phase-5-probes` off master `050f5c4b`. Phases 04 already shipped.
## Goal
Replace the bare-TCP `socket.ConnectAsync` Test-Connect probes with **real protocol handshakes** so a
*live-but-rejecting* device reads **RED**, not green. Today a firewalled port, a non-Modbus TCP server, a
PLC at the wrong rack/slot, or a down-but-port-forwarded OPC UA endpoint all surface a healthy tick — the
operator gets a false "connection OK" and only discovers the truth when the driver faults at deploy.
## Grounding (verified this session)
- **All 8 probes are byte-identical TCP-only** boilerplate (`deserialize → ExtractTarget → Socket.ConnectAsync
→ close`): Modbus, S7, AbCip, AbLegacy, TwinCAT, OpcUaClient, Galaxy, FOCAS.
- **Historian.Wonderware is ALREADY a real handshake** — it sends a `Hello` and confirms `HelloAck`
(`WonderwareHistorianDriverProbe.cs:54-71`). So §2's "task 9 / historian probe" is **already done**; this
phase does not touch it (only documents it).
- **Every driver already owns the handshake primitive** in its client code — no new package references:
- Modbus: `ModbusTcpTransport.ConnectAsync` + `IModbusTransport.SendAsync(unitId, pdu, ct)`.
- S7: `new Plc(cpu, host, port, rack, slot).OpenAsync(ct)` (COTP CR/CC + S7 setup-communication).
- AbCip / AbLegacy: `libplctag` Tag `InitializeAsync` (opens the EIP session + first CIP op).
- TwinCAT: `AdsClient.Connect(netId, port)` + `ReadStateAsync` (`AdsTwinCATClient.cs:90,194`).
- OpcUaClient: `DiscoveryClient.GetEndpointsAsync` (no session / cert / auth — `OpcUaClientDriver.cs:422`).
- Galaxy: the `MxGateway.Client` gRPC channel + one lightweight unary call.
- FOCAS: `cnc_allclibhndl3` via the existing wire P/Invoke (`Wire.WireFocasClient`).
- **Probe dispatch** clamps the timeout to 160 s and passes a cancelled-on-timeout `ct`
(`AdminOperationsActor.cs:284-291`). Probes MUST honour it and MUST NOT mutate state (read-only handshakes).
- **A proven skip-gated E2E harness exists** (`DriverTestConnectE2eTests`) targeting the live Modbus sim with
happy / wrong-port / black-hole scenarios, auto-skipping when the fixture is unreachable. The dev-rig sims
(Modbus `:5020`, AbCip `:44818`, S7 `:1102`, opc-plc `:50000`) and the mxaccessgw (`10.100.0.48:5120`) are
reachable from the dev Mac, so 5 of the 8 are live-verifiable agent-side.
## Architecture
**Per-probe, no shared scaffold.** Each probe stays self-contained in its own driver project (matches the
existing intentional copy-paste style, keeps all 8 projects disjoint → parallelizable like Phase 4, and avoids
adding socket/handshake logic to the `Core.Abstractions` contracts project). Each in-scope probe keeps its TCP
preflight and **adds one handshake step**, reusing the driver's own client primitive.
**New three-way result contract** (the operator value — message templates kept consistent across all 8):
| Outcome | Result |
|---|---|
| TCP connect fails | `Ok=false` · `"Connect failed: {SocketError}"` *(unchanged)* |
| TCP ok **+ handshake ok** | `Ok=true` · latency · descriptive msg (e.g. `"Modbus FC03 OK"`, `"OPC UA: N endpoint(s)"`, `"S7 connected (CPU …)"`, `"CIP session OK"`, `"ADS state: Run"`, `"gateway gRPC OK"`) |
| TCP ok **but handshake rejected** | `Ok=false` · `"Reachable at {host}:{port} but {proto} handshake failed: {detail}"` ← **the new behavior** |
| timeout | `Ok=false` · `"Probe timed out after {n}s."` *(unchanged)* |
`IDriverProbe` / `DriverProbeResult` are **unchanged** — no Commons / Core contract touch, no DI change
(the 8 probes are already registered in `DriverFactoryBootstrap.AddOtOpcUaDriverProbes`).
## Per-driver handshake + degradation
### Tier A — real handshake, live-verifiable on the rig (agent-driven)
1. **Modbus** — `ConnectAsync` then `SendAsync` one **FC03** (Read Holding Registers, qty 1 @ addr 0, unit
from config/default 1). Any well-formed MBAP reply that echoes the TxId with protocol-id 0 — **including a
Modbus exception PDU (0x83…)** — proves a real Modbus device ⇒ `Ok`. A malformed/non-MBAP reply or silence
⇒ handshake-fail. Sim `:5020`.
2. **OpcUaClient** — `DiscoveryClient.GetEndpointsAsync` (no session, no app-cert, no auth). ≥1 endpoint ⇒
`Ok` (`"OPC UA: N endpoint(s)"`); a non-OPC-UA TCP server throws/times out ⇒ handshake-fail. opc-plc `:50000`.
3. **S7** — `new Plc(...).OpenAsync(ct)` with `ReadTimeout` set first (mirror `S7Driver.cs:164`), check
`IsConnected`, `Close`. Wrong rack/slot or non-S7 server ⇒ `OpenAsync` throws ⇒ handshake-fail. python-snap7
`:1102`.
4. **AbCip** — create a `libplctag` Tag for the first configured tag path (else a benign probe name) and
`InitializeAsync`. Session opens ⇒ `Ok`; a **CIP-level** error (tag-not-found / bad-path) **still counts as
reachable** (the controller answered CIP); a session/ForwardOpen/connect error ⇒ handshake-fail. CIP sim `:44818`.
5. **Galaxy** — build the `MxGateway.Client` gRPC channel (honour the config's cleartext/TLS) and issue one
lightweight unary call. **The probe does NOT resolve `secretref:` secrets** — it sends whatever key string is
in the transient config (possibly empty/unresolved). An `OK` reply ⇒ `Ok`; an **`Unauthenticated` /
`PermissionDenied`** reply **also ⇒ `Ok`** ("gateway reachable & speaking gRPC; auth not checked") because it
proves a live mxgw server; `Unavailable` / transport error ⇒ handshake-fail. Gateway `10.100.0.48:5120`.
### Tier B — real handshake, unit-proven only (no rig target → live-verify deferred, honestly recorded)
6. **AbLegacy** — same `libplctag` Init handshake as AbCip but the PCCC protocol family (verified-by-proxy via
AbCip's identical code path). No PLC5/SLC sim on the rig.
7. **TwinCAT** — `AdsClient.Connect` + `ReadStateAsync` ⇒ `Ok` with the ADS state (Run/Config/Stop). An ADS
**route-table** rejection is a *true* RED (the driver also cannot function without an authorized route — the
message says so: *"check the target's ADS route table authorizes this host"*). **Degrade guard:** if
`AdsClient` cannot construct/connect headless (managed AMS router unavailable), catch and fall back to the
TCP-preflight result with a *"ADS handshake unavailable on this host — TCP reachability only"* note — never
worse than today. No TwinCAT target on the rig.
8. **FOCAS** — attempt `cnc_allclibhndl3` via the existing wire P/Invoke. **Degrade guard:** the FWLIB native
lib is absent on the dev box / Linux containers (`UnimplementedFocasClientFactory` gates the driver), so the
call throws `DllNotFoundException` / `NotSupportedException` ⇒ catch and fall back to the TCP-preflight result
with a *"FOCAS handshake unavailable on this host (FWLIB absent) — TCP reachability only"* note. A real
handshake runs only on a Windows host with FWLIB + a reachable CNC. No CNC on the rig.
**Degradation principle:** the three Tier-B handshakes must NEVER produce a result worse than today's TCP-only
probe. A genuine protocol rejection from a reachable device is a correct RED; an *environmental inability to run
the handshake at all* (no FWLIB, no managed router) degrades to the existing TCP-reachability message.
## Testing & verification
- **Unit (per probe, TDD red→green, xUnit + Shouldly):** in-process `TcpListener` drives — (a) invalid/empty
config, (b) unreachable → `Connect failed`, (c) **TCP-accepts-then-closes / garbage → handshake-fail** (the
key new path — cleanly assertable for Modbus via a canned-MBAP server; loopback-accept for the rest), plus
Modbus's canned-MBAP **happy** path and FOCAS's **degrade** path (DllNotFound on the dev box is the actual CI
behavior, so it is directly testable). Galaxy's `Unauthenticated⇒Ok` is testable against a tiny in-process
gRPC server or via the live gateway.
- **Live `/run` (agent-driven, extends `DriverTestConnectE2eTests`, skip-gated):** Modbus, OpcUaClient, S7,
AbCip against the rig sims; Galaxy against `10.100.0.48:5120`. Each: green vs the live sim, RED vs wrong port
/ non-protocol server, timeout vs a black-hole IP. AbLegacy / TwinCAT / FOCAS live-verify is **honestly
deferred** (no PLC5/SLC sim, no ADS target, no CNC+FWLIB) — unit-proven + degrade-guarded.
- `dotnet build` clean (production projects are `TreatWarningsAsErrors`) + full `dotnet test` green before merge.
- Final integration review (the three degrade guards + the consistent message contract + no-regression-vs-TCP).
- **No bUnit** — no Razor change (the `DriverTestConnectButton` already renders whatever the probe returns).
## Out of scope / not touched
- Historian.Wonderware probe (already a real Hello/HelloAck handshake).
- `IDriverProbe` / `DriverProbeResult` contract, DI registration, the AdminUI button/Razor, the persisted
`DriverInstance` row (probes run against transient form config only).
- Plan `2026-05-28-adminui-driver-pages` Phase 9 typed **address pickers** and Phase 10 driver-page E2E — those
are Phase 6 (AdminUI), not this phase.
## Hard constraints (carried from the parent roadmap)
- **NO Configuration entity / EF migration.** No contract change (`IDriverProbe`/`DriverProbeResult` frozen).
- Stage by path — never `git add .`. Never stage `sql_login.txt`, `src/Server/.../Host/pki/`, `pending.md`,
`current.md`, `docker-dev/docker-compose.yml`, `stillpending.md`. **Never echo or commit the gateway API key**
(the Galaxy live-verify sources it without echoing, per the established recipe). No force-push, no `--no-verify`.
- Finish = merge to master + push.