Files
lmxopcua/docs/drivers/TestConnectProbes.md
T
Joseph Doherty 2124f21ab6
v2-ci / build (pull_request) Failing after 38s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (pull_request) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (pull_request) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (pull_request) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (pull_request) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (pull_request) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (pull_request) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (pull_request) Has been skipped
docs(historian-gateway): document gateway backend, config keys, EnsureTags hook, known gates; retire Wonderware from docs
HistorianGateway is now the sole historian backend (read + alarm SendEvent +
continuous WriteLiveValues). Document the final state and retire the Wonderware
sidecar from the docs/config/labels:

- CLAUDE.md: rewrite the Historian section — ServerHistorian /
  ContinuousHistorization / AlarmHistorian config keys, the IHistorianProvisioning
  EnsureTags hook, the GatewayAlarmHistorianWriter SendEvent path + ReadEvents
  dependency on gateway RuntimeDb:EventReadsEnabled=true, gateway-side
  prerequisites (RuntimeDb flags + historian:read/write/tags:write scopes),
  migration note, and two KNOWN-LIMITATION callouts (live-validation gate +
  empty historized-ref-set recorder follow-on).
- appsettings.json: fix the stale ServerHistorian block (Host/Port/SharedSecret/
  ServerCertThumbprint -> Endpoint/ApiKey/UseTls/AllowUntrustedServerCertificate/
  CaCertificatePath/CallTimeout, keep MaxTieClusterOverfetch); add a disabled
  ContinuousHistorization block; prune the orphaned Wonderware keys from
  AlarmHistorian (keep the SQLite knobs). ApiKey env-supplied via
  ServerHistorian__ApiKey (commented; valid strict JSON via _comment keys).
- README.md + docs (Historian.md, AlarmHistorian.md, Configuration.md,
  ServiceHosting.md, DriverLifecycle.md, drivers/README.md, Uns.md, VirtualTags.md,
  AlarmTracking.md, Client.UI.md, README.md, TestConnectProbes.md): retire the
  Wonderware historian backend from current-backend descriptions; fix the stale
  ServerHistorian/AlarmHistorian config tables (now gateway shape); convert
  drivers/Historian.Wonderware.md to a retired stub pointing at the gateway.
- Source/UI labels (descriptive text only, no behavior change):
  OtOpcUaServerHostedService.cs, HistoryPaging.cs, OtOpcUaSdkServer.cs,
  HistorianAdapterActor.cs, VirtualTagModal.razor, ScriptedAlarmModal.razor,
  AlarmsHistorian.razor now name the HistorianGateway backend.

Build clean (0 errors); AdminUI.Tests green (514 passed).

Claude-Session: https://claude.ai/code/session_012SDSQ3AcaXqPcBtDESBRii
2026-06-26 19:46:27 -04:00

150 lines
8.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Test-Connect Probes — Protocol Handshakes
Each driver's **Test-Connect** button in the AdminUI runs a probe against the
form's current config (never the persisted row, never the live driver actor).
Before Phase 5 (shipped 2026-06-16) every probe was a bare TCP `ConnectAsync`
— a live-but-rejecting device showed a healthy green tick, and the operator
only discovered the truth when the driver faulted at deploy. Phase 5 replaced
each TCP-only probe with a **real protocol handshake** so a reachable-but-wrong
or actively-rejecting endpoint now reads RED.
The `IDriverProbe` / `DriverProbeResult` contract and DI registration are
unchanged. Probes run in a transient actor with a timeout clamp of 160 s
and must not mutate any state.
For the AdminUI probe flow (button → `AdminOperationsActor` → transient probe
actor), see
[`docs/plans/2026-05-28-adminui-driver-pages-design.md`](../plans/2026-05-28-adminui-driver-pages-design.md)
§4.
---
## Result contract
All probes return a consistent `DriverProbeResult(bool Ok, string? Message, TimeSpan? Latency)`.
The message templates below are uniform across all 8 drivers:
| Outcome | `Ok` | Message template |
|---------|------|-----------------|
| TCP connect fails | `false` | `"Connect failed: {SocketErrorCode}"` |
| TCP ok + handshake ok | `true` | driver-specific descriptive string (see table below) |
| TCP ok but handshake rejected | `false` | `"Reachable at {host}:{port} but {proto} handshake failed: {detail}"` |
| Timeout | `false` | `"Probe timed out after {n}s."` |
The third row is the key new behavior: a reachable device that answers on the
port but rejects the protocol-level handshake now surfaces a `false` result
with a human-readable explanation rather than a false-green TCP-open tick.
---
## Per-driver handshake
| Driver | Handshake | Ok message | Dev-rig target |
|--------|-----------|------------|----------------|
| **Modbus** | FC03 (Read Holding Registers, qty 1 @ addr 0) via `ModbusTcpTransport`. A Modbus exception PDU still proves a real Modbus device → `Ok`. A non-MBAP reply → handshake fail. | `"Modbus FC03 OK"` | `10.100.0.35:5020` (Modbus sim) |
| **OpcUaClient** | `DiscoveryClient.GetEndpointsAsync` — no session, no app-cert, no auth. ≥ 1 endpoint → `Ok`. A non-OPC-UA TCP server throws or times out → handshake fail. | `"OPC UA: N endpoint(s)"` | `opc.tcp://10.100.0.35:50000` (opc-plc) |
| **S7** | `Plc.OpenAsync` (COTP CR/CC + S7 setup-communication), check `IsConnected`, then `Close`. Wrong rack/slot or a non-S7 server causes `OpenAsync` to throw → handshake fail. | `"S7 connected (CPU …)"` | `10.100.0.35:1102` (python-snap7 sim) |
| **AbCip** | `libplctag` Tag `InitializeAsync` (EIP session + CIP Forward Open). A CIP-level error such as tag-not-found still proves the controller answered CIP → `Ok`. A session/ForwardOpen/connect error → handshake fail. | `"CIP session OK"` | `10.100.0.35:44818` (CIP sim) |
| **AbLegacy** | Same `libplctag` `InitializeAsync` handshake as AbCip, PCCC protocol family. | `"CIP session OK"` (PCCC family) | Deferred — no PLC5/SLC sim |
| **TwinCAT** | `AdsClient.Connect` + `ReadStateAsync`. See [degrade semantics](#twincat-degrade) below. | `"ADS state: {state}"` | Deferred — no ADS target |
| **FOCAS** | `cnc_allclibhndl3` via a direct `DllImport("fwlib32")` in the probe. See [degrade semantics](#focas-degrade) below. | `"FOCAS handle OK"` | Deferred — no CNC + FWLIB |
| **Galaxy** | gRPC unary call to `GalaxyRepository.TestConnection` on the configured mxaccessgw endpoint. See [auth-rejection rule](#galaxy-auth-rejection) below. | `"gateway gRPC OK"` | `http://10.100.0.48:5120` (mxaccessgw) |
**Historian.Wonderware** had a TCP `Hello``HelloAck` handshake probe before Phase 5, but the
Wonderware historian backend (and its driver-type / probe) has since been **retired** — the historian
backend is now the external HistorianGateway (a gRPC client package, not a probed `IDriver`). See
[`Historian.Wonderware.md`](Historian.Wonderware.md) (retired stub) and [`../Historian.md`](../Historian.md).
---
## Degrade semantics
Three drivers have environmental constraints that can prevent the handshake
from running on certain hosts. The **degradation principle** is: the probe
must never produce a result *worse* than today's TCP-only probe. A genuine
protocol rejection from a reachable device is a correct RED; an inability to
*run* the handshake at all (no FWLIB, no managed router) degrades to the
existing TCP-reachability message — still a green tick but annotated.
### TwinCAT degrade
Where the handshake is available:
- `AdsClient.Connect(netId, port)` + `ReadStateAsync``Ok=true`,
`"ADS state: {state}"` (Run / Config / Stop).
- An ADS **route-table rejection** from a reachable ADS router is a **true RED**:
`"Reachable at {host}:{port} but ADS handshake failed: {detail} — check the
target's ADS route table authorizes this host"`. This is the correct result:
the driver would also be unable to function without an authorized route.
Where the handshake is unavailable (headless server, no TwinCAT runtime, the
managed AMS router cannot start):
- Probe degrades to TCP-reachability: `Ok=true`,
`"(ADS handshake unavailable on this host — TCP reachability only)"`.
### FOCAS degrade
On a Windows host with the FANUC FWLIB shared library present:
- `cnc_allclibhndl3` is called via a direct `DllImport("fwlib32")` declared in
the probe (the production `Wire.WireFocasClient` is a pure-managed FOCAS/2 TCP
client, not an FWLIB P/Invoke, so the probe carries its own native binding).
A successful handle allocation → `Ok=true`, `"FOCAS handle OK"`.
- A CNC-level rejection → handshake fail.
On dev, Linux, or macOS (no native FWLIB — `UnimplementedFocasClientFactory`
gates the driver):
- `DllNotFoundException` / `NotSupportedException` is caught and the probe
degrades to TCP-reachability: `Ok=true`,
`"(FOCAS handshake unavailable on this host — FWLIB absent, TCP reachability only)"`.
### Galaxy auth-rejection rule
The probe builds the gRPC channel from the form's config and issues one
lightweight unary call. It does **not** resolve `secretref:` secrets — the
key string in the transient config (possibly empty or unresolved) is used as-is.
- `Unavailable` / transport failure → `Ok=false` (gateway is down or unreachable).
- `Unauthenticated` / `PermissionDenied`**`Ok=true`**,
`"gateway reachable & speaking gRPC (auth not checked)"` — an auth rejection
proves a live mxaccessgw gRPC server. This is the correct result: the driver's
own session-layer will handle auth; the probe is testing reachability only.
The mxaccessgw client surfaces a rejected key as a typed
`MxGatewayAuthenticationException` / `MxGatewayAuthorizationException`, **not** a
raw `RpcException` — the probe catches both and maps them to the reachable result
above. (Live verification on `10.100.0.48:5120` with no key returns
`MxGatewayAuthenticationException("Missing or invalid API key.")``Ok=true`.)
> **Config note:** `UseTls` must match the endpoint scheme — `UseTls:false` for an
> `http://` (h2c) gateway, `UseTls:true` for `https://`. A mismatch fails the
> client's own validation (the same constraint the Galaxy driver enforces).
---
## Live-verify scope
| Driver | Live-verify status | Notes |
|--------|-------------------|-------|
| Modbus | Verified | Dev-rig sim `10.100.0.35:5020`; green vs sim, RED vs wrong port / non-Modbus server, timeout vs black-hole IP |
| OpcUaClient | Verified | opc-plc `10.100.0.35:50000`; same three-scenario matrix |
| S7 | Verified | python-snap7 `10.100.0.35:1102` |
| AbCip | Verified | CIP sim `10.100.0.35:44818` |
| Galaxy | Verified | mxaccessgw `10.100.0.48:5120`; `Unauthenticated` reply counts as Ok |
| AbLegacy | Deferred | No PLC5/SLC sim; unit-proven + code path identical to AbCip |
| TwinCAT | Deferred | No ADS target; unit-proven + degrade guard tested |
| FOCAS | Deferred | No CNC + FWLIB on dev host; degrade guard is the CI-observable path |
---
## Implementation references
- Phase 5 design: `docs/plans/2026-06-16-stillpending-phase-5-probes-design.md`
- Parent roadmap: `docs/plans/2026-06-15-stillpending-backlog-design.md` §Phase 5
- AdminUI probe flow: `docs/plans/2026-05-28-adminui-driver-pages-design.md` §4
- Per-driver probe implementations: `src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.<Type>/<Type>DriverProbe.cs`
- `IDriverProbe` contract: `src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IDriverProbe.cs`
- Probe dispatch + timeout clamp: `src/Server/ZB.MOM.WW.OtOpcUa.Host/Actors/AdminOperationsActor.cs` (around line 284)