Files
lmxopcua/docs/drivers/TestConnectProbes.md
T
Joseph Doherty 2124f21ab6
v2-ci / build (pull_request) Failing after 38s
v2-ci / unit-tests (tests/Core/ZB.MOM.WW.OtOpcUa.Cluster.Tests) (pull_request) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.ControlPlane.Tests) (pull_request) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.Tests) (pull_request) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Runtime.Tests) (pull_request) Has been skipped
v2-ci / unit-tests (tests/Server/ZB.MOM.WW.OtOpcUa.Security.Tests) (pull_request) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.Host.IntegrationTests) (pull_request) Has been skipped
v2-ci / integration (tests/Server/ZB.MOM.WW.OtOpcUa.OpcUaServer.IntegrationTests) (pull_request) Has been skipped
docs(historian-gateway): document gateway backend, config keys, EnsureTags hook, known gates; retire Wonderware from docs
HistorianGateway is now the sole historian backend (read + alarm SendEvent +
continuous WriteLiveValues). Document the final state and retire the Wonderware
sidecar from the docs/config/labels:

- CLAUDE.md: rewrite the Historian section — ServerHistorian /
  ContinuousHistorization / AlarmHistorian config keys, the IHistorianProvisioning
  EnsureTags hook, the GatewayAlarmHistorianWriter SendEvent path + ReadEvents
  dependency on gateway RuntimeDb:EventReadsEnabled=true, gateway-side
  prerequisites (RuntimeDb flags + historian:read/write/tags:write scopes),
  migration note, and two KNOWN-LIMITATION callouts (live-validation gate +
  empty historized-ref-set recorder follow-on).
- appsettings.json: fix the stale ServerHistorian block (Host/Port/SharedSecret/
  ServerCertThumbprint -> Endpoint/ApiKey/UseTls/AllowUntrustedServerCertificate/
  CaCertificatePath/CallTimeout, keep MaxTieClusterOverfetch); add a disabled
  ContinuousHistorization block; prune the orphaned Wonderware keys from
  AlarmHistorian (keep the SQLite knobs). ApiKey env-supplied via
  ServerHistorian__ApiKey (commented; valid strict JSON via _comment keys).
- README.md + docs (Historian.md, AlarmHistorian.md, Configuration.md,
  ServiceHosting.md, DriverLifecycle.md, drivers/README.md, Uns.md, VirtualTags.md,
  AlarmTracking.md, Client.UI.md, README.md, TestConnectProbes.md): retire the
  Wonderware historian backend from current-backend descriptions; fix the stale
  ServerHistorian/AlarmHistorian config tables (now gateway shape); convert
  drivers/Historian.Wonderware.md to a retired stub pointing at the gateway.
- Source/UI labels (descriptive text only, no behavior change):
  OtOpcUaServerHostedService.cs, HistoryPaging.cs, OtOpcUaSdkServer.cs,
  HistorianAdapterActor.cs, VirtualTagModal.razor, ScriptedAlarmModal.razor,
  AlarmsHistorian.razor now name the HistorianGateway backend.

Build clean (0 errors); AdminUI.Tests green (514 passed).

Claude-Session: https://claude.ai/code/session_012SDSQ3AcaXqPcBtDESBRii
2026-06-26 19:46:27 -04:00

8.5 KiB
Raw Blame History

Test-Connect Probes — Protocol Handshakes

Each driver's Test-Connect button in the AdminUI runs a probe against the form's current config (never the persisted row, never the live driver actor). Before Phase 5 (shipped 2026-06-16) every probe was a bare TCP ConnectAsync — a live-but-rejecting device showed a healthy green tick, and the operator only discovered the truth when the driver faulted at deploy. Phase 5 replaced each TCP-only probe with a real protocol handshake so a reachable-but-wrong or actively-rejecting endpoint now reads RED.

The IDriverProbe / DriverProbeResult contract and DI registration are unchanged. Probes run in a transient actor with a timeout clamp of 160 s and must not mutate any state.

For the AdminUI probe flow (button → AdminOperationsActor → transient probe actor), see docs/plans/2026-05-28-adminui-driver-pages-design.md §4.


Result contract

All probes return a consistent DriverProbeResult(bool Ok, string? Message, TimeSpan? Latency). The message templates below are uniform across all 8 drivers:

Outcome Ok Message template
TCP connect fails false "Connect failed: {SocketErrorCode}"
TCP ok + handshake ok true driver-specific descriptive string (see table below)
TCP ok but handshake rejected false "Reachable at {host}:{port} but {proto} handshake failed: {detail}"
Timeout false "Probe timed out after {n}s."

The third row is the key new behavior: a reachable device that answers on the port but rejects the protocol-level handshake now surfaces a false result with a human-readable explanation rather than a false-green TCP-open tick.


Per-driver handshake

Driver Handshake Ok message Dev-rig target
Modbus FC03 (Read Holding Registers, qty 1 @ addr 0) via ModbusTcpTransport. A Modbus exception PDU still proves a real Modbus device → Ok. A non-MBAP reply → handshake fail. "Modbus FC03 OK" 10.100.0.35:5020 (Modbus sim)
OpcUaClient DiscoveryClient.GetEndpointsAsync — no session, no app-cert, no auth. ≥ 1 endpoint → Ok. A non-OPC-UA TCP server throws or times out → handshake fail. "OPC UA: N endpoint(s)" opc.tcp://10.100.0.35:50000 (opc-plc)
S7 Plc.OpenAsync (COTP CR/CC + S7 setup-communication), check IsConnected, then Close. Wrong rack/slot or a non-S7 server causes OpenAsync to throw → handshake fail. "S7 connected (CPU …)" 10.100.0.35:1102 (python-snap7 sim)
AbCip libplctag Tag InitializeAsync (EIP session + CIP Forward Open). A CIP-level error such as tag-not-found still proves the controller answered CIP → Ok. A session/ForwardOpen/connect error → handshake fail. "CIP session OK" 10.100.0.35:44818 (CIP sim)
AbLegacy Same libplctag InitializeAsync handshake as AbCip, PCCC protocol family. "CIP session OK" (PCCC family) Deferred — no PLC5/SLC sim
TwinCAT AdsClient.Connect + ReadStateAsync. See degrade semantics below. "ADS state: {state}" Deferred — no ADS target
FOCAS cnc_allclibhndl3 via a direct DllImport("fwlib32") in the probe. See degrade semantics below. "FOCAS handle OK" Deferred — no CNC + FWLIB
Galaxy gRPC unary call to GalaxyRepository.TestConnection on the configured mxaccessgw endpoint. See auth-rejection rule below. "gateway gRPC OK" http://10.100.0.48:5120 (mxaccessgw)

Historian.Wonderware had a TCP HelloHelloAck handshake probe before Phase 5, but the Wonderware historian backend (and its driver-type / probe) has since been retired — the historian backend is now the external HistorianGateway (a gRPC client package, not a probed IDriver). See Historian.Wonderware.md (retired stub) and ../Historian.md.


Degrade semantics

Three drivers have environmental constraints that can prevent the handshake from running on certain hosts. The degradation principle is: the probe must never produce a result worse than today's TCP-only probe. A genuine protocol rejection from a reachable device is a correct RED; an inability to run the handshake at all (no FWLIB, no managed router) degrades to the existing TCP-reachability message — still a green tick but annotated.

TwinCAT degrade

Where the handshake is available:

  • AdsClient.Connect(netId, port) + ReadStateAsyncOk=true, "ADS state: {state}" (Run / Config / Stop).
  • An ADS route-table rejection from a reachable ADS router is a true RED: "Reachable at {host}:{port} but ADS handshake failed: {detail} — check the target's ADS route table authorizes this host". This is the correct result: the driver would also be unable to function without an authorized route.

Where the handshake is unavailable (headless server, no TwinCAT runtime, the managed AMS router cannot start):

  • Probe degrades to TCP-reachability: Ok=true, "(ADS handshake unavailable on this host — TCP reachability only)".

FOCAS degrade

On a Windows host with the FANUC FWLIB shared library present:

  • cnc_allclibhndl3 is called via a direct DllImport("fwlib32") declared in the probe (the production Wire.WireFocasClient is a pure-managed FOCAS/2 TCP client, not an FWLIB P/Invoke, so the probe carries its own native binding). A successful handle allocation → Ok=true, "FOCAS handle OK".
  • A CNC-level rejection → handshake fail.

On dev, Linux, or macOS (no native FWLIB — UnimplementedFocasClientFactory gates the driver):

  • DllNotFoundException / NotSupportedException is caught and the probe degrades to TCP-reachability: Ok=true, "(FOCAS handshake unavailable on this host — FWLIB absent, TCP reachability only)".

Galaxy auth-rejection rule

The probe builds the gRPC channel from the form's config and issues one lightweight unary call. It does not resolve secretref: secrets — the key string in the transient config (possibly empty or unresolved) is used as-is.

  • Unavailable / transport failure → Ok=false (gateway is down or unreachable).
  • Unauthenticated / PermissionDeniedOk=true, "gateway reachable & speaking gRPC (auth not checked)" — an auth rejection proves a live mxaccessgw gRPC server. This is the correct result: the driver's own session-layer will handle auth; the probe is testing reachability only.

The mxaccessgw client surfaces a rejected key as a typed MxGatewayAuthenticationException / MxGatewayAuthorizationException, not a raw RpcException — the probe catches both and maps them to the reachable result above. (Live verification on 10.100.0.48:5120 with no key returns MxGatewayAuthenticationException("Missing or invalid API key.")Ok=true.)

Config note: UseTls must match the endpoint scheme — UseTls:false for an http:// (h2c) gateway, UseTls:true for https://. A mismatch fails the client's own validation (the same constraint the Galaxy driver enforces).


Live-verify scope

Driver Live-verify status Notes
Modbus Verified Dev-rig sim 10.100.0.35:5020; green vs sim, RED vs wrong port / non-Modbus server, timeout vs black-hole IP
OpcUaClient Verified opc-plc 10.100.0.35:50000; same three-scenario matrix
S7 Verified python-snap7 10.100.0.35:1102
AbCip Verified CIP sim 10.100.0.35:44818
Galaxy Verified mxaccessgw 10.100.0.48:5120; Unauthenticated reply counts as Ok
AbLegacy Deferred No PLC5/SLC sim; unit-proven + code path identical to AbCip
TwinCAT Deferred No ADS target; unit-proven + degrade guard tested
FOCAS Deferred No CNC + FWLIB on dev host; degrade guard is the CI-observable path

Implementation references

  • Phase 5 design: docs/plans/2026-06-16-stillpending-phase-5-probes-design.md
  • Parent roadmap: docs/plans/2026-06-15-stillpending-backlog-design.md §Phase 5
  • AdminUI probe flow: docs/plans/2026-05-28-adminui-driver-pages-design.md §4
  • Per-driver probe implementations: src/Drivers/ZB.MOM.WW.OtOpcUa.Driver.<Type>/<Type>DriverProbe.cs
  • IDriverProbe contract: src/Core/ZB.MOM.WW.OtOpcUa.Core.Abstractions/IDriverProbe.cs
  • Probe dispatch + timeout clamp: src/Server/ZB.MOM.WW.OtOpcUa.Host/Actors/AdminOperationsActor.cs (around line 284)