Plan TCP connection validation (live verification of the existing remote-TCP plumbing)

docs/plans/tcp-connection-validation.md (308 lines):
  Plan to live-verify the RemoteTcpIntegrated and RemoteTcpCertificate
  transports against an actual remote AVEVA Historian. The SDK's
  HistorianWcfBindingFactory already builds all three bindings
  (CreateMdasNetTcpBinding, CreateMdasNetTcpWindowsBinding,
  CreateMdasNetTcpCertificateBinding) but only LocalPipe has been
  exercised end-to-end. Wire format is identical across transports;
  only WCF binding shape and credential negotiation differ.

  Discovery workstreams A/B/C run in parallel (SPN discovery via static
  IL + WCF probe; cert binding requirements via wcf-cert-probe; operator
  preconditions checklist). D blocks on A. Verification tracks V1-V5 also
  parallelize once V1 (ProbeAsync) confirms the transport is reachable.
  Includes risks (SPN mismatch, cert chain validation, idle disconnect,
  Open2 response delta, compression negotiation, time skew, false-positive
  empty reads), success criteria, eight open questions, and explicit
  out-of-scope items filed under the existing write-commands and
  store-forward plans.

  No code changes; no preconditions assumed met. Implementer must satisfy
  §2 preconditions (reachable remote Historian, port 32568 open, test
  account, SPN registered, etc.) before §4 discovery starts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
dohertj2
2026-05-04 07:20:39 -04:00
parent 6f01b83313
commit 1b31c24c8d
+308
View File
@@ -0,0 +1,308 @@
# TCP Connection Validation Plan
Status: PLAN ONLY (no implementation yet). Scope is **live verification of
the existing remote-TCP transport plumbing**, not new wire-protocol
reverse-engineering — the wire format itself is the same MDAS-encoded SOAP
already verified end-to-end over `LocalPipe`.
Read together with:
- [`docs/reverse-engineering/handoff.md`](../reverse-engineering/handoff.md) — protocol decode state for the reads/events/status helpers
- [`src/AVEVA.Historian.Client/Wcf/HistorianWcfBindingFactory.cs`](../../src/AVEVA.Historian.Client/Wcf/HistorianWcfBindingFactory.cs) — the three already-built bindings
- [`tests/AVEVA.Historian.Client.Tests/HistorianClientIntegrationTests.cs`](../../tests/AVEVA.Historian.Client.Tests/HistorianClientIntegrationTests.cs) — the existing live test pattern (env-var gated, currently `LocalPipe`-only)
## 1. Goal
"TCP transport works" means the production SDK at `src/AVEVA.Historian.Client/`
performs every operation in the CLAUDE.md required surface end-to-end against
a **remote** AVEVA Historian over Net.TCP on port `32568`, with parsed
responses, gated live integration tests, and explicit expectations about
when `RemoteTcpIntegrated` vs `RemoteTcpCertificate` is the right transport
choice.
In scope:
1. **`RemoteTcpIntegrated`** — Net.TCP + SSPI Windows transport credentials,
binding `CreateMdasNetTcpWindowsBinding` (`HistorianWcfBindingFactory.cs:36`),
endpoint `/Hist-Integrated`. The SDK already wires this through
`HistorianClientOptions.Transport = HistorianTransport.RemoteTcpIntegrated`.
2. **`RemoteTcpCertificate`** — Net.TCP + transport security with no client
credential (server cert only), binding `CreateMdasNetTcpCertificateBinding`
(`HistorianWcfBindingFactory.cs:66`), endpoint `/HistCert`. SDK plumbing
exists.
3. **Plain `RemoteTcp`** (no transport security) — `CreateMdasNetTcpBinding`
(`HistorianWcfBindingFactory.cs:13`) is what the `/Retr` endpoint uses
for both above. Verify it works end-to-end as the read-side channel.
4. **Verification of every public op** over each transport: `ProbeAsync`,
`ReadRawAsync`, `ReadAggregateAsync`, `ReadAtTimeAsync`, `ReadEventsAsync`,
`BrowseTagNamesAsync`, `GetTagMetadataAsync`, `GetConnectionStatusAsync`,
`GetStoreForwardStatusAsync`, `GetSystemParameterAsync`.
Out of scope:
- New wire-protocol reverse engineering — the binary protocol is identical
across transports; only the WCF binding shape and credential negotiation
differ.
- Discovering or installing remote Historian instances — operator task,
not SDK work.
- Cert generation / CA bootstrap — operator task; the SDK consumes a cert,
not provisions one.
- `RemoteTcp*` for the explicit-credentials path
(`IntegratedSecurity = false` with username + password) — that's a
separate gap (`HistorianSspiClient` currently only handles current-user
credentials).
- Connection pooling, reconnection on idle disconnect, or load balancing
across redundant Historians.
## 2. Preconditions
The work cannot start until **all** of these are true:
| Precondition | Why | Who |
|---|---|---|
| A reachable remote AVEVA Historian (not `localhost`) | TCP transport behavior cannot be exercised against a same-host install (the LocalPipe binding short-circuits the SSPI negotiation that `Net.TCP + Windows transport` actually exercises) | Operator |
| Network reachability on port 32568 (TCP) from the dev workstation | The Historian listens on this port for the `/Hist-Integrated`, `/HistCert`, `/Retr` endpoints | Operator + IT |
| A test account with at least read access | Used for `RemoteTcpIntegrated` SSPI negotiation | Operator |
| The Historian's SPN registered on the host account | SSPI auth fails without a valid SPN; default native uses `NT SERVICE\aahClientAccessPoint` for LocalPipe. Remote uses something else (likely `MSSQLSvc/host:port` or a custom historian SPN) — DISCOVER FIRST | Discovery (§4.A) |
| For `RemoteTcpCertificate`: a server cert exposed at `/HistCert`, with the cert's CA chain trusted by the dev workstation OR an explicit thumbprint pinning hook | TLS handshake aborts otherwise | Operator + Discovery |
| At least one tag with non-zero history rows on the remote Historian | Otherwise `ReadRawAsync` returns empty and we can't distinguish "transport works, no data" from "transport silently broken" | Operator |
| Time skew between dev workstation and remote Historian < 5 minutes | SSPI negotiation rejects out-of-skew tickets | Operator |
If any precondition is missing, the plan **stops** at §4.A discovery and
reports back; don't try to "guess" a workaround.
## 3. Current state
**Already wired and compiling, never live-verified:**
| Path | Status |
|---|---|
| `HistorianWcfBindingFactory.CreateBindingPair` (`:126`) — dispatch on `HistorianTransport` enum | ✅ all three transport branches exist |
| `RemoteTcpIntegrated` branch (`:138`) — uses `MdasNetTcpWindows` for Hist + plain `MdasNetTcp` for Retr, both at `Host:Port` | ✅ wired |
| `RemoteTcpCertificate` branch (`:143`) — uses `MdasNetTcpCertificate` for Hist + plain `MdasNetTcp` for Retr | ✅ wired |
| `MdasMessageEncodingBindingElement` shared across transports | ✅ same encoder used everywhere; not transport-specific |
| `HistorianSspiClient` (P/Invoke `InitializeSecurityContextW` with native flags `0x2081C` / `0x81C`) | ⚠️ only exercised over LocalPipe; need to verify SPN logic works for TCP host SPN |
**Hard-coded LocalPipe in tests** (must not be left in place once TCP is verified):
```text
EventChainDiagnosticTests.cs:30
HistorianClientIntegrationTests.cs:79, :114, :154, :184, :215, :237, :262, :320
```
There are ten instances of `Transport = HistorianTransport.LocalPipe`. The
existing tests skip cleanly when `HISTORIAN_HOST != "localhost"`; they do
NOT need to change to validate TCP — instead, add **new** parallel tests
gated by a separate env var (e.g., `HISTORIAN_REMOTE_TCP_HOST`) so both
test families run independently.
## 4. Discovery workstreams
These can run in any order; **A, B, and C are parallelizable** since each
hits a different surface (binary inspection vs probe vs operator I/O). D
must be sequential after one of A/B/C produces actionable results.
### A. SPN discovery (parallel-safe — read-only binary + WCF probe)
The native client's TCP SPN is currently unknown. Find it via:
1. **Static IL** — search `current/aahClientManaged.dll` for the strings
`"NT SERVICE"`, `"aahClient"`, and `"MSSQLSvc"` using
`tools/AVEVA.Historian.ReverseEngineering` (the `methods` and
`dnlib-method --instructions` commands handled the SSPI flag discovery
the same way). The `HistorianClientOptions.TargetSpn` default
(`NT SERVICE\aahClientAccessPoint`) is the LocalPipe SPN — TCP almost
certainly differs.
2. **WCF probe** — `tools/AVEVA.Historian.ReverseEngineering -- wcf-probe
<remote-host> 32568` against the remote Historian. Capture the SOAP
fault on failure: it usually echoes the expected SPN in the
`wsa:FaultDetail`.
3. **Cross-reference** with `setspn -L <historian-svc-account>` on the
remote Historian (requires operator access).
Output: a documented `TargetSpn` value for `RemoteTcpIntegrated` use, plus
how the SPN is computed (likely host-derived).
### B. Cert binding discovery (parallel-safe — read-only WCF probe)
For `RemoteTcpCertificate`:
1. **WCF cert probe** — `tools/AVEVA.Historian.ReverseEngineering --
wcf-cert-probe <remote-host> 32568 <expected-cn>`. The probe captures
the cert chain and reports CN/SAN.
2. **Cert validation policy** — current binding (`HistorianWcfBindingFactory.cs:74`)
sets `ClientCredentialType = None`; the server cert is validated by the
default WCF chain check. Document what's required: trusted CA, or
thumbprint pinning, or `X509ServiceCertificateAuthentication.
CertificateValidationMode = PeerOrChainTrust`.
3. **Verify endpoint identity** — `/HistCert` may require an `EndpointIdentity`
(DNS or RSA) on the `EndpointAddress`. Current code (`HistorianWcfBindingFactory.cs:152`)
does not set one. Test whether identity verification fails without it.
Output: documented cert validation requirements + whether
`EndpointAddress(uri, identity)` overload is needed.
### C. Operator setup checklist (parallel-safe — operator-side)
Produce a one-page checklist the operator runs against the remote Historian
to confirm preconditions. Includes:
- `Test-NetConnection -ComputerName <host> -Port 32568` from dev workstation
- `Get-Service aahHistorian*` on the remote (verify running)
- `setspn -L <svc-account>` to capture registered SPNs
- `sqlcmd -E -S <host> -d Runtime -Q "SELECT TOP 1 TagName FROM Tag"` to
prove the Runtime DB is reachable with the operator's credentials
- `w32tm /query /status` to confirm time sync vs the Historian
Output: `docs/plans/tcp-validation-operator-checklist.md` (or appendix in
this doc) the operator can hand back filled in.
### D. Auth-chain delta vs LocalPipe (sequential — needs A's SPN)
Once A returns an SPN, run `tools/AVEVA.Historian.ReverseEngineering --
wcf-probe` against `/Hist-Integrated` over TCP and confirm:
1. The `Hist.GetV → Hist.ValCl × N → Hist.Open2` chain runs the same number
of ValCl rounds (LocalPipe was 2; TCP may be 2 or 3 depending on whether
the underlying transport already negotiated something).
2. The `OpenConnection2` request bytes are identical (the body is
transport-agnostic — only the WCF wrapper differs).
3. The Open2 response carries the same `outParameters` shape (42 bytes
with version, session GUID, FILETIMEs, status). If TCP returns a
different shape, the parser at `HistorianWcfAuthChainHelper.cs` needs a
transport-aware path.
Output: byte-for-byte diff between LocalPipe and TCP capture, with any
deltas noted.
## 5. Verification workstreams
Once §4.A and §4.B return actionable answers, every operation gets a
parallel verification track. **All five tracks below are parallelizable**
because each exercises a different SDK method and they don't share state.
| Track | Op | Live test to author | Parallel-safe? |
|---|---|---|---|
| V1 | `ProbeAsync` | `ProbeAsync_RemoteTcpIntegrated_ReturnsTrue` + `ProbeAsync_RemoteTcpCertificate_ReturnsTrue` | ✅ |
| V2 | Reads (`ReadRawAsync`, `ReadAggregateAsync`, `ReadAtTimeAsync`) | mirror of the existing 3 LocalPipe tests, env-var gated by `HISTORIAN_REMOTE_TCP_HOST` | ✅ |
| V3 | `ReadEventsAsync` | mirror of `ReadEventsAsync_AgainstLocalHistorian_DoesNotThrow` | ✅ |
| V4 | Tag ops (`BrowseTagNamesAsync`, `GetTagMetadataAsync`) | mirror of the two LocalPipe tests | ✅ |
| V5 | Status helpers (`GetConnectionStatusAsync`, `GetStoreForwardStatusAsync`, `GetSystemParameterAsync`) | mirror of the three LocalPipe tests | ✅ |
The only sequential dependency: V1 must pass before V2-V5 are meaningful
(if `ProbeAsync` returns false, the others will too for transport reasons).
For each track, the test pattern is:
```csharp
string? host = Environment.GetEnvironmentVariable("HISTORIAN_REMOTE_TCP_HOST");
if (string.IsNullOrWhiteSpace(host) || !OperatingSystem.IsWindows()) return;
HistorianClient client = new(new HistorianClientOptions
{
Host = host,
Port = 32568,
IntegratedSecurity = true,
Transport = HistorianTransport.RemoteTcpIntegrated,
TargetSpn = Environment.GetEnvironmentVariable("HISTORIAN_REMOTE_TCP_SPN")
?? throw new InvalidOperationException("Set HISTORIAN_REMOTE_TCP_SPN per §4.A"),
});
// ... existing test body, unchanged ...
```
Add a parallel set for `RemoteTcpCertificate` gated by
`HISTORIAN_REMOTE_TCPCERT_HOST` + `HISTORIAN_REMOTE_TCPCERT_THUMBPRINT`.
## 6. Risks and mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| **SPN mismatch** — TCP SSPI negotiation rejects with `SEC_E_TARGET_UNKNOWN` | High | All TCP ops fail | §4.A discovery first; expose `TargetSpn` as already done in `HistorianClientOptions` |
| **Cert chain validation rejects** — server cert not trusted by dev workstation | High for `Certificate` transport | Cert transport unusable | §4.B: document required CA / pinning hook; consider a `ServerCertificateValidator` callback option |
| **Endpoint identity required** — `/HistCert` rejects without DNS identity in `EndpointAddress` | Medium | Cert transport unusable | §4.B step 3; if confirmed, add overload to `CreateEndpointAddress` |
| **Wire-level idle disconnect** — TCP connection dropped after N seconds idle, mid-test | Medium | Flaky tests | Set `RequestTimeout` low enough to fail fast; add reconnect logic if seen repeatedly |
| **Open2 response differs over TCP** — extra bytes in `outParameters` for TCP-specific session state | Low (reads/events use the same ConnectionMode 0x402 regardless of transport) | Auth chain breaks | §4.D byte-diff captures it; if found, transport-aware parser branch |
| **Compression negotiation** — `HistorianClientOptions.Compression` unset on LocalPipe; over TCP, the server might enable gzip and our `MdasMessageEncoder` doesn't unwrap it | Medium-Low | Requests succeed, responses garbled | Confirm compression off in initial probe; add gzip handling later if needed |
| **Time skew** — Kerberos ticket clock skew > 5min rejects auth | Low | Total auth failure | Operator checklist (§4.C) catches this |
| **`Probe` succeeds but reads silently empty** — common when tag-permissions don't grant the test account read access | Medium | False positive in V1, V2 fails | V2 asserts `samples.Count > 0` for a tag known to have data |
## 7. Success criteria
For `RemoteTcpIntegrated`:
- [ ] `ProbeAsync` returns `true` against the remote host
- [ ] All five Verification tracks (V1-V5) pass against the remote host
- [ ] Captured wire bytes for `Open2` and `StartQuery2` match the LocalPipe
captures (modulo session-specific GUIDs / FILETIMEs)
- [ ] Test count is 114 → 124 (10 new live tests) when both env vars set
- [ ] All existing `Transport = HistorianTransport.LocalPipe` tests still
pass when only the LocalPipe env var is set (no regression)
For `RemoteTcpCertificate`: same as above, gated by the cert-specific env
vars. May skip ReadEvents if the cert account doesn't have AnE permission.
For documentation:
- [ ] `README.md` operation status table updated: `RemoteTcpIntegrated` and
`RemoteTcpCertificate` transports change from "wired but only
`LocalPipe` has live verification" to "live-verified"
- [ ] `docs/reverse-engineering/handoff.md` gets a new section documenting
any LocalPipe vs TCP wire-byte deltas found in §4.D
## 8. Open questions
1. Is there even a remote Historian available to test against? If not, this
plan stalls at §2 preconditions until one is provisioned. (Note:
handoff.md mentions a `10.100.0.x` remote Historian and a Debian relay
used in earlier sessions — verify whether that infrastructure is still
live and reachable.)
2. Does `RemoteTcpCertificate` use mutual-TLS or just server-cert? Current
binding (`HistorianWcfBindingFactory.cs:74`) sets
`ClientCredentialType = None` (server-cert only). Confirm against the
actual `/HistCert` endpoint behavior.
3. Does the `/Retr` channel need its own auth, or does it inherit from the
`/Hist-Integrated` session? Current code uses plain `MdasNetTcp` (no
transport security) for Retr in both `RemoteTcpIntegrated` and
`RemoteTcpCertificate` configurations — is that actually how the native
client does it, or does the native push security on Retr too?
4. What happens if the cert presented by `/HistCert` has a SAN that doesn't
match the host the SDK connected to? Decide pinning vs DNS validation.
5. The `HistorianClientOptions.Compression` flag exists but is not consumed
anywhere in the WCF layer. Is compression a transport concern or an
application-payload concern? Need to know before TCP — the bandwidth
savings only matter over WAN.
## 9. Parallelization summary
Within the discovery phase: A, B, C run in parallel. D blocks on A.
Within the verification phase: V1 must pass first, then V2-V5 parallel.
Both `RemoteTcpIntegrated` and `RemoteTcpCertificate` verification tracks
can run independently from each other.
End-to-end estimated wall-clock if all preconditions are met:
- §4 discovery: half a day if SPN is straightforward, longer if cert chain
surprises bite.
- §5 verification: 2-4 hours given the test scaffolding is largely a copy
of the existing LocalPipe tests.
If only one developer works the plan: ~1 day. With two developers
parallelizing across `RemoteTcpIntegrated` and `RemoteTcpCertificate`:
~half a day.
## 10. Out of scope (filed under separate plans)
- **Write commands over TCP** — `docs/plans/write-commands-reverse-engineering.md`
covers writes; once that lands, this doc adds a §11 "TCP write
verification" track.
- **Store/Forward sidecar over TCP** — covered by
`docs/plans/store-forward-cache-reverse-engineering.md`. SF probably
uses a separate IPC anyway, not Net.TCP.
- **Explicit-credentials TCP** — `IntegratedSecurity = false` with
username + password requires `HistorianSspiClient` to support explicit
credentials, which is its own task. Net.TCP can use either Kerberos or
the explicit creds, but the SDK's SSPI client only does current-user
Kerberos today.