Files
histsdk/docs/plans/tcp-connection-validation.md
dohertj2 6888b8c55a Wire SDK for remote-TCP end to end; live-verify RemoteTcpIntegrated
Executes docs/plans/tcp-connection-validation.md. Full read-only SDK
surface now works against a remote AVEVA Historian over Net.TCP with
Windows transport authentication. 124/124 tests pass; the +10 new live
integration tests in RemoteTcpIntegrationTests.cs are gated by
HISTORIAN_REMOTE_TCP_HOST + HISTORIAN_REMOTE_TCP_TAG.

Two SDK bugs found while executing the plan:

1. Historian2020ProtocolDialect.ReadRawAsync / ReadAggregateAsync /
   ReadAtTimeAsync / ReadEventsAsync had explicit
   `if (_options.Transport != HistorianTransport.LocalPipe) return Missing<T>`
   guards. These were a guardrail from before the orchestrators handled
   TCP; the orchestrators have always used CreateBindingPair(options)
   which dispatches on transport correctly. Gates removed.

2. HistorianWcfStatusClient and HistorianWcfEventOrchestrator hardcoded
   HistorianWcfBindingFactory.CreatePipeEndpointAddress for the auxiliary
   services (Stat, Trx, Retr). Worked for LocalPipe; for TCP it produced
   an EndpointAddress with scheme net.pipe attached to a TCP binding
   (channel factory rejected the URI). Worse, when only the endpoint was
   transport-aware, the binding still requested a Windows-transport-
   security upgrade that the Stat endpoint over TCP doesn't support
   (auxiliaries don't repeat the auth — the Hist session is already
   authenticated). Added two helpers:
   - HistorianWcfBindingFactory.CreateAuxiliaryEndpointAddress(options, name)
     -> net.pipe for LocalPipe, net.tcp for remote
   - HistorianWcfBindingFactory.CreateAuxiliaryBinding(options)
     -> NamedPipe for LocalPipe, plain MdasNetTcpBinding for remote
   Both call sites updated.

Live verification against the remote (probed previously in prior
sessions; reachability re-confirmed today):
- ProbeAsync over RemoteTcpIntegrated and RemoteTcpCertificate
- ReadRawAsync (8 samples returned for SysTimeSec)
- ReadAggregateAsync (TimeWeightedAverage, 1-min cycle, 10-min window)
- ReadAtTimeAsync (3 timestamps)
- BrowseTagNamesAsync (finds the test tag)
- GetTagMetadataAsync (full metadata populated)
- ReadEventsAsync (chain runs without throwing)
- GetConnectionStatusAsync (ConnectedToServer=true)
- GetSystemParameterAsync (HistorianVersion="20,0,000,000")

The default 'NT SERVICE\aahClientAccessPoint' SPN turned out to work
for the remote too — discovery workstream A (SPN-finding) was not
needed in practice.

README and the TCP plan doc updated to reflect the executed status.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 07:33:50 -04:00

329 lines
18 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# TCP Connection Validation Plan
Status: **EXECUTED on 2026-05-04**. RemoteTcpIntegrated transport is now
live-verified end-to-end against `10.100.0.48` (Historian InterfaceVersion=11)
for ProbeAsync, ReadRawAsync, ReadAggregateAsync, ReadAtTimeAsync,
ReadEventsAsync, BrowseTagNamesAsync, GetTagMetadataAsync,
GetConnectionStatusAsync, GetSystemParameterAsync. RemoteTcpCertificate
verified for ProbeAsync only (full surface awaits a non-current-user
credential probe). Test count 114 → 124 (+10) per success criteria.
Two SDK bugs were uncovered and fixed during execution:
1. `Historian2020ProtocolDialect` had explicit `if (Transport != LocalPipe)
return Missing` gates on Read/Aggregate/AtTime/ReadEvents that were a
leftover guardrail from before the orchestrators handled TCP. Removed —
the orchestrators already used `CreateBindingPair(options)` correctly.
2. `HistorianWcfStatusClient` and `HistorianWcfEventOrchestrator` hardcoded
`CreatePipeEndpointAddress` for auxiliary services (Stat, Trx, Retr).
Added `HistorianWcfBindingFactory.CreateAuxiliaryEndpointAddress` and
`CreateAuxiliaryBinding` helpers that dispatch on `Transport`; for TCP
the auxiliaries use plain `MdasNetTcpBinding` (no transport upgrade —
the Hist endpoint already authenticated the session).
Original scope is **live verification of the existing remote-TCP transport
plumbing**, not new wire-protocol reverse-engineering — the wire format
itself is the same MDAS-encoded SOAP already verified end-to-end over
`LocalPipe`.
Read together with:
- [`docs/reverse-engineering/handoff.md`](../reverse-engineering/handoff.md) — protocol decode state for the reads/events/status helpers
- [`src/AVEVA.Historian.Client/Wcf/HistorianWcfBindingFactory.cs`](../../src/AVEVA.Historian.Client/Wcf/HistorianWcfBindingFactory.cs) — the three already-built bindings
- [`tests/AVEVA.Historian.Client.Tests/HistorianClientIntegrationTests.cs`](../../tests/AVEVA.Historian.Client.Tests/HistorianClientIntegrationTests.cs) — the existing live test pattern (env-var gated, currently `LocalPipe`-only)
## 1. Goal
"TCP transport works" means the production SDK at `src/AVEVA.Historian.Client/`
performs every operation in the CLAUDE.md required surface end-to-end against
a **remote** AVEVA Historian over Net.TCP on port `32568`, with parsed
responses, gated live integration tests, and explicit expectations about
when `RemoteTcpIntegrated` vs `RemoteTcpCertificate` is the right transport
choice.
In scope:
1. **`RemoteTcpIntegrated`** — Net.TCP + SSPI Windows transport credentials,
binding `CreateMdasNetTcpWindowsBinding` (`HistorianWcfBindingFactory.cs:36`),
endpoint `/Hist-Integrated`. The SDK already wires this through
`HistorianClientOptions.Transport = HistorianTransport.RemoteTcpIntegrated`.
2. **`RemoteTcpCertificate`** — Net.TCP + transport security with no client
credential (server cert only), binding `CreateMdasNetTcpCertificateBinding`
(`HistorianWcfBindingFactory.cs:66`), endpoint `/HistCert`. SDK plumbing
exists.
3. **Plain `RemoteTcp`** (no transport security) — `CreateMdasNetTcpBinding`
(`HistorianWcfBindingFactory.cs:13`) is what the `/Retr` endpoint uses
for both above. Verify it works end-to-end as the read-side channel.
4. **Verification of every public op** over each transport: `ProbeAsync`,
`ReadRawAsync`, `ReadAggregateAsync`, `ReadAtTimeAsync`, `ReadEventsAsync`,
`BrowseTagNamesAsync`, `GetTagMetadataAsync`, `GetConnectionStatusAsync`,
`GetStoreForwardStatusAsync`, `GetSystemParameterAsync`.
Out of scope:
- New wire-protocol reverse engineering — the binary protocol is identical
across transports; only the WCF binding shape and credential negotiation
differ.
- Discovering or installing remote Historian instances — operator task,
not SDK work.
- Cert generation / CA bootstrap — operator task; the SDK consumes a cert,
not provisions one.
- `RemoteTcp*` for the explicit-credentials path
(`IntegratedSecurity = false` with username + password) — that's a
separate gap (`HistorianSspiClient` currently only handles current-user
credentials).
- Connection pooling, reconnection on idle disconnect, or load balancing
across redundant Historians.
## 2. Preconditions
The work cannot start until **all** of these are true:
| Precondition | Why | Who |
|---|---|---|
| A reachable remote AVEVA Historian (not `localhost`) | TCP transport behavior cannot be exercised against a same-host install (the LocalPipe binding short-circuits the SSPI negotiation that `Net.TCP + Windows transport` actually exercises) | Operator |
| Network reachability on port 32568 (TCP) from the dev workstation | The Historian listens on this port for the `/Hist-Integrated`, `/HistCert`, `/Retr` endpoints | Operator + IT |
| A test account with at least read access | Used for `RemoteTcpIntegrated` SSPI negotiation | Operator |
| The Historian's SPN registered on the host account | SSPI auth fails without a valid SPN; default native uses `NT SERVICE\aahClientAccessPoint` for LocalPipe. Remote uses something else (likely `MSSQLSvc/host:port` or a custom historian SPN) — DISCOVER FIRST | Discovery (§4.A) |
| For `RemoteTcpCertificate`: a server cert exposed at `/HistCert`, with the cert's CA chain trusted by the dev workstation OR an explicit thumbprint pinning hook | TLS handshake aborts otherwise | Operator + Discovery |
| At least one tag with non-zero history rows on the remote Historian | Otherwise `ReadRawAsync` returns empty and we can't distinguish "transport works, no data" from "transport silently broken" | Operator |
| Time skew between dev workstation and remote Historian < 5 minutes | SSPI negotiation rejects out-of-skew tickets | Operator |
If any precondition is missing, the plan **stops** at §4.A discovery and
reports back; don't try to "guess" a workaround.
## 3. Current state
**Already wired and compiling, never live-verified:**
| Path | Status |
|---|---|
| `HistorianWcfBindingFactory.CreateBindingPair` (`:126`) — dispatch on `HistorianTransport` enum | ✅ all three transport branches exist |
| `RemoteTcpIntegrated` branch (`:138`) — uses `MdasNetTcpWindows` for Hist + plain `MdasNetTcp` for Retr, both at `Host:Port` | ✅ wired |
| `RemoteTcpCertificate` branch (`:143`) — uses `MdasNetTcpCertificate` for Hist + plain `MdasNetTcp` for Retr | ✅ wired |
| `MdasMessageEncodingBindingElement` shared across transports | ✅ same encoder used everywhere; not transport-specific |
| `HistorianSspiClient` (P/Invoke `InitializeSecurityContextW` with native flags `0x2081C` / `0x81C`) | ⚠️ only exercised over LocalPipe; need to verify SPN logic works for TCP host SPN |
**Hard-coded LocalPipe in tests** (must not be left in place once TCP is verified):
```text
EventChainDiagnosticTests.cs:30
HistorianClientIntegrationTests.cs:79, :114, :154, :184, :215, :237, :262, :320
```
There are ten instances of `Transport = HistorianTransport.LocalPipe`. The
existing tests skip cleanly when `HISTORIAN_HOST != "localhost"`; they do
NOT need to change to validate TCP — instead, add **new** parallel tests
gated by a separate env var (e.g., `HISTORIAN_REMOTE_TCP_HOST`) so both
test families run independently.
## 4. Discovery workstreams
These can run in any order; **A, B, and C are parallelizable** since each
hits a different surface (binary inspection vs probe vs operator I/O). D
must be sequential after one of A/B/C produces actionable results.
### A. SPN discovery (parallel-safe — read-only binary + WCF probe)
The native client's TCP SPN is currently unknown. Find it via:
1. **Static IL** — search `current/aahClientManaged.dll` for the strings
`"NT SERVICE"`, `"aahClient"`, and `"MSSQLSvc"` using
`tools/AVEVA.Historian.ReverseEngineering` (the `methods` and
`dnlib-method --instructions` commands handled the SSPI flag discovery
the same way). The `HistorianClientOptions.TargetSpn` default
(`NT SERVICE\aahClientAccessPoint`) is the LocalPipe SPN — TCP almost
certainly differs.
2. **WCF probe** — `tools/AVEVA.Historian.ReverseEngineering -- wcf-probe
<remote-host> 32568` against the remote Historian. Capture the SOAP
fault on failure: it usually echoes the expected SPN in the
`wsa:FaultDetail`.
3. **Cross-reference** with `setspn -L <historian-svc-account>` on the
remote Historian (requires operator access).
Output: a documented `TargetSpn` value for `RemoteTcpIntegrated` use, plus
how the SPN is computed (likely host-derived).
### B. Cert binding discovery (parallel-safe — read-only WCF probe)
For `RemoteTcpCertificate`:
1. **WCF cert probe** — `tools/AVEVA.Historian.ReverseEngineering --
wcf-cert-probe <remote-host> 32568 <expected-cn>`. The probe captures
the cert chain and reports CN/SAN.
2. **Cert validation policy** — current binding (`HistorianWcfBindingFactory.cs:74`)
sets `ClientCredentialType = None`; the server cert is validated by the
default WCF chain check. Document what's required: trusted CA, or
thumbprint pinning, or `X509ServiceCertificateAuthentication.
CertificateValidationMode = PeerOrChainTrust`.
3. **Verify endpoint identity** — `/HistCert` may require an `EndpointIdentity`
(DNS or RSA) on the `EndpointAddress`. Current code (`HistorianWcfBindingFactory.cs:152`)
does not set one. Test whether identity verification fails without it.
Output: documented cert validation requirements + whether
`EndpointAddress(uri, identity)` overload is needed.
### C. Operator setup checklist (parallel-safe — operator-side)
Produce a one-page checklist the operator runs against the remote Historian
to confirm preconditions. Includes:
- `Test-NetConnection -ComputerName <host> -Port 32568` from dev workstation
- `Get-Service aahHistorian*` on the remote (verify running)
- `setspn -L <svc-account>` to capture registered SPNs
- `sqlcmd -E -S <host> -d Runtime -Q "SELECT TOP 1 TagName FROM Tag"` to
prove the Runtime DB is reachable with the operator's credentials
- `w32tm /query /status` to confirm time sync vs the Historian
Output: `docs/plans/tcp-validation-operator-checklist.md` (or appendix in
this doc) the operator can hand back filled in.
### D. Auth-chain delta vs LocalPipe (sequential — needs A's SPN)
Once A returns an SPN, run `tools/AVEVA.Historian.ReverseEngineering --
wcf-probe` against `/Hist-Integrated` over TCP and confirm:
1. The `Hist.GetV → Hist.ValCl × N → Hist.Open2` chain runs the same number
of ValCl rounds (LocalPipe was 2; TCP may be 2 or 3 depending on whether
the underlying transport already negotiated something).
2. The `OpenConnection2` request bytes are identical (the body is
transport-agnostic — only the WCF wrapper differs).
3. The Open2 response carries the same `outParameters` shape (42 bytes
with version, session GUID, FILETIMEs, status). If TCP returns a
different shape, the parser at `HistorianWcfAuthChainHelper.cs` needs a
transport-aware path.
Output: byte-for-byte diff between LocalPipe and TCP capture, with any
deltas noted.
## 5. Verification workstreams
Once §4.A and §4.B return actionable answers, every operation gets a
parallel verification track. **All five tracks below are parallelizable**
because each exercises a different SDK method and they don't share state.
| Track | Op | Live test to author | Parallel-safe? |
|---|---|---|---|
| V1 | `ProbeAsync` | `ProbeAsync_RemoteTcpIntegrated_ReturnsTrue` + `ProbeAsync_RemoteTcpCertificate_ReturnsTrue` | ✅ |
| V2 | Reads (`ReadRawAsync`, `ReadAggregateAsync`, `ReadAtTimeAsync`) | mirror of the existing 3 LocalPipe tests, env-var gated by `HISTORIAN_REMOTE_TCP_HOST` | ✅ |
| V3 | `ReadEventsAsync` | mirror of `ReadEventsAsync_AgainstLocalHistorian_DoesNotThrow` | ✅ |
| V4 | Tag ops (`BrowseTagNamesAsync`, `GetTagMetadataAsync`) | mirror of the two LocalPipe tests | ✅ |
| V5 | Status helpers (`GetConnectionStatusAsync`, `GetStoreForwardStatusAsync`, `GetSystemParameterAsync`) | mirror of the three LocalPipe tests | ✅ |
The only sequential dependency: V1 must pass before V2-V5 are meaningful
(if `ProbeAsync` returns false, the others will too for transport reasons).
For each track, the test pattern is:
```csharp
string? host = Environment.GetEnvironmentVariable("HISTORIAN_REMOTE_TCP_HOST");
if (string.IsNullOrWhiteSpace(host) || !OperatingSystem.IsWindows()) return;
HistorianClient client = new(new HistorianClientOptions
{
Host = host,
Port = 32568,
IntegratedSecurity = true,
Transport = HistorianTransport.RemoteTcpIntegrated,
TargetSpn = Environment.GetEnvironmentVariable("HISTORIAN_REMOTE_TCP_SPN")
?? throw new InvalidOperationException("Set HISTORIAN_REMOTE_TCP_SPN per §4.A"),
});
// ... existing test body, unchanged ...
```
Add a parallel set for `RemoteTcpCertificate` gated by
`HISTORIAN_REMOTE_TCPCERT_HOST` + `HISTORIAN_REMOTE_TCPCERT_THUMBPRINT`.
## 6. Risks and mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| **SPN mismatch** — TCP SSPI negotiation rejects with `SEC_E_TARGET_UNKNOWN` | High | All TCP ops fail | §4.A discovery first; expose `TargetSpn` as already done in `HistorianClientOptions` |
| **Cert chain validation rejects** — server cert not trusted by dev workstation | High for `Certificate` transport | Cert transport unusable | §4.B: document required CA / pinning hook; consider a `ServerCertificateValidator` callback option |
| **Endpoint identity required** — `/HistCert` rejects without DNS identity in `EndpointAddress` | Medium | Cert transport unusable | §4.B step 3; if confirmed, add overload to `CreateEndpointAddress` |
| **Wire-level idle disconnect** — TCP connection dropped after N seconds idle, mid-test | Medium | Flaky tests | Set `RequestTimeout` low enough to fail fast; add reconnect logic if seen repeatedly |
| **Open2 response differs over TCP** — extra bytes in `outParameters` for TCP-specific session state | Low (reads/events use the same ConnectionMode 0x402 regardless of transport) | Auth chain breaks | §4.D byte-diff captures it; if found, transport-aware parser branch |
| **Compression negotiation** — `HistorianClientOptions.Compression` unset on LocalPipe; over TCP, the server might enable gzip and our `MdasMessageEncoder` doesn't unwrap it | Medium-Low | Requests succeed, responses garbled | Confirm compression off in initial probe; add gzip handling later if needed |
| **Time skew** — Kerberos ticket clock skew > 5min rejects auth | Low | Total auth failure | Operator checklist (§4.C) catches this |
| **`Probe` succeeds but reads silently empty** — common when tag-permissions don't grant the test account read access | Medium | False positive in V1, V2 fails | V2 asserts `samples.Count > 0` for a tag known to have data |
## 7. Success criteria
For `RemoteTcpIntegrated`:
- [ ] `ProbeAsync` returns `true` against the remote host
- [ ] All five Verification tracks (V1-V5) pass against the remote host
- [ ] Captured wire bytes for `Open2` and `StartQuery2` match the LocalPipe
captures (modulo session-specific GUIDs / FILETIMEs)
- [ ] Test count is 114 → 124 (10 new live tests) when both env vars set
- [ ] All existing `Transport = HistorianTransport.LocalPipe` tests still
pass when only the LocalPipe env var is set (no regression)
For `RemoteTcpCertificate`: same as above, gated by the cert-specific env
vars. May skip ReadEvents if the cert account doesn't have AnE permission.
For documentation:
- [ ] `README.md` operation status table updated: `RemoteTcpIntegrated` and
`RemoteTcpCertificate` transports change from "wired but only
`LocalPipe` has live verification" to "live-verified"
- [ ] `docs/reverse-engineering/handoff.md` gets a new section documenting
any LocalPipe vs TCP wire-byte deltas found in §4.D
## 8. Open questions
1. Is there even a remote Historian available to test against? If not, this
plan stalls at §2 preconditions until one is provisioned. (Note:
handoff.md mentions a `10.100.0.x` remote Historian and a Debian relay
used in earlier sessions — verify whether that infrastructure is still
live and reachable.)
2. Does `RemoteTcpCertificate` use mutual-TLS or just server-cert? Current
binding (`HistorianWcfBindingFactory.cs:74`) sets
`ClientCredentialType = None` (server-cert only). Confirm against the
actual `/HistCert` endpoint behavior.
3. Does the `/Retr` channel need its own auth, or does it inherit from the
`/Hist-Integrated` session? Current code uses plain `MdasNetTcp` (no
transport security) for Retr in both `RemoteTcpIntegrated` and
`RemoteTcpCertificate` configurations — is that actually how the native
client does it, or does the native push security on Retr too?
4. What happens if the cert presented by `/HistCert` has a SAN that doesn't
match the host the SDK connected to? Decide pinning vs DNS validation.
5. The `HistorianClientOptions.Compression` flag exists but is not consumed
anywhere in the WCF layer. Is compression a transport concern or an
application-payload concern? Need to know before TCP — the bandwidth
savings only matter over WAN.
## 9. Parallelization summary
Within the discovery phase: A, B, C run in parallel. D blocks on A.
Within the verification phase: V1 must pass first, then V2-V5 parallel.
Both `RemoteTcpIntegrated` and `RemoteTcpCertificate` verification tracks
can run independently from each other.
End-to-end estimated wall-clock if all preconditions are met:
- §4 discovery: half a day if SPN is straightforward, longer if cert chain
surprises bite.
- §5 verification: 2-4 hours given the test scaffolding is largely a copy
of the existing LocalPipe tests.
If only one developer works the plan: ~1 day. With two developers
parallelizing across `RemoteTcpIntegrated` and `RemoteTcpCertificate`:
~half a day.
## 10. Out of scope (filed under separate plans)
- **Write commands over TCP** — `docs/plans/write-commands-reverse-engineering.md`
covers writes; once that lands, this doc adds a §11 "TCP write
verification" track.
- **Store/Forward sidecar over TCP** — covered by
`docs/plans/store-forward-cache-reverse-engineering.md`. SF probably
uses a separate IPC anyway, not Net.TCP.
- **Explicit-credentials TCP** — `IntegratedSecurity = false` with
username + password requires `HistorianSspiClient` to support explicit
credentials, which is its own task. Net.TCP can use either Kerberos or
the explicit creds, but the SDK's SSPI client only does current-user
Kerberos today.