docs: design for gateway TLS auto-cert and lenient client trust
This commit is contained in:
@@ -0,0 +1,150 @@
|
||||
# Gateway TLS Auto-Certificate and Lenient Client Trust — Design
|
||||
|
||||
Date: 2026-06-01
|
||||
Status: Approved (brainstorming), pending implementation plan
|
||||
|
||||
## Problem
|
||||
|
||||
The gateway can serve gRPC and the dashboard over TLS, but only if an operator
|
||||
supplies a certificate via the Kestrel `https://` endpoint config. With no cert,
|
||||
an `https` endpoint fails at startup with Kestrel's opaque "No server certificate
|
||||
was specified" error. Both current deployments therefore run plaintext (`h2c`),
|
||||
exposing the API key and request payloads on the wire.
|
||||
|
||||
`mxaccessgw` is an internal tool. The goal is for TLS to "just work" with zero PKI
|
||||
management: the gateway fabricates its own long-lived certificate when an HTTPS
|
||||
endpoint is configured without one, and clients accept whatever certificate is
|
||||
presented unless an operator explicitly opts into pinning.
|
||||
|
||||
## Decisions
|
||||
|
||||
1. **Gateway = fill-missing-cert-only.** No new "enable TLS" switch. TLS is still
|
||||
driven by configuring a Kestrel `https://` endpoint. New behavior: when an
|
||||
HTTPS endpoint has no `Certificate` section, the gateway generates/loads a
|
||||
persisted self-signed cert instead of failing. Plaintext-only hosts are
|
||||
untouched — no certificate or key material is ever written for them.
|
||||
2. **Persist & reuse.** The self-signed cert is saved as a PFX under
|
||||
`C:\ProgramData\MxGateway\certs`, reused across restarts, regenerated only if
|
||||
missing, expired, or unreadable. Stable thumbprint; survives restarts; any
|
||||
CA-pinning client keeps working.
|
||||
3. **Clients = lenient TLS, plaintext default.** When a client connects over TLS
|
||||
without a pinned CA, it skips verification (accepts any cert). Pinning a CA file
|
||||
restores full verification. The per-client connection default (mostly
|
||||
plaintext/`http`) does not change — TLS is still opt-in via the endpoint scheme.
|
||||
|
||||
**Scope boundary:** the gateway↔worker named-pipe IPC is unchanged (local,
|
||||
OS-secured by the pipe ACL). This work touches only the public gRPC/dashboard
|
||||
transport and the five language clients.
|
||||
|
||||
## Gateway component
|
||||
|
||||
New type `SelfSignedCertificateProvider` in
|
||||
`src/ZB.MOM.WW.MxGateway.Server/Security/Tls/`.
|
||||
|
||||
1. **Detect need.** Inspect `Kestrel:Endpoints:*` configuration at startup. If any
|
||||
endpoint has an `https://` URL and no `Certificate` subsection, a default cert
|
||||
is needed. If none do, the provider is a no-op (no file written).
|
||||
2. **Load-or-create.** Look for the persisted PFX. If present, valid, and
|
||||
unexpired, load it. Otherwise generate and persist.
|
||||
3. **Generate.** `CertificateRequest` with **ECDSA P-256**, `notBefore = now - 1
|
||||
day` (clock-skew slack), `notAfter = now + ValidityYears`. SANs: `DNS=localhost`,
|
||||
`DNS=<MachineName>`, `DNS=<MachineName.FQDN>` when resolvable, plus
|
||||
`IP=127.0.0.1` and `IP=::1`. Server-auth EKU.
|
||||
4. **Persist securely.** Write the PFX with a random in-memory-only export password;
|
||||
restrictive ACL (SYSTEM + Administrators + service account) on the `certs`
|
||||
directory and file; atomic write (temp + rename).
|
||||
5. **Wire into Kestrel.** In `GatewayApplication.CreateBuilder`, add
|
||||
`builder.WebHost.ConfigureKestrel(o => o.ConfigureHttpsDefaults(h =>
|
||||
h.ServerCertificate = cert))`. `ConfigureHttpsDefaults` supplies the cert only
|
||||
for HTTPS endpoints that did not specify their own, so an operator-configured
|
||||
`Kestrel:Endpoints:*:Certificate` transparently overrides it. One hook covers
|
||||
both the gRPC and dashboard ports.
|
||||
|
||||
### New config block `MxGateway:Tls`
|
||||
|
||||
All optional; the zero-config path needs none of them.
|
||||
|
||||
| Option | Default | Purpose |
|
||||
|---|---|---|
|
||||
| `Tls:SelfSignedCertPath` | `C:\ProgramData\MxGateway\certs\gateway-selfsigned.pfx` | Where the generated cert lives |
|
||||
| `Tls:ValidityYears` | `10` | Lifetime of the generated cert |
|
||||
| `Tls:AdditionalDnsNames` | `[]` | Extra SANs (e.g. a load-balancer name) |
|
||||
| `Tls:RegenerateIfExpired` | `true` | Auto-replace an expired persisted cert |
|
||||
|
||||
Validated by `GatewayOptionsValidator`: path non-empty when TLS is active,
|
||||
`ValidityYears` in 1–100.
|
||||
|
||||
**Logging:** on generate/load, log thumbprint + SAN list + `notAfter` at
|
||||
Information. Never log the PFX password or private key.
|
||||
|
||||
## Client lenient-TLS behavior
|
||||
|
||||
Uniform rule: **TLS on + no CA pinned ⇒ skip verification; CA pinned ⇒ full
|
||||
verification.** No transport default changes. Each client also exposes an explicit
|
||||
switch to force-disable leniency (strict-without-pinning) for the future.
|
||||
|
||||
| Client | Mechanism | Effort |
|
||||
|---|---|---|
|
||||
| .NET | In `CreateHttpHandler`, when `UseTls` and `CaCertificatePath` empty, set `SslOptions.RemoteCertificateValidationCallback = (_,_,_,_) => true`. CA path keeps existing custom-root validation. | trivial |
|
||||
| Go | In `buildCredentials`, when TLS and no `CACertFile`/`TLSConfig`, use `tls.Config{InsecureSkipVerify: true, ServerName: override}`. | trivial |
|
||||
| Java | grpc-netty-shaded 1.76.0 ships `InsecureTrustManagerFactory`. When TLS and no CA, build `GrpcSslContexts.forClient().trustManager(InsecureTrustManagerFactory.INSTANCE)`. | easy |
|
||||
| Python | grpc-python has no per-channel skip-verify. Fetch the server leaf cert at connect via `ssl.get_server_certificate((host, port))`, pass it as `root_certificates` to `ssl_channel_credentials`, plus `grpc.ssl_target_name_override`. Effectively trusts what is presented (TOFU). | moderate, special-cased |
|
||||
| Rust | tonic 0.13.1 + rustls (`tls-ring`). Implement a custom `rustls::client::danger::ServerCertVerifier` that accepts everything, build a `rustls::ClientConfig` via `.dangerous().with_custom_certificate_verifier(...)`, feed it to the channel. May require a custom hyper-rustls connector if `ClientTlsConfig` will not take a raw rustls config. **Needs an API spike.** | highest |
|
||||
|
||||
### Honesty caveats
|
||||
|
||||
- **Python** is not literally "ignore the cert"; it pins whatever the server
|
||||
presents on first contact via a separate unverified TLS probe. For a self-signed
|
||||
internal cert this is the intended outcome. Documented as a difference.
|
||||
- **Rust** leniency depends on the tonic 0.13 TLS surface. If a custom verifier is
|
||||
disproportionately invasive, the fallback is to require a CA file for Rust TLS
|
||||
(pin-only) and document Rust as the exception.
|
||||
|
||||
## Error handling
|
||||
|
||||
Gateway:
|
||||
- Cert dir not writable / ACL fails ⇒ fail fast at startup with a diagnostic naming
|
||||
the path and required permission. No silent in-memory fallback.
|
||||
- Persisted PFX corrupt/unreadable ⇒ warn, regenerate, overwrite.
|
||||
- Persisted cert expired ⇒ regenerate if `RegenerateIfExpired` (default), else fail
|
||||
fast instructing the operator to delete it or enable regeneration.
|
||||
- HTTPS endpoint configured but generation disabled / path empty ⇒ validator
|
||||
rejects at startup rather than letting Kestrel throw its opaque error.
|
||||
|
||||
Clients: surface unchanged. Skip-verify cannot itself raise. Python's pre-fetch
|
||||
wraps connect failure into the existing connect-error type with the endpoint in the
|
||||
message. Rust pin-only fallback surfaces the existing CA-file error.
|
||||
|
||||
## Documentation (same commit as source, per CLAUDE.md)
|
||||
|
||||
- `docs/GatewayConfiguration.md` — extend the TLS section: auto-generation, the
|
||||
`MxGateway:Tls:*` block, persistence location/ACL, thumbprint logging, operator
|
||||
override via `Kestrel:Endpoints:*:Certificate`.
|
||||
- Each client README + `*ClientDesign.md` — "TLS is lenient by default; pin a CA to
|
||||
verify," with Python TOFU and any Rust caveat noted.
|
||||
- `docs/DesignDecisions.md` — record both posture choices and the why (internal
|
||||
tool, no PKI) so they are not mistaken for an oversight.
|
||||
|
||||
## Testing
|
||||
|
||||
Gateway (`MxGateway.Tests`, no MXAccess):
|
||||
- `SelfSignedCertificateProvider`: SANs, server-auth EKU, `notAfter ≈ now +
|
||||
ValidityYears`, ECDSA P-256.
|
||||
- Load-or-create: valid persisted PFX reused (same thumbprint); expired regenerates
|
||||
when enabled; corrupt regenerates with a warning.
|
||||
- Detection: HTTPS-without-cert engages; all-plaintext no-ops and writes no file;
|
||||
endpoint with its own cert is not overridden.
|
||||
- `GatewayOptionsValidator`: new `Tls:*` rules.
|
||||
- Host integration: `Kestrel:Endpoints:Http:Url=https://127.0.0.1:0` builds and
|
||||
binds (today it throws "no certificate specified").
|
||||
|
||||
Clients: each test project gets a lenient-TLS test against a throwaway self-signed
|
||||
cert — connect with no CA succeeds; pinning a wrong CA fails (proves pinning still
|
||||
verifies). Python exercises the pre-fetch path; mark opt-in if loopback timing is
|
||||
flaky. Standard (non-live) tests; no MXAccess or external services.
|
||||
|
||||
Cross-language: add a TLS variant note to `docs/CrossLanguageSmokeMatrix.md`;
|
||||
running the matrix over TLS stays manual/opt-in, consistent with the existing gate.
|
||||
|
||||
Per-component verification follows CLAUDE.md's source-update table (build + test
|
||||
each touched component independently).
|
||||
Reference in New Issue
Block a user