diff --git a/docs/plans/2026-06-01-gateway-cert-autogen-design.md b/docs/plans/2026-06-01-gateway-cert-autogen-design.md new file mode 100644 index 0000000..e0dabc7 --- /dev/null +++ b/docs/plans/2026-06-01-gateway-cert-autogen-design.md @@ -0,0 +1,150 @@ +# Gateway TLS Auto-Certificate and Lenient Client Trust — Design + +Date: 2026-06-01 +Status: Approved (brainstorming), pending implementation plan + +## Problem + +The gateway can serve gRPC and the dashboard over TLS, but only if an operator +supplies a certificate via the Kestrel `https://` endpoint config. With no cert, +an `https` endpoint fails at startup with Kestrel's opaque "No server certificate +was specified" error. Both current deployments therefore run plaintext (`h2c`), +exposing the API key and request payloads on the wire. + +`mxaccessgw` is an internal tool. The goal is for TLS to "just work" with zero PKI +management: the gateway fabricates its own long-lived certificate when an HTTPS +endpoint is configured without one, and clients accept whatever certificate is +presented unless an operator explicitly opts into pinning. + +## Decisions + +1. **Gateway = fill-missing-cert-only.** No new "enable TLS" switch. TLS is still + driven by configuring a Kestrel `https://` endpoint. New behavior: when an + HTTPS endpoint has no `Certificate` section, the gateway generates/loads a + persisted self-signed cert instead of failing. Plaintext-only hosts are + untouched — no certificate or key material is ever written for them. +2. **Persist & reuse.** The self-signed cert is saved as a PFX under + `C:\ProgramData\MxGateway\certs`, reused across restarts, regenerated only if + missing, expired, or unreadable. Stable thumbprint; survives restarts; any + CA-pinning client keeps working. +3. **Clients = lenient TLS, plaintext default.** When a client connects over TLS + without a pinned CA, it skips verification (accepts any cert). Pinning a CA file + restores full verification. The per-client connection default (mostly + plaintext/`http`) does not change — TLS is still opt-in via the endpoint scheme. + +**Scope boundary:** the gateway↔worker named-pipe IPC is unchanged (local, +OS-secured by the pipe ACL). This work touches only the public gRPC/dashboard +transport and the five language clients. + +## Gateway component + +New type `SelfSignedCertificateProvider` in +`src/ZB.MOM.WW.MxGateway.Server/Security/Tls/`. + +1. **Detect need.** Inspect `Kestrel:Endpoints:*` configuration at startup. If any + endpoint has an `https://` URL and no `Certificate` subsection, a default cert + is needed. If none do, the provider is a no-op (no file written). +2. **Load-or-create.** Look for the persisted PFX. If present, valid, and + unexpired, load it. Otherwise generate and persist. +3. **Generate.** `CertificateRequest` with **ECDSA P-256**, `notBefore = now - 1 + day` (clock-skew slack), `notAfter = now + ValidityYears`. SANs: `DNS=localhost`, + `DNS=`, `DNS=` when resolvable, plus + `IP=127.0.0.1` and `IP=::1`. Server-auth EKU. +4. **Persist securely.** Write the PFX with a random in-memory-only export password; + restrictive ACL (SYSTEM + Administrators + service account) on the `certs` + directory and file; atomic write (temp + rename). +5. **Wire into Kestrel.** In `GatewayApplication.CreateBuilder`, add + `builder.WebHost.ConfigureKestrel(o => o.ConfigureHttpsDefaults(h => + h.ServerCertificate = cert))`. `ConfigureHttpsDefaults` supplies the cert only + for HTTPS endpoints that did not specify their own, so an operator-configured + `Kestrel:Endpoints:*:Certificate` transparently overrides it. One hook covers + both the gRPC and dashboard ports. + +### New config block `MxGateway:Tls` + +All optional; the zero-config path needs none of them. + +| Option | Default | Purpose | +|---|---|---| +| `Tls:SelfSignedCertPath` | `C:\ProgramData\MxGateway\certs\gateway-selfsigned.pfx` | Where the generated cert lives | +| `Tls:ValidityYears` | `10` | Lifetime of the generated cert | +| `Tls:AdditionalDnsNames` | `[]` | Extra SANs (e.g. a load-balancer name) | +| `Tls:RegenerateIfExpired` | `true` | Auto-replace an expired persisted cert | + +Validated by `GatewayOptionsValidator`: path non-empty when TLS is active, +`ValidityYears` in 1–100. + +**Logging:** on generate/load, log thumbprint + SAN list + `notAfter` at +Information. Never log the PFX password or private key. + +## Client lenient-TLS behavior + +Uniform rule: **TLS on + no CA pinned ⇒ skip verification; CA pinned ⇒ full +verification.** No transport default changes. Each client also exposes an explicit +switch to force-disable leniency (strict-without-pinning) for the future. + +| Client | Mechanism | Effort | +|---|---|---| +| .NET | In `CreateHttpHandler`, when `UseTls` and `CaCertificatePath` empty, set `SslOptions.RemoteCertificateValidationCallback = (_,_,_,_) => true`. CA path keeps existing custom-root validation. | trivial | +| Go | In `buildCredentials`, when TLS and no `CACertFile`/`TLSConfig`, use `tls.Config{InsecureSkipVerify: true, ServerName: override}`. | trivial | +| Java | grpc-netty-shaded 1.76.0 ships `InsecureTrustManagerFactory`. When TLS and no CA, build `GrpcSslContexts.forClient().trustManager(InsecureTrustManagerFactory.INSTANCE)`. | easy | +| Python | grpc-python has no per-channel skip-verify. Fetch the server leaf cert at connect via `ssl.get_server_certificate((host, port))`, pass it as `root_certificates` to `ssl_channel_credentials`, plus `grpc.ssl_target_name_override`. Effectively trusts what is presented (TOFU). | moderate, special-cased | +| Rust | tonic 0.13.1 + rustls (`tls-ring`). Implement a custom `rustls::client::danger::ServerCertVerifier` that accepts everything, build a `rustls::ClientConfig` via `.dangerous().with_custom_certificate_verifier(...)`, feed it to the channel. May require a custom hyper-rustls connector if `ClientTlsConfig` will not take a raw rustls config. **Needs an API spike.** | highest | + +### Honesty caveats + +- **Python** is not literally "ignore the cert"; it pins whatever the server + presents on first contact via a separate unverified TLS probe. For a self-signed + internal cert this is the intended outcome. Documented as a difference. +- **Rust** leniency depends on the tonic 0.13 TLS surface. If a custom verifier is + disproportionately invasive, the fallback is to require a CA file for Rust TLS + (pin-only) and document Rust as the exception. + +## Error handling + +Gateway: +- Cert dir not writable / ACL fails ⇒ fail fast at startup with a diagnostic naming + the path and required permission. No silent in-memory fallback. +- Persisted PFX corrupt/unreadable ⇒ warn, regenerate, overwrite. +- Persisted cert expired ⇒ regenerate if `RegenerateIfExpired` (default), else fail + fast instructing the operator to delete it or enable regeneration. +- HTTPS endpoint configured but generation disabled / path empty ⇒ validator + rejects at startup rather than letting Kestrel throw its opaque error. + +Clients: surface unchanged. Skip-verify cannot itself raise. Python's pre-fetch +wraps connect failure into the existing connect-error type with the endpoint in the +message. Rust pin-only fallback surfaces the existing CA-file error. + +## Documentation (same commit as source, per CLAUDE.md) + +- `docs/GatewayConfiguration.md` — extend the TLS section: auto-generation, the + `MxGateway:Tls:*` block, persistence location/ACL, thumbprint logging, operator + override via `Kestrel:Endpoints:*:Certificate`. +- Each client README + `*ClientDesign.md` — "TLS is lenient by default; pin a CA to + verify," with Python TOFU and any Rust caveat noted. +- `docs/DesignDecisions.md` — record both posture choices and the why (internal + tool, no PKI) so they are not mistaken for an oversight. + +## Testing + +Gateway (`MxGateway.Tests`, no MXAccess): +- `SelfSignedCertificateProvider`: SANs, server-auth EKU, `notAfter ≈ now + + ValidityYears`, ECDSA P-256. +- Load-or-create: valid persisted PFX reused (same thumbprint); expired regenerates + when enabled; corrupt regenerates with a warning. +- Detection: HTTPS-without-cert engages; all-plaintext no-ops and writes no file; + endpoint with its own cert is not overridden. +- `GatewayOptionsValidator`: new `Tls:*` rules. +- Host integration: `Kestrel:Endpoints:Http:Url=https://127.0.0.1:0` builds and + binds (today it throws "no certificate specified"). + +Clients: each test project gets a lenient-TLS test against a throwaway self-signed +cert — connect with no CA succeeds; pinning a wrong CA fails (proves pinning still +verifies). Python exercises the pre-fetch path; mark opt-in if loopback timing is +flaky. Standard (non-live) tests; no MXAccess or external services. + +Cross-language: add a TLS variant note to `docs/CrossLanguageSmokeMatrix.md`; +running the matrix over TLS stays manual/opt-in, consistent with the existing gate. + +Per-component verification follows CLAUDE.md's source-update table (build + test +each touched component independently).