# Gateway TLS Auto-Certificate and Lenient Client Trust — Design Date: 2026-06-01 Status: Approved (brainstorming), pending implementation plan ## Problem The gateway can serve gRPC and the dashboard over TLS, but only if an operator supplies a certificate via the Kestrel `https://` endpoint config. With no cert, an `https` endpoint fails at startup with Kestrel's opaque "No server certificate was specified" error. Both current deployments therefore run plaintext (`h2c`), exposing the API key and request payloads on the wire. `mxaccessgw` is an internal tool. The goal is for TLS to "just work" with zero PKI management: the gateway fabricates its own long-lived certificate when an HTTPS endpoint is configured without one, and clients accept whatever certificate is presented unless an operator explicitly opts into pinning. ## Decisions 1. **Gateway = fill-missing-cert-only.** No new "enable TLS" switch. TLS is still driven by configuring a Kestrel `https://` endpoint. New behavior: when an HTTPS endpoint has no `Certificate` section, the gateway generates/loads a persisted self-signed cert instead of failing. Plaintext-only hosts are untouched — no certificate or key material is ever written for them. 2. **Persist & reuse.** The self-signed cert is saved as a PFX under `C:\ProgramData\MxGateway\certs`, reused across restarts, regenerated only if missing, expired, or unreadable. Stable thumbprint; survives restarts; any CA-pinning client keeps working. 3. **Clients = lenient TLS, plaintext default.** When a client connects over TLS without a pinned CA, it skips verification (accepts any cert). Pinning a CA file restores full verification. The per-client connection default (mostly plaintext/`http`) does not change — TLS is still opt-in via the endpoint scheme. **Scope boundary:** the gateway↔worker named-pipe IPC is unchanged (local, OS-secured by the pipe ACL). This work touches only the public gRPC/dashboard transport and the five language clients. ## Gateway component New type `SelfSignedCertificateProvider` in `src/ZB.MOM.WW.MxGateway.Server/Security/Tls/`. 1. **Detect need.** Inspect `Kestrel:Endpoints:*` configuration at startup. If any endpoint has an `https://` URL and no `Certificate` subsection, a default cert is needed. If none do, the provider is a no-op (no file written). 2. **Load-or-create.** Look for the persisted PFX. If present, valid, and unexpired, load it. Otherwise generate and persist. 3. **Generate.** `CertificateRequest` with **ECDSA P-256**, `notBefore = now - 1 day` (clock-skew slack), `notAfter = now + ValidityYears`. SANs: `DNS=localhost`, `DNS=`, `DNS=` when resolvable, plus `IP=127.0.0.1` and `IP=::1`. Server-auth EKU. 4. **Persist securely.** Write the PFX with an **empty** export password (a random in-memory password cannot be reused across restarts, which the persist-and-reuse decision requires); protect the private key with a restrictive ACL (SYSTEM + Administrators + service account) on the `certs` directory and file on Windows, and `0600` on non-Windows; atomic write (temp + rename). After generating, the cert is reloaded from the persisted PFX so Kestrel always serves the on-disk key. 5. **Wire into Kestrel.** In `GatewayApplication.CreateBuilder`, add `builder.WebHost.ConfigureKestrel(o => o.ConfigureHttpsDefaults(h => h.ServerCertificate = cert))`. `ConfigureHttpsDefaults` supplies the cert only for HTTPS endpoints that did not specify their own, so an operator-configured `Kestrel:Endpoints:*:Certificate` transparently overrides it. One hook covers both the gRPC and dashboard ports. ### New config block `MxGateway:Tls` All optional; the zero-config path needs none of them. | Option | Default | Purpose | |---|---|---| | `Tls:SelfSignedCertPath` | `C:\ProgramData\MxGateway\certs\gateway-selfsigned.pfx` | Where the generated cert lives | | `Tls:ValidityYears` | `10` | Lifetime of the generated cert | | `Tls:AdditionalDnsNames` | `[]` | Extra SANs (e.g. a load-balancer name) | | `Tls:RegenerateIfExpired` | `true` | Auto-replace an expired persisted cert | Validated by `GatewayOptionsValidator`: `ValidityYears` in 1–100, `SelfSignedCertPath` is a valid path shape when non-blank, and `AdditionalDnsNames` entries are non-blank. (The "https endpoint exists but cert path is blank" fail-fast lives in the bootstrap/provider, not the validator, because the validator only sees the `MxGateway` section, not `Kestrel:Endpoints`.) **Logging:** on generate/load, log thumbprint + SAN list + `notAfter` at Information. Never log the PFX password or private key. ## Client lenient-TLS behavior Uniform rule: **TLS on + no CA pinned ⇒ skip verification; CA pinned ⇒ full verification.** No transport default changes. Each client also exposes an explicit switch to force-disable leniency (strict-without-pinning) for the future. | Client | Mechanism | Effort | |---|---|---| | .NET | In `CreateHttpHandler`, when `UseTls` and `CaCertificatePath` empty, set `SslOptions.RemoteCertificateValidationCallback = (_,_,_,_) => true`. CA path keeps existing custom-root validation. | trivial | | Go | In `buildCredentials`, when TLS and no `CACertFile`/`TLSConfig`, use `tls.Config{InsecureSkipVerify: true, ServerName: override}`. | trivial | | Java | grpc-netty-shaded 1.76.0 ships `InsecureTrustManagerFactory`. When TLS and no CA, build `GrpcSslContexts.forClient().trustManager(InsecureTrustManagerFactory.INSTANCE)`. | easy | | Python | grpc-python has no per-channel skip-verify. Fetch the server leaf cert at connect via `ssl.get_server_certificate((host, port))`, pass it as `root_certificates` to `ssl_channel_credentials`, plus `grpc.ssl_target_name_override`. Effectively trusts what is presented (TOFU). | moderate, special-cased | | Rust | tonic 0.13.1 + rustls (`tls-ring`). Implement a custom `rustls::client::danger::ServerCertVerifier` that accepts everything, build a `rustls::ClientConfig` via `.dangerous().with_custom_certificate_verifier(...)`, feed it to the channel. May require a custom hyper-rustls connector if `ClientTlsConfig` will not take a raw rustls config. **Needs an API spike.** | highest | ### Honesty caveats - **Python** is not literally "ignore the cert"; it pins whatever the server presents on first contact via a separate unverified TLS probe. For a self-signed internal cert this is the intended outcome. Documented as a difference. - **Rust** leniency depends on the tonic 0.13 TLS surface. If a custom verifier is disproportionately invasive, the fallback is to require a CA file for Rust TLS (pin-only) and document Rust as the exception. ## Error handling Gateway: - Cert dir not writable / ACL fails ⇒ fail fast at startup with a diagnostic naming the path and required permission. No silent in-memory fallback. - Persisted PFX corrupt/unreadable ⇒ warn, regenerate, overwrite. - Persisted cert expired ⇒ regenerate if `RegenerateIfExpired` (default), else fail fast instructing the operator to delete it or enable regeneration. - HTTPS endpoint configured but generation disabled / path empty ⇒ validator rejects at startup rather than letting Kestrel throw its opaque error. Clients: surface unchanged. Skip-verify cannot itself raise. Python's pre-fetch wraps connect failure into the existing connect-error type with the endpoint in the message. Rust pin-only fallback surfaces the existing CA-file error. ## Documentation (same commit as source, per CLAUDE.md) - `docs/GatewayConfiguration.md` — extend the TLS section: auto-generation, the `MxGateway:Tls:*` block, persistence location/ACL, thumbprint logging, operator override via `Kestrel:Endpoints:*:Certificate`. - Each client README + `*ClientDesign.md` — "TLS is lenient by default; pin a CA to verify," with Python TOFU and any Rust caveat noted. - `docs/DesignDecisions.md` — record both posture choices and the why (internal tool, no PKI) so they are not mistaken for an oversight. ## Testing Gateway (`MxGateway.Tests`, no MXAccess): - `SelfSignedCertificateProvider`: SANs, server-auth EKU, `notAfter ≈ now + ValidityYears`, ECDSA P-256. - Load-or-create: valid persisted PFX reused (same thumbprint); expired regenerates when enabled; corrupt regenerates with a warning. - Detection: HTTPS-without-cert engages; all-plaintext no-ops and writes no file; endpoint with its own cert is not overridden. - `GatewayOptionsValidator`: new `Tls:*` rules. - Host integration: `Kestrel:Endpoints:Http:Url=https://127.0.0.1:0` builds and binds (today it throws "no certificate specified"). Clients: each test project gets a lenient-TLS test against a throwaway self-signed cert — connect with no CA succeeds; pinning a wrong CA fails (proves pinning still verifies). Python exercises the pre-fetch path; mark opt-in if loopback timing is flaky. Standard (non-live) tests; no MXAccess or external services. Cross-language: add a TLS variant note to `docs/CrossLanguageSmokeMatrix.md`; running the matrix over TLS stays manual/opt-in, consistent with the existing gate. Per-component verification follows CLAUDE.md's source-update table (build + test each touched component independently).