9.1 KiB
Gateway TLS Auto-Certificate and Lenient Client Trust — Design
Date: 2026-06-01 Status: Approved (brainstorming), pending implementation plan
Problem
The gateway can serve gRPC and the dashboard over TLS, but only if an operator
supplies a certificate via the Kestrel https:// endpoint config. With no cert,
an https endpoint fails at startup with Kestrel's opaque "No server certificate
was specified" error. Both current deployments therefore run plaintext (h2c),
exposing the API key and request payloads on the wire.
mxaccessgw is an internal tool. The goal is for TLS to "just work" with zero PKI
management: the gateway fabricates its own long-lived certificate when an HTTPS
endpoint is configured without one, and clients accept whatever certificate is
presented unless an operator explicitly opts into pinning.
Decisions
- Gateway = fill-missing-cert-only. No new "enable TLS" switch. TLS is still
driven by configuring a Kestrel
https://endpoint. New behavior: when an HTTPS endpoint has noCertificatesection, the gateway generates/loads a persisted self-signed cert instead of failing. Plaintext-only hosts are untouched — no certificate or key material is ever written for them. - Persist & reuse. The self-signed cert is saved as a PFX under
C:\ProgramData\MxGateway\certs, reused across restarts, regenerated only if missing, expired, or unreadable. Stable thumbprint; survives restarts; any CA-pinning client keeps working. - Clients = lenient TLS, plaintext default. When a client connects over TLS
without a pinned CA, it skips verification (accepts any cert). Pinning a CA file
restores full verification. The per-client connection default (mostly
plaintext/
http) does not change — TLS is still opt-in via the endpoint scheme.
Scope boundary: the gateway↔worker named-pipe IPC is unchanged (local, OS-secured by the pipe ACL). This work touches only the public gRPC/dashboard transport and the five language clients.
Gateway component
New type SelfSignedCertificateProvider in
src/ZB.MOM.WW.MxGateway.Server/Security/Tls/.
- Detect need. Inspect
Kestrel:Endpoints:*configuration at startup. If any endpoint has anhttps://URL and noCertificatesubsection, a default cert is needed. If none do, the provider is a no-op (no file written). - Load-or-create. Look for the persisted PFX. If present, valid, and unexpired, load it. Otherwise generate and persist.
- Generate.
CertificateRequestwith ECDSA P-256,notBefore = now - 1 day(clock-skew slack),notAfter = now + ValidityYears. SANs:DNS=localhost,DNS=<MachineName>,DNS=<MachineName.FQDN>when resolvable, plusIP=127.0.0.1andIP=::1. Server-auth EKU. - Persist securely. Write the PFX with an empty export password (a random
in-memory password cannot be reused across restarts, which the persist-and-reuse
decision requires); protect the private key with a restrictive ACL (SYSTEM +
Administrators + service account) on the
certsdirectory and file on Windows, and0600on non-Windows; atomic write (temp + rename). After generating, the cert is reloaded from the persisted PFX so Kestrel always serves the on-disk key. - Wire into Kestrel. In
GatewayApplication.CreateBuilder, addbuilder.WebHost.ConfigureKestrel(o => o.ConfigureHttpsDefaults(h => h.ServerCertificate = cert)).ConfigureHttpsDefaultssupplies the cert only for HTTPS endpoints that did not specify their own, so an operator-configuredKestrel:Endpoints:*:Certificatetransparently overrides it. One hook covers both the gRPC and dashboard ports.
New config block MxGateway:Tls
All optional; the zero-config path needs none of them.
| Option | Default | Purpose |
|---|---|---|
Tls:SelfSignedCertPath |
C:\ProgramData\MxGateway\certs\gateway-selfsigned.pfx |
Where the generated cert lives |
Tls:ValidityYears |
10 |
Lifetime of the generated cert |
Tls:AdditionalDnsNames |
[] |
Extra SANs (e.g. a load-balancer name) |
Tls:RegenerateIfExpired |
true |
Auto-replace an expired persisted cert |
Validated by GatewayOptionsValidator: ValidityYears in 1–100,
SelfSignedCertPath is a valid path shape when non-blank, and
AdditionalDnsNames entries are non-blank. (The "https endpoint exists but cert
path is blank" fail-fast lives in the bootstrap/provider, not the validator,
because the validator only sees the MxGateway section, not Kestrel:Endpoints.)
Logging: on generate/load, log thumbprint + SAN list + notAfter at
Information. Never log the PFX password or private key.
Client lenient-TLS behavior
Uniform rule: TLS on + no CA pinned ⇒ skip verification; CA pinned ⇒ full verification. No transport default changes. Each client also exposes an explicit switch to force-disable leniency (strict-without-pinning) for the future.
| Client | Mechanism | Effort |
|---|---|---|
| .NET | In CreateHttpHandler, when UseTls and CaCertificatePath empty, set SslOptions.RemoteCertificateValidationCallback = (_,_,_,_) => true. CA path keeps existing custom-root validation. |
trivial |
| Go | In buildCredentials, when TLS and no CACertFile/TLSConfig, use tls.Config{InsecureSkipVerify: true, ServerName: override}. |
trivial |
| Java | grpc-netty-shaded 1.76.0 ships InsecureTrustManagerFactory. When TLS and no CA, build GrpcSslContexts.forClient().trustManager(InsecureTrustManagerFactory.INSTANCE). |
easy |
| Python | grpc-python has no per-channel skip-verify. Fetch the server leaf cert at connect via ssl.get_server_certificate((host, port)), pass it as root_certificates to ssl_channel_credentials, plus grpc.ssl_target_name_override. Effectively trusts what is presented (TOFU). |
moderate, special-cased |
| Rust | tonic 0.13.1 + rustls (tls-ring). Implement a custom rustls::client::danger::ServerCertVerifier that accepts everything, build a rustls::ClientConfig via .dangerous().with_custom_certificate_verifier(...), feed it to the channel. May require a custom hyper-rustls connector if ClientTlsConfig will not take a raw rustls config. Needs an API spike. |
highest |
Honesty caveats
- Python is not literally "ignore the cert"; it pins whatever the server presents on first contact via a separate unverified TLS probe. For a self-signed internal cert this is the intended outcome. Documented as a difference.
- Rust leniency depends on the tonic 0.13 TLS surface. If a custom verifier is disproportionately invasive, the fallback is to require a CA file for Rust TLS (pin-only) and document Rust as the exception.
Error handling
Gateway:
- Cert dir not writable / ACL fails ⇒ fail fast at startup with a diagnostic naming the path and required permission. No silent in-memory fallback.
- Persisted PFX corrupt/unreadable ⇒ warn, regenerate, overwrite.
- Persisted cert expired ⇒ regenerate if
RegenerateIfExpired(default), else fail fast instructing the operator to delete it or enable regeneration. - HTTPS endpoint configured but generation disabled / path empty ⇒ validator rejects at startup rather than letting Kestrel throw its opaque error.
Clients: surface unchanged. Skip-verify cannot itself raise. Python's pre-fetch wraps connect failure into the existing connect-error type with the endpoint in the message. Rust pin-only fallback surfaces the existing CA-file error.
Documentation (same commit as source, per CLAUDE.md)
docs/GatewayConfiguration.md— extend the TLS section: auto-generation, theMxGateway:Tls:*block, persistence location/ACL, thumbprint logging, operator override viaKestrel:Endpoints:*:Certificate.- Each client README +
*ClientDesign.md— "TLS is lenient by default; pin a CA to verify," with Python TOFU and any Rust caveat noted. docs/DesignDecisions.md— record both posture choices and the why (internal tool, no PKI) so they are not mistaken for an oversight.
Testing
Gateway (MxGateway.Tests, no MXAccess):
SelfSignedCertificateProvider: SANs, server-auth EKU,notAfter ≈ now + ValidityYears, ECDSA P-256.- Load-or-create: valid persisted PFX reused (same thumbprint); expired regenerates when enabled; corrupt regenerates with a warning.
- Detection: HTTPS-without-cert engages; all-plaintext no-ops and writes no file; endpoint with its own cert is not overridden.
GatewayOptionsValidator: newTls:*rules.- Host integration:
Kestrel:Endpoints:Http:Url=https://127.0.0.1:0builds and binds (today it throws "no certificate specified").
Clients: each test project gets a lenient-TLS test against a throwaway self-signed cert — connect with no CA succeeds; pinning a wrong CA fails (proves pinning still verifies). Python exercises the pre-fetch path; mark opt-in if loopback timing is flaky. Standard (non-live) tests; no MXAccess or external services.
Cross-language: add a TLS variant note to docs/CrossLanguageSmokeMatrix.md;
running the matrix over TLS stays manual/opt-in, consistent with the existing gate.
Per-component verification follows CLAUDE.md's source-update table (build + test each touched component independently).