Files
mxaccessgw/docs/plans/2026-06-01-gateway-cert-autogen-design.md
T

9.1 KiB
Raw Blame History

Gateway TLS Auto-Certificate and Lenient Client Trust — Design

Date: 2026-06-01 Status: Approved (brainstorming), pending implementation plan

Problem

The gateway can serve gRPC and the dashboard over TLS, but only if an operator supplies a certificate via the Kestrel https:// endpoint config. With no cert, an https endpoint fails at startup with Kestrel's opaque "No server certificate was specified" error. Both current deployments therefore run plaintext (h2c), exposing the API key and request payloads on the wire.

mxaccessgw is an internal tool. The goal is for TLS to "just work" with zero PKI management: the gateway fabricates its own long-lived certificate when an HTTPS endpoint is configured without one, and clients accept whatever certificate is presented unless an operator explicitly opts into pinning.

Decisions

  1. Gateway = fill-missing-cert-only. No new "enable TLS" switch. TLS is still driven by configuring a Kestrel https:// endpoint. New behavior: when an HTTPS endpoint has no Certificate section, the gateway generates/loads a persisted self-signed cert instead of failing. Plaintext-only hosts are untouched — no certificate or key material is ever written for them.
  2. Persist & reuse. The self-signed cert is saved as a PFX under C:\ProgramData\MxGateway\certs, reused across restarts, regenerated only if missing, expired, or unreadable. Stable thumbprint; survives restarts; any CA-pinning client keeps working.
  3. Clients = lenient TLS, plaintext default. When a client connects over TLS without a pinned CA, it skips verification (accepts any cert). Pinning a CA file restores full verification. The per-client connection default (mostly plaintext/http) does not change — TLS is still opt-in via the endpoint scheme.

Scope boundary: the gateway↔worker named-pipe IPC is unchanged (local, OS-secured by the pipe ACL). This work touches only the public gRPC/dashboard transport and the five language clients.

Gateway component

New type SelfSignedCertificateProvider in src/ZB.MOM.WW.MxGateway.Server/Security/Tls/.

  1. Detect need. Inspect Kestrel:Endpoints:* configuration at startup. If any endpoint has an https:// URL and no Certificate subsection, a default cert is needed. If none do, the provider is a no-op (no file written).
  2. Load-or-create. Look for the persisted PFX. If present, valid, and unexpired, load it. Otherwise generate and persist.
  3. Generate. CertificateRequest with ECDSA P-256, notBefore = now - 1 day (clock-skew slack), notAfter = now + ValidityYears. SANs: DNS=localhost, DNS=<MachineName>, DNS=<MachineName.FQDN> when resolvable, plus IP=127.0.0.1 and IP=::1. Server-auth EKU.
  4. Persist securely. Write the PFX with an empty export password (a random in-memory password cannot be reused across restarts, which the persist-and-reuse decision requires); protect the private key with a restrictive ACL (SYSTEM + Administrators + service account) on the certs directory and file on Windows, and 0600 on non-Windows; atomic write (temp + rename). After generating, the cert is reloaded from the persisted PFX so Kestrel always serves the on-disk key.
  5. Wire into Kestrel. In GatewayApplication.CreateBuilder, add builder.WebHost.ConfigureKestrel(o => o.ConfigureHttpsDefaults(h => h.ServerCertificate = cert)). ConfigureHttpsDefaults supplies the cert only for HTTPS endpoints that did not specify their own, so an operator-configured Kestrel:Endpoints:*:Certificate transparently overrides it. One hook covers both the gRPC and dashboard ports.

New config block MxGateway:Tls

All optional; the zero-config path needs none of them.

Option Default Purpose
Tls:SelfSignedCertPath C:\ProgramData\MxGateway\certs\gateway-selfsigned.pfx Where the generated cert lives
Tls:ValidityYears 10 Lifetime of the generated cert
Tls:AdditionalDnsNames [] Extra SANs (e.g. a load-balancer name)
Tls:RegenerateIfExpired true Auto-replace an expired persisted cert

Validated by GatewayOptionsValidator: ValidityYears in 1100, SelfSignedCertPath is a valid path shape when non-blank, and AdditionalDnsNames entries are non-blank. (The "https endpoint exists but cert path is blank" fail-fast lives in the bootstrap/provider, not the validator, because the validator only sees the MxGateway section, not Kestrel:Endpoints.)

Logging: on generate/load, log thumbprint + SAN list + notAfter at Information. Never log the PFX password or private key.

Client lenient-TLS behavior

Uniform rule: TLS on + no CA pinned ⇒ skip verification; CA pinned ⇒ full verification. No transport default changes. Each client also exposes an explicit switch to force-disable leniency (strict-without-pinning) for the future.

Client Mechanism Effort
.NET In CreateHttpHandler, when UseTls and CaCertificatePath empty, set SslOptions.RemoteCertificateValidationCallback = (_,_,_,_) => true. CA path keeps existing custom-root validation. trivial
Go In buildCredentials, when TLS and no CACertFile/TLSConfig, use tls.Config{InsecureSkipVerify: true, ServerName: override}. trivial
Java grpc-netty-shaded 1.76.0 ships InsecureTrustManagerFactory. When TLS and no CA, build GrpcSslContexts.forClient().trustManager(InsecureTrustManagerFactory.INSTANCE). easy
Python grpc-python has no per-channel skip-verify. Fetch the server leaf cert at connect via ssl.get_server_certificate((host, port)), pass it as root_certificates to ssl_channel_credentials, plus grpc.ssl_target_name_override. Effectively trusts what is presented (TOFU). moderate, special-cased
Rust tonic 0.13.1 + rustls (tls-ring). Implement a custom rustls::client::danger::ServerCertVerifier that accepts everything, build a rustls::ClientConfig via .dangerous().with_custom_certificate_verifier(...), feed it to the channel. May require a custom hyper-rustls connector if ClientTlsConfig will not take a raw rustls config. Needs an API spike. highest

Honesty caveats

  • Python is not literally "ignore the cert"; it pins whatever the server presents on first contact via a separate unverified TLS probe. For a self-signed internal cert this is the intended outcome. Documented as a difference.
  • Rust leniency depends on the tonic 0.13 TLS surface. If a custom verifier is disproportionately invasive, the fallback is to require a CA file for Rust TLS (pin-only) and document Rust as the exception.

Error handling

Gateway:

  • Cert dir not writable / ACL fails ⇒ fail fast at startup with a diagnostic naming the path and required permission. No silent in-memory fallback.
  • Persisted PFX corrupt/unreadable ⇒ warn, regenerate, overwrite.
  • Persisted cert expired ⇒ regenerate if RegenerateIfExpired (default), else fail fast instructing the operator to delete it or enable regeneration.
  • HTTPS endpoint configured but generation disabled / path empty ⇒ validator rejects at startup rather than letting Kestrel throw its opaque error.

Clients: surface unchanged. Skip-verify cannot itself raise. Python's pre-fetch wraps connect failure into the existing connect-error type with the endpoint in the message. Rust pin-only fallback surfaces the existing CA-file error.

Documentation (same commit as source, per CLAUDE.md)

  • docs/GatewayConfiguration.md — extend the TLS section: auto-generation, the MxGateway:Tls:* block, persistence location/ACL, thumbprint logging, operator override via Kestrel:Endpoints:*:Certificate.
  • Each client README + *ClientDesign.md — "TLS is lenient by default; pin a CA to verify," with Python TOFU and any Rust caveat noted.
  • docs/DesignDecisions.md — record both posture choices and the why (internal tool, no PKI) so they are not mistaken for an oversight.

Testing

Gateway (MxGateway.Tests, no MXAccess):

  • SelfSignedCertificateProvider: SANs, server-auth EKU, notAfter ≈ now + ValidityYears, ECDSA P-256.
  • Load-or-create: valid persisted PFX reused (same thumbprint); expired regenerates when enabled; corrupt regenerates with a warning.
  • Detection: HTTPS-without-cert engages; all-plaintext no-ops and writes no file; endpoint with its own cert is not overridden.
  • GatewayOptionsValidator: new Tls:* rules.
  • Host integration: Kestrel:Endpoints:Http:Url=https://127.0.0.1:0 builds and binds (today it throws "no certificate specified").

Clients: each test project gets a lenient-TLS test against a throwaway self-signed cert — connect with no CA succeeds; pinning a wrong CA fails (proves pinning still verifies). Python exercises the pre-fetch path; mark opt-in if loopback timing is flaky. Standard (non-live) tests; no MXAccess or external services.

Cross-language: add a TLS variant note to docs/CrossLanguageSmokeMatrix.md; running the matrix over TLS stays manual/opt-in, consistent with the existing gate.

Per-component verification follows CLAUDE.md's source-update table (build + test each touched component independently).