Files
mxaccessgw/docs/plans/2026-06-01-gateway-cert-autogen-design.md
T

157 lines
9.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Gateway TLS Auto-Certificate and Lenient Client Trust — Design
Date: 2026-06-01
Status: Approved (brainstorming), pending implementation plan
## Problem
The gateway can serve gRPC and the dashboard over TLS, but only if an operator
supplies a certificate via the Kestrel `https://` endpoint config. With no cert,
an `https` endpoint fails at startup with Kestrel's opaque "No server certificate
was specified" error. Both current deployments therefore run plaintext (`h2c`),
exposing the API key and request payloads on the wire.
`mxaccessgw` is an internal tool. The goal is for TLS to "just work" with zero PKI
management: the gateway fabricates its own long-lived certificate when an HTTPS
endpoint is configured without one, and clients accept whatever certificate is
presented unless an operator explicitly opts into pinning.
## Decisions
1. **Gateway = fill-missing-cert-only.** No new "enable TLS" switch. TLS is still
driven by configuring a Kestrel `https://` endpoint. New behavior: when an
HTTPS endpoint has no `Certificate` section, the gateway generates/loads a
persisted self-signed cert instead of failing. Plaintext-only hosts are
untouched — no certificate or key material is ever written for them.
2. **Persist & reuse.** The self-signed cert is saved as a PFX under
`C:\ProgramData\MxGateway\certs`, reused across restarts, regenerated only if
missing, expired, or unreadable. Stable thumbprint; survives restarts; any
CA-pinning client keeps working.
3. **Clients = lenient TLS, plaintext default.** When a client connects over TLS
without a pinned CA, it skips verification (accepts any cert). Pinning a CA file
restores full verification. The per-client connection default (mostly
plaintext/`http`) does not change — TLS is still opt-in via the endpoint scheme.
**Scope boundary:** the gateway↔worker named-pipe IPC is unchanged (local,
OS-secured by the pipe ACL). This work touches only the public gRPC/dashboard
transport and the five language clients.
## Gateway component
New type `SelfSignedCertificateProvider` in
`src/ZB.MOM.WW.MxGateway.Server/Security/Tls/`.
1. **Detect need.** Inspect `Kestrel:Endpoints:*` configuration at startup. If any
endpoint has an `https://` URL and no `Certificate` subsection, a default cert
is needed. If none do, the provider is a no-op (no file written).
2. **Load-or-create.** Look for the persisted PFX. If present, valid, and
unexpired, load it. Otherwise generate and persist.
3. **Generate.** `CertificateRequest` with **ECDSA P-256**, `notBefore = now - 1
day` (clock-skew slack), `notAfter = now + ValidityYears`. SANs: `DNS=localhost`,
`DNS=<MachineName>`, `DNS=<MachineName.FQDN>` when resolvable, plus
`IP=127.0.0.1` and `IP=::1`. Server-auth EKU.
4. **Persist securely.** Write the PFX with an **empty** export password (a random
in-memory password cannot be reused across restarts, which the persist-and-reuse
decision requires); protect the private key with a restrictive ACL (SYSTEM +
Administrators + service account) on the `certs` directory and file on Windows,
and `0600` on non-Windows; atomic write (temp + rename). After generating, the
cert is reloaded from the persisted PFX so Kestrel always serves the on-disk key.
5. **Wire into Kestrel.** In `GatewayApplication.CreateBuilder`, add
`builder.WebHost.ConfigureKestrel(o => o.ConfigureHttpsDefaults(h =>
h.ServerCertificate = cert))`. `ConfigureHttpsDefaults` supplies the cert only
for HTTPS endpoints that did not specify their own, so an operator-configured
`Kestrel:Endpoints:*:Certificate` transparently overrides it. One hook covers
both the gRPC and dashboard ports.
### New config block `MxGateway:Tls`
All optional; the zero-config path needs none of them.
| Option | Default | Purpose |
|---|---|---|
| `Tls:SelfSignedCertPath` | `C:\ProgramData\MxGateway\certs\gateway-selfsigned.pfx` | Where the generated cert lives |
| `Tls:ValidityYears` | `10` | Lifetime of the generated cert |
| `Tls:AdditionalDnsNames` | `[]` | Extra SANs (e.g. a load-balancer name) |
| `Tls:RegenerateIfExpired` | `true` | Auto-replace an expired persisted cert |
Validated by `GatewayOptionsValidator`: `ValidityYears` in 1100,
`SelfSignedCertPath` is a valid path shape when non-blank, and
`AdditionalDnsNames` entries are non-blank. (The "https endpoint exists but cert
path is blank" fail-fast lives in the bootstrap/provider, not the validator,
because the validator only sees the `MxGateway` section, not `Kestrel:Endpoints`.)
**Logging:** on generate/load, log thumbprint + SAN list + `notAfter` at
Information. Never log the PFX password or private key.
## Client lenient-TLS behavior
Uniform rule: **TLS on + no CA pinned ⇒ skip verification; CA pinned ⇒ full
verification.** No transport default changes. Each client also exposes an explicit
switch to force-disable leniency (strict-without-pinning) for the future.
| Client | Mechanism | Effort |
|---|---|---|
| .NET | In `CreateHttpHandler`, when `UseTls` and `CaCertificatePath` empty, set `SslOptions.RemoteCertificateValidationCallback = (_,_,_,_) => true`. CA path keeps existing custom-root validation. | trivial |
| Go | In `buildCredentials`, when TLS and no `CACertFile`/`TLSConfig`, use `tls.Config{InsecureSkipVerify: true, ServerName: override}`. | trivial |
| Java | grpc-netty-shaded 1.76.0 ships `InsecureTrustManagerFactory`. When TLS and no CA, build `GrpcSslContexts.forClient().trustManager(InsecureTrustManagerFactory.INSTANCE)`. | easy |
| Python | grpc-python has no per-channel skip-verify. Fetch the server leaf cert at connect via `ssl.get_server_certificate((host, port))`, pass it as `root_certificates` to `ssl_channel_credentials`, plus `grpc.ssl_target_name_override`. Effectively trusts what is presented (TOFU). | moderate, special-cased |
| Rust | tonic 0.13.1 + rustls (`tls-ring`). Implement a custom `rustls::client::danger::ServerCertVerifier` that accepts everything, build a `rustls::ClientConfig` via `.dangerous().with_custom_certificate_verifier(...)`, feed it to the channel. May require a custom hyper-rustls connector if `ClientTlsConfig` will not take a raw rustls config. **Needs an API spike.** | highest |
### Honesty caveats
- **Python** is not literally "ignore the cert"; it pins whatever the server
presents on first contact via a separate unverified TLS probe. For a self-signed
internal cert this is the intended outcome. Documented as a difference.
- **Rust** leniency depends on the tonic 0.13 TLS surface. If a custom verifier is
disproportionately invasive, the fallback is to require a CA file for Rust TLS
(pin-only) and document Rust as the exception.
## Error handling
Gateway:
- Cert dir not writable / ACL fails ⇒ fail fast at startup with a diagnostic naming
the path and required permission. No silent in-memory fallback.
- Persisted PFX corrupt/unreadable ⇒ warn, regenerate, overwrite.
- Persisted cert expired ⇒ regenerate if `RegenerateIfExpired` (default), else fail
fast instructing the operator to delete it or enable regeneration.
- HTTPS endpoint configured but generation disabled / path empty ⇒ validator
rejects at startup rather than letting Kestrel throw its opaque error.
Clients: surface unchanged. Skip-verify cannot itself raise. Python's pre-fetch
wraps connect failure into the existing connect-error type with the endpoint in the
message. Rust pin-only fallback surfaces the existing CA-file error.
## Documentation (same commit as source, per CLAUDE.md)
- `docs/GatewayConfiguration.md` — extend the TLS section: auto-generation, the
`MxGateway:Tls:*` block, persistence location/ACL, thumbprint logging, operator
override via `Kestrel:Endpoints:*:Certificate`.
- Each client README + `*ClientDesign.md` — "TLS is lenient by default; pin a CA to
verify," with Python TOFU and any Rust caveat noted.
- `docs/DesignDecisions.md` — record both posture choices and the why (internal
tool, no PKI) so they are not mistaken for an oversight.
## Testing
Gateway (`MxGateway.Tests`, no MXAccess):
- `SelfSignedCertificateProvider`: SANs, server-auth EKU, `notAfter ≈ now +
ValidityYears`, ECDSA P-256.
- Load-or-create: valid persisted PFX reused (same thumbprint); expired regenerates
when enabled; corrupt regenerates with a warning.
- Detection: HTTPS-without-cert engages; all-plaintext no-ops and writes no file;
endpoint with its own cert is not overridden.
- `GatewayOptionsValidator`: new `Tls:*` rules.
- Host integration: `Kestrel:Endpoints:Http:Url=https://127.0.0.1:0` builds and
binds (today it throws "no certificate specified").
Clients: each test project gets a lenient-TLS test against a throwaway self-signed
cert — connect with no CA succeeds; pinning a wrong CA fails (proves pinning still
verifies). Python exercises the pre-fetch path; mark opt-in if loopback timing is
flaky. Standard (non-live) tests; no MXAccess or external services.
Cross-language: add a TLS variant note to `docs/CrossLanguageSmokeMatrix.md`;
running the matrix over TLS stays manual/opt-in, consistent with the existing gate.
Per-component verification follows CLAUDE.md's source-update table (build + test
each touched component independently).