157 lines
9.1 KiB
Markdown
157 lines
9.1 KiB
Markdown
# Gateway TLS Auto-Certificate and Lenient Client Trust — Design
|
||
|
||
Date: 2026-06-01
|
||
Status: Approved (brainstorming), pending implementation plan
|
||
|
||
## Problem
|
||
|
||
The gateway can serve gRPC and the dashboard over TLS, but only if an operator
|
||
supplies a certificate via the Kestrel `https://` endpoint config. With no cert,
|
||
an `https` endpoint fails at startup with Kestrel's opaque "No server certificate
|
||
was specified" error. Both current deployments therefore run plaintext (`h2c`),
|
||
exposing the API key and request payloads on the wire.
|
||
|
||
`mxaccessgw` is an internal tool. The goal is for TLS to "just work" with zero PKI
|
||
management: the gateway fabricates its own long-lived certificate when an HTTPS
|
||
endpoint is configured without one, and clients accept whatever certificate is
|
||
presented unless an operator explicitly opts into pinning.
|
||
|
||
## Decisions
|
||
|
||
1. **Gateway = fill-missing-cert-only.** No new "enable TLS" switch. TLS is still
|
||
driven by configuring a Kestrel `https://` endpoint. New behavior: when an
|
||
HTTPS endpoint has no `Certificate` section, the gateway generates/loads a
|
||
persisted self-signed cert instead of failing. Plaintext-only hosts are
|
||
untouched — no certificate or key material is ever written for them.
|
||
2. **Persist & reuse.** The self-signed cert is saved as a PFX under
|
||
`C:\ProgramData\MxGateway\certs`, reused across restarts, regenerated only if
|
||
missing, expired, or unreadable. Stable thumbprint; survives restarts; any
|
||
CA-pinning client keeps working.
|
||
3. **Clients = lenient TLS, plaintext default.** When a client connects over TLS
|
||
without a pinned CA, it skips verification (accepts any cert). Pinning a CA file
|
||
restores full verification. The per-client connection default (mostly
|
||
plaintext/`http`) does not change — TLS is still opt-in via the endpoint scheme.
|
||
|
||
**Scope boundary:** the gateway↔worker named-pipe IPC is unchanged (local,
|
||
OS-secured by the pipe ACL). This work touches only the public gRPC/dashboard
|
||
transport and the five language clients.
|
||
|
||
## Gateway component
|
||
|
||
New type `SelfSignedCertificateProvider` in
|
||
`src/ZB.MOM.WW.MxGateway.Server/Security/Tls/`.
|
||
|
||
1. **Detect need.** Inspect `Kestrel:Endpoints:*` configuration at startup. If any
|
||
endpoint has an `https://` URL and no `Certificate` subsection, a default cert
|
||
is needed. If none do, the provider is a no-op (no file written).
|
||
2. **Load-or-create.** Look for the persisted PFX. If present, valid, and
|
||
unexpired, load it. Otherwise generate and persist.
|
||
3. **Generate.** `CertificateRequest` with **ECDSA P-256**, `notBefore = now - 1
|
||
day` (clock-skew slack), `notAfter = now + ValidityYears`. SANs: `DNS=localhost`,
|
||
`DNS=<MachineName>`, `DNS=<MachineName.FQDN>` when resolvable, plus
|
||
`IP=127.0.0.1` and `IP=::1`. Server-auth EKU.
|
||
4. **Persist securely.** Write the PFX with an **empty** export password (a random
|
||
in-memory password cannot be reused across restarts, which the persist-and-reuse
|
||
decision requires); protect the private key with a restrictive ACL (SYSTEM +
|
||
Administrators + service account) on the `certs` directory and file on Windows,
|
||
and `0600` on non-Windows; atomic write (temp + rename). After generating, the
|
||
cert is reloaded from the persisted PFX so Kestrel always serves the on-disk key.
|
||
5. **Wire into Kestrel.** In `GatewayApplication.CreateBuilder`, add
|
||
`builder.WebHost.ConfigureKestrel(o => o.ConfigureHttpsDefaults(h =>
|
||
h.ServerCertificate = cert))`. `ConfigureHttpsDefaults` supplies the cert only
|
||
for HTTPS endpoints that did not specify their own, so an operator-configured
|
||
`Kestrel:Endpoints:*:Certificate` transparently overrides it. One hook covers
|
||
both the gRPC and dashboard ports.
|
||
|
||
### New config block `MxGateway:Tls`
|
||
|
||
All optional; the zero-config path needs none of them.
|
||
|
||
| Option | Default | Purpose |
|
||
|---|---|---|
|
||
| `Tls:SelfSignedCertPath` | `C:\ProgramData\MxGateway\certs\gateway-selfsigned.pfx` | Where the generated cert lives |
|
||
| `Tls:ValidityYears` | `10` | Lifetime of the generated cert |
|
||
| `Tls:AdditionalDnsNames` | `[]` | Extra SANs (e.g. a load-balancer name) |
|
||
| `Tls:RegenerateIfExpired` | `true` | Auto-replace an expired persisted cert |
|
||
|
||
Validated by `GatewayOptionsValidator`: `ValidityYears` in 1–100,
|
||
`SelfSignedCertPath` is a valid path shape when non-blank, and
|
||
`AdditionalDnsNames` entries are non-blank. (The "https endpoint exists but cert
|
||
path is blank" fail-fast lives in the bootstrap/provider, not the validator,
|
||
because the validator only sees the `MxGateway` section, not `Kestrel:Endpoints`.)
|
||
|
||
**Logging:** on generate/load, log thumbprint + SAN list + `notAfter` at
|
||
Information. Never log the PFX password or private key.
|
||
|
||
## Client lenient-TLS behavior
|
||
|
||
Uniform rule: **TLS on + no CA pinned ⇒ skip verification; CA pinned ⇒ full
|
||
verification.** No transport default changes. Each client also exposes an explicit
|
||
switch to force-disable leniency (strict-without-pinning) for the future.
|
||
|
||
| Client | Mechanism | Effort |
|
||
|---|---|---|
|
||
| .NET | In `CreateHttpHandler`, when `UseTls` and `CaCertificatePath` empty, set `SslOptions.RemoteCertificateValidationCallback = (_,_,_,_) => true`. CA path keeps existing custom-root validation. | trivial |
|
||
| Go | In `buildCredentials`, when TLS and no `CACertFile`/`TLSConfig`, use `tls.Config{InsecureSkipVerify: true, ServerName: override}`. | trivial |
|
||
| Java | grpc-netty-shaded 1.76.0 ships `InsecureTrustManagerFactory`. When TLS and no CA, build `GrpcSslContexts.forClient().trustManager(InsecureTrustManagerFactory.INSTANCE)`. | easy |
|
||
| Python | grpc-python has no per-channel skip-verify. Fetch the server leaf cert at connect via `ssl.get_server_certificate((host, port))`, pass it as `root_certificates` to `ssl_channel_credentials`, plus `grpc.ssl_target_name_override`. Effectively trusts what is presented (TOFU). | moderate, special-cased |
|
||
| Rust | tonic 0.13.1 + rustls (`tls-ring`). Implement a custom `rustls::client::danger::ServerCertVerifier` that accepts everything, build a `rustls::ClientConfig` via `.dangerous().with_custom_certificate_verifier(...)`, feed it to the channel. May require a custom hyper-rustls connector if `ClientTlsConfig` will not take a raw rustls config. **Needs an API spike.** | highest |
|
||
|
||
### Honesty caveats
|
||
|
||
- **Python** is not literally "ignore the cert"; it pins whatever the server
|
||
presents on first contact via a separate unverified TLS probe. For a self-signed
|
||
internal cert this is the intended outcome. Documented as a difference.
|
||
- **Rust** leniency depends on the tonic 0.13 TLS surface. If a custom verifier is
|
||
disproportionately invasive, the fallback is to require a CA file for Rust TLS
|
||
(pin-only) and document Rust as the exception.
|
||
|
||
## Error handling
|
||
|
||
Gateway:
|
||
- Cert dir not writable / ACL fails ⇒ fail fast at startup with a diagnostic naming
|
||
the path and required permission. No silent in-memory fallback.
|
||
- Persisted PFX corrupt/unreadable ⇒ warn, regenerate, overwrite.
|
||
- Persisted cert expired ⇒ regenerate if `RegenerateIfExpired` (default), else fail
|
||
fast instructing the operator to delete it or enable regeneration.
|
||
- HTTPS endpoint configured but generation disabled / path empty ⇒ validator
|
||
rejects at startup rather than letting Kestrel throw its opaque error.
|
||
|
||
Clients: surface unchanged. Skip-verify cannot itself raise. Python's pre-fetch
|
||
wraps connect failure into the existing connect-error type with the endpoint in the
|
||
message. Rust pin-only fallback surfaces the existing CA-file error.
|
||
|
||
## Documentation (same commit as source, per CLAUDE.md)
|
||
|
||
- `docs/GatewayConfiguration.md` — extend the TLS section: auto-generation, the
|
||
`MxGateway:Tls:*` block, persistence location/ACL, thumbprint logging, operator
|
||
override via `Kestrel:Endpoints:*:Certificate`.
|
||
- Each client README + `*ClientDesign.md` — "TLS is lenient by default; pin a CA to
|
||
verify," with Python TOFU and any Rust caveat noted.
|
||
- `docs/DesignDecisions.md` — record both posture choices and the why (internal
|
||
tool, no PKI) so they are not mistaken for an oversight.
|
||
|
||
## Testing
|
||
|
||
Gateway (`MxGateway.Tests`, no MXAccess):
|
||
- `SelfSignedCertificateProvider`: SANs, server-auth EKU, `notAfter ≈ now +
|
||
ValidityYears`, ECDSA P-256.
|
||
- Load-or-create: valid persisted PFX reused (same thumbprint); expired regenerates
|
||
when enabled; corrupt regenerates with a warning.
|
||
- Detection: HTTPS-without-cert engages; all-plaintext no-ops and writes no file;
|
||
endpoint with its own cert is not overridden.
|
||
- `GatewayOptionsValidator`: new `Tls:*` rules.
|
||
- Host integration: `Kestrel:Endpoints:Http:Url=https://127.0.0.1:0` builds and
|
||
binds (today it throws "no certificate specified").
|
||
|
||
Clients: each test project gets a lenient-TLS test against a throwaway self-signed
|
||
cert — connect with no CA succeeds; pinning a wrong CA fails (proves pinning still
|
||
verifies). Python exercises the pre-fetch path; mark opt-in if loopback timing is
|
||
flaky. Standard (non-live) tests; no MXAccess or external services.
|
||
|
||
Cross-language: add a TLS variant note to `docs/CrossLanguageSmokeMatrix.md`;
|
||
running the matrix over TLS stays manual/opt-in, consistent with the existing gate.
|
||
|
||
Per-component verification follows CLAUDE.md's source-update table (build + test
|
||
each touched component independently).
|