Add Polly resilience policies

This commit is contained in:
Joseph Doherty
2026-04-27 15:37:56 -04:00
parent d431ff9660
commit bd4a09a35e
22 changed files with 611 additions and 21 deletions
+13
View File
@@ -37,6 +37,19 @@ The default probe only verifies that the worker did not exit immediately. The
worker client replaces this probe when pipe connection, hello, and
`WorkerReady` handling are implemented.
Startup probing uses a bounded Polly retry policy. The gateway starts the worker
process once, then retries only transient startup-probe failures while the
process remains alive. The policy is configured by
`WorkerOptions.StartupProbeRetryAttempts` and
`WorkerOptions.StartupProbeRetryDelayMilliseconds`; the retry counter is
recorded as `mxgateway.retries.attempted` with `area=worker_startup`.
The launcher also passes
`MXGATEWAY_WORKER_PIPE_CONNECT_ATTEMPT_TIMEOUT_MS` to the worker process from
`WorkerOptions.PipeConnectAttemptTimeoutMilliseconds`. The worker uses that
value as the per-attempt named-pipe connect timeout inside its own bounded
Polly retry loop.
If startup fails or exceeds `WorkerOptions.StartupTimeoutSeconds`, the launcher
kills the worker process tree, disposes the process handle, disposes the
optional pipe reservation, records a worker kill metric, and reports a