Implement worker heartbeat watchdog

This commit is contained in:
Joseph Doherty
2026-04-26 19:12:06 -04:00
parent a3ccd5c80b
commit 4a3560c7ee
15 changed files with 1048 additions and 20 deletions
+20 -8
View File
@@ -576,13 +576,19 @@ Do not drop or coalesce events in v1.
## Heartbeat And Watchdog
The worker heartbeat should prove that:
`WorkerPipeSession` starts the heartbeat loop after the gateway validates
`WorkerHello` and receives `WorkerReady`. Heartbeats continue until
`WorkerShutdown`, cancellation, or a pipe/protocol failure stops the session.
The loop uses `WorkerPipeSessionOptions.HeartbeatInterval`; the default matches
the gateway worker heartbeat interval.
The worker heartbeat proves that:
- pipe writer is alive,
- worker host is alive,
- STA has recently pumped or completed work.
Heartbeat payload should include:
Heartbeat payload includes:
- worker process id,
- session id,
@@ -593,13 +599,19 @@ Heartbeat payload should include:
- event sequence,
- current command correlation id if any.
The STA watchdog should warn when:
`MxAccessStaSession.CaptureHeartbeat()` reads `StaRuntime.LastActivityUtc` and
`StaCommandDispatcher` queue state without touching the raw MXAccess COM object
outside the STA. Event queue depth and event sequence are reported as zero until
the event queue implementation owns those counters.
- one command exceeds its expected duration,
- the STA has not pumped messages within the heartbeat grace period,
- event queue depth remains high.
The worker can report the problem, but the gateway owns the final kill decision.
The STA watchdog currently emits a `WorkerFault` with
`WorkerFaultCategory.StaHung` when `LastStaActivityUtc` is older than
`WorkerPipeSessionOptions.HeartbeatGrace`. The fault includes the current
command correlation id when a command is active. Command duration and high event
queue depth remain observable through heartbeat fields until dedicated
thresholds own those warnings. The worker reports stale STA activity, but the
gateway owns the final kill decision through its existing heartbeat and worker
lifecycle policy.
## Shutdown