Implement worker heartbeat watchdog
This commit is contained in:
@@ -18,6 +18,7 @@ starting `MxGateway.Worker.exe` or loading MXAccess COM. The harness scripts:
|
||||
- `WorkerHello` and `WorkerReady` startup,
|
||||
- command replies with matching correlation ids,
|
||||
- ordered `WorkerEvent` frames,
|
||||
- `WorkerHeartbeat` frames,
|
||||
- `WorkerFault` frames,
|
||||
- shutdown acknowledgements,
|
||||
- malformed protobuf payloads and oversized frame headers,
|
||||
@@ -43,6 +44,8 @@ event streaming behavior:
|
||||
dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj --filter FullyQualifiedName~FakeWorkerHarnessTests
|
||||
dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj --filter FullyQualifiedName~SessionWorkerClientFactoryFakeWorkerTests
|
||||
dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj --filter FullyQualifiedName~GatewayEndToEndFakeWorkerSmokeTests
|
||||
dotnet test src/MxGateway.Tests/MxGateway.Tests.csproj --filter FullyQualifiedName~WorkerClientTests
|
||||
dotnet test src/MxGateway.Worker.Tests/MxGateway.Worker.Tests.csproj -p:Platform=x86 --filter FullyQualifiedName~WorkerPipeSessionTests
|
||||
```
|
||||
|
||||
Run the gateway test project after shared gateway test infrastructure changes:
|
||||
|
||||
@@ -576,13 +576,19 @@ Do not drop or coalesce events in v1.
|
||||
|
||||
## Heartbeat And Watchdog
|
||||
|
||||
The worker heartbeat should prove that:
|
||||
`WorkerPipeSession` starts the heartbeat loop after the gateway validates
|
||||
`WorkerHello` and receives `WorkerReady`. Heartbeats continue until
|
||||
`WorkerShutdown`, cancellation, or a pipe/protocol failure stops the session.
|
||||
The loop uses `WorkerPipeSessionOptions.HeartbeatInterval`; the default matches
|
||||
the gateway worker heartbeat interval.
|
||||
|
||||
The worker heartbeat proves that:
|
||||
|
||||
- pipe writer is alive,
|
||||
- worker host is alive,
|
||||
- STA has recently pumped or completed work.
|
||||
|
||||
Heartbeat payload should include:
|
||||
Heartbeat payload includes:
|
||||
|
||||
- worker process id,
|
||||
- session id,
|
||||
@@ -593,13 +599,19 @@ Heartbeat payload should include:
|
||||
- event sequence,
|
||||
- current command correlation id if any.
|
||||
|
||||
The STA watchdog should warn when:
|
||||
`MxAccessStaSession.CaptureHeartbeat()` reads `StaRuntime.LastActivityUtc` and
|
||||
`StaCommandDispatcher` queue state without touching the raw MXAccess COM object
|
||||
outside the STA. Event queue depth and event sequence are reported as zero until
|
||||
the event queue implementation owns those counters.
|
||||
|
||||
- one command exceeds its expected duration,
|
||||
- the STA has not pumped messages within the heartbeat grace period,
|
||||
- event queue depth remains high.
|
||||
|
||||
The worker can report the problem, but the gateway owns the final kill decision.
|
||||
The STA watchdog currently emits a `WorkerFault` with
|
||||
`WorkerFaultCategory.StaHung` when `LastStaActivityUtc` is older than
|
||||
`WorkerPipeSessionOptions.HeartbeatGrace`. The fault includes the current
|
||||
command correlation id when a command is active. Command duration and high event
|
||||
queue depth remain observable through heartbeat fields until dedicated
|
||||
thresholds own those warnings. The worker reports stale STA activity, but the
|
||||
gateway owns the final kill decision through its existing heartbeat and worker
|
||||
lifecycle policy.
|
||||
|
||||
## Shutdown
|
||||
|
||||
|
||||
Reference in New Issue
Block a user