fix(driver-galaxy): wire event-stream faults to the reconnect supervisor (Driver.Galaxy-001)
The ReconnectSupervisor was constructed but its trigger ReportTransportFailure was never called. When the gateway StreamEvents stream faulted, EventPump just logged and exited — the supervisor was never notified, so a transient gateway drop permanently stopped data-change notifications while GetHealth() still reported Healthy. EventPump gains an optional onStreamFault callback invoked from its stream-fault catch block (not on clean shutdown). GalaxyDriver wires it to ReconnectSupervisor.ReportTransportFailure so a transport drop drives reopen → replay. This is the minimal fix for -001; the pump-restart-on-reopen gap remains tracked as Driver.Galaxy-008. Regression tests cover the callback being invoked on fault, the end-to-end supervisor reopen/replay, and that a clean shutdown does not fire it. Driver.Galaxy suite: 206/206 pass. Resolves code-review finding Driver.Galaxy-001 (Critical). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -7,7 +7,7 @@
|
||||
| Review date | 2026-05-22 |
|
||||
| Commit reviewed | `76d35d1` |
|
||||
| Status | Reviewed |
|
||||
| Open findings | 14 |
|
||||
| Open findings | 13 |
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
@@ -33,13 +33,13 @@
|
||||
| Severity | Critical |
|
||||
| Category | Error handling & resilience |
|
||||
| Location | `Runtime/EventPump.cs:128`, `GalaxyDriver.cs:222` |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** The `ReconnectSupervisor` is constructed in `BuildProductionRuntimeAsync` and exposes `ReportTransportFailure(Exception)` as the only entry point that starts the reopen -> replay recovery loop. Nothing in the driver ever calls `ReportTransportFailure` (a repo-wide search finds only the declaration). When the gateway `StreamEvents` stream faults, `EventPump.RunAsync` catches the exception, logs "reconnect supervisor (PR 4.5) handles restart", completes the channel, and exits — but the supervisor is never told. The result: a transient gateway transport drop permanently kills the event stream. Data-change notifications stop, no reconnect/replay runs, and `GetHealth()` keeps reporting `Healthy` because `_supervisor.IsDegraded` stays false. This is a production outage with no self-recovery.
|
||||
|
||||
**Recommendation:** Wire the EventPump (and any gw RPC that observes a transport fault) to call `_supervisor.ReportTransportFailure(ex)`. The simplest path: give `EventPump` a fault callback (or expose a `StreamFaulted` event) that `GalaxyDriver` subscribes to and forwards to the supervisor. The supervisor's `ReopenAsync`/`ReplayAsync` must also restart the EventPump itself (see Driver.Galaxy-008).
|
||||
|
||||
**Resolution:** _(open)_
|
||||
**Resolution:** Resolved 2026-05-22 — added an optional `onStreamFault` callback to `EventPump`; `RunAsync`'s stream-fault catch block now invokes it, and `GalaxyDriver.EnsureEventPumpStarted` wires it to `OnEventPumpStreamFault` which forwards the cause to `ReconnectSupervisor.ReportTransportFailure`, so a transient gw transport drop now drives reopen → replay. Regression coverage in `EventPumpStreamFaultTests`. Note: the EventPump itself is still not restarted on reconnect — that pump-restart gap remains tracked under Driver.Galaxy-008.
|
||||
|
||||
### Driver.Galaxy-002
|
||||
|
||||
|
||||
Reference in New Issue
Block a user