fix(driver-galaxy): wire event-stream faults to the reconnect supervisor (Driver.Galaxy-001)

The ReconnectSupervisor was constructed but its trigger
ReportTransportFailure was never called. When the gateway StreamEvents
stream faulted, EventPump just logged and exited — the supervisor was
never notified, so a transient gateway drop permanently stopped
data-change notifications while GetHealth() still reported Healthy.

EventPump gains an optional onStreamFault callback invoked from its
stream-fault catch block (not on clean shutdown). GalaxyDriver wires it
to ReconnectSupervisor.ReportTransportFailure so a transport drop drives
reopen → replay.

This is the minimal fix for -001; the pump-restart-on-reopen gap remains
tracked as Driver.Galaxy-008. Regression tests cover the callback being
invoked on fault, the end-to-end supervisor reopen/replay, and that a
clean shutdown does not fire it. Driver.Galaxy suite: 206/206 pass.

Resolves code-review finding Driver.Galaxy-001 (Critical).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-22 05:54:33 -04:00
parent 796871c210
commit 4df8737c86
4 changed files with 175 additions and 9 deletions

View File

@@ -7,7 +7,7 @@
| Review date | 2026-05-22 | | Review date | 2026-05-22 |
| Commit reviewed | `76d35d1` | | Commit reviewed | `76d35d1` |
| Status | Reviewed | | Status | Reviewed |
| Open findings | 14 | | Open findings | 13 |
## Checklist coverage ## Checklist coverage
@@ -33,13 +33,13 @@
| Severity | Critical | | Severity | Critical |
| Category | Error handling & resilience | | Category | Error handling & resilience |
| Location | `Runtime/EventPump.cs:128`, `GalaxyDriver.cs:222` | | Location | `Runtime/EventPump.cs:128`, `GalaxyDriver.cs:222` |
| Status | Open | | Status | Resolved |
**Description:** The `ReconnectSupervisor` is constructed in `BuildProductionRuntimeAsync` and exposes `ReportTransportFailure(Exception)` as the only entry point that starts the reopen -> replay recovery loop. Nothing in the driver ever calls `ReportTransportFailure` (a repo-wide search finds only the declaration). When the gateway `StreamEvents` stream faults, `EventPump.RunAsync` catches the exception, logs "reconnect supervisor (PR 4.5) handles restart", completes the channel, and exits — but the supervisor is never told. The result: a transient gateway transport drop permanently kills the event stream. Data-change notifications stop, no reconnect/replay runs, and `GetHealth()` keeps reporting `Healthy` because `_supervisor.IsDegraded` stays false. This is a production outage with no self-recovery. **Description:** The `ReconnectSupervisor` is constructed in `BuildProductionRuntimeAsync` and exposes `ReportTransportFailure(Exception)` as the only entry point that starts the reopen -> replay recovery loop. Nothing in the driver ever calls `ReportTransportFailure` (a repo-wide search finds only the declaration). When the gateway `StreamEvents` stream faults, `EventPump.RunAsync` catches the exception, logs "reconnect supervisor (PR 4.5) handles restart", completes the channel, and exits — but the supervisor is never told. The result: a transient gateway transport drop permanently kills the event stream. Data-change notifications stop, no reconnect/replay runs, and `GetHealth()` keeps reporting `Healthy` because `_supervisor.IsDegraded` stays false. This is a production outage with no self-recovery.
**Recommendation:** Wire the EventPump (and any gw RPC that observes a transport fault) to call `_supervisor.ReportTransportFailure(ex)`. The simplest path: give `EventPump` a fault callback (or expose a `StreamFaulted` event) that `GalaxyDriver` subscribes to and forwards to the supervisor. The supervisor's `ReopenAsync`/`ReplayAsync` must also restart the EventPump itself (see Driver.Galaxy-008). **Recommendation:** Wire the EventPump (and any gw RPC that observes a transport fault) to call `_supervisor.ReportTransportFailure(ex)`. The simplest path: give `EventPump` a fault callback (or expose a `StreamFaulted` event) that `GalaxyDriver` subscribes to and forwards to the supervisor. The supervisor's `ReopenAsync`/`ReplayAsync` must also restart the EventPump itself (see Driver.Galaxy-008).
**Resolution:** _(open)_ **Resolution:** Resolved 2026-05-22 — added an optional `onStreamFault` callback to `EventPump`; `RunAsync`'s stream-fault catch block now invokes it, and `GalaxyDriver.EnsureEventPumpStarted` wires it to `OnEventPumpStreamFault` which forwards the cause to `ReconnectSupervisor.ReportTransportFailure`, so a transient gw transport drop now drives reopen → replay. Regression coverage in `EventPumpStreamFaultTests`. Note: the EventPump itself is still not restarted on reconnect — that pump-restart gap remains tracked under Driver.Galaxy-008.
### Driver.Galaxy-002 ### Driver.Galaxy-002

View File

@@ -732,13 +732,43 @@ public sealed class GalaxyDriver
_eventPump = new EventPump( _eventPump = new EventPump(
_subscriber!, _subscriptions, _logger, _subscriber!, _subscriptions, _logger,
channelCapacity: _options.MxAccess.EventPumpChannelCapacity, channelCapacity: _options.MxAccess.EventPumpChannelCapacity,
clientName: _options.MxAccess.ClientName); clientName: _options.MxAccess.ClientName,
onStreamFault: OnEventPumpStreamFault);
_eventPump.OnDataChange += OnPumpDataChange; _eventPump.OnDataChange += OnPumpDataChange;
_eventPump.Start(); _eventPump.Start();
return _eventPump; return _eventPump;
} }
} }
/// <summary>
/// Stream-fault callback for the <see cref="EventPump"/>. The gw StreamEvents
/// stream faulted (transient gateway drop, network blip, gw restart). Forward
/// the cause to the <see cref="ReconnectSupervisor"/> so it drives reopen →
/// replay; without this hand-off a transient transport drop permanently kills
/// the event stream and <c>GetHealth()</c> keeps reporting Healthy.
/// </summary>
private void OnEventPumpStreamFault(Exception cause)
{
var supervisor = _supervisor;
if (supervisor is null)
{
// No production runtime (skeleton / injected-seam path) — nothing to drive.
_logger.LogWarning(cause,
"GalaxyDriver {InstanceId} event stream faulted but no reconnect supervisor is wired.",
_driverInstanceId);
return;
}
try
{
supervisor.ReportTransportFailure(cause);
}
catch (ObjectDisposedException)
{
// Driver is being disposed — the stream fault is just shutdown noise.
}
}
// ===== IAlarmSource ===== // ===== IAlarmSource =====
/// <summary> /// <summary>

View File

@@ -16,9 +16,11 @@ namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Runtime;
/// </summary> /// </summary>
/// <remarks> /// <remarks>
/// <para> /// <para>
/// One pump per connected <see cref="GalaxyMxSession"/>. Reconnect lives in PR 4.5's /// One pump per connected <see cref="GalaxyMxSession"/>. Reconnect lives in the
/// supervisor; on transport failure here we log + propagate so the supervisor can /// <see cref="ReconnectSupervisor"/>; on transport failure here we log and invoke
/// decide whether to restart. /// the optional <c>onStreamFault</c> callback so the owner (GalaxyDriver) can
/// forward the fault to <see cref="ReconnectSupervisor.ReportTransportFailure"/>
/// and the supervisor can drive reopen → replay.
/// </para> /// </para>
/// <para> /// <para>
/// PR 6.2 — the network-read loop and the listener-fanout loop are decoupled by a /// PR 6.2 — the network-read loop and the listener-fanout loop are decoupled by a
@@ -50,6 +52,7 @@ internal sealed class EventPump : IAsyncDisposable
private readonly SubscriptionRegistry _registry; private readonly SubscriptionRegistry _registry;
private readonly ILogger _logger; private readonly ILogger _logger;
private readonly Func<long, ISubscriptionHandle> _handleFactory; private readonly Func<long, ISubscriptionHandle> _handleFactory;
private readonly Action<Exception>? _onStreamFault;
private readonly Channel<MxEvent> _channel; private readonly Channel<MxEvent> _channel;
private readonly KeyValuePair<string, object?> _clientTag; private readonly KeyValuePair<string, object?> _clientTag;
private readonly CancellationTokenSource _cts = new(); private readonly CancellationTokenSource _cts = new();
@@ -66,12 +69,14 @@ internal sealed class EventPump : IAsyncDisposable
ILogger? logger = null, ILogger? logger = null,
Func<long, ISubscriptionHandle>? handleFactory = null, Func<long, ISubscriptionHandle>? handleFactory = null,
int channelCapacity = DefaultChannelCapacity, int channelCapacity = DefaultChannelCapacity,
string? clientName = null) string? clientName = null,
Action<Exception>? onStreamFault = null)
{ {
_subscriber = subscriber ?? throw new ArgumentNullException(nameof(subscriber)); _subscriber = subscriber ?? throw new ArgumentNullException(nameof(subscriber));
_registry = registry ?? throw new ArgumentNullException(nameof(registry)); _registry = registry ?? throw new ArgumentNullException(nameof(registry));
_logger = logger ?? NullLogger.Instance; _logger = logger ?? NullLogger.Instance;
_handleFactory = handleFactory ?? (id => new GalaxySubscriptionHandle(id)); _handleFactory = handleFactory ?? (id => new GalaxySubscriptionHandle(id));
_onStreamFault = onStreamFault;
if (channelCapacity < 1) if (channelCapacity < 1)
{ {
@@ -127,7 +132,20 @@ internal sealed class EventPump : IAsyncDisposable
catch (Exception ex) catch (Exception ex)
{ {
_logger.LogWarning(ex, _logger.LogWarning(ex,
"Galaxy EventPump loop ended with an exception — reconnect supervisor (PR 4.5) handles restart."); "Galaxy EventPump loop ended with an exception — notifying reconnect supervisor.");
// The gw StreamEvents stream faulted. Signal the reconnect supervisor so it
// drives reopen → replay. Without this the stream silently dies and a
// transient gateway drop permanently stops data-change notifications.
if (_onStreamFault is not null)
{
try { _onStreamFault(ex); }
catch (Exception cbEx)
{
_logger.LogWarning(cbEx,
"Galaxy EventPump stream-fault callback threw — supervisor may not have been notified.");
}
}
} }
finally finally
{ {

View File

@@ -0,0 +1,118 @@
using System.Threading.Channels;
using MxGateway.Contracts.Proto;
using Shouldly;
using Xunit;
using ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Runtime;
namespace ZB.MOM.WW.OtOpcUa.Driver.Galaxy.Tests.Runtime;
/// <summary>
/// Regression coverage for Driver.Galaxy-001 (Critical): when the gw StreamEvents
/// stream faults, the <see cref="EventPump"/> must notify the reconnect supervisor
/// rather than silently logging and exiting. Without the <c>onStreamFault</c>
/// hand-off a transient gateway transport drop permanently kills the event stream.
/// </summary>
public sealed class EventPumpStreamFaultTests
{
private const int WaitMs = 2_000;
[Fact]
public async Task StreamFault_InvokesOnStreamFaultCallback_WithTheCause()
{
var subscriber = new FaultingSubscriber();
var registry = new SubscriptionRegistry();
var faultObserved = new TaskCompletionSource<Exception>(
TaskCreationOptions.RunContinuationsAsynchronously);
await using var pump = new EventPump(
subscriber, registry, channelCapacity: 4, clientName: "FaultTest",
onStreamFault: ex => faultObserved.TrySetResult(ex));
pump.Start();
// Drop the gw stream — RunAsync's await foreach throws.
subscriber.FaultStream(new IOException("simulated gateway transport drop"));
var completed = await Task.WhenAny(faultObserved.Task, Task.Delay(WaitMs));
completed.ShouldBe(faultObserved.Task,
"EventPump must invoke onStreamFault when the gw StreamEvents stream faults");
(await faultObserved.Task).ShouldBeOfType<IOException>();
}
[Fact]
public async Task StreamFault_DrivesReconnectSupervisorReopenReplay()
{
// End-to-end: a faulting EventPump wired to a real ReconnectSupervisor must
// drive the supervisor through its reopen → replay recovery loop.
var subscriber = new FaultingSubscriber();
var registry = new SubscriptionRegistry();
var reopenRan = new TaskCompletionSource(TaskCreationOptions.RunContinuationsAsynchronously);
var replayRan = new TaskCompletionSource(TaskCreationOptions.RunContinuationsAsynchronously);
using var supervisor = new ReconnectSupervisor(
reopen: _ => { reopenRan.TrySetResult(); return Task.CompletedTask; },
replay: _ => { replayRan.TrySetResult(); return Task.CompletedTask; },
options: new ReconnectOptions(
InitialBackoffOverride: TimeSpan.FromMilliseconds(5),
MaxBackoffOverride: TimeSpan.FromMilliseconds(20)));
await using var pump = new EventPump(
subscriber, registry, channelCapacity: 4, clientName: "FaultTest",
onStreamFault: supervisor.ReportTransportFailure);
pump.Start();
supervisor.CurrentState.ShouldBe(ReconnectSupervisor.State.Healthy);
subscriber.FaultStream(new IOException("simulated gateway transport drop"));
(await Task.WhenAny(reopenRan.Task, Task.Delay(WaitMs))).ShouldBe(reopenRan.Task,
"stream fault must trigger the supervisor's reopen path");
(await Task.WhenAny(replayRan.Task, Task.Delay(WaitMs))).ShouldBe(replayRan.Task,
"stream fault must trigger the supervisor's replay path");
await supervisor.WaitForHealthyAsync(new CancellationTokenSource(WaitMs).Token);
supervisor.IsDegraded.ShouldBeFalse();
}
[Fact]
public async Task CleanShutdown_DoesNotInvokeOnStreamFault()
{
var subscriber = new FaultingSubscriber();
var registry = new SubscriptionRegistry();
var faulted = false;
var pump = new EventPump(
subscriber, registry, channelCapacity: 4, clientName: "FaultTest",
onStreamFault: _ => faulted = true);
pump.Start();
// Graceful disposal cancels the loop — that is OperationCanceledException,
// not a transport fault, and must NOT trip the supervisor.
await pump.DisposeAsync();
faulted.ShouldBeFalse("clean shutdown must not be reported as a transport fault");
}
/// <summary>
/// <see cref="IGalaxySubscriber"/> fake whose StreamEvents stream can be faulted
/// on demand so the EventPump's RunAsync catch path is exercised.
/// </summary>
private sealed class FaultingSubscriber : IGalaxySubscriber
{
private readonly Channel<MxEvent> _stream =
Channel.CreateUnbounded<MxEvent>(new UnboundedChannelOptions { SingleReader = true });
public Task<IReadOnlyList<SubscribeResult>> SubscribeBulkAsync(
IReadOnlyList<string> fullReferences, int bufferedUpdateIntervalMs, CancellationToken cancellationToken)
=> Task.FromResult<IReadOnlyList<SubscribeResult>>([]);
public Task UnsubscribeBulkAsync(IReadOnlyList<int> itemHandles, CancellationToken cancellationToken)
=> Task.CompletedTask;
public IAsyncEnumerable<MxEvent> StreamEventsAsync(CancellationToken cancellationToken)
=> _stream.Reader.ReadAllAsync(cancellationToken);
/// <summary>Fault the stream so the pump's <c>await foreach</c> throws.</summary>
public void FaultStream(Exception cause) => _stream.Writer.TryComplete(cause);
}
}