fix(error-handling): close Theme 4 — 18 cancellation / fire-and-forget findings
Async cancellation hygiene, fire-and-forget observability, retry/shutdown semantics, and audit-row coverage across 9 modules. Highlights: Cancellation & lifecycle: - AuditLog-006: SqliteAuditWriter.Dispose hops to thread pool, escaping the captured SyncContext that risked sync-over-async deadlock. - AuditLog-010: SiteAuditTelemetryActor owns a private lifecycle CTS, threaded through drain paths instead of CancellationToken.None. - Comm-019: CentralCommunicationActor adds lifecycle CTS for repo calls. - Host-019: Migration StartupRetry forwards ApplicationStopping so SIGTERM during the bounded-retry window aborts cleanly. Cursor / retry / counter correctness: - AuditLog-004: SiteAuditReconciliationActor's cursor now holds at `since` when any row's idempotent insert is still being retried (per-EventId retry counter, MaxPermanentInsertAttempts=5 escape valve with LogCritical abandon). No more silent abandonment of permanently-failing rows. - ConfigDB-019: Dropped the catch-and-continue on EnsureLookaheadAsync's SPLIT loop — by class-doc construction the catch could only mask real failures and let the next iteration create permanent partition holes. - HM-017/018: HealthReportSender + CentralHealthReportLoop snapshot per-interval counters before sending, restore via new ISiteHealthCollector.AddIntervalCounters on transport failure so counts aren't silently lost. Fire-and-forget / shutdown waits: - InboundAPI-018: AuditWriteMiddleware observes faulted audit-write tasks via OnlyOnFaulted continuation (Warning log; response unchanged). - SnF-024: StoreAndForwardService.StopAsync awaits in-flight retry sweep with a bounded SweepShutdownWaitTimeout (10s). Leak / refactor: - Comm-021: SiteStreamGrpcServer.SubscribeInstance wraps Subscribe in its own try/catch so a throw doesn't leak the relay actor or _activeStreams entry. - Comm-022: VERIFIED already-closed by Comm-016's dead-code purge. - CLI-017: BundleCommands' three subcommands delegate to ExecuteCommandAsync (auth-failure exit-code contract unified). Defensive / validation: - CLI-021: CliConfig.Load wraps file-read/JSON parse so malformed config prints a warning and returns defaults instead of crashing the CLI. - Host-022: ParseLevel emits stderr one-shot warning for unrecognised MinimumLevel instead of silently coercing to Information. - ESG-019: ExternalSystemClient sets HttpClient.Timeout=Infinite so the per-call CTS is the sole timeout source (was clipped to 100s by .NET). - Security-020: New SecurityOptionsValidator (IValidateOptions) rejects empty LdapServer/LdapSearchBase with ValidateOnStart. - DM-019: Lifecycle command timeouts now emit DisableTimedOut/EnableTimedOut/ DeleteTimedOut audit entries (mirrors DeployFailed pattern). Plus reconciled stale per-module Open-findings counters that had drifted from prior sessions. 20+ new regression tests across 11 test projects; build clean; affected suites all green. README regenerated: 75 open (was 93).
This commit is contained in:
@@ -51,6 +51,30 @@ public class StoreAndForwardService
|
||||
private Timer? _retryTimer;
|
||||
private int _retryInProgress;
|
||||
|
||||
/// <summary>
|
||||
/// StoreAndForward-024: the in-flight retry sweep <see cref="Task"/>, or
|
||||
/// <c>null</c> when no sweep is currently running. Captured when the timer
|
||||
/// callback starts a sweep so <see cref="StopAsync"/> can wait for it to
|
||||
/// finish before the host disposes downstream dependencies
|
||||
/// (<see cref="_storage"/>, <see cref="_replication"/>) that the sweep is
|
||||
/// still touching. Written from the timer thread and from
|
||||
/// <see cref="StopAsync"/>, so reads are synchronised via the
|
||||
/// <see cref="Volatile"/> APIs.
|
||||
/// </summary>
|
||||
private Task? _sweepTask;
|
||||
|
||||
/// <summary>
|
||||
/// StoreAndForward-024: how long <see cref="StopAsync"/> waits for an
|
||||
/// in-flight retry sweep to finish before returning. The default — 10 s —
|
||||
/// is generous enough to let a typical sweep over the buffered queue drain,
|
||||
/// but bounded so a hung downstream call (a stuck SQLite write, a
|
||||
/// long-running delivery handler) cannot block host shutdown indefinitely.
|
||||
/// On timeout the wait is abandoned and the timer is still disposed; the
|
||||
/// sweep keeps running but will throw on the next call into a disposed
|
||||
/// dependency — preferred to blocking shutdown forever.
|
||||
/// </summary>
|
||||
private static readonly TimeSpan SweepShutdownWaitTimeout = TimeSpan.FromSeconds(10);
|
||||
|
||||
/// <summary>
|
||||
/// WP-10: Delivery handler delegate. The return value / exception is interpreted
|
||||
/// the same way on both the immediate-delivery path (<see cref="EnqueueAsync"/>)
|
||||
@@ -120,7 +144,14 @@ public class StoreAndForwardService
|
||||
{
|
||||
await _storage.InitializeAsync();
|
||||
_retryTimer = new Timer(
|
||||
_ => _ = RetryPendingMessagesAsync(),
|
||||
// StoreAndForward-024: capture the sweep Task on each tick so
|
||||
// StopAsync can await any in-flight invocation before the host
|
||||
// disposes _storage/_replication underneath it. The RetryPending
|
||||
// path is self-guarded against overlapping sweeps via the
|
||||
// _retryInProgress Interlocked flag, so unconditionally re-assigning
|
||||
// the field here cannot lose a still-running task (the new tick
|
||||
// will short-circuit if one is already running).
|
||||
_ => Volatile.Write(ref _sweepTask, RetryPendingMessagesAsync()),
|
||||
null,
|
||||
_options.RetryTimerInterval,
|
||||
_options.RetryTimerInterval);
|
||||
@@ -131,15 +162,58 @@ public class StoreAndForwardService
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Stops the background retry timer.
|
||||
/// Stops the background retry timer and waits (bounded) for any in-flight
|
||||
/// retry sweep to finish before returning.
|
||||
///
|
||||
/// StoreAndForward-024: prior to this fix, <see cref="StopAsync"/> only
|
||||
/// disposed the timer — a sweep already inside
|
||||
/// <see cref="RetryPendingMessagesAsync"/> continued running against
|
||||
/// <see cref="_storage"/> and <see cref="_replication"/> after this method
|
||||
/// returned, and could then NRE / throw on a disposed dependency once the
|
||||
/// DI container ran its own shutdown. We now await the captured sweep task
|
||||
/// (with a bounded <see cref="SweepShutdownWaitTimeout"/> so a hung
|
||||
/// dependency cannot block host shutdown indefinitely) before returning.
|
||||
/// </summary>
|
||||
public async Task StopAsync()
|
||||
{
|
||||
if (_retryTimer != null)
|
||||
{
|
||||
// Stop the periodic callback first so no new sweep starts while we
|
||||
// are waiting for the in-flight one to drain.
|
||||
await _retryTimer.DisposeAsync();
|
||||
_retryTimer = null;
|
||||
}
|
||||
|
||||
var inflight = Volatile.Read(ref _sweepTask);
|
||||
if (inflight is null || inflight.IsCompleted)
|
||||
{
|
||||
return;
|
||||
}
|
||||
|
||||
try
|
||||
{
|
||||
// WaitAsync with a finite timeout: a hung delivery handler /
|
||||
// storage call cannot block host shutdown indefinitely. On timeout
|
||||
// the sweep keeps running but the host is free to proceed with
|
||||
// disposal — preferred to never returning.
|
||||
await inflight.WaitAsync(SweepShutdownWaitTimeout).ConfigureAwait(false);
|
||||
}
|
||||
catch (TimeoutException)
|
||||
{
|
||||
_logger.LogWarning(
|
||||
"Store-and-forward retry sweep did not finish within {Timeout}; " +
|
||||
"shutdown is proceeding while the sweep is still in-flight",
|
||||
SweepShutdownWaitTimeout);
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
// The sweep itself already logs at Error on failure (see
|
||||
// RetryPendingMessagesAsync's catch); we only log here so a
|
||||
// surprise fault during shutdown is still visible. Swallow so the
|
||||
// host's shutdown sequence can continue regardless.
|
||||
_logger.LogWarning(ex,
|
||||
"Store-and-forward retry sweep faulted during shutdown wait");
|
||||
}
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
|
||||
Reference in New Issue
Block a user