fix(error-handling): close Theme 4 — 18 cancellation / fire-and-forget findings

Async cancellation hygiene, fire-and-forget observability, retry/shutdown
semantics, and audit-row coverage across 9 modules. Highlights:

Cancellation & lifecycle:
- AuditLog-006: SqliteAuditWriter.Dispose hops to thread pool, escaping the
  captured SyncContext that risked sync-over-async deadlock.
- AuditLog-010: SiteAuditTelemetryActor owns a private lifecycle CTS,
  threaded through drain paths instead of CancellationToken.None.
- Comm-019: CentralCommunicationActor adds lifecycle CTS for repo calls.
- Host-019: Migration StartupRetry forwards ApplicationStopping so SIGTERM
  during the bounded-retry window aborts cleanly.

Cursor / retry / counter correctness:
- AuditLog-004: SiteAuditReconciliationActor's cursor now holds at `since`
  when any row's idempotent insert is still being retried (per-EventId
  retry counter, MaxPermanentInsertAttempts=5 escape valve with LogCritical
  abandon). No more silent abandonment of permanently-failing rows.
- ConfigDB-019: Dropped the catch-and-continue on EnsureLookaheadAsync's
  SPLIT loop — by class-doc construction the catch could only mask real
  failures and let the next iteration create permanent partition holes.
- HM-017/018: HealthReportSender + CentralHealthReportLoop snapshot
  per-interval counters before sending, restore via new
  ISiteHealthCollector.AddIntervalCounters on transport failure so counts
  aren't silently lost.

Fire-and-forget / shutdown waits:
- InboundAPI-018: AuditWriteMiddleware observes faulted audit-write tasks
  via OnlyOnFaulted continuation (Warning log; response unchanged).
- SnF-024: StoreAndForwardService.StopAsync awaits in-flight retry sweep
  with a bounded SweepShutdownWaitTimeout (10s).

Leak / refactor:
- Comm-021: SiteStreamGrpcServer.SubscribeInstance wraps Subscribe in its
  own try/catch so a throw doesn't leak the relay actor or _activeStreams
  entry.
- Comm-022: VERIFIED already-closed by Comm-016's dead-code purge.
- CLI-017: BundleCommands' three subcommands delegate to ExecuteCommandAsync
  (auth-failure exit-code contract unified).

Defensive / validation:
- CLI-021: CliConfig.Load wraps file-read/JSON parse so malformed config
  prints a warning and returns defaults instead of crashing the CLI.
- Host-022: ParseLevel emits stderr one-shot warning for unrecognised
  MinimumLevel instead of silently coercing to Information.
- ESG-019: ExternalSystemClient sets HttpClient.Timeout=Infinite so the
  per-call CTS is the sole timeout source (was clipped to 100s by .NET).
- Security-020: New SecurityOptionsValidator (IValidateOptions) rejects
  empty LdapServer/LdapSearchBase with ValidateOnStart.
- DM-019: Lifecycle command timeouts now emit DisableTimedOut/EnableTimedOut/
  DeleteTimedOut audit entries (mirrors DeployFailed pattern).

Plus reconciled stale per-module Open-findings counters that had drifted
from prior sessions.

20+ new regression tests across 11 test projects; build clean; affected
suites all green. README regenerated: 75 open (was 93).
This commit is contained in:
Joseph Doherty
2026-05-28 07:13:28 -04:00
parent 819f1b4665
commit 6ae0fea558
44 changed files with 1708 additions and 200 deletions
@@ -51,6 +51,30 @@ public class StoreAndForwardService
private Timer? _retryTimer;
private int _retryInProgress;
/// <summary>
/// StoreAndForward-024: the in-flight retry sweep <see cref="Task"/>, or
/// <c>null</c> when no sweep is currently running. Captured when the timer
/// callback starts a sweep so <see cref="StopAsync"/> can wait for it to
/// finish before the host disposes downstream dependencies
/// (<see cref="_storage"/>, <see cref="_replication"/>) that the sweep is
/// still touching. Written from the timer thread and from
/// <see cref="StopAsync"/>, so reads are synchronised via the
/// <see cref="Volatile"/> APIs.
/// </summary>
private Task? _sweepTask;
/// <summary>
/// StoreAndForward-024: how long <see cref="StopAsync"/> waits for an
/// in-flight retry sweep to finish before returning. The default — 10 s —
/// is generous enough to let a typical sweep over the buffered queue drain,
/// but bounded so a hung downstream call (a stuck SQLite write, a
/// long-running delivery handler) cannot block host shutdown indefinitely.
/// On timeout the wait is abandoned and the timer is still disposed; the
/// sweep keeps running but will throw on the next call into a disposed
/// dependency — preferred to blocking shutdown forever.
/// </summary>
private static readonly TimeSpan SweepShutdownWaitTimeout = TimeSpan.FromSeconds(10);
/// <summary>
/// WP-10: Delivery handler delegate. The return value / exception is interpreted
/// the same way on both the immediate-delivery path (<see cref="EnqueueAsync"/>)
@@ -120,7 +144,14 @@ public class StoreAndForwardService
{
await _storage.InitializeAsync();
_retryTimer = new Timer(
_ => _ = RetryPendingMessagesAsync(),
// StoreAndForward-024: capture the sweep Task on each tick so
// StopAsync can await any in-flight invocation before the host
// disposes _storage/_replication underneath it. The RetryPending
// path is self-guarded against overlapping sweeps via the
// _retryInProgress Interlocked flag, so unconditionally re-assigning
// the field here cannot lose a still-running task (the new tick
// will short-circuit if one is already running).
_ => Volatile.Write(ref _sweepTask, RetryPendingMessagesAsync()),
null,
_options.RetryTimerInterval,
_options.RetryTimerInterval);
@@ -131,15 +162,58 @@ public class StoreAndForwardService
}
/// <summary>
/// Stops the background retry timer.
/// Stops the background retry timer and waits (bounded) for any in-flight
/// retry sweep to finish before returning.
///
/// StoreAndForward-024: prior to this fix, <see cref="StopAsync"/> only
/// disposed the timer — a sweep already inside
/// <see cref="RetryPendingMessagesAsync"/> continued running against
/// <see cref="_storage"/> and <see cref="_replication"/> after this method
/// returned, and could then NRE / throw on a disposed dependency once the
/// DI container ran its own shutdown. We now await the captured sweep task
/// (with a bounded <see cref="SweepShutdownWaitTimeout"/> so a hung
/// dependency cannot block host shutdown indefinitely) before returning.
/// </summary>
public async Task StopAsync()
{
if (_retryTimer != null)
{
// Stop the periodic callback first so no new sweep starts while we
// are waiting for the in-flight one to drain.
await _retryTimer.DisposeAsync();
_retryTimer = null;
}
var inflight = Volatile.Read(ref _sweepTask);
if (inflight is null || inflight.IsCompleted)
{
return;
}
try
{
// WaitAsync with a finite timeout: a hung delivery handler /
// storage call cannot block host shutdown indefinitely. On timeout
// the sweep keeps running but the host is free to proceed with
// disposal — preferred to never returning.
await inflight.WaitAsync(SweepShutdownWaitTimeout).ConfigureAwait(false);
}
catch (TimeoutException)
{
_logger.LogWarning(
"Store-and-forward retry sweep did not finish within {Timeout}; " +
"shutdown is proceeding while the sweep is still in-flight",
SweepShutdownWaitTimeout);
}
catch (Exception ex)
{
// The sweep itself already logs at Error on failure (see
// RetryPendingMessagesAsync's catch); we only log here so a
// surprise fault during shutdown is still visible. Swallow so the
// host's shutdown sequence can continue regardless.
_logger.LogWarning(ex,
"Store-and-forward retry sweep faulted during shutdown wait");
}
}
/// <summary>