fix(error-handling): close Theme 4 — 18 cancellation / fire-and-forget findings

Async cancellation hygiene, fire-and-forget observability, retry/shutdown
semantics, and audit-row coverage across 9 modules. Highlights:

Cancellation & lifecycle:
- AuditLog-006: SqliteAuditWriter.Dispose hops to thread pool, escaping the
  captured SyncContext that risked sync-over-async deadlock.
- AuditLog-010: SiteAuditTelemetryActor owns a private lifecycle CTS,
  threaded through drain paths instead of CancellationToken.None.
- Comm-019: CentralCommunicationActor adds lifecycle CTS for repo calls.
- Host-019: Migration StartupRetry forwards ApplicationStopping so SIGTERM
  during the bounded-retry window aborts cleanly.

Cursor / retry / counter correctness:
- AuditLog-004: SiteAuditReconciliationActor's cursor now holds at `since`
  when any row's idempotent insert is still being retried (per-EventId
  retry counter, MaxPermanentInsertAttempts=5 escape valve with LogCritical
  abandon). No more silent abandonment of permanently-failing rows.
- ConfigDB-019: Dropped the catch-and-continue on EnsureLookaheadAsync's
  SPLIT loop — by class-doc construction the catch could only mask real
  failures and let the next iteration create permanent partition holes.
- HM-017/018: HealthReportSender + CentralHealthReportLoop snapshot
  per-interval counters before sending, restore via new
  ISiteHealthCollector.AddIntervalCounters on transport failure so counts
  aren't silently lost.

Fire-and-forget / shutdown waits:
- InboundAPI-018: AuditWriteMiddleware observes faulted audit-write tasks
  via OnlyOnFaulted continuation (Warning log; response unchanged).
- SnF-024: StoreAndForwardService.StopAsync awaits in-flight retry sweep
  with a bounded SweepShutdownWaitTimeout (10s).

Leak / refactor:
- Comm-021: SiteStreamGrpcServer.SubscribeInstance wraps Subscribe in its
  own try/catch so a throw doesn't leak the relay actor or _activeStreams
  entry.
- Comm-022: VERIFIED already-closed by Comm-016's dead-code purge.
- CLI-017: BundleCommands' three subcommands delegate to ExecuteCommandAsync
  (auth-failure exit-code contract unified).

Defensive / validation:
- CLI-021: CliConfig.Load wraps file-read/JSON parse so malformed config
  prints a warning and returns defaults instead of crashing the CLI.
- Host-022: ParseLevel emits stderr one-shot warning for unrecognised
  MinimumLevel instead of silently coercing to Information.
- ESG-019: ExternalSystemClient sets HttpClient.Timeout=Infinite so the
  per-call CTS is the sole timeout source (was clipped to 100s by .NET).
- Security-020: New SecurityOptionsValidator (IValidateOptions) rejects
  empty LdapServer/LdapSearchBase with ValidateOnStart.
- DM-019: Lifecycle command timeouts now emit DisableTimedOut/EnableTimedOut/
  DeleteTimedOut audit entries (mirrors DeployFailed pattern).

Plus reconciled stale per-module Open-findings counters that had drifted
from prior sessions.

20+ new regression tests across 11 test projects; build clean; affected
suites all green. README regenerated: 75 open (was 93).
This commit is contained in:
Joseph Doherty
2026-05-28 07:13:28 -04:00
parent 819f1b4665
commit 6ae0fea558
44 changed files with 1708 additions and 200 deletions
@@ -75,6 +75,14 @@ public class CentralCommunicationActor : ReceiveActor
private ICancelable? _refreshSchedule;
/// <summary>
/// Communication-019: per-actor lifecycle CTS threaded into the periodic
/// <see cref="LoadSiteAddressesFromDb"/> repository call so a hung MS SQL
/// connection is bounded by actor shutdown rather than holding piped tasks
/// open indefinitely. Cancelled in <see cref="PostStop"/>; never reset.
/// </summary>
private readonly CancellationTokenSource _lifecycleCts = new();
/// <summary>
/// Proxy <see cref="IActorRef"/> for the central NotificationOutboxActor cluster singleton.
/// Set via <see cref="RegisterNotificationOutbox"/> — the Host creates the singleton proxy
@@ -358,11 +366,26 @@ public class CentralCommunicationActor : ReceiveActor
private void LoadSiteAddressesFromDb()
{
var self = Self;
// Communication-019: pass the actor's lifecycle CT into the repository
// call so a hung database query is cancelled when the actor stops
// rather than leaving the piped task to accumulate. Captured locally
// because the lifecycle CTS may have been disposed by PostStop on a
// racing late tick; treat that as "actor gone, give up".
CancellationToken ct;
try
{
ct = _lifecycleCts.Token;
}
catch (ObjectDisposedException)
{
return;
}
Task.Run(async () =>
{
using var scope = _serviceProvider.CreateScope();
var repo = scope.ServiceProvider.GetRequiredService<ISiteRepository>();
var sites = await repo.GetAllSitesAsync();
var sites = await repo.GetAllSitesAsync(ct).ConfigureAwait(false);
var contacts = new Dictionary<string, List<string>>();
foreach (var site in sites)
@@ -495,6 +518,17 @@ public class CentralCommunicationActor : ReceiveActor
{
_log.Info("CentralCommunicationActor stopped");
_refreshSchedule?.Cancel();
// Communication-019: cancel any in-flight LoadSiteAddressesFromDb so a
// hung MS SQL query does not outlive the actor.
try
{
_lifecycleCts.Cancel();
}
catch (ObjectDisposedException)
{
// Double-stop is benign.
}
_lifecycleCts.Dispose();
}
}