fix(error-handling): close Theme 4 — 18 cancellation / fire-and-forget findings
Async cancellation hygiene, fire-and-forget observability, retry/shutdown semantics, and audit-row coverage across 9 modules. Highlights: Cancellation & lifecycle: - AuditLog-006: SqliteAuditWriter.Dispose hops to thread pool, escaping the captured SyncContext that risked sync-over-async deadlock. - AuditLog-010: SiteAuditTelemetryActor owns a private lifecycle CTS, threaded through drain paths instead of CancellationToken.None. - Comm-019: CentralCommunicationActor adds lifecycle CTS for repo calls. - Host-019: Migration StartupRetry forwards ApplicationStopping so SIGTERM during the bounded-retry window aborts cleanly. Cursor / retry / counter correctness: - AuditLog-004: SiteAuditReconciliationActor's cursor now holds at `since` when any row's idempotent insert is still being retried (per-EventId retry counter, MaxPermanentInsertAttempts=5 escape valve with LogCritical abandon). No more silent abandonment of permanently-failing rows. - ConfigDB-019: Dropped the catch-and-continue on EnsureLookaheadAsync's SPLIT loop — by class-doc construction the catch could only mask real failures and let the next iteration create permanent partition holes. - HM-017/018: HealthReportSender + CentralHealthReportLoop snapshot per-interval counters before sending, restore via new ISiteHealthCollector.AddIntervalCounters on transport failure so counts aren't silently lost. Fire-and-forget / shutdown waits: - InboundAPI-018: AuditWriteMiddleware observes faulted audit-write tasks via OnlyOnFaulted continuation (Warning log; response unchanged). - SnF-024: StoreAndForwardService.StopAsync awaits in-flight retry sweep with a bounded SweepShutdownWaitTimeout (10s). Leak / refactor: - Comm-021: SiteStreamGrpcServer.SubscribeInstance wraps Subscribe in its own try/catch so a throw doesn't leak the relay actor or _activeStreams entry. - Comm-022: VERIFIED already-closed by Comm-016's dead-code purge. - CLI-017: BundleCommands' three subcommands delegate to ExecuteCommandAsync (auth-failure exit-code contract unified). Defensive / validation: - CLI-021: CliConfig.Load wraps file-read/JSON parse so malformed config prints a warning and returns defaults instead of crashing the CLI. - Host-022: ParseLevel emits stderr one-shot warning for unrecognised MinimumLevel instead of silently coercing to Information. - ESG-019: ExternalSystemClient sets HttpClient.Timeout=Infinite so the per-call CTS is the sole timeout source (was clipped to 100s by .NET). - Security-020: New SecurityOptionsValidator (IValidateOptions) rejects empty LdapServer/LdapSearchBase with ValidateOnStart. - DM-019: Lifecycle command timeouts now emit DisableTimedOut/EnableTimedOut/ DeleteTimedOut audit entries (mirrors DeployFailed pattern). Plus reconciled stale per-module Open-findings counters that had drifted from prior sessions. 20+ new regression tests across 11 test projects; build clean; affected suites all green. README regenerated: 75 open (was 93).
This commit is contained in:
@@ -251,10 +251,16 @@ public sealed class AuditWriteMiddleware
|
||||
ForwardState = null,
|
||||
};
|
||||
|
||||
// Fire-and-forget — the writer itself swallows; the additional
|
||||
// try/catch around the fire still protects us if WriteAsync throws
|
||||
// synchronously before returning a task.
|
||||
_ = _auditWriter.WriteAsync(evt);
|
||||
// InboundAPI-018: fire-and-forget the writer so the user-facing
|
||||
// response stays non-blocking (alog.md §13 — audit emission must
|
||||
// NEVER abort or delay the user request), but observe the returned
|
||||
// Task so an asynchronous fault is logged instead of vanishing into
|
||||
// TaskScheduler.UnobservedTaskException. The outer try/catch still
|
||||
// catches a synchronous throw before WriteAsync returns a task; the
|
||||
// ContinueWith only fires on a faulted task and runs off-thread, so
|
||||
// it does not block the response.
|
||||
var writeTask = _auditWriter.WriteAsync(evt);
|
||||
ObserveAuditWriteFault(writeTask, ctx);
|
||||
}
|
||||
catch (Exception ex)
|
||||
{
|
||||
@@ -265,6 +271,36 @@ public sealed class AuditWriteMiddleware
|
||||
}
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// InboundAPI-018: observe the audit writer's returned <see cref="Task"/>
|
||||
/// so a fault that surfaces ASYNCHRONOUSLY (e.g. a DB timeout deep in the
|
||||
/// central audit pipeline) is logged at Warning rather than dropped into
|
||||
/// <see cref="TaskScheduler.UnobservedTaskException"/>. Stays
|
||||
/// fire-and-forget so the user-facing response is not delayed — the
|
||||
/// continuation runs only on a faulted task and writes a single log line
|
||||
/// off the request thread. A completed-successfully task takes the fast
|
||||
/// path with no continuation scheduled.
|
||||
/// </summary>
|
||||
private void ObserveAuditWriteFault(Task writeTask, HttpContext ctx)
|
||||
{
|
||||
if (writeTask.IsCompletedSuccessfully)
|
||||
{
|
||||
return;
|
||||
}
|
||||
|
||||
var method = ctx.Request.Method;
|
||||
var path = ctx.Request.Path;
|
||||
var status = ctx.Response.StatusCode;
|
||||
_ = writeTask.ContinueWith(
|
||||
t => _logger.LogWarning(
|
||||
t.Exception,
|
||||
"AuditWriteMiddleware async audit write faulted for {Method} {Path} (status {Status})",
|
||||
method, path, status),
|
||||
CancellationToken.None,
|
||||
TaskContinuationOptions.OnlyOnFaulted | TaskContinuationOptions.ExecuteSynchronously,
|
||||
TaskScheduler.Default);
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Reads the buffered request body up to <paramref name="capBytes"/> bytes
|
||||
/// into a string for the audit copy and rewinds the stream so the
|
||||
|
||||
Reference in New Issue
Block a user