fix(error-handling): close Theme 4 — 18 cancellation / fire-and-forget findings
Async cancellation hygiene, fire-and-forget observability, retry/shutdown semantics, and audit-row coverage across 9 modules. Highlights: Cancellation & lifecycle: - AuditLog-006: SqliteAuditWriter.Dispose hops to thread pool, escaping the captured SyncContext that risked sync-over-async deadlock. - AuditLog-010: SiteAuditTelemetryActor owns a private lifecycle CTS, threaded through drain paths instead of CancellationToken.None. - Comm-019: CentralCommunicationActor adds lifecycle CTS for repo calls. - Host-019: Migration StartupRetry forwards ApplicationStopping so SIGTERM during the bounded-retry window aborts cleanly. Cursor / retry / counter correctness: - AuditLog-004: SiteAuditReconciliationActor's cursor now holds at `since` when any row's idempotent insert is still being retried (per-EventId retry counter, MaxPermanentInsertAttempts=5 escape valve with LogCritical abandon). No more silent abandonment of permanently-failing rows. - ConfigDB-019: Dropped the catch-and-continue on EnsureLookaheadAsync's SPLIT loop — by class-doc construction the catch could only mask real failures and let the next iteration create permanent partition holes. - HM-017/018: HealthReportSender + CentralHealthReportLoop snapshot per-interval counters before sending, restore via new ISiteHealthCollector.AddIntervalCounters on transport failure so counts aren't silently lost. Fire-and-forget / shutdown waits: - InboundAPI-018: AuditWriteMiddleware observes faulted audit-write tasks via OnlyOnFaulted continuation (Warning log; response unchanged). - SnF-024: StoreAndForwardService.StopAsync awaits in-flight retry sweep with a bounded SweepShutdownWaitTimeout (10s). Leak / refactor: - Comm-021: SiteStreamGrpcServer.SubscribeInstance wraps Subscribe in its own try/catch so a throw doesn't leak the relay actor or _activeStreams entry. - Comm-022: VERIFIED already-closed by Comm-016's dead-code purge. - CLI-017: BundleCommands' three subcommands delegate to ExecuteCommandAsync (auth-failure exit-code contract unified). Defensive / validation: - CLI-021: CliConfig.Load wraps file-read/JSON parse so malformed config prints a warning and returns defaults instead of crashing the CLI. - Host-022: ParseLevel emits stderr one-shot warning for unrecognised MinimumLevel instead of silently coercing to Information. - ESG-019: ExternalSystemClient sets HttpClient.Timeout=Infinite so the per-call CTS is the sole timeout source (was clipped to 100s by .NET). - Security-020: New SecurityOptionsValidator (IValidateOptions) rejects empty LdapServer/LdapSearchBase with ValidateOnStart. - DM-019: Lifecycle command timeouts now emit DisableTimedOut/EnableTimedOut/ DeleteTimedOut audit entries (mirrors DeployFailed pattern). Plus reconciled stale per-module Open-findings counters that had drifted from prior sessions. 20+ new regression tests across 11 test projects; build clean; affected suites all green. README regenerated: 75 open (was 93).
This commit is contained in:
@@ -1,5 +1,4 @@
|
||||
using System.Globalization;
|
||||
using Microsoft.Data.SqlClient;
|
||||
using Microsoft.EntityFrameworkCore;
|
||||
using Microsoft.Extensions.Logging;
|
||||
using Microsoft.Extensions.Logging.Abstractions;
|
||||
@@ -178,22 +177,20 @@ WHERE pf.name = 'pf_AuditLog_Month';";
|
||||
ALTER PARTITION SCHEME {PartitionSchemeName} NEXT USED [{TargetFileGroup}];
|
||||
ALTER PARTITION FUNCTION {PartitionFunctionName}() SPLIT RANGE ('{literal}');";
|
||||
|
||||
try
|
||||
{
|
||||
await _context.Database.ExecuteSqlRawAsync(sql, ct).ConfigureAwait(false);
|
||||
added.Add(next);
|
||||
}
|
||||
catch (SqlException ex)
|
||||
{
|
||||
// Belt-and-braces: even though we read max-boundary first, an
|
||||
// ALTER from another process could have raced us. Logging at
|
||||
// Warning rather than Error because the desired end state
|
||||
// (boundary present) is satisfied by either path.
|
||||
_logger.LogWarning(
|
||||
ex,
|
||||
"EnsureLookaheadAsync: SPLIT RANGE for boundary {Boundary:o} failed; continuing.",
|
||||
next);
|
||||
}
|
||||
// ConfigDB-019: the loop pre-reads max-boundary and only issues
|
||||
// SPLITs for strictly-greater months, so msg 7708/7711 ("boundary
|
||||
// already exists") cannot happen by construction. Any OTHER
|
||||
// SqlException (permission revoked on the role, deadlock victim,
|
||||
// log full, filegroup full, transient connection drop) means the
|
||||
// boundary genuinely failed to create. The previous catch-and-
|
||||
// continue silently moved on to the next month, splitting month
|
||||
// N+1 successfully and leaving a permanent partition hole for
|
||||
// month N that blocks partition-switch purge until an operator
|
||||
// notices and rebuilds. Let SqlException propagate so the daily
|
||||
// hosted-service tick logs an Error and the next tick retries
|
||||
// from the same boundary (at-least-once, no holes).
|
||||
await _context.Database.ExecuteSqlRawAsync(sql, ct).ConfigureAwait(false);
|
||||
added.Add(next);
|
||||
|
||||
next = NextMonthBoundary(next);
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user