fix(error-handling): close Theme 4 — 18 cancellation / fire-and-forget findings
Async cancellation hygiene, fire-and-forget observability, retry/shutdown semantics, and audit-row coverage across 9 modules. Highlights: Cancellation & lifecycle: - AuditLog-006: SqliteAuditWriter.Dispose hops to thread pool, escaping the captured SyncContext that risked sync-over-async deadlock. - AuditLog-010: SiteAuditTelemetryActor owns a private lifecycle CTS, threaded through drain paths instead of CancellationToken.None. - Comm-019: CentralCommunicationActor adds lifecycle CTS for repo calls. - Host-019: Migration StartupRetry forwards ApplicationStopping so SIGTERM during the bounded-retry window aborts cleanly. Cursor / retry / counter correctness: - AuditLog-004: SiteAuditReconciliationActor's cursor now holds at `since` when any row's idempotent insert is still being retried (per-EventId retry counter, MaxPermanentInsertAttempts=5 escape valve with LogCritical abandon). No more silent abandonment of permanently-failing rows. - ConfigDB-019: Dropped the catch-and-continue on EnsureLookaheadAsync's SPLIT loop — by class-doc construction the catch could only mask real failures and let the next iteration create permanent partition holes. - HM-017/018: HealthReportSender + CentralHealthReportLoop snapshot per-interval counters before sending, restore via new ISiteHealthCollector.AddIntervalCounters on transport failure so counts aren't silently lost. Fire-and-forget / shutdown waits: - InboundAPI-018: AuditWriteMiddleware observes faulted audit-write tasks via OnlyOnFaulted continuation (Warning log; response unchanged). - SnF-024: StoreAndForwardService.StopAsync awaits in-flight retry sweep with a bounded SweepShutdownWaitTimeout (10s). Leak / refactor: - Comm-021: SiteStreamGrpcServer.SubscribeInstance wraps Subscribe in its own try/catch so a throw doesn't leak the relay actor or _activeStreams entry. - Comm-022: VERIFIED already-closed by Comm-016's dead-code purge. - CLI-017: BundleCommands' three subcommands delegate to ExecuteCommandAsync (auth-failure exit-code contract unified). Defensive / validation: - CLI-021: CliConfig.Load wraps file-read/JSON parse so malformed config prints a warning and returns defaults instead of crashing the CLI. - Host-022: ParseLevel emits stderr one-shot warning for unrecognised MinimumLevel instead of silently coercing to Information. - ESG-019: ExternalSystemClient sets HttpClient.Timeout=Infinite so the per-call CTS is the sole timeout source (was clipped to 100s by .NET). - Security-020: New SecurityOptionsValidator (IValidateOptions) rejects empty LdapServer/LdapSearchBase with ValidateOnStart. - DM-019: Lifecycle command timeouts now emit DisableTimedOut/EnableTimedOut/ DeleteTimedOut audit entries (mirrors DeployFailed pattern). Plus reconciled stale per-module Open-findings counters that had drifted from prior sessions. 20+ new regression tests across 11 test projects; build clean; affected suites all green. README regenerated: 75 open (was 93).
This commit is contained in:
@@ -337,6 +337,49 @@ public class ExternalSystemClientTests
|
||||
$"Call took {sw.Elapsed}, expected to time out near the configured 200ms window");
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// ExternalSystemGateway-019 regression: <see cref="HttpClient.Timeout"/> defaults
|
||||
/// to 100 seconds and is enforced internally by <c>SendAsync</c> via its own
|
||||
/// private CTS — a <see cref="TaskCanceledException"/> raised by that internal
|
||||
/// CTS does not trip either the caller's token or the gateway's timeout CTS,
|
||||
/// so any operator-configured <see cref="ExternalSystemGatewayOptions.DefaultHttpTimeout"/>
|
||||
/// greater than 100 s would be silently clipped. The fix sets
|
||||
/// <see cref="System.Threading.Timeout.InfiniteTimeSpan"/> on the rented client
|
||||
/// so the per-call <c>CancellationTokenSource(DefaultHttpTimeout)</c> in
|
||||
/// <c>InvokeHttpAsync</c> is the sole timeout source. This test verifies the
|
||||
/// property is in fact set before any request is dispatched.
|
||||
/// </summary>
|
||||
[Fact]
|
||||
public async Task Call_DisablesHttpClientFrameworkTimeoutSoLongTimeoutsArentClipped()
|
||||
{
|
||||
var system = new ExternalSystemDefinition("TestAPI", "https://api.example.com", "none") { Id = 1 };
|
||||
var method = new ExternalSystemMethod("getData", "GET", "/data") { Id = 1, ExternalSystemDefinitionId = 1 };
|
||||
StubResolution(system, method);
|
||||
|
||||
var httpClient = new HttpClient(new MockHttpMessageHandler(HttpStatusCode.OK, "{}"));
|
||||
// Sanity check: the factory-supplied default is the framework's 100 s — exactly
|
||||
// the value the fix must override so an operator-configured timeout > 100 s is
|
||||
// honoured verbatim.
|
||||
Assert.Equal(TimeSpan.FromSeconds(100), httpClient.Timeout);
|
||||
|
||||
_httpClientFactory.CreateClient(Arg.Any<string>()).Returns(httpClient);
|
||||
|
||||
var options = new ExternalSystemGatewayOptions { DefaultHttpTimeout = TimeSpan.FromMinutes(5) };
|
||||
var client = new ExternalSystemClient(
|
||||
_httpClientFactory, _repository,
|
||||
NullLogger<ExternalSystemClient>.Instance,
|
||||
options: Microsoft.Extensions.Options.Options.Create(options));
|
||||
|
||||
var result = await client.CallAsync("TestAPI", "getData");
|
||||
|
||||
Assert.True(result.Success);
|
||||
// After InvokeHttpAsync runs, the rented client's Timeout must have been
|
||||
// set to InfiniteTimeSpan — proving the framework-default 100 s clip is
|
||||
// disabled and the per-call CTS built from DefaultHttpTimeout is the
|
||||
// sole timeout source.
|
||||
Assert.Equal(Timeout.InfiniteTimeSpan, httpClient.Timeout);
|
||||
}
|
||||
|
||||
[Fact]
|
||||
public async Task Call_CallerCancellation_IsNotMisreportedAsTimeout()
|
||||
{
|
||||
|
||||
Reference in New Issue
Block a user