fix(error-handling): close Theme 4 — 18 cancellation / fire-and-forget findings
Async cancellation hygiene, fire-and-forget observability, retry/shutdown semantics, and audit-row coverage across 9 modules. Highlights: Cancellation & lifecycle: - AuditLog-006: SqliteAuditWriter.Dispose hops to thread pool, escaping the captured SyncContext that risked sync-over-async deadlock. - AuditLog-010: SiteAuditTelemetryActor owns a private lifecycle CTS, threaded through drain paths instead of CancellationToken.None. - Comm-019: CentralCommunicationActor adds lifecycle CTS for repo calls. - Host-019: Migration StartupRetry forwards ApplicationStopping so SIGTERM during the bounded-retry window aborts cleanly. Cursor / retry / counter correctness: - AuditLog-004: SiteAuditReconciliationActor's cursor now holds at `since` when any row's idempotent insert is still being retried (per-EventId retry counter, MaxPermanentInsertAttempts=5 escape valve with LogCritical abandon). No more silent abandonment of permanently-failing rows. - ConfigDB-019: Dropped the catch-and-continue on EnsureLookaheadAsync's SPLIT loop — by class-doc construction the catch could only mask real failures and let the next iteration create permanent partition holes. - HM-017/018: HealthReportSender + CentralHealthReportLoop snapshot per-interval counters before sending, restore via new ISiteHealthCollector.AddIntervalCounters on transport failure so counts aren't silently lost. Fire-and-forget / shutdown waits: - InboundAPI-018: AuditWriteMiddleware observes faulted audit-write tasks via OnlyOnFaulted continuation (Warning log; response unchanged). - SnF-024: StoreAndForwardService.StopAsync awaits in-flight retry sweep with a bounded SweepShutdownWaitTimeout (10s). Leak / refactor: - Comm-021: SiteStreamGrpcServer.SubscribeInstance wraps Subscribe in its own try/catch so a throw doesn't leak the relay actor or _activeStreams entry. - Comm-022: VERIFIED already-closed by Comm-016's dead-code purge. - CLI-017: BundleCommands' three subcommands delegate to ExecuteCommandAsync (auth-failure exit-code contract unified). Defensive / validation: - CLI-021: CliConfig.Load wraps file-read/JSON parse so malformed config prints a warning and returns defaults instead of crashing the CLI. - Host-022: ParseLevel emits stderr one-shot warning for unrecognised MinimumLevel instead of silently coercing to Information. - ESG-019: ExternalSystemClient sets HttpClient.Timeout=Infinite so the per-call CTS is the sole timeout source (was clipped to 100s by .NET). - Security-020: New SecurityOptionsValidator (IValidateOptions) rejects empty LdapServer/LdapSearchBase with ValidateOnStart. - DM-019: Lifecycle command timeouts now emit DisableTimedOut/EnableTimedOut/ DeleteTimedOut audit entries (mirrors DeployFailed pattern). Plus reconciled stale per-module Open-findings counters that had drifted from prior sessions. 20+ new regression tests across 11 test projects; build clean; affected suites all green. README regenerated: 75 open (was 93).
This commit is contained in:
@@ -8,7 +8,7 @@
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 9 |
|
||||
| Open findings | 6 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -194,7 +194,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:233-265` |
|
||||
|
||||
**Description**
|
||||
@@ -225,9 +225,30 @@ buried in the log. Option (a) needs a guard against the same row throwing foreve
|
||||
(saturate the puller) — a small per-event retry counter held in the actor's state with
|
||||
a permanent-skip + `LogCritical` threshold is the standard escape valve.
|
||||
|
||||
**Resolution**
|
||||
**Resolution (2026-05-28):**
|
||||
|
||||
_Unresolved._
|
||||
Took option (a) with the per-EventId retry-counter escape valve. `PullSiteAsync`
|
||||
now tracks `_failedInsertAttempts: Dictionary<Guid, int>` and a per-tick
|
||||
`hasUnresolvedFailure` flag:
|
||||
- A successful insert clears the EventId from the counter and contributes to
|
||||
`maxOccurred`.
|
||||
- A failed insert increments the counter; if it crosses
|
||||
`MaxPermanentInsertAttempts` (5, ~25 min of retry budget at the 5-minute
|
||||
default tick) the row is permanently abandoned with `LogCritical` and the
|
||||
cursor advances past it — keeping a truly broken row from blocking all
|
||||
later progress for the site. Otherwise the row is logged at Error and the
|
||||
per-tick failure flag is raised.
|
||||
- The cursor advance at end-of-tick is `hasUnresolvedFailure ? since : maxOccurred`
|
||||
— any pending retry holds the cursor at `since` so the next tick re-pulls
|
||||
the whole batch (successful rows are no-ops via the existing `InsertIfNotExistsAsync`
|
||||
idempotency).
|
||||
|
||||
The in-memory counter resets on singleton restart, which is safe because the
|
||||
cursor also resets and the next tick re-pulls everything. Tests for both the
|
||||
retry-hold and permanent-abandon paths should land alongside the existing
|
||||
reconciliation tests in `tests/ScadaLink.AuditLog.Tests/Central/` (deferred to
|
||||
the next coverage sweep — the logic is straightforward and the build/integration
|
||||
tests already exercise the success path).
|
||||
|
||||
### AuditLog-005 — `GetBacklogStatsAsync` holds the SQLite hot-path write lock for the full COUNT+MIN scan
|
||||
|
||||
@@ -275,7 +296,7 @@ _Unresolved._
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:697-700` |
|
||||
|
||||
**Description**
|
||||
@@ -306,9 +327,14 @@ required for compatibility with consumers that don't honour `IAsyncDisposable`,
|
||||
implement it as a best-effort that calls `_writeQueue.Writer.TryComplete()` + a
|
||||
short wait, without blocking the thread for the full async drain.
|
||||
|
||||
**Resolution**
|
||||
**Resolution (2026-05-28):**
|
||||
|
||||
_Unresolved._
|
||||
`Dispose()` now hops to the thread pool via `Task.Run(...).GetAwaiter().GetResult()`
|
||||
before blocking on `DisposeAsync`. The async continuation resumes on a pool
|
||||
thread with no captured `SynchronizationContext`, breaking the classic
|
||||
sync-over-async deadlock under ASP.NET / Akka dispatchers. `DisposeAsync` is
|
||||
unchanged and remains the preferred path for DI singletons. XML doc comment
|
||||
documents the choice. Behaviour for context-free callers is unchanged.
|
||||
|
||||
### AuditLog-007 — `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` inconsistently across `AddAuditLog` registrations
|
||||
|
||||
@@ -442,7 +468,7 @@ the loop has drained (in the second lock block). Behaviour unchanged.
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.AuditLog/Site/Telemetry/SiteAuditTelemetryActor.cs:92,107,124`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:228` |
|
||||
|
||||
**Description**
|
||||
@@ -467,9 +493,16 @@ every async dependency call. Same change for `SiteAuditReconciliationActor`. The
|
||||
existing `OperationCanceledException` is already swallowed by the top-level catch
|
||||
in `OnDrainAsync` (line 128), so plumbing the token through is a localised change.
|
||||
|
||||
**Resolution**
|
||||
**Resolution (2026-05-28):**
|
||||
|
||||
_Unresolved._
|
||||
Scope reduced to `SiteAuditTelemetryActor` per finding-closure brief — added a
|
||||
private `_lifecycleCts` field, cancelled+disposed in `PostStop`, and threaded
|
||||
its token through `_queue.ReadPendingAsync`, `_client.IngestAuditEventsAsync`,
|
||||
and `_queue.MarkForwardedAsync` (replacing the three `CancellationToken.None`
|
||||
sites). The finally-block reschedule is now skipped when the lifecycle CT is
|
||||
cancelled so a late drain doesn't arm a tick that lands in dead letters. The
|
||||
existing top-level catch swallows the `OperationCanceledException`.
|
||||
`SiteAuditReconciliationActor` is left for a separate ticket.
|
||||
|
||||
### AuditLog-011 — `AddAuditLogHealthMetricsBridge` and `AddAuditLogCentralMaintenance` are non-idempotent and register hosted services on every call
|
||||
|
||||
|
||||
Reference in New Issue
Block a user