fix(error-handling): close Theme 4 — 18 cancellation / fire-and-forget findings
Async cancellation hygiene, fire-and-forget observability, retry/shutdown semantics, and audit-row coverage across 9 modules. Highlights: Cancellation & lifecycle: - AuditLog-006: SqliteAuditWriter.Dispose hops to thread pool, escaping the captured SyncContext that risked sync-over-async deadlock. - AuditLog-010: SiteAuditTelemetryActor owns a private lifecycle CTS, threaded through drain paths instead of CancellationToken.None. - Comm-019: CentralCommunicationActor adds lifecycle CTS for repo calls. - Host-019: Migration StartupRetry forwards ApplicationStopping so SIGTERM during the bounded-retry window aborts cleanly. Cursor / retry / counter correctness: - AuditLog-004: SiteAuditReconciliationActor's cursor now holds at `since` when any row's idempotent insert is still being retried (per-EventId retry counter, MaxPermanentInsertAttempts=5 escape valve with LogCritical abandon). No more silent abandonment of permanently-failing rows. - ConfigDB-019: Dropped the catch-and-continue on EnsureLookaheadAsync's SPLIT loop — by class-doc construction the catch could only mask real failures and let the next iteration create permanent partition holes. - HM-017/018: HealthReportSender + CentralHealthReportLoop snapshot per-interval counters before sending, restore via new ISiteHealthCollector.AddIntervalCounters on transport failure so counts aren't silently lost. Fire-and-forget / shutdown waits: - InboundAPI-018: AuditWriteMiddleware observes faulted audit-write tasks via OnlyOnFaulted continuation (Warning log; response unchanged). - SnF-024: StoreAndForwardService.StopAsync awaits in-flight retry sweep with a bounded SweepShutdownWaitTimeout (10s). Leak / refactor: - Comm-021: SiteStreamGrpcServer.SubscribeInstance wraps Subscribe in its own try/catch so a throw doesn't leak the relay actor or _activeStreams entry. - Comm-022: VERIFIED already-closed by Comm-016's dead-code purge. - CLI-017: BundleCommands' three subcommands delegate to ExecuteCommandAsync (auth-failure exit-code contract unified). Defensive / validation: - CLI-021: CliConfig.Load wraps file-read/JSON parse so malformed config prints a warning and returns defaults instead of crashing the CLI. - Host-022: ParseLevel emits stderr one-shot warning for unrecognised MinimumLevel instead of silently coercing to Information. - ESG-019: ExternalSystemClient sets HttpClient.Timeout=Infinite so the per-call CTS is the sole timeout source (was clipped to 100s by .NET). - Security-020: New SecurityOptionsValidator (IValidateOptions) rejects empty LdapServer/LdapSearchBase with ValidateOnStart. - DM-019: Lifecycle command timeouts now emit DisableTimedOut/EnableTimedOut/ DeleteTimedOut audit entries (mirrors DeployFailed pattern). Plus reconciled stale per-module Open-findings counters that had drifted from prior sessions. 20+ new regression tests across 11 test projects; build clean; affected suites all green. README regenerated: 75 open (was 93).
This commit is contained in:
@@ -8,7 +8,7 @@
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 7 |
|
||||
| Open findings | 4 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -973,9 +973,28 @@ be marked `Success`.
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:328-339,385-396,445-458` |
|
||||
|
||||
**Resolution (2026-05-28):** added `TryLogLifecycleTimeoutAsync`, a private
|
||||
helper that mirrors the `DeployFailed` pattern — it calls `_auditService.LogAsync`
|
||||
with `CancellationToken.None` (so the operator's already-cancelled outer
|
||||
token cannot also prevent the audit write) and stamps the row with the
|
||||
`<Action>TimedOut` action name (`DisableTimedOut` / `EnableTimedOut` /
|
||||
`DeleteTimedOut`), the command id, the configured deadline, and the captured
|
||||
exception message. Each of `DisableInstanceAsync` / `EnableInstanceAsync` /
|
||||
`DeleteInstanceAsync` invokes the helper from its
|
||||
`catch (TimeoutException or OperationCanceledException)` block before
|
||||
returning the failure `Result`. The helper itself try/catches around the
|
||||
audit write so a failed audit pipeline does not mask the underlying timeout
|
||||
for the caller — it only logs at Warning. Regression tests
|
||||
`DisableInstanceAsync_LifecycleTimeout_WritesDisableTimedOutAuditEntry`,
|
||||
`EnableInstanceAsync_LifecycleTimeout_WritesEnableTimedOutAuditEntry`, and
|
||||
`DeleteInstanceAsync_LifecycleTimeout_WritesDeleteTimedOutAuditEntry` use the
|
||||
existing `SilentProbeActor` to keep the site unresponsive, configure a 300 ms
|
||||
`LifecycleCommandTimeout` to bound the wait, and assert the audit log
|
||||
received the corresponding `<Action>TimedOut` entry exactly once.
|
||||
|
||||
**Description**
|
||||
|
||||
`DisableInstanceAsync`, `EnableInstanceAsync`, and `DeleteInstanceAsync` each
|
||||
|
||||
Reference in New Issue
Block a user