fix(error-handling): close Theme 4 — 18 cancellation / fire-and-forget findings

Async cancellation hygiene, fire-and-forget observability, retry/shutdown
semantics, and audit-row coverage across 9 modules. Highlights:

Cancellation & lifecycle:
- AuditLog-006: SqliteAuditWriter.Dispose hops to thread pool, escaping the
  captured SyncContext that risked sync-over-async deadlock.
- AuditLog-010: SiteAuditTelemetryActor owns a private lifecycle CTS,
  threaded through drain paths instead of CancellationToken.None.
- Comm-019: CentralCommunicationActor adds lifecycle CTS for repo calls.
- Host-019: Migration StartupRetry forwards ApplicationStopping so SIGTERM
  during the bounded-retry window aborts cleanly.

Cursor / retry / counter correctness:
- AuditLog-004: SiteAuditReconciliationActor's cursor now holds at `since`
  when any row's idempotent insert is still being retried (per-EventId
  retry counter, MaxPermanentInsertAttempts=5 escape valve with LogCritical
  abandon). No more silent abandonment of permanently-failing rows.
- ConfigDB-019: Dropped the catch-and-continue on EnsureLookaheadAsync's
  SPLIT loop — by class-doc construction the catch could only mask real
  failures and let the next iteration create permanent partition holes.
- HM-017/018: HealthReportSender + CentralHealthReportLoop snapshot
  per-interval counters before sending, restore via new
  ISiteHealthCollector.AddIntervalCounters on transport failure so counts
  aren't silently lost.

Fire-and-forget / shutdown waits:
- InboundAPI-018: AuditWriteMiddleware observes faulted audit-write tasks
  via OnlyOnFaulted continuation (Warning log; response unchanged).
- SnF-024: StoreAndForwardService.StopAsync awaits in-flight retry sweep
  with a bounded SweepShutdownWaitTimeout (10s).

Leak / refactor:
- Comm-021: SiteStreamGrpcServer.SubscribeInstance wraps Subscribe in its
  own try/catch so a throw doesn't leak the relay actor or _activeStreams
  entry.
- Comm-022: VERIFIED already-closed by Comm-016's dead-code purge.
- CLI-017: BundleCommands' three subcommands delegate to ExecuteCommandAsync
  (auth-failure exit-code contract unified).

Defensive / validation:
- CLI-021: CliConfig.Load wraps file-read/JSON parse so malformed config
  prints a warning and returns defaults instead of crashing the CLI.
- Host-022: ParseLevel emits stderr one-shot warning for unrecognised
  MinimumLevel instead of silently coercing to Information.
- ESG-019: ExternalSystemClient sets HttpClient.Timeout=Infinite so the
  per-call CTS is the sole timeout source (was clipped to 100s by .NET).
- Security-020: New SecurityOptionsValidator (IValidateOptions) rejects
  empty LdapServer/LdapSearchBase with ValidateOnStart.
- DM-019: Lifecycle command timeouts now emit DisableTimedOut/EnableTimedOut/
  DeleteTimedOut audit entries (mirrors DeployFailed pattern).

Plus reconciled stale per-module Open-findings counters that had drifted
from prior sessions.

20+ new regression tests across 11 test projects; build clean; affected
suites all green. README regenerated: 75 open (was 93).
This commit is contained in:
Joseph Doherty
2026-05-28 07:13:28 -04:00
parent 819f1b4665
commit 6ae0fea558
44 changed files with 1708 additions and 200 deletions
+30 -4
View File
@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 7 |
| Open findings | 4 |
## Summary
@@ -957,7 +957,7 @@ Communication.Tests).
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:397-431` |
**Description**
@@ -980,6 +980,15 @@ Maintain a per-load `CancellationTokenSource` with a deadline (e.g. the same
Pass its `Token` to `GetAllSitesAsync`. Cancel the prior token before spinning
a new load to avoid task accumulation.
**Resolution (2026-05-28):** Added a per-actor lifecycle `CancellationTokenSource`
on `CentralCommunicationActor`, cancelled+disposed in `PostStop`. Its `Token`
is now passed into `repo.GetAllSitesAsync(ct)` so a hung MS SQL query is
bounded by actor shutdown rather than holding piped tasks open. The existing
60-second refresh cadence and `Status.Failure` handler (Comm-006) are unchanged
— a deadline-per-load was scoped out as a future enhancement; this fix
addresses the immediate "no upper bound on actor stop" concern called out in
the finding.
---
### Communication-020 — `SiteAddressCacheLoaded` carries mutable `Dictionary`/`List` types
@@ -1017,7 +1026,7 @@ once per refresh tick.
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:188-200` |
**Description**
@@ -1047,6 +1056,17 @@ creation *inside* the existing `try` block (the `finally` will then handle
cleanup uniformly). Option (b) is the simplest — just move lines 189-194 down
past the `try {` brace.
**Resolution (2026-05-28):** Took option (a). `_streamSubscriber.Subscribe(...)`
is now wrapped in its own try/catch — on throw, the freshly-created relay actor
is stopped via `_actorSystem.Stop`, the bounded channel is completed via
`channel.Writer.TryComplete()`, and the `_activeStreams` entry is removed via
the ownership-preserving `TryRemove(KeyValuePair)` overload before the
exception is re-thrown to the caller. Added regression test
`SiteStreamGrpcServerTests.Comm021_SubscribeThrows_StopsRelayActorAndRemovesActiveStreamEntry`
using an NSubstitute `ISiteStreamSubscriber` that throws on Subscribe;
asserts `ActiveStreamCount == 0` and that `RemoveSubscriber` was NOT called
(confirming the catch path, not the finally path).
---
### Communication-022 — `_debugSubscriptions` keyed by caller-supplied correlation ID; reuse silently orphans the prior subscriber
@@ -1055,7 +1075,7 @@ past the `try {` brace.
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:67`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:493` |
**Description**
@@ -1084,3 +1104,9 @@ subscription with an error response or evict the prior subscriber via
`DebugStreamTerminated` before installing the new one. Mirrors the
`SiteStreamGrpcServer` defensive behaviour where a duplicate `correlation_id`
cancels the existing stream (line 167).
**Resolution (2026-05-28):** Closed by Comm-016 — field removed in commit ac96b83.
The `_debugSubscriptions` dictionary, `TrackMessageForCleanup` helper, and the
`HandleConnectionStateChanged` handler that consumed them were all deleted as
part of Comm-016's dead-code purge. There is no longer any caller-supplied
correlation-id keyed map to overwrite — the orphan-on-reuse hazard is gone.