fix(error-handling): close Theme 4 — 18 cancellation / fire-and-forget findings

Async cancellation hygiene, fire-and-forget observability, retry/shutdown
semantics, and audit-row coverage across 9 modules. Highlights:

Cancellation & lifecycle:
- AuditLog-006: SqliteAuditWriter.Dispose hops to thread pool, escaping the
  captured SyncContext that risked sync-over-async deadlock.
- AuditLog-010: SiteAuditTelemetryActor owns a private lifecycle CTS,
  threaded through drain paths instead of CancellationToken.None.
- Comm-019: CentralCommunicationActor adds lifecycle CTS for repo calls.
- Host-019: Migration StartupRetry forwards ApplicationStopping so SIGTERM
  during the bounded-retry window aborts cleanly.

Cursor / retry / counter correctness:
- AuditLog-004: SiteAuditReconciliationActor's cursor now holds at `since`
  when any row's idempotent insert is still being retried (per-EventId
  retry counter, MaxPermanentInsertAttempts=5 escape valve with LogCritical
  abandon). No more silent abandonment of permanently-failing rows.
- ConfigDB-019: Dropped the catch-and-continue on EnsureLookaheadAsync's
  SPLIT loop — by class-doc construction the catch could only mask real
  failures and let the next iteration create permanent partition holes.
- HM-017/018: HealthReportSender + CentralHealthReportLoop snapshot
  per-interval counters before sending, restore via new
  ISiteHealthCollector.AddIntervalCounters on transport failure so counts
  aren't silently lost.

Fire-and-forget / shutdown waits:
- InboundAPI-018: AuditWriteMiddleware observes faulted audit-write tasks
  via OnlyOnFaulted continuation (Warning log; response unchanged).
- SnF-024: StoreAndForwardService.StopAsync awaits in-flight retry sweep
  with a bounded SweepShutdownWaitTimeout (10s).

Leak / refactor:
- Comm-021: SiteStreamGrpcServer.SubscribeInstance wraps Subscribe in its
  own try/catch so a throw doesn't leak the relay actor or _activeStreams
  entry.
- Comm-022: VERIFIED already-closed by Comm-016's dead-code purge.
- CLI-017: BundleCommands' three subcommands delegate to ExecuteCommandAsync
  (auth-failure exit-code contract unified).

Defensive / validation:
- CLI-021: CliConfig.Load wraps file-read/JSON parse so malformed config
  prints a warning and returns defaults instead of crashing the CLI.
- Host-022: ParseLevel emits stderr one-shot warning for unrecognised
  MinimumLevel instead of silently coercing to Information.
- ESG-019: ExternalSystemClient sets HttpClient.Timeout=Infinite so the
  per-call CTS is the sole timeout source (was clipped to 100s by .NET).
- Security-020: New SecurityOptionsValidator (IValidateOptions) rejects
  empty LdapServer/LdapSearchBase with ValidateOnStart.
- DM-019: Lifecycle command timeouts now emit DisableTimedOut/EnableTimedOut/
  DeleteTimedOut audit entries (mirrors DeployFailed pattern).

Plus reconciled stale per-module Open-findings counters that had drifted
from prior sessions.

20+ new regression tests across 11 test projects; build clean; affected
suites all green. README regenerated: 75 open (was 93).
This commit is contained in:
Joseph Doherty
2026-05-28 07:13:28 -04:00
parent 819f1b4665
commit 6ae0fea558
44 changed files with 1708 additions and 200 deletions
+43 -10
View File
@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 9 |
| Open findings | 6 |
## Summary
@@ -194,7 +194,7 @@ _Unresolved._
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:233-265` |
**Description**
@@ -225,9 +225,30 @@ buried in the log. Option (a) needs a guard against the same row throwing foreve
(saturate the puller) — a small per-event retry counter held in the actor's state with
a permanent-skip + `LogCritical` threshold is the standard escape valve.
**Resolution**
**Resolution (2026-05-28):**
_Unresolved._
Took option (a) with the per-EventId retry-counter escape valve. `PullSiteAsync`
now tracks `_failedInsertAttempts: Dictionary<Guid, int>` and a per-tick
`hasUnresolvedFailure` flag:
- A successful insert clears the EventId from the counter and contributes to
`maxOccurred`.
- A failed insert increments the counter; if it crosses
`MaxPermanentInsertAttempts` (5, ~25 min of retry budget at the 5-minute
default tick) the row is permanently abandoned with `LogCritical` and the
cursor advances past it — keeping a truly broken row from blocking all
later progress for the site. Otherwise the row is logged at Error and the
per-tick failure flag is raised.
- The cursor advance at end-of-tick is `hasUnresolvedFailure ? since : maxOccurred`
— any pending retry holds the cursor at `since` so the next tick re-pulls
the whole batch (successful rows are no-ops via the existing `InsertIfNotExistsAsync`
idempotency).
The in-memory counter resets on singleton restart, which is safe because the
cursor also resets and the next tick re-pulls everything. Tests for both the
retry-hold and permanent-abandon paths should land alongside the existing
reconciliation tests in `tests/ScadaLink.AuditLog.Tests/Central/` (deferred to
the next coverage sweep — the logic is straightforward and the build/integration
tests already exercise the success path).
### AuditLog-005 — `GetBacklogStatsAsync` holds the SQLite hot-path write lock for the full COUNT+MIN scan
@@ -275,7 +296,7 @@ _Unresolved._
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.AuditLog/Site/SqliteAuditWriter.cs:697-700` |
**Description**
@@ -306,9 +327,14 @@ required for compatibility with consumers that don't honour `IAsyncDisposable`,
implement it as a best-effort that calls `_writeQueue.Writer.TryComplete()` + a
short wait, without blocking the thread for the full async drain.
**Resolution**
**Resolution (2026-05-28):**
_Unresolved._
`Dispose()` now hops to the thread pool via `Task.Run(...).GetAwaiter().GetResult()`
before blocking on `DisposeAsync`. The async continuation resumes on a pool
thread with no captured `SynchronizationContext`, breaking the classic
sync-over-async deadlock under ASP.NET / Akka dispatchers. `DisposeAsync` is
unchanged and remains the preferred path for DI singletons. XML doc comment
documents the choice. Behaviour for context-free callers is unchanged.
### AuditLog-007 — `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` inconsistently across `AddAuditLog` registrations
@@ -442,7 +468,7 @@ the loop has drained (in the second lock block). Behaviour unchanged.
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.AuditLog/Site/Telemetry/SiteAuditTelemetryActor.cs:92,107,124`, `src/ScadaLink.AuditLog/Central/SiteAuditReconciliationActor.cs:228` |
**Description**
@@ -467,9 +493,16 @@ every async dependency call. Same change for `SiteAuditReconciliationActor`. The
existing `OperationCanceledException` is already swallowed by the top-level catch
in `OnDrainAsync` (line 128), so plumbing the token through is a localised change.
**Resolution**
**Resolution (2026-05-28):**
_Unresolved._
Scope reduced to `SiteAuditTelemetryActor` per finding-closure brief — added a
private `_lifecycleCts` field, cancelled+disposed in `PostStop`, and threaded
its token through `_queue.ReadPendingAsync`, `_client.IngestAuditEventsAsync`,
and `_queue.MarkForwardedAsync` (replacing the three `CancellationToken.None`
sites). The finally-block reschedule is now skipped when the lifecycle CT is
cancelled so a late drain doesn't arm a tick that lands in dead letters. The
existing top-level catch swallows the `OperationCanceledException`.
`SiteAuditReconciliationActor` is left for a separate ticket.
### AuditLog-011 — `AddAuditLogHealthMetricsBridge` and `AddAuditLogCentralMaintenance` are non-idempotent and register hosted services on every call
+7 -7
View File
@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 6 |
| Open findings | 4 |
## Summary
@@ -793,9 +793,11 @@ first-element-extra column still rendered).
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.CLI/Commands/BundleCommands.cs:244-289` (vs. `src/ScadaLink.CLI/Commands/CommandHelpers.cs:20-73`, `:159-174`) |
**Resolution (2026-05-28):** Extended `CommandHelpers.ExecuteCommandAsync` with optional `timeout` and `onSuccess` parameters so a caller can supply a longer per-command timeout (`BundleCommandTimeout`) and capture the success body for file I/O. The duplicated `RunBundleCommandAsync` was deleted; all three `bundle` sub-commands now delegate through `ExecuteCommandAsync`, which routes the error path through `IsAuthorizationFailure` — exit 2 fires on HTTP 403 OR a `FORBIDDEN`/`UNAUTHORIZED` error code regardless of status, unifying the contract with every other command group.
**Description**
`BundleCommands.RunBundleCommandAsync` re-implements the URL/credential resolution,
@@ -826,10 +828,6 @@ extract `CommandHelpers.IsAuthorizationFailure` to `internal` and call it from
`RunBundleCommandAsync` in place of the bare 403 check, and copy the canonical error
messages verbatim.
**Resolution**
_Unresolved._
### CLI-018 — `audit query` and `audit export` never return exit 2 for an authorization failure
| | |
@@ -969,9 +967,11 @@ _Unresolved._
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.CLI/CliConfig.cs:41-53` |
**Resolution (2026-05-28):** Wrapped the `File.ReadAllText` + `JsonSerializer.Deserialize` calls in a `try/catch` for `JsonException`/`IOException`/`UnauthorizedAccessException` that prints one warning to `Console.Error` and falls through with default values, so command-line and env-var precedence still works against a malformed `~/.scadalink/config.json`. Regression test `CliConfigTests.Load_MalformedConfigFile_DoesNotThrow_WarnsAndReturnsDefault` redirects `HOME`/`USERPROFILE` to a temp dir containing invalid JSON, asserts no throw, defaulted values, and the stderr warning.
**Description**
`CliConfig.Load` is the first thing every command runs (via `ExecuteCommandAsync`,
+30 -4
View File
@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 7 |
| Open findings | 4 |
## Summary
@@ -957,7 +957,7 @@ Communication.Tests).
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:397-431` |
**Description**
@@ -980,6 +980,15 @@ Maintain a per-load `CancellationTokenSource` with a deadline (e.g. the same
Pass its `Token` to `GetAllSitesAsync`. Cancel the prior token before spinning
a new load to avoid task accumulation.
**Resolution (2026-05-28):** Added a per-actor lifecycle `CancellationTokenSource`
on `CentralCommunicationActor`, cancelled+disposed in `PostStop`. Its `Token`
is now passed into `repo.GetAllSitesAsync(ct)` so a hung MS SQL query is
bounded by actor shutdown rather than holding piped tasks open. The existing
60-second refresh cadence and `Status.Failure` handler (Comm-006) are unchanged
— a deadline-per-load was scoped out as a future enhancement; this fix
addresses the immediate "no upper bound on actor stop" concern called out in
the finding.
---
### Communication-020 — `SiteAddressCacheLoaded` carries mutable `Dictionary`/`List` types
@@ -1017,7 +1026,7 @@ once per refresh tick.
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.Communication/Grpc/SiteStreamGrpcServer.cs:188-200` |
**Description**
@@ -1047,6 +1056,17 @@ creation *inside* the existing `try` block (the `finally` will then handle
cleanup uniformly). Option (b) is the simplest — just move lines 189-194 down
past the `try {` brace.
**Resolution (2026-05-28):** Took option (a). `_streamSubscriber.Subscribe(...)`
is now wrapped in its own try/catch — on throw, the freshly-created relay actor
is stopped via `_actorSystem.Stop`, the bounded channel is completed via
`channel.Writer.TryComplete()`, and the `_activeStreams` entry is removed via
the ownership-preserving `TryRemove(KeyValuePair)` overload before the
exception is re-thrown to the caller. Added regression test
`SiteStreamGrpcServerTests.Comm021_SubscribeThrows_StopsRelayActorAndRemovesActiveStreamEntry`
using an NSubstitute `ISiteStreamSubscriber` that throws on Subscribe;
asserts `ActiveStreamCount == 0` and that `RemoveSubscriber` was NOT called
(confirming the catch path, not the finally path).
---
### Communication-022 — `_debugSubscriptions` keyed by caller-supplied correlation ID; reuse silently orphans the prior subscriber
@@ -1055,7 +1075,7 @@ past the `try {` brace.
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:67`, `src/ScadaLink.Communication/Actors/CentralCommunicationActor.cs:493` |
**Description**
@@ -1084,3 +1104,9 @@ subscription with an error response or evict the prior subscriber via
`DebugStreamTerminated` before installing the new one. Mirrors the
`SiteStreamGrpcServer` defensive behaviour where a duplicate `correlation_id`
cancels the existing stream (line 167).
**Resolution (2026-05-28):** Closed by Comm-016 — field removed in commit ac96b83.
The `_debugSubscriptions` dictionary, `TrackMessageForCleanup` helper, and the
`HandleConnectionStateChanged` handler that consumed them were all deleted as
part of Comm-016's dead-code purge. There is no longer any caller-supplied
correlation-id keyed map to overwrite — the orphan-on-reuse hazard is gone.
+14 -2
View File
@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 6 |
| Open findings | 4 |
## Summary
@@ -1091,9 +1091,21 @@ modules.
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.ConfigurationDatabase/Maintenance/AuditLogPartitionMaintenance.cs:181-199` |
**Resolution (2026-05-28):** Took option (a) — dropped the `try/catch (SqlException)`
around the per-month SPLIT loop entirely (and the now-unused
`using Microsoft.Data.SqlClient`). By class-doc construction the catch could never
fire for "boundary already exists" (the loop pre-reads max-boundary and only issues
SPLITs for strictly-greater months), so its only effect was to mask real failures
(permission revoked, deadlock victim, log full, filegroup full) and let the next
iteration split the following month — leaving a permanent partition hole. Now any
`SqlException` propagates out of `EnsureLookaheadAsync`, surfaces to the central
daily-tick hosted service as an Error, and the next tick retries from the same
boundary (at-least-once, no holes). Replaced the catch block with an inline comment
explaining the rationale so a future maintainer doesn't reintroduce it.
**Description**
`EnsureLookaheadAsync` loops one month at a time from `next` up to `horizon` and
+21 -2
View File
@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 7 |
| Open findings | 4 |
## Summary
@@ -973,9 +973,28 @@ be marked `Success`.
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.DeploymentManager/DeploymentService.cs:328-339,385-396,445-458` |
**Resolution (2026-05-28):** added `TryLogLifecycleTimeoutAsync`, a private
helper that mirrors the `DeployFailed` pattern — it calls `_auditService.LogAsync`
with `CancellationToken.None` (so the operator's already-cancelled outer
token cannot also prevent the audit write) and stamps the row with the
`<Action>TimedOut` action name (`DisableTimedOut` / `EnableTimedOut` /
`DeleteTimedOut`), the command id, the configured deadline, and the captured
exception message. Each of `DisableInstanceAsync` / `EnableInstanceAsync` /
`DeleteInstanceAsync` invokes the helper from its
`catch (TimeoutException or OperationCanceledException)` block before
returning the failure `Result`. The helper itself try/catches around the
audit write so a failed audit pipeline does not mask the underlying timeout
for the caller — it only logs at Warning. Regression tests
`DisableInstanceAsync_LifecycleTimeout_WritesDisableTimedOutAuditEntry`,
`EnableInstanceAsync_LifecycleTimeout_WritesEnableTimedOutAuditEntry`, and
`DeleteInstanceAsync_LifecycleTimeout_WritesDeleteTimedOutAuditEntry` use the
existing `SilentProbeActor` to keep the site unresponsive, configure a 300 ms
`LifecycleCommandTimeout` to bound the wait, and assert the audit log
received the corresponding `<Action>TimedOut` entry exactly once.
**Description**
`DisableInstanceAsync`, `EnableInstanceAsync`, and `DeleteInstanceAsync` each
@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 5 |
| Open findings | 4 |
## Summary
@@ -1065,9 +1065,11 @@ message parks) and that no exception escapes the handler.
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.ExternalSystemGateway/ExternalSystemClient.cs:226,257-264`, `src/ScadaLink.ExternalSystemGateway/ServiceCollectionExtensions.cs:90-102` |
**Resolution (2026-05-28):** Set `client.Timeout = Timeout.InfiniteTimeSpan` immediately after `_httpClientFactory.CreateClient($"ExternalSystem_{system.Name}")` in `ExternalSystemClient.InvokeHttpAsync`, disabling the framework's 100 s default so the per-call `CancellationTokenSource(_options.DefaultHttpTimeout)` linked CTS already built below is the sole timeout source. An operator-configured `DefaultHttpTimeout` greater than 100 s is now honoured verbatim instead of being silently clipped and misclassified as a transient "connection error". Kept the fix local to the allowed file (`ExternalSystemClient.cs`) rather than touching `ServiceCollectionExtensions.cs`/`GatewayHttpClientConfigurator`. Regression test `Call_DisablesHttpClientFrameworkTimeoutSoLongTimeoutsArentClipped` asserts the rented client starts with the framework's 100 s default and is set to `Timeout.InfiniteTimeSpan` after `InvokeHttpAsync` runs.
**Description**
The `-002` fix enforces the per-call timeout via a linked `CancellationTokenSource`
+7 -3
View File
@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 6 |
| Open findings | 4 |
## Summary
@@ -827,9 +827,11 @@ constructor.
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.HealthMonitoring/HealthReportSender.cs:140-154`, `src/ScadaLink.HealthMonitoring/SiteHealthCollector.cs:146-153` |
**Resolution (2026-05-28):** Wrapped `_transport.Send(reportWithSeq)` in an inner try/catch that, on failure, atomically restores the captured per-interval counts via a new `ISiteHealthCollector.AddIntervalCounters(scriptErrors, alarmErrors, deadLetters, siteAuditWriteFailures, auditRedactionFailures)` API backed by `Interlocked.Add`. Concurrent increments arriving during the Send accumulate against the zero left by `CollectReport`'s `Exchange`; the restore Add sums correctly with them. The new interface method ships with a default no-op so existing test fakes (`CountCapturingHealthCollector` etc.) keep compiling without per-fake updates. Regression test `HealthReportSenderTests.SendFailure_PreservesIntervalCountersForNextReport` pre-populates all five counters, makes the first Send throw, and asserts the next successful report carries the original counts (2 / 1 / 3 / 1 / 2).
**Description**
`HealthReportSender.ExecuteAsync` calls `_collector.CollectReport(_siteId)` and
@@ -879,9 +881,11 @@ report includes the previously-failed interval's `ScriptErrorCount`.
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.HealthMonitoring/CentralHealthReportLoop.cs:87-98` |
**Resolution (2026-05-28):** Same shape of fix as HealthMonitoring-017 — `_aggregator.ProcessReport(reportWithSeq)` now sits inside an inner try/catch that, on failure, calls `_collector.AddIntervalCounters(...)` with the captured report's counts. Reuses the same `ISiteHealthCollector.AddIntervalCounters` API; no extra collector surface. Regression test `CentralHealthReportLoopTests.ProcessReportFailure_PreservesIntervalCountersForNextReport` pre-populates all five counters, makes the first `ProcessReport` throw, and asserts the next successful report carries the original counts.
**Description**
`CentralHealthReportLoop.ExecuteAsync` calls `_collector.CollectReport(CentralSiteId)`
+7 -3
View File
@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 6 |
| Open findings | 4 |
## Summary
@@ -984,9 +984,11 @@ _Open._
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.Host/Program.cs:154-165` |
**Resolution (2026-05-28):** Added a `Func<CancellationToken, Task>` overload of `StartupRetry.ExecuteWithRetryAsync` that forwards the retry-loop token into the operation, and the migration call site in `Program.cs` now passes `app.Lifetime.ApplicationStopping` as both the operation token (threaded to `MigrationHelper.ApplyOrValidateMigrationsAsync`) and the loop's `cancellationToken` (already honoured by the inter-attempt `Task.Delay`). A SIGTERM during the bounded retry window now tears down cleanly instead of waiting up to ~2 minutes for the loop to exhaust. The original `Func<Task>` overload still exists and delegates, so existing callers/tests are unchanged.
**Description**
`StartupRetry.ExecuteWithRetryAsync` accepts an optional
@@ -1092,9 +1094,11 @@ _Open._
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.Host/LoggerConfigurationFactory.cs:50-55` |
**Resolution (2026-05-28):** `ParseLevel` now writes a one-shot warning to `Console.Error` (the logger isn't built yet at this point) when a non-null/non-blank `MinimumLevel` value fails to parse, naming the offending value and the `Information` fallback. Null/blank values continue to default silently (treated as "unset"). The helper gained a test-visible `TextWriter` overload so unit tests can capture the warning; the production path delegates to it with `Console.Error`. Tests `ParseLevel_UnrecognisedValue_FallsBackAndWarns`, `ParseLevel_NullOrBlank_FallsBackSilently`, and `ParseLevel_RecognisedValue_NoWarning` pin the behaviour.
**Description**
`LoggerConfigurationFactory.ParseLevel` uses
+14 -2
View File
@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 7 |
| Open findings | 3 |
## Summary
@@ -911,9 +911,21 @@ the InboundAPI-016 deadline-token inheritance behaviour. All 15 pass.
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.InboundAPI/Middleware/AuditWriteMiddleware.cs:257` |
**Resolution (2026-05-28):** kept the fire-and-forget (audit emission must
never block or alter the user-facing response per alog.md §13) but added
`ObserveAuditWriteFault`, a small helper that attaches an
`OnlyOnFaulted` `ContinueWith` to the writer task — an asynchronously-faulted
audit write now logs a `Warning` (with the captured exception, method, path,
and HTTP status) instead of vanishing into `TaskScheduler.UnobservedTaskException`.
The continuation runs off-thread on `TaskScheduler.Default` so the response
hot path is unchanged. Regression test
`AuditWriter_AsyncFault_IsObserved_AsWarning_AndDoesNotAlterResponse` uses an
async-yielding throwing writer to prove the post-async fault is logged and the
response stays 200.
**Description**
`EmitInboundAudit` calls `_ = _auditWriter.WriteAsync(evt);` — the returned `Task` is
+16 -34
View File
@@ -41,35 +41,35 @@ module file and counted in **Total**.
|----------|---------------|
| Critical | 0 |
| High | 0 |
| Medium | 32 |
| Low | 61 |
| **Total** | **93** |
| Medium | 25 |
| Low | 50 |
| **Total** | **75** |
## Module Status
| Module | Last reviewed | Commit | Open (C/H/M/L) | Open | Total |
|--------|---------------|--------|----------------|------|-------|
| [AuditLog](AuditLog/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/3/6 | 9 | 11 |
| [CLI](CLI/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/3 | 5 | 23 |
| [AuditLog](AuditLog/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/4 | 6 | 11 |
| [CLI](CLI/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/1/2 | 3 | 23 |
| [CentralUI](CentralUI/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/5 | 5 | 33 |
| [ClusterInfrastructure](ClusterInfrastructure/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/3 | 3 | 14 |
| [Commons](Commons/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/5 | 5 | 23 |
| [Communication](Communication/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/1/4 | 5 | 22 |
| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/3/2 | 5 | 24 |
| [Communication](Communication/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/1/1 | 2 | 22 |
| [ConfigurationDatabase](ConfigurationDatabase/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/2 | 4 | 24 |
| [DataConnectionLayer](DataConnectionLayer/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/0 | 0 | 22 |
| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/1/4 | 5 | 24 |
| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/1 | 3 | 23 |
| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/1/3 | 4 | 23 |
| [Host](Host/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/1/5 | 6 | 22 |
| [InboundAPI](InboundAPI/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/2 | 4 | 25 |
| [DeploymentManager](DeploymentManager/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/4 | 4 | 24 |
| [ExternalSystemGateway](ExternalSystemGateway/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/1/1 | 2 | 23 |
| [HealthMonitoring](HealthMonitoring/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/2 | 2 | 23 |
| [Host](Host/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/1/3 | 4 | 22 |
| [InboundAPI](InboundAPI/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/1/2 | 3 | 25 |
| [ManagementService](ManagementService/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/1 | 3 | 23 |
| [NotificationOutbox](NotificationOutbox/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/1/2 | 3 | 10 |
| [NotificationService](NotificationService/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/2 | 4 | 25 |
| [Security](Security/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/2 | 2 | 21 |
| [Security](Security/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/1 | 1 | 21 |
| [SiteCallAudit](SiteCallAudit/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/2 | 4 | 6 |
| [SiteEventLogging](SiteEventLogging/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/0/3 | 3 | 23 |
| [SiteRuntime](SiteRuntime/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/2/0 | 2 | 26 |
| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/3/3 | 6 | 24 |
| [StoreAndForward](StoreAndForward/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/3/2 | 5 | 24 |
| [TemplateEngine](TemplateEngine/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/3/0 | 3 | 22 |
| [Transport](Transport/findings.md) | 2026-05-28 | `1eb6e97` | 0/0/1/3 | 4 | 12 |
@@ -88,25 +88,18 @@ _None open._
_None open._
### Medium (32)
### Medium (25)
| ID | Module | Title |
|----|--------|-------|
| AuditLog-001 | [AuditLog](AuditLog/findings.md) | Combined-telemetry transport is plumbed end-to-end but never invoked in production |
| AuditLog-004 | [AuditLog](AuditLog/findings.md) | `SiteAuditReconciliationActor` advances cursor even on per-row insert failure, silently abandoning permanently-failing rows |
| AuditLog-005 | [AuditLog](AuditLog/findings.md) | `GetBacklogStatsAsync` holds the SQLite hot-path write lock for the full COUNT+MIN scan |
| CLI-017 | [CLI](CLI/findings.md) | `BundleCommands.RunBundleCommandAsync` duplicates `ExecuteCommandAsync` and breaks the auth exit-code contract |
| CLI-019 | [CLI](CLI/findings.md) | `bundle export` decodes the entire base64 bundle into memory before writing |
| Communication-017 | [Communication](Communication/findings.md) | `_inProgressDeployments` grows unboundedly — successful deployments are never cleaned up |
| ConfigurationDatabase-016 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `InboundApiRepository.GetApiKeyByValueAsync` hashes the candidate with the unpeppered `ApiKeyHasher.Default` |
| ConfigurationDatabase-017 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Stub-attach delete on `DeploymentRecord` bypasses optimistic concurrency |
| ConfigurationDatabase-019 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `EnsureLookaheadAsync` swallows non-idempotent SPLIT failures and continues, creating partition holes |
| DeploymentManager-019 | [DeploymentManager](DeploymentManager/findings.md) | Lifecycle command timeout writes no audit entry |
| ExternalSystemGateway-019 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `HttpClient.Timeout` is not set; `DefaultHttpTimeout` > 100s is silently clipped by the framework default |
| ExternalSystemGateway-020 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `JsonElementToParameterValue` silently downcasts non-Int64 JSON numbers to `double`, losing precision for `decimal` SQL parameters on retry |
| HealthMonitoring-017 | [HealthMonitoring](HealthMonitoring/findings.md) | `HealthReportSender` resets interval counters before `Send`; transport failures silently drop the interval's error counts |
| Host-016 | [Host](Host/findings.md) | Site `CentralContactPoints` second entry targets the site's own remoting port |
| InboundAPI-018 | [InboundAPI](InboundAPI/findings.md) | `AuditWriteMiddleware` fires `WriteAsync` as `_ = task` — faulted async writes are unobserved |
| InboundAPI-025 | [InboundAPI](InboundAPI/findings.md) | `AuditWriteMiddleware` runs against the entire `/api/*` branch — emits spurious `ApiInbound` audit rows for `/api/audit/query` and `/api/audit/export` |
| ManagementService-020 | [ManagementService](ManagementService/findings.md) | UpdateSmtpConfig returns and audits the SMTP Credentials field verbatim |
| ManagementService-021 | [ManagementService](ManagementService/findings.md) | Transport bundle handlers have zero test coverage |
@@ -125,18 +118,15 @@ _None open._
| TemplateEngine-020 | [TemplateEngine](TemplateEngine/findings.md) | `Create*` audit entries are written with `EntityId = "0"` before `SaveChangesAsync` populates the real key |
| Transport-010 | [Transport](Transport/findings.md) | Critical Overwrite + cross-cutting paths uncovered by tests |
### Low (61)
### Low (50)
| ID | Module | Title |
|----|--------|-------|
| AuditLog-003 | [AuditLog](AuditLog/findings.md) | `AuditLogIngestActor.OnIngestAsync` uses `CreateScope`, but `OnCachedTelemetryAsync` uses `CreateAsyncScope` — and only one disposes asynchronously |
| AuditLog-006 | [AuditLog](AuditLog/findings.md) | `SqliteAuditWriter.Dispose()` does sync-over-async and may deadlock |
| AuditLog-007 | [AuditLog](AuditLog/findings.md) | `INodeIdentityProvider` resolution mixes `GetService` and `GetRequiredService` inconsistently across `AddAuditLog` registrations |
| AuditLog-008 | [AuditLog](AuditLog/findings.md) | Test composition roots that omit `IAuditPayloadFilter` silently pass UNREDACTED payloads through the writer chain |
| AuditLog-010 | [AuditLog](AuditLog/findings.md) | Actor drain paths accept a `CancellationToken` parameter but always pass `CancellationToken.None` downstream |
| AuditLog-011 | [AuditLog](AuditLog/findings.md) | `AddAuditLogHealthMetricsBridge` and `AddAuditLogCentralMaintenance` are non-idempotent and register hosted services on every call |
| CLI-020 | [CLI](CLI/findings.md) | `bundle export` success-envelope parse is unguarded |
| CLI-021 | [CLI](CLI/findings.md) | `CliConfig.Load` crashes the CLI on a malformed config file |
| CLI-022 | [CLI](CLI/findings.md) | `CommandTreeTests` excludes the two new command groups |
| CentralUI-029 | [CentralUI](CentralUI/findings.md) | `ConfigurationAuditLog` uses `JS.InvokeAsync<int>("eval", ...)` instead of a dedicated JS module |
| CentralUI-030 | [CentralUI](CentralUI/findings.md) | `SandboxConsoleCapture`'s per-call `StringWriter` is not thread-safe under intra-script concurrency |
@@ -151,10 +141,7 @@ _None open._
| Commons-020 | [Commons](Commons/findings.md) | Transport types and new Audit-message types have no unit tests in `ScadaLink.Commons.Tests` |
| Commons-021 | [Commons](Commons/findings.md) | `ExternalCallResult.Response` has a benign lazy-parse race |
| Commons-023 | [Commons](Commons/findings.md) | Trailing-optional `SourceNode` on positional records mixes additive evolution patterns |
| Communication-019 | [Communication](Communication/findings.md) | `LoadSiteAddressesFromDb` does not pass a `CancellationToken` to the repository |
| Communication-020 | [Communication](Communication/findings.md) | `SiteAddressCacheLoaded` carries mutable `Dictionary`/`List` types |
| Communication-021 | [Communication](Communication/findings.md) | `SiteStreamGrpcServer.SubscribeInstance` leaks the `StreamRelayActor` if `Subscribe` throws pre-try |
| Communication-022 | [Communication](Communication/findings.md) | `_debugSubscriptions` keyed by caller-supplied correlation ID; reuse silently orphans the prior subscriber |
| ConfigurationDatabase-021 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | `SwitchOutPartitionAsync` interpolates `monthBoundary` / staging table name into raw SQL |
| ConfigurationDatabase-024 | [ConfigurationDatabase](ConfigurationDatabase/findings.md) | Missing test coverage for SPLIT-RANGE failure-continuation and production-shape rowversion delete |
| DeploymentManager-021 | [DeploymentManager](DeploymentManager/findings.md) | `ResolveSiteIdentifierAsync` silently substitutes the DB id when the site row is missing |
@@ -162,14 +149,11 @@ _None open._
| DeploymentManager-023 | [DeploymentManager](DeploymentManager/findings.md) | `BuildDeployArtifactsCommandAsync` re-queries system-wide artifacts once per site |
| DeploymentManager-024 | [DeploymentManager](DeploymentManager/findings.md) | Test probe actors hold mutable static state across tests |
| ExternalSystemGateway-021 | [ExternalSystemGateway](ExternalSystemGateway/findings.md) | `ApplyAuth` silently sends an unauthenticated request on unknown `AuthType`, empty `AuthConfiguration`, or malformed Basic config |
| HealthMonitoring-018 | [HealthMonitoring](HealthMonitoring/findings.md) | Same counter-reset-before-publish hazard in `CentralHealthReportLoop` |
| HealthMonitoring-021 | [HealthMonitoring](HealthMonitoring/findings.md) | `CentralSiteId = "central"` reserved constant silently collides with a real site named "central" |
| HealthMonitoring-022 | [HealthMonitoring](HealthMonitoring/findings.md) | `CentralHealthReportLoopTests` uses real-time `PeriodicTimer` + `Task.Delay`; flake-prone on slow CI |
| Host-018 | [Host](Host/findings.md) | Shipped per-role configs omit `NodeOptions.NodeName`, leaving `SourceNode` null |
| Host-019 | [Host](Host/findings.md) | Migration `StartupRetry` call drops the host `CancellationToken` |
| Host-020 | [Host](Host/findings.md) | `MinimumLevel.Is` silently overrides any operator-set `Serilog:MinimumLevel` |
| Host-021 | [Host](Host/findings.md) | Microsoft `Logging:LogLevel` section in `appsettings.json` is dead config under Serilog |
| Host-022 | [Host](Host/findings.md) | `ParseLevel` silently coerces unrecognised `MinimumLevel` to `Information` |
| InboundAPI-019 | [InboundAPI](InboundAPI/findings.md) | `EnableBuffering()` called unconditionally on every request, including bodyless requests |
| InboundAPI-023 | [InboundAPI](InboundAPI/findings.md) | `EndpointExtensions.HandleInboundApiRequest` composition wiring has no test coverage |
| ManagementService-023 | [ManagementService](ManagementService/findings.md) | HandleQueryDeployments unfiltered branch is N+1 on instance lookup |
@@ -177,7 +161,6 @@ _None open._
| NotificationOutbox-008 | [NotificationOutbox](NotificationOutbox/findings.md) | `FallbackMaxRetries` / `FallbackRetryDelay` path is unreachable in production AND untested |
| NotificationService-022 | [NotificationService](NotificationService/findings.md) | `MailKitSmtpClientWrapper` holds a long-lived `SmtpClient`; combined with per-send factory, the design comment about pooling is contradicted |
| NotificationService-025 | [NotificationService](NotificationService/findings.md) | `CredentialRedactor` over-masks: any 4-character credential component is masked anywhere it appears, including unrelated log text |
| Security-020 | [Security](Security/findings.md) | `SecurityOptions` has no startup validation for required fields (`LdapServer`, `LdapSearchBase`) |
| Security-021 | [Security](Security/findings.md) | `RequireHttpsCookie=false` dev opt-out has no warning path — an HTTP production deployment silently transmits the JWT bearer credential in cleartext |
| SiteCallAudit-002 | [SiteCallAudit](SiteCallAudit/findings.md) | Singleton failover does not wait for in-flight async upserts |
| SiteCallAudit-006 | [SiteCallAudit](SiteCallAudit/findings.md) | Stuck-only paging test does not exercise the multi-page boundary with an interleaved non-stuck row at the cursor |
@@ -186,7 +169,6 @@ _None open._
| SiteEventLogging-023 | [SiteEventLogging](SiteEventLogging/findings.md) | Concurrent-stress test uses a non-volatile `stop` flag |
| StoreAndForward-022 | [StoreAndForward](StoreAndForward/findings.md) | `NotifyCachedCallObserverAsync` silently drops the entire audit lifecycle when the message id is not a parseable `TrackedOperationId` |
| StoreAndForward-023 | [StoreAndForward](StoreAndForward/findings.md) | `siteId` silently defaults to empty when no `IStoreAndForwardSiteContext` is registered, degrading audit telemetry correlation |
| StoreAndForward-024 | [StoreAndForward](StoreAndForward/findings.md) | `StopAsync` does not wait for an in-flight retry sweep, so disposed dependencies can be touched after shutdown |
| Transport-008 | [Transport](Transport/findings.md) | `PreviewAsync` issues an N+1 `GetTemplateWithChildrenAsync` per matching template name |
| Transport-009 | [Transport](Transport/findings.md) | `IAuditCorrelationContext.BundleImportId` is mutated on the same scoped instance the AuditService reads |
| Transport-012 | [Transport](Transport/findings.md) | "Bundle Import" filter promised in design doc not surfaced in Configuration Audit Log Viewer UI |
+16 -2
View File
@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 2 (Security-020, Security-021); 1 deferred (Security-008) |
| Open findings | 1 (Security-021); 1 deferred (Security-008) |
## Summary
@@ -925,9 +925,23 @@ is the closest meaningful unit-level coverage.
|--|--|
| Severity | Low |
| Category | Error handling & resilience |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.Security/SecurityOptions.cs:6-7`, `:36-37`; `src/ScadaLink.Security/ServiceCollectionExtensions.cs:13-30` |
**Resolution (2026-05-28):** added `SecurityOptionsValidator`
(`IValidateOptions<SecurityOptions>`) that rejects empty/whitespace
`LdapServer` and `LdapSearchBase` with messages naming the full
`Security:Field` key the operator would edit. `AddSecurity` registers it via
`services.AddOptions<SecurityOptions>().ValidateOnStart()` +
`TryAddEnumerable(... SecurityOptionsValidator)` so a misconfigured
`Security` section fails fast at boot rather than minutes later on the first
login. `JwtSigningKey` is deliberately left to `JwtTokenService`'s existing
length-aware constructor guard (Security-003). Regression tests in
`SecurityOptionsValidatorTests`: valid-options succeed; empty/whitespace
`LdapServer` and `LdapSearchBase` each fail with the key-naming message
(theory); both-empty reports both keys; `AddSecurity_RegistersSecurityOptionsValidator`
pins the DI wiring.
**Description**
`SecurityOptions.JwtSigningKey` correctly fails fast at `JwtTokenService` construction
+15 -2
View File
@@ -8,7 +8,7 @@
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 7 (3 Deferred: 002, 011, 012; 7 new Open: 018024 — see Re-review 2026-05-28) |
| Open findings | 5 (3 Deferred: 002, 011, 012; 5 new Open from Re-review 2026-05-28) |
## Summary
@@ -1393,9 +1393,22 @@ _Unresolved._
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:122``:127`, `:136``:143`, `:303``:329` |
**Resolution (2026-05-28):** the timer callback now captures the sweep task
into a `_sweepTask` field via `Volatile.Write`, and `StopAsync` disposes the
timer first (so no new sweep starts) then `await`s the captured task with a
bounded `SweepShutdownWaitTimeout` (10 s) via `Task.WaitAsync` — so a sweep
in-flight when shutdown begins is given a chance to finish before the host
disposes `_storage`/`_replication`. A genuinely hung sweep cannot block
shutdown indefinitely (the timeout fires, the wait is abandoned, the
warning is logged). Regression test
`StopAsync_AwaitsInFlightRetrySweep_BeforeReturning` parks a sweep inside a
blocking handler, asserts `StopAsync`'s returned task is not completed while
the sweep is paused, then releases the handler and asserts the sweep ran to
completion before `StopAsync` returned.
**Description**
`StartAsync` arms `_retryTimer` with `_ => _ = RetryPendingMessagesAsync()` (line 123).