fix(store-and-forward): resolve StoreAndForward-006,007,008,009 — transactional parked reads, PipeTo, fault-isolated activity events; 002/011/012 deferred

2026-05-16 22:32:30 -04:00
parent dd7626da63
commit 9e2416b34c
6 changed files with 255 additions and 52 deletions
--- a/code-reviews/StoreAndForward/findings.md
+++ b/code-reviews/StoreAndForward/findings.md
@@ -8,7 +8,7 @@
 | Last reviewed | 2026-05-16 |
 | Reviewer | claude-agent |
 | Commit reviewed | `9c60592` |
-| Open findings | 7 |
+| Open findings | 0 (3 Deferred: 002, 011, 012 — see notes) |

 ## Summary

@@ -97,7 +97,7 @@ commit whose message references `StoreAndForward-001`.
 | Severity | Low |
 | Original severity | High (re-triaged down to Low on 2026-05-16 — see Re-triage note) |
 | Category | Error handling & resilience |
-| Status | Open |
+| Status | Deferred |
 | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:162`, `:201` |

 **Description**
@@ -152,9 +152,17 @@ should be made deliberately rather than forced here.

 **Resolution**

-_Open — re-triaged to Low. Premise (no handler registration anywhere) is stale: Host
-now wires all three handlers. Residual gap is minor and the prescribed fix is a
-cross-module contract change needing a design decision._
+_Deferred 2026-05-16 (re-triaged High → Low). Verified again against the source in this
+pass: the finding's premise (no `RegisterDeliveryHandler` caller anywhere) is stale —
+`ScadaLink.Host` wires all three handlers at site startup — so the High-severity
+"engine cannot deliver anything" outcome no longer occurs. The residual gap (a message
+enqueued for a category that genuinely has no handler is buffered then skipped forever)
+is real but minor. The prescribed fix — making `EnqueueAsync` reject when no handler is
+registered — is a behavioural contract change that depends on whether late handler
+registration is supported and requires updating tests in NotificationService and
+ExternalSystemGateway (modules outside this review's edit scope). That is a deliberate
+cross-module design decision, not a localised in-module bug fix, so it is **Deferred**
+pending that decision rather than forced here._

 ### StoreAndForward-003 — Off-by-one in retry accounting: immediate failure pre-counts as retry 1

@@ -319,7 +327,7 @@ other writer's `RetryCount`).
 |--|--|
 | Severity | Low |
 | Category | Performance & resource management |
-| Status | Open |
+| Status | Resolved |
 | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:166`, `:175` |

 **Description**
@@ -339,7 +347,22 @@ removes the inconsistency.

 **Resolution**

-_Unresolved._
+Resolved 2026-05-16 (commit pending). Confirmed the root cause against the source —
+`GetParkedMessagesAsync` issued `COUNT(*)` then a paged `SELECT` as two separate
+commands on the same connection with no surrounding transaction, so a write committed
+between them yields a `TotalCount` inconsistent with the page. Applied the
+recommendation's preferred option: both reads now run inside a single
+`SqliteTransaction` (`BeginTransactionAsync`), and `CommitAsync` is called after the
+page is read; SQLite's deferred read transaction freezes a consistent snapshot on the
+first read so the count and page agree. Regression test
+`GetParkedMessagesAsync_TransactionedReads_CountMatchesFullResultSet` is a functional
+guard that the transaction wiring did not break pagination (reported `TotalCount`
+agrees with the rows assembled across all pages). Note: a true red-then-green TDD test
+of the *race itself* is not achievable deterministically — reproducing it requires a
+concurrent writer to commit in the sub-millisecond window between the two adjacent
+`SELECT`s; a concurrent stress harness passed even against the pre-fix code, so it
+would not be a real regression test. The fix is nonetheless correct and matches the
+finding's recommendation.

 ### StoreAndForward-007 — Async work in `ParkedMessageHandlerActor` uses `ContinueWith` without scheduler/affinity guarantees

@@ -347,7 +370,7 @@ _Unresolved._
 |--|--|
 | Severity | Low |
 | Category | Akka.NET conventions |
-| Status | Open |
+| Status | Resolved |
 | Location | `src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs:34`, `:68`, `:87` |

 **Description**
@@ -370,7 +393,21 @@ off the actor thread safely, and makes the success/failure branches explicit.

 **Resolution**

-_Unresolved._
+Resolved 2026-05-16 (commit pending). Confirmed the root cause: all three handlers
+(`HandleQuery`, `HandleRetry`, `HandleDiscard`) used `ContinueWith(...).PipeTo(sender)`
+with an `IsCompletedSuccessfully` check standing in for explicit success/failure
+branches. Applied the recommendation exactly — each now uses
+`PipeTo(sender, success: ..., failure: ...)`, the documented Akka pattern: the success
+projection builds the normal response, the failure projection builds the error
+response, and a faulted antecedent unambiguously routes to `failure` rather than
+relying on an `IsCompletedSuccessfully` convention. `Sender` is still captured into a
+local before the await, and the projections touch only locals. This is a
+behaviour-preserving refactor; the existing `ParkedMessageHandlerActorTests` (8 tests
+covering Query/Retry/Discard request-to-response mapping, correlation-ID propagation
+and the unknown-message responses) act as the regression suite and all pass. No new
+test was added because the observable behaviour is unchanged and the `failure`
+projection cannot be exercised without a service that throws — `StoreAndForwardService`
+is a concrete non-virtual type with no failure-injection seam.

 ### StoreAndForward-008 — A SQLite connection is opened and torn down on every storage call

@@ -378,7 +415,7 @@ _Unresolved._
 |--|--|
 | Severity | Low |
 | Category | Performance & resource management |
-| Status | Open |
+| Status | Resolved |
 | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:28`, `:61`, `:93`, `:117`, `:144`, `:162`, `:199`, `:221`, `:237`, `:267`, `:285`, `:305`, `:319` |

 **Description**
@@ -398,7 +435,18 @@ the design relies on the Sqlite connection pool for acceptable performance.

 **Resolution**

-_Unresolved._
+Resolved 2026-05-16 (commit pending). Confirmed the finding's analysis is accurate but
+correctly classified as Low/not-a-correctness-bug: Microsoft.Data.Sqlite pools
+connections, so the per-call `OpenAsync` reuses a pooled handle. Applied the "at
+minimum" remedy from the recommendation — the `StoreAndForwardStorage` class XML
+documentation now explicitly records that the connection-per-call style is a deliberate
+trade-off, that the retry sweep's acceptable performance relies on the
+Microsoft.Data.Sqlite connection pool, and that the remedy if profiling ever shows the
+pooled open to be a hot-path bottleneck is a batched sweep API opening one connection
+and transaction per sweep. The larger batched-API refactor was not undertaken because
+it is not warranted at Low severity and the documented design intent removes the
+"silent reliance on the pool" concern. Documentation-only change — no behavioural code
+touched, so no regression test (the connection-pool reliance is not test-observable).

 ### StoreAndForward-009 — `OnActivity` event invocation is not thread-safe against concurrent subscribe/unsubscribe

@@ -406,7 +454,7 @@ _Unresolved._
 |--|--|
 | Severity | Low |
 | Category | Concurrency & thread safety |
-| Status | Open |
+| Status | Resolved |
 | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:46`, `:309` |

 **Description**
@@ -430,7 +478,21 @@ notifications asynchronously.

 **Resolution**

-_Unresolved._
+Resolved 2026-05-16 (commit pending). Confirmed — and found the impact is worse than
+the finding states. `RaiseActivity` previously did `OnActivity?.Invoke(...)`; a
+throwing subscriber's exception escaped it. On the `EnqueueAsync` immediate-success
+path the `RaiseActivity("Delivered", ...)` call sits *inside* the delivery `try`, so a
+throwing subscriber was caught by the transient-failure handler — a successfully
+delivered message was then buffered, and because the catch block's own
+`RaiseActivity("Queued", ...)` also threw, the exception escaped `EnqueueAsync`
+entirely. `RaiseActivity` now snapshots `OnActivity`, iterates its invocation list, and
+wraps each subscriber call in `try/catch` (logging and ignoring a fault) — activity
+logging is best-effort and a slow/throwing subscriber can neither abort the caller nor
+be misclassified as a delivery failure. Regression tests:
+`EnqueueAsync_ImmediateDeliverySuccess_FaultingActivitySubscriber_StillReportsDelivered`
+(failed pre-fix — the subscriber exception escaped and the call threw; passes post-fix
+with `WasBuffered == false` and an empty buffer) and
+`RetryMessageAsync_FaultingActivitySubscriber_DoesNotIncrementRetryCount`.

 ### StoreAndForward-010 — Retry of a parked message does not reset `LastAttemptAt`, so its retry timing is unspecified

@@ -479,7 +541,7 @@ cleared, message excluded from the retry-due set) and passes post-fix.
 |--|--|
 | Severity | Low |
 | Category | Design-document adherence |
-| Status | Open |
+| Status | Deferred |
 | Location | `src/ScadaLink.Commons/Types/Enums/StoreAndForwardMessageStatus.cs:9`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:219`, `:235` |

 **Description**
@@ -500,7 +562,18 @@ sweep is actively delivering (which would also help with finding 005).

 **Resolution**

-_Unresolved._
+_Deferred 2026-05-16. Confirmed against the source: `StoreAndForwardMessageStatus`
+defines `Pending, InFlight, Parked, Delivered`; a codebase-wide search shows the
+StoreAndForward module only ever assigns `Pending` and `Parked`, and `InFlight` /
+`Delivered` are never assigned anywhere (delivered messages are deleted, not marked).
+The design doc's `retrying` state is unmodelled. Both options the recommendation offers
+— (a) drop the unused `InFlight`/`Delivered` members, or (b) add a `Retrying` member —
+require editing `StoreAndForwardMessageStatus.cs`, which lives in `src/ScadaLink.Commons`
+(outside this review's edit scope: only `src/ScadaLink.StoreAndForward/**` may be
+changed). The enum is also referenced by IntegrationTests and HealthMonitoring tests, so
+removing members is a cross-module change. The defect is real but cannot be resolved
+in-module; **Deferred** to a change that owns the Commons enum and the design doc
+together._

 ### StoreAndForward-012 — `StoreAndForwardMessage` is a persistence entity but lives in the component, not Commons

@@ -508,7 +581,7 @@ _Unresolved._
 |--|--|
 | Severity | Low |
 | Category | Code organization & conventions |
-| Status | Open |
+| Status | Deferred |
 | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardMessage.cs:9` |

 **Description**
@@ -532,7 +605,19 @@ local persistence model. Document the decision.

 **Resolution**

-_Unresolved._
+_Deferred 2026-05-16. Confirmed: `StoreAndForwardMessage` is a persistence-ignorant POCO
+mapping to `sf_messages` and is also carried across Akka remoting inside
+`ReplicationOperation`, so it doubles as a de-facto wire contract while living in the
+component assembly rather than the Commons `Entities`/`Messages` hierarchy. The
+recommendation's primary remedy — moving `StoreAndForwardMessage` (and
+`ReplicationOperation`) into Commons — crosses module boundaries (it would add a type to
+`src/ScadaLink.Commons`, outside this review's edit scope of
+`src/ScadaLink.StoreAndForward/**`, and change every referencing module). The alternative
+"separate replication DTO" still leaves the persistence entity in the component, so it
+does not actually resolve the finding's core concern (entity placement / contract-
+evolution governance). Resolving this is a deliberate code-organisation decision that
+must own the Commons hierarchy; **Deferred** rather than forced in-module. Flagged for a
+cross-module follow-up._

 ### StoreAndForward-013 — Critical paths lack test coverage: retry-due timing, replication-from-active, and the actor bridge