fix(store-and-forward): resolve StoreAndForward-004,005,010,013 — accurate handler-contract doc, conditional sweep writes, reset LastAttemptAt on parked retry, test coverage

2026-05-16 21:44:10 -04:00
parent a88bec9376
commit 5672502d83
7 changed files with 447 additions and 18 deletions
--- a/code-reviews/StoreAndForward/findings.md
+++ b/code-reviews/StoreAndForward/findings.md
@@ -8,7 +8,7 @@
 | Last reviewed | 2026-05-16 |
 | Reviewer | claude-agent |
 | Commit reviewed | `9c60592` |
-| Open findings | 11 |
+| Open findings | 7 |

 ## Summary

@@ -212,7 +212,7 @@ corrected semantics.
 |--|--|
 | Severity | Medium |
 | Category | Documentation & comments |
-| Status | Open |
+| Status | Resolved |
 | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:38`, `:60` |

 **Description**
@@ -233,15 +233,25 @@ retry); exception = transient failure (buffer / increment retry).

 **Resolution**

-_Unresolved._
+Resolved 2026-05-16 (commit pending). Confirmed the root cause against the source: the
+old XML comment's "Permanent failures should return false (message will NOT be
+buffered)" describes only the `EnqueueAsync` immediate path; on the retry path a
+`false` return parks the already-buffered message. Reworded the `_deliveryHandlers`
+field doc into an explicit three-way contract (`true` = delivered/removed-or-never-
+buffered; `false` = permanent failure = not-buffered-on-immediate / parked-on-retry;
+throws = transient = buffered-for-retry / retry-count-incremented), noting it applies
+identically on both paths, and pointed `RegisterDeliveryHandler`'s summary at it.
+Documentation-only change — no behavioural code touched, so no regression test
+(an XML comment is not test-observable).

 ### StoreAndForward-005 — Parked-message retry/discard can race with the in-progress retry sweep

 | | |
 |--|--|
-| Severity | Medium |
+| Severity | Low — re-triaged down from Medium on 2026-05-16; the described data-loss race is not actually reachable, see Re-triage note |
+| Original severity | Medium |
 | Category | Concurrency & thread safety |
-| Status | Open |
+| Status | Resolved |
 | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:184`, `:266`, `:280` |

 **Description**
@@ -266,9 +276,42 @@ so the actor mailbox serialises them, or make status transitions conditional in
 (e.g. `UPDATE ... WHERE id = @id AND status = @expected`) and re-check the affected
 row count.

+**Re-triage note (2026-05-16)**
+
+Verified against the source: the *specific* race the finding describes — *"an operator
+`RetryParkedMessageAsync` resets a row to Pending while the sweep simultaneously parks
+the same in-flight message"* — **cannot occur** at the reviewed code. The two operator
+operations (`StoreAndForwardStorage.RetryParkedMessageAsync` /
+`DiscardParkedMessageAsync`) are already SQL-conditional on `status = Parked`, and the
+retry sweep only ever loads and processes rows that are `Pending`. For the operator and
+the sweep to touch the *same* message the row would have to be simultaneously `Parked`
+(operator-actionable) and in the sweep's in-flight `Pending` snapshot — mutually
+exclusive states. Overlapping sweeps are additionally prevented by the `Interlocked`
+guard, so there is only ever one sweep writer. The described data-loss outcome is
+therefore not reachable, which makes the original Medium ("risky behaviour") severity
+an over-statement — **re-triaged to Low**.
+
+What *is* real is a latent fragility: the sweep's own state-changing writes used the
+unconditional `UpdateMessageAsync` (`WHERE id = @id`), so the no-clobber guarantee
+rested entirely on the `Interlocked` invariant with zero defence-in-depth if that
+invariant were ever broken (e.g. `RetryMessageAsync` called outside the sweep, or the
+guard removed). That residual Low-severity weakness is worth closing, so the
+recommendation's conditional-SQL option was applied.
+
 **Resolution**

-_Unresolved._
+Resolved 2026-05-16 (commit pending). Re-triaged Medium → Low (the described race is
+unreachable; see Re-triage note) and the residual latent fragility fixed: added
+`StoreAndForwardStorage.UpdateMessageIfStatusAsync`, a conditional update
+(`UPDATE ... WHERE id = @id AND status = @expectedStatus`) returning whether a row was
+written. `RetryMessageAsync`'s three state-changing writes (park-on-permanent-failure,
+park-on-max-retries, retry-count increment) now use it with `expectedStatus = Pending`,
+so the sweep can never overwrite a row whose status changed underneath it; a skipped
+write is logged and the message left for the next sweep. Regression test
+`RetryMessageAsync_StatusChangedDuringDelivery_SweepParkWriteIsSkipped` drives a writer
+that moves the row out of `Pending` mid-delivery and asserts the sweep's stale park
+write is skipped (it failed against the pre-fix unconditional update, clobbering the
+other writer's `RetryCount`).

 ### StoreAndForward-006 — `GetParkedMessagesAsync` count and page run without a transaction

@@ -395,7 +438,7 @@ _Unresolved._
 |--|--|
 | Severity | Medium |
 | Category | Correctness & logic bugs |
-| Status | Open |
+| Status | Resolved |
 | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:203`, `:101` |

 **Description**
@@ -419,7 +462,16 @@ interval. Document the chosen behaviour in the method's XML comment.

 **Resolution**

-_Unresolved._
+Resolved 2026-05-16 (commit pending). Confirmed against the source: `RetryParkedMessage
+Async` reset `status`, `retry_count` and `last_error` but left `last_attempt_at` stale,
+so the operator-retried message's retry timing depended on the elapsed time since the
+original pre-park attempt. Encoded the intent explicitly — an operator retry means
+"attempt this again now" — by also setting `last_attempt_at = NULL` in the UPDATE, so
+the re-queued message is unambiguously due on the next sweep regardless of the
+configured `retry_interval_ms`. The method's XML comment now documents this. Regression
+test `RetryParkedMessageAsync_ClearsLastAttemptAt_SoMessageIsImmediatelyDue` uses a
+1-hour retry interval and a recent `last_attempt_at`; it failed pre-fix (timestamp not
+cleared, message excluded from the retry-due set) and passes post-fix.

 ### StoreAndForward-011 — `StoreAndForwardMessageStatus.InFlight` is unused and the doc's "retrying" status is unmodelled

@@ -488,7 +540,7 @@ _Unresolved._
 |--|--|
 | Severity | Medium |
 | Category | Testing coverage |
-| Status | Open |
+| Status | Resolved |
 | Location | `tests/ScadaLink.StoreAndForward.Tests/` (whole directory); `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:101`; `src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs` |

 **Description**
@@ -514,7 +566,20 @@ invalid-JSON payloads.

 **Resolution**

-_Unresolved._
+Resolved 2026-05-16 (commit pending). All three gaps closed.
+(1) Retry-due timing: `GetMessagesForRetryAsync_NonZeroInterval_ExcludesNotYetDue
+IncludesDue` exercises the `julianday` elapsed-time SQL with non-zero intervals — a
+just-attempted message (1-hour interval) is excluded, an old one (attempted 2h ago,
+5-min interval) and a never-attempted one are included.
+(2) Active-side replication: this sub-claim was **already stale** at the reviewed code —
+`StoreAndForwardReplicationTests` (added with finding 001's fix) asserts the Add, Remove
+and Park replication operations; no new test was needed.
+(3) `ParkedMessageHandlerActor`: new `ParkedMessageHandlerActorTests` using
+`Akka.TestKit.Xunit2` (package reference added to the test project) covers Query/Retry/
+Discard request-to-response mapping, correlation-ID propagation, the unknown-message
+failure responses, and `ExtractMethodName` for `MethodName`, `Subject`, missing-property
+and malformed-JSON payloads (all falling back to the category name without throwing).
+These are coverage-gap tests over already-correct code, so they pass on first run.

 ### StoreAndForward-014 — Storage does not create its SQLite database directory