# Code Review — StoreAndForward | Field | Value | |-------|-------| | Module | `src/ScadaLink.StoreAndForward` | | Design doc | `docs/requirements/Component-StoreAndForward.md` | | Status | Reviewed | | Last reviewed | 2026-05-17 | | Reviewer | claude-agent | | Commit reviewed | `39d737e` | | Open findings | 3 (3 Deferred: 002, 011, 012 — see notes) | ## Summary The Store-and-Forward module is small and readable, with a clean SQLite persistence layer, a sensible service API, and reasonable test coverage of the storage and service happy paths. However the review surfaced two issues that undermine the module's core purpose. First, the active delivery path never invokes the `ReplicationService` — `ReplicateEnqueue/Remove/Park` have no callers anywhere in the codebase, so buffered messages are not replicated to the standby node and the design's failover-durability guarantee (Component doc "Persistence", CLAUDE.md "Store-and-Forward") is not met. Second, there is an off-by-one in retry accounting: the immediate-failure path stores a buffered message with `RetryCount = 1`, so a message configured with `MaxRetries = N` is actually attempted `N` times in total rather than one immediate attempt plus `N` retries, and a per-source `MaxRetries` of 1 produces zero retry attempts. Additional themes: SQLite connection-per-call with no transactional grouping of multi-statement operations, no concurrency guard against a parked message being retried while the sweep is mid-flight, an unused enum member (`InFlight`) that drifts from the documented status set, and untested critical paths (retry-due timing, replication-from-active, the actor bridge). None of the findings are blockers for compilation, but the replication and retry-count issues are functional defects against the design. #### Re-review 2026-05-17 (commit `39d737e`) Re-reviewed at commit `39d737e` after the batch-3 fixes. All of findings 001 and 003–010, plus 013–014, are confirmed `Resolved` against the current source: the replication wiring (`BufferAsync`/`ReplicateRemove`/`ReplicatePark`), the corrected retry-count semantics, the conditional `UpdateMessageIfStatusAsync` writes, the transactioned parked-message reads, the `PipeTo` refactor, the `RaiseActivity` hardening, the `RetryParkedMessageAsync` `last_attempt_at` reset and the database directory creation are all present as described. Findings 002, 011 and 012 were re-verified and remain validly `Deferred` — their preconditions are unchanged (002's residual no-handler gap, 011's Commons-owned enum, 012's Commons-owned entity placement). This pass surfaced **three new findings**. StoreAndForward-015 records the StoreAndForward side of the cross-module `MaxRetries == 0` ambiguity flagged by ExternalSystemGateway-015: `EnqueueAsync`'s public contract documents `maxRetries` only as "parked once `MaxRetries` is reached" and never states the `0 = no limit / retry forever` special case that `RetryMessageAsync` actually enforces, so an ESG caller passing `0` to mean "never retry" gets the opposite behaviour with no warning from the S&F API surface. StoreAndForward-016 records that operator-initiated parked-message retry and discard are not replicated to the standby — only the add/remove/park sweep paths are — so a failover diverges the standby buffer from the active one. StoreAndForward-017 records that the Retry/Discard activity-log entries hard-code the `ExternalSystem` category, mislabelling notification and cached-DB-write messages in the site event log. ## Checklist coverage | # | Category | Examined | Notes | |---|----------|----------|-------| | 1 | Correctness & logic bugs | ☑ | Off-by-one in retry counting (003); parked-message retry timing (010); Retry/Discard activity log hard-codes the category (017). | | 2 | Akka.NET conventions | ☑ | `ContinueWith` used instead of `PipeTo`-friendly continuations; default supervision; see 007. | | 3 | Concurrency & thread safety | ☑ | Sweep guarded by `Interlocked`, but no guard against retry-vs-manage races (005); `OnActivity` event not thread-safe (009). | | 4 | Error handling & resilience | ☑ | Replication never invoked from active path (001); no-handler messages buffered then stuck (002); operator retry/discard not replicated to standby (016). | | 5 | Security | ☑ | No issues found — parameterised SQL throughout; no secrets handled directly; payload JSON treated opaquely. | | 6 | Performance & resource management | ☑ | New SQLite connection per call; multi-statement operations not wrapped in a transaction (006, 008). | | 7 | Design-document adherence | ☑ | Replication gap (001); `InFlight` status undocumented/unused (011); "retrying" status from design doc not modelled. | | 8 | Code organization & conventions | ☑ | `StoreAndForwardMessage` is an entity-like POCO living in the component, not Commons (012). | | 9 | Testing coverage | ☑ | Retry-due timing, replication-from-active, and `ParkedMessageHandlerActor` are untested (013). | | 10 | Documentation & comments | ☑ | XML doc on `RegisterDeliveryHandler` contract is inconsistent with code (004); `EnqueueAsync` never documents the `maxRetries == 0` = "retry forever" special case (015). | ## Findings ### StoreAndForward-001 — Replication to standby is never triggered by the active node | | | |--|--| | Severity | Critical | | Category | Error handling & resilience | | Status | Resolved | | Location | `src/ScadaLink.StoreAndForward/ReplicationService.cs:40`, `:53`, `:66`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:155`, `:212`, `:222`, `:236` | **Description** `ReplicationService` exposes `ReplicateEnqueue`, `ReplicateRemove` and `ReplicatePark` to forward buffer operations to the standby node, but a codebase-wide search shows these methods have no callers. `StoreAndForwardService` — which performs every add (`EnqueueAsync` line 155 / 163), remove (`RemoveMessageAsync` call at line 212) and park (`UpdateMessageAsync` calls at lines 222/236) — holds no reference to `ReplicationService` and never invokes it. Only the receiving half is wired (`SetReplicationHandler` and `ApplyReplicatedOperationAsync` are used by `SiteReplicationActor`). The Component design doc ("Persistence") and CLAUDE.md ("Store-and-Forward") require the active node to forward every buffer operation to the standby so that, on failover, the new active node "has a near-complete copy of the buffer." As written, the standby's S&F SQLite database stays empty and a failover loses the entire buffer — a data-loss defect against a core requirement. **Recommendation** Inject `ReplicationService` into `StoreAndForwardService` and call `ReplicateEnqueue` after a successful `_storage.EnqueueAsync`, `ReplicateRemove` after `RemoveMessageAsync`, and `ReplicatePark` after a park-causing `UpdateMessageAsync`. Update `ServiceCollectionExtensions.AddStoreAndForward` to pass the dependency. Add a test that asserts the replication handler observes each operation type. **Resolution** Resolved 2026-05-16. `ReplicationService` is now injected into `StoreAndForwardService` (wired in `AddStoreAndForward`), and every buffer operation is forwarded to the standby: a new `BufferAsync` helper calls `ReplicateEnqueue` after each persist, `ReplicateRemove` runs after a successful retry removes a message, and `ReplicatePark` runs on both park paths. Replication stays fire-and-forget and is a no-op when `ReplicationEnabled` is false or no handler is wired. Regression tests `StoreAndForwardReplicationTests` assert the replication handler observes the Add, Remove and Park operations. Fixed by the commit whose message references `StoreAndForward-001`. ### StoreAndForward-002 — Messages enqueued with no registered handler are buffered but never deliverable | | | |--|--| | Severity | Low | | Original severity | High (re-triaged down to Low on 2026-05-16 — see Re-triage note) | | Category | Error handling & resilience | | Status | Deferred | | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:162`, `:201` | **Description** `EnqueueAsync` falls through to "No handler registered — buffer for later" (line 162) when no delivery handler is registered for the category. The retry sweep (`RetryMessageAsync`, line 201) then logs "No delivery handler for category" and `return`s without touching the message. No caller in the codebase ever calls `RegisterDeliveryHandler` (the External System Gateway, Notification Service and Database Gateway only call `EnqueueAsync`), so in the current wiring **every** buffered message lands in this dead state: it is persisted, counts toward buffer depth, but can never be retried, delivered or parked. It will sit Pending forever. Either the handler registration is missing from Host/gateway startup, or the "buffer for later" path is a silent trap. Either way the engine cannot deliver anything. **Recommendation** Decide the intended contract. If handlers are expected to be registered before `EnqueueAsync` is reachable, make `EnqueueAsync` reject (or log an error) when no handler exists rather than silently buffering an undeliverable message, and wire `RegisterDeliveryHandler` calls in Host startup for all three categories. If late registration is intended, the retry sweep should treat a still-missing handler as a transient condition with bounded logging rather than a permanent no-op. **Re-triage note (2026-05-16)** The finding's central factual claim — *"No caller in the codebase ever calls `RegisterDeliveryHandler`"* and therefore *"every buffered message lands in this dead state"* — is **no longer true at the reviewed code**. `ScadaLink.Host` (`AkkaHostedService.RegisterSiteActors`, `AkkaHostedService.cs:353-379`) registers all three delivery handlers (`ExternalSystem`, `CachedDbWrite`, `Notification`) at site startup, immediately after `StoreAndForwardService.StartAsync()`. The finding was written against commit `9c60592` before that wiring existed; the High-severity "engine cannot deliver anything" outcome no longer occurs. The remaining residual risk is narrow: a message enqueued for a category that genuinely has no handler (e.g. an enqueue racing ahead of `RegisterDeliveryHandler`, or a future category added without a handler) is still buffered and then skipped by the sweep forever. That is a real but minor robustness gap, hence the **downgrade to Low**. It is left **Open** rather than fixed in this pass because the finding's recommended fix — making `EnqueueAsync` reject when no handler is registered — is a behavioural contract change, not a localised bug fix: the "buffer with no handler yet" path is exercised by `StoreAndForwardReplicationTests` and by three NotificationService and ExternalSystemGateway tests (`Send_TransientError_WithStoreAndForward_BuffersMessage`, `Send_Smtp4xxCommandException_ClassifiedTransientAndBuffered`, `Send_SmtpProtocolException_ClassifiedTransient`) which construct a real `StoreAndForwardService` without registering a handler and assert `WasBuffered`. Changing the contract requires deciding whether late handler registration is supported and updating tests in modules outside this review's edit scope — a design decision that should be made deliberately rather than forced here. **Resolution** _Deferred 2026-05-16 (re-triaged High → Low). Verified again against the source in this pass: the finding's premise (no `RegisterDeliveryHandler` caller anywhere) is stale — `ScadaLink.Host` wires all three handlers at site startup — so the High-severity "engine cannot deliver anything" outcome no longer occurs. The residual gap (a message enqueued for a category that genuinely has no handler is buffered then skipped forever) is real but minor. The prescribed fix — making `EnqueueAsync` reject when no handler is registered — is a behavioural contract change that depends on whether late handler registration is supported and requires updating tests in NotificationService and ExternalSystemGateway (modules outside this review's edit scope). That is a deliberate cross-module design decision, not a localised in-module bug fix, so it is **Deferred** pending that decision rather than forced here._ ### StoreAndForward-003 — Off-by-one in retry accounting: immediate failure pre-counts as retry 1 | | | |--|--| | Severity | High | | Category | Correctness & logic bugs | | Status | Resolved | | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:153`, `:229`, `:233` | **Description** On a transient immediate-delivery failure, `EnqueueAsync` buffers the message with `message.RetryCount = 1` (line 153). The retry sweep then increments `RetryCount` before the max check (`RetryCount++` at line 229; `RetryCount >= MaxRetries` at line 233). Consequences: (1) a message configured with `MaxRetries = 1` is parked on the *first* retry sweep without ever being retried, because after the immediate attempt `RetryCount` is already 1 and the first sweep makes it 2 ≥ 1 — zero actual retries occur, contradicting the design intent that the immediate attempt and the retry budget are distinct; (2) the design doc's `Retry Count` field is "Number of attempts so far," but here it is seeded to 1 before any *retry* has happened, making the parked-message `AttemptCount` shown to operators off by one relative to configured `MaxRetries`. The `EnqueueAsync_TransientFailure_BuffersForRetry` test even asserts `RetryCount == 1`, locking in the ambiguity. **Recommendation** Choose one consistent meaning for `RetryCount` (recommended: total delivery attempts, including the immediate one) and apply it uniformly. If `MaxRetries` is meant to bound *retries* after the immediate attempt, buffer with `RetryCount = 0` and treat the immediate failure as attempt 0; if it bounds *total attempts*, document that and adjust the comparison. Update the affected test to match the chosen semantics. **Resolution** Resolved 2026-05-16 (commit ``). `RetryCount` now consistently means "number of background retry-sweep attempts so far"; the initial immediate (or caller-made) delivery attempt is attempt 0 and is not counted, and `MaxRetries` bounds retry-sweep attempts after that initial attempt. `EnqueueAsync` no longer seeds `RetryCount = 1` on either the transient-immediate-failure path or the `attemptImmediateDelivery: false` path — a freshly buffered message has `RetryCount = 0`. `RetryMessageAsync` already increments before the `>= MaxRetries` check, which is now correct, so a message with `MaxRetries = 1` gets exactly one real retry before parking (previously zero). The `StoreAndForwardMessage.RetryCount` XML doc was corrected to match. Regression test `RetryPendingMessagesAsync_MaxRetriesOne_PerformsExactlyOneRetryBeforeParking` asserts the immediate attempt plus exactly one retry occur before parking; the affected existing tests (`EnqueueAsync_TransientFailure_BuffersForRetry`, `EnqueueAsync_AttemptImmediateDeliveryFalse_BuffersWithoutInvokingHandler`, `RetryPendingMessagesAsync_MaxRetriesReached_ParksMessage`) were updated to the corrected semantics. ### StoreAndForward-004 — `RegisterDeliveryHandler` XML doc contradicts the implemented contract | | | |--|--| | Severity | Medium | | Category | Documentation & comments | | Status | Resolved | | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:38`, `:60` | **Description** The XML comment on the handler delegate (lines 37–40) says "Returns true on success, throws on transient failure. Permanent failures should return false (message will NOT be buffered)." That last clause is wrong for the retry path: in `RetryMessageAsync`, a handler returning `false` does not "not buffer" — the message is already buffered, and the code *parks* it immediately (lines 218–224). The comment describes only the `EnqueueAsync` immediate path and misleads anyone implementing a handler about what `false` means once a message is in the retry loop. **Recommendation** Reword the contract to cover both paths explicitly: `true` = delivered (remove from buffer); `false` = permanent failure (not buffered on immediate attempt, parked on a retry); exception = transient failure (buffer / increment retry). **Resolution** Resolved 2026-05-16 (commit pending). Confirmed the root cause against the source: the old XML comment's "Permanent failures should return false (message will NOT be buffered)" describes only the `EnqueueAsync` immediate path; on the retry path a `false` return parks the already-buffered message. Reworded the `_deliveryHandlers` field doc into an explicit three-way contract (`true` = delivered/removed-or-never- buffered; `false` = permanent failure = not-buffered-on-immediate / parked-on-retry; throws = transient = buffered-for-retry / retry-count-incremented), noting it applies identically on both paths, and pointed `RegisterDeliveryHandler`'s summary at it. Documentation-only change — no behavioural code touched, so no regression test (an XML comment is not test-observable). ### StoreAndForward-005 — Parked-message retry/discard can race with the in-progress retry sweep | | | |--|--| | Severity | Low — re-triaged down from Medium on 2026-05-16; the described data-loss race is not actually reachable, see Re-triage note | | Original severity | Medium | | Category | Concurrency & thread safety | | Status | Resolved | | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:184`, `:266`, `:280` | **Description** `RetryPendingMessagesAsync` loads a snapshot of due messages (line 179) and then processes them one by one (line 184), `await`-ing delivery for each. Meanwhile `RetryParkedMessageAsync` / `DiscardParkedMessageAsync` (operator actions arriving via `ParkedMessageHandlerActor`) run on unrelated threads and mutate the same rows. Because each operation opens its own SQLite connection and there is no row-level coordination, an operator can `DiscardParkedMessageAsync` a message that the sweep is concurrently delivering: the sweep's later `RemoveMessageAsync`/`UpdateMessageAsync` operates on a now-deleted row (harmless) — but if an operator `RetryParkedMessageAsync` resets a row to Pending while the sweep simultaneously parks the same in-flight message, the operator intent is silently overwritten. The `Interlocked` guard only prevents *overlapping sweeps*, not sweep-vs-management races. **Recommendation** Funnel all message-state mutations through a single serialization point — e.g. perform all S&F state changes inside the `ParkedMessageHandlerActor` (or a dedicated S&F actor) so the actor mailbox serialises them, or make status transitions conditional in SQL (e.g. `UPDATE ... WHERE id = @id AND status = @expected`) and re-check the affected row count. **Re-triage note (2026-05-16)** Verified against the source: the *specific* race the finding describes — *"an operator `RetryParkedMessageAsync` resets a row to Pending while the sweep simultaneously parks the same in-flight message"* — **cannot occur** at the reviewed code. The two operator operations (`StoreAndForwardStorage.RetryParkedMessageAsync` / `DiscardParkedMessageAsync`) are already SQL-conditional on `status = Parked`, and the retry sweep only ever loads and processes rows that are `Pending`. For the operator and the sweep to touch the *same* message the row would have to be simultaneously `Parked` (operator-actionable) and in the sweep's in-flight `Pending` snapshot — mutually exclusive states. Overlapping sweeps are additionally prevented by the `Interlocked` guard, so there is only ever one sweep writer. The described data-loss outcome is therefore not reachable, which makes the original Medium ("risky behaviour") severity an over-statement — **re-triaged to Low**. What *is* real is a latent fragility: the sweep's own state-changing writes used the unconditional `UpdateMessageAsync` (`WHERE id = @id`), so the no-clobber guarantee rested entirely on the `Interlocked` invariant with zero defence-in-depth if that invariant were ever broken (e.g. `RetryMessageAsync` called outside the sweep, or the guard removed). That residual Low-severity weakness is worth closing, so the recommendation's conditional-SQL option was applied. **Resolution** Resolved 2026-05-16 (commit pending). Re-triaged Medium → Low (the described race is unreachable; see Re-triage note) and the residual latent fragility fixed: added `StoreAndForwardStorage.UpdateMessageIfStatusAsync`, a conditional update (`UPDATE ... WHERE id = @id AND status = @expectedStatus`) returning whether a row was written. `RetryMessageAsync`'s three state-changing writes (park-on-permanent-failure, park-on-max-retries, retry-count increment) now use it with `expectedStatus = Pending`, so the sweep can never overwrite a row whose status changed underneath it; a skipped write is logged and the message left for the next sweep. Regression test `RetryMessageAsync_StatusChangedDuringDelivery_SweepParkWriteIsSkipped` drives a writer that moves the row out of `Pending` mid-delivery and asserts the sweep's stale park write is skipped (it failed against the pre-fix unconditional update, clobbering the other writer's `RetryCount`). ### StoreAndForward-006 — `GetParkedMessagesAsync` count and page run without a transaction | | | |--|--| | Severity | Low | | Category | Performance & resource management | | Status | Resolved | | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:166`, `:175` | **Description** `GetParkedMessagesAsync` issues a `COUNT(*)` and then a separate paged `SELECT` on two commands on the same connection with no surrounding transaction. A concurrent enqueue/park/discard between the two statements yields a `TotalCount` inconsistent with the returned page (e.g. total reported as 51 while only 50 distinct parked rows now exist, or a row visible in the page but excluded from the count). For a paginated UI this produces flickering totals and occasional off-by-one page math. **Recommendation** Wrap both reads in a single transaction (`BeginTransaction`) so they see a consistent snapshot, or accept the staleness and document it. A transaction is cheap here and removes the inconsistency. **Resolution** Resolved 2026-05-16 (commit pending). Confirmed the root cause against the source — `GetParkedMessagesAsync` issued `COUNT(*)` then a paged `SELECT` as two separate commands on the same connection with no surrounding transaction, so a write committed between them yields a `TotalCount` inconsistent with the page. Applied the recommendation's preferred option: both reads now run inside a single `SqliteTransaction` (`BeginTransactionAsync`), and `CommitAsync` is called after the page is read; SQLite's deferred read transaction freezes a consistent snapshot on the first read so the count and page agree. Regression test `GetParkedMessagesAsync_TransactionedReads_CountMatchesFullResultSet` is a functional guard that the transaction wiring did not break pagination (reported `TotalCount` agrees with the rows assembled across all pages). Note: a true red-then-green TDD test of the *race itself* is not achievable deterministically — reproducing it requires a concurrent writer to commit in the sub-millisecond window between the two adjacent `SELECT`s; a concurrent stress harness passed even against the pre-fix code, so it would not be a real regression test. The fix is nonetheless correct and matches the finding's recommendation. ### StoreAndForward-007 — Async work in `ParkedMessageHandlerActor` uses `ContinueWith` without scheduler/affinity guarantees | | | |--|--| | Severity | Low | | Category | Akka.NET conventions | | Status | Resolved | | Location | `src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs:34`, `:68`, `:87` | **Description** The three handlers call a `Task`-returning service method and chain `.ContinueWith(...) .PipeTo(sender)`. `Sender` is correctly captured into a local first, so the closure is safe. However `ContinueWith` without an explicit `TaskScheduler` runs the continuation on a thread-pool thread and the captured continuation builds the response objects there — acceptable since it only touches locals, but it bypasses the idiomatic `PipeTo`-with-success/failure-projection pattern and is fragile if someone later adds a line touching actor state inside the continuation. There is also no `TaskContinuationOptions`, so a faulted antecedent still runs the continuation (handled here via `IsCompletedSuccessfully`, but only by convention). **Recommendation** Replace `ContinueWith(...).PipeTo(sender)` with `PipeTo(sender, success: result => ..., failure: ex => ...)`, which is the documented Akka pattern, keeps response construction off the actor thread safely, and makes the success/failure branches explicit. **Resolution** Resolved 2026-05-16 (commit pending). Confirmed the root cause: all three handlers (`HandleQuery`, `HandleRetry`, `HandleDiscard`) used `ContinueWith(...).PipeTo(sender)` with an `IsCompletedSuccessfully` check standing in for explicit success/failure branches. Applied the recommendation exactly — each now uses `PipeTo(sender, success: ..., failure: ...)`, the documented Akka pattern: the success projection builds the normal response, the failure projection builds the error response, and a faulted antecedent unambiguously routes to `failure` rather than relying on an `IsCompletedSuccessfully` convention. `Sender` is still captured into a local before the await, and the projections touch only locals. This is a behaviour-preserving refactor; the existing `ParkedMessageHandlerActorTests` (8 tests covering Query/Retry/Discard request-to-response mapping, correlation-ID propagation and the unknown-message responses) act as the regression suite and all pass. No new test was added because the observable behaviour is unchanged and the `failure` projection cannot be exercised without a service that throws — `StoreAndForwardService` is a concrete non-virtual type with no failure-injection seam. ### StoreAndForward-008 — A SQLite connection is opened and torn down on every storage call | | | |--|--| | Severity | Low | | Category | Performance & resource management | | Status | Resolved | | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:28`, `:61`, `:93`, `:117`, `:144`, `:162`, `:199`, `:221`, `:237`, `:267`, `:285`, `:305`, `:319` | **Description** Every method in `StoreAndForwardStorage` constructs a fresh `SqliteConnection` and calls `OpenAsync`. Microsoft.Data.Sqlite pools connections, so this is not a correctness bug, but a retry sweep over a large buffer performs one open per `UpdateMessageAsync`/ `RemoveMessageAsync` call inside the loop (`RetryMessageAsync`), multiplying connection churn under load. With no max buffer size (by design) the buffer can grow large, so the per-message connection acquisition is a measurable overhead on the hot retry path. **Recommendation** Consider a batched retry API that opens one connection (and one transaction) per sweep, or pass an open connection into the per-message update calls. At minimum, document that the design relies on the Sqlite connection pool for acceptable performance. **Resolution** Resolved 2026-05-16 (commit pending). Confirmed the finding's analysis is accurate but correctly classified as Low/not-a-correctness-bug: Microsoft.Data.Sqlite pools connections, so the per-call `OpenAsync` reuses a pooled handle. Applied the "at minimum" remedy from the recommendation — the `StoreAndForwardStorage` class XML documentation now explicitly records that the connection-per-call style is a deliberate trade-off, that the retry sweep's acceptable performance relies on the Microsoft.Data.Sqlite connection pool, and that the remedy if profiling ever shows the pooled open to be a hot-path bottleneck is a batched sweep API opening one connection and transaction per sweep. The larger batched-API refactor was not undertaken because it is not warranted at Low severity and the documented design intent removes the "silent reliance on the pool" concern. Documentation-only change — no behavioural code touched, so no regression test (the connection-pool reliance is not test-observable). ### StoreAndForward-009 — `OnActivity` event invocation is not thread-safe against concurrent subscribe/unsubscribe | | | |--|--| | Severity | Low | | Category | Concurrency & thread safety | | Status | Resolved | | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:46`, `:309` | **Description** `OnActivity` is a public `event Action<...>` raised via `OnActivity?.Invoke(...)` in `RaiseActivity` (line 309). `RaiseActivity` is called from both `EnqueueAsync` (caller thread) and `RetryMessageAsync` (timer thread). The `?.Invoke` null-conditional captures the delegate once so it will not NRE, but there is no synchronisation around the event field itself; a subscriber added/removed concurrently with a raise has no defined ordering. More importantly, subscriber callbacks run synchronously on the timer thread, so a slow or throwing subscriber stalls or aborts the retry sweep (an exception in a subscriber propagates out of `RaiseActivity` into `RetryMessageAsync`'s `try` and is swallowed as a "transient failure," wrongly incrementing the message's retry count). **Recommendation** Snapshot the delegate (already done) and additionally wrap subscriber invocation in a `try/catch` so a faulting logging subscriber cannot be misclassified as a delivery failure. Document that handlers must be fast and non-throwing, or dispatch activity notifications asynchronously. **Resolution** Resolved 2026-05-16 (commit pending). Confirmed — and found the impact is worse than the finding states. `RaiseActivity` previously did `OnActivity?.Invoke(...)`; a throwing subscriber's exception escaped it. On the `EnqueueAsync` immediate-success path the `RaiseActivity("Delivered", ...)` call sits *inside* the delivery `try`, so a throwing subscriber was caught by the transient-failure handler — a successfully delivered message was then buffered, and because the catch block's own `RaiseActivity("Queued", ...)` also threw, the exception escaped `EnqueueAsync` entirely. `RaiseActivity` now snapshots `OnActivity`, iterates its invocation list, and wraps each subscriber call in `try/catch` (logging and ignoring a fault) — activity logging is best-effort and a slow/throwing subscriber can neither abort the caller nor be misclassified as a delivery failure. Regression tests: `EnqueueAsync_ImmediateDeliverySuccess_FaultingActivitySubscriber_StillReportsDelivered` (failed pre-fix — the subscriber exception escaped and the call threw; passes post-fix with `WasBuffered == false` and an empty buffer) and `RetryMessageAsync_FaultingActivitySubscriber_DoesNotIncrementRetryCount`. ### StoreAndForward-010 — Retry of a parked message does not reset `LastAttemptAt`, so its retry timing is unspecified | | | |--|--| | Severity | Medium | | Category | Correctness & logic bugs | | Status | Resolved | | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:203`, `:101` | **Description** `RetryParkedMessageAsync` sets `status = Pending, retry_count = 0, last_error = NULL` but leaves `last_attempt_at` unchanged (line 203–206). The retry-due query (`GetMessagesForRetryAsync`, line 101–105) selects Pending rows where `last_attempt_at IS NULL OR ... elapsed >= retry_interval_ms`. A message parked after exhausting retries has an old `last_attempt_at`; once re-queued, the elapsed time since that stale timestamp is almost certainly already greater than the retry interval, so the operator-retried message is attempted on the very next sweep regardless of the configured interval. That is probably the desired behaviour (operator wants it tried now), but it is unspecified and inconsistent — if `retry_interval_ms` were very large the behaviour would instead be "try immediately" by accident rather than by design. **Recommendation** Explicitly decide and encode the intent: either set `last_attempt_at = NULL` on re-queue so the message is unambiguously due now, or set it to "now" so it waits one interval. Document the chosen behaviour in the method's XML comment. **Resolution** Resolved 2026-05-16 (commit pending). Confirmed against the source: `RetryParkedMessage Async` reset `status`, `retry_count` and `last_error` but left `last_attempt_at` stale, so the operator-retried message's retry timing depended on the elapsed time since the original pre-park attempt. Encoded the intent explicitly — an operator retry means "attempt this again now" — by also setting `last_attempt_at = NULL` in the UPDATE, so the re-queued message is unambiguously due on the next sweep regardless of the configured `retry_interval_ms`. The method's XML comment now documents this. Regression test `RetryParkedMessageAsync_ClearsLastAttemptAt_SoMessageIsImmediatelyDue` uses a 1-hour retry interval and a recent `last_attempt_at`; it failed pre-fix (timestamp not cleared, message excluded from the retry-due set) and passes post-fix. ### StoreAndForward-011 — `StoreAndForwardMessageStatus.InFlight` is unused and the doc's "retrying" status is unmodelled | | | |--|--| | Severity | Low | | Category | Design-document adherence | | Status | Deferred | | Location | `src/ScadaLink.Commons/Types/Enums/StoreAndForwardMessageStatus.cs:9`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:219`, `:235` | **Description** The enum defines `Pending, InFlight, Parked, Delivered`. The module only ever uses `Pending` and `Parked` — `InFlight` and `Delivered` are never assigned (delivered messages are deleted, not marked `Delivered`). Meanwhile the Component design doc ("Message Format" -> Status) specifies the set "Pending, retrying, or parked." So the code's enum drifts from the doc in two directions: it carries dead members the doc does not mention (`InFlight`, `Delivered`) and omits the doc's `retrying` state. A message mid-retry is indistinguishable from one that has never been attempted. **Recommendation** Reconcile the enum with the design. Either drop the unused members and update the doc, or implement the documented `retrying` state and use `InFlight` to mark a message the sweep is actively delivering (which would also help with finding 005). **Resolution** _Deferred 2026-05-16. Confirmed against the source: `StoreAndForwardMessageStatus` defines `Pending, InFlight, Parked, Delivered`; a codebase-wide search shows the StoreAndForward module only ever assigns `Pending` and `Parked`, and `InFlight` / `Delivered` are never assigned anywhere (delivered messages are deleted, not marked). The design doc's `retrying` state is unmodelled. Both options the recommendation offers — (a) drop the unused `InFlight`/`Delivered` members, or (b) add a `Retrying` member — require editing `StoreAndForwardMessageStatus.cs`, which lives in `src/ScadaLink.Commons` (outside this review's edit scope: only `src/ScadaLink.StoreAndForward/**` may be changed). The enum is also referenced by IntegrationTests and HealthMonitoring tests, so removing members is a cross-module change. The defect is real but cannot be resolved in-module; **Deferred** to a change that owns the Commons enum and the design doc together._ ### StoreAndForward-012 — `StoreAndForwardMessage` is a persistence entity but lives in the component, not Commons | | | |--|--| | Severity | Low | | Category | Code organization & conventions | | Status | Deferred | | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardMessage.cs:9` | **Description** `StoreAndForwardMessage` is a persistence-ignorant POCO that maps directly to the `sf_messages` table and is also carried across the network inside `ReplicationOperation` (replicated to the standby node over Akka remoting). CLAUDE.md "Code Organization" states that entity classes are persistence-ignorant POCOs in Commons and that message contracts follow additive-only evolution. Because this type doubles as a replication wire contract but lives in the component assembly, it is not co-located with the other Commons entities and its evolution is not governed by the additive-only message-contract rule. This is a borderline case (the type is site-local), but the cross-node use via `ReplicationOperation` makes it a de-facto message contract. **Recommendation** Either move `StoreAndForwardMessage` (and `ReplicationOperation`) into the Commons `Entities`/`Messages` hierarchy so they are governed by the contract-evolution rules, or introduce a separate DTO for replication and keep `StoreAndForwardMessage` purely as the local persistence model. Document the decision. **Resolution** _Deferred 2026-05-16. Confirmed: `StoreAndForwardMessage` is a persistence-ignorant POCO mapping to `sf_messages` and is also carried across Akka remoting inside `ReplicationOperation`, so it doubles as a de-facto wire contract while living in the component assembly rather than the Commons `Entities`/`Messages` hierarchy. The recommendation's primary remedy — moving `StoreAndForwardMessage` (and `ReplicationOperation`) into Commons — crosses module boundaries (it would add a type to `src/ScadaLink.Commons`, outside this review's edit scope of `src/ScadaLink.StoreAndForward/**`, and change every referencing module). The alternative "separate replication DTO" still leaves the persistence entity in the component, so it does not actually resolve the finding's core concern (entity placement / contract- evolution governance). Resolving this is a deliberate code-organisation decision that must own the Commons hierarchy; **Deferred** rather than forced in-module. Flagged for a cross-module follow-up._ ### StoreAndForward-013 — Critical paths lack test coverage: retry-due timing, replication-from-active, and the actor bridge | | | |--|--| | Severity | Medium | | Category | Testing coverage | | Status | Resolved | | Location | `tests/ScadaLink.StoreAndForward.Tests/` (whole directory); `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:101`; `src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs` | **Description** The existing tests cover storage CRUD and the service happy/failure paths well, but three important behaviours are untested: (1) the retry-due time filter in `GetMessagesForRetryAsync` — every service test sets `DefaultRetryInterval = TimeSpan.Zero`, so the `julianday` elapsed-time comparison (the most error-prone SQL in the module) is never exercised with a non-zero interval; a message that is *not yet due* should be skipped, and that is never verified. (2) Replication from the active side — no test asserts that an enqueue/remove/park causes a `Replicate*` call (this is exactly the gap behind finding 001; a test would have caught it). (3) `ParkedMessageHandlerActor` has no test at all — the Query/Retry/Discard request-to-response mapping and the `ExtractMethodName` JSON parsing are unverified, including the malformed-JSON branch. **Recommendation** Add tests for: a non-zero retry interval where a recently-attempted message is excluded and an older one is included; active-side replication invocation per operation type (once finding 001 is fixed); and `ParkedMessageHandlerActor` using `Akka.TestKit`, including `ExtractMethodName` for `MethodName`, `Subject`, missing-property and invalid-JSON payloads. **Resolution** Resolved 2026-05-16 (commit pending). All three gaps closed. (1) Retry-due timing: `GetMessagesForRetryAsync_NonZeroInterval_ExcludesNotYetDue IncludesDue` exercises the `julianday` elapsed-time SQL with non-zero intervals — a just-attempted message (1-hour interval) is excluded, an old one (attempted 2h ago, 5-min interval) and a never-attempted one are included. (2) Active-side replication: this sub-claim was **already stale** at the reviewed code — `StoreAndForwardReplicationTests` (added with finding 001's fix) asserts the Add, Remove and Park replication operations; no new test was needed. (3) `ParkedMessageHandlerActor`: new `ParkedMessageHandlerActorTests` using `Akka.TestKit.Xunit2` (package reference added to the test project) covers Query/Retry/ Discard request-to-response mapping, correlation-ID propagation, the unknown-message failure responses, and `ExtractMethodName` for `MethodName`, `Subject`, missing-property and malformed-JSON payloads (all falling back to the category name without throwing). These are coverage-gap tests over already-correct code, so they pass on first run. ### StoreAndForward-014 — Storage does not create its SQLite database directory | | | |--|--| | Severity | Medium | | Category | Error handling & resilience | | Status | Resolved | | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:26` | **Found 2026-05-16** while verifying the store-and-forward fixes — this defect was not part of the original baseline review. **Description** `StoreAndForwardStorage.InitializeAsync` opened a `SqliteConnection` against the configured `SqliteDbPath` (default `./data/store-and-forward.db`) without ensuring the parent directory exists. SQLite creates the database *file* on demand but not its *directory*, so when `data/` does not already exist the connection fails to open with `SQLite Error 14: 'unable to open database file'`. Every site-host boot therefore failed in any environment whose working directory has no `data/` folder — the cause of the six failing `SiteActorPathTests` (the host's `RegisterSiteActors` aborts at `StoreAndForwardService.StartAsync`). Production masked it because `data/` is created by the Docker image / deployment. **Recommendation** Create the parent directory of a file-backed SQLite database before opening it. **Resolution** Resolved 2026-05-16. `InitializeAsync` now calls a new `EnsureDatabaseDirectoryExists` helper that parses the connection string with `SqliteConnectionStringBuilder` and, for a file-backed database, creates the parent directory if it is missing (in-memory databases and bare filenames are skipped). Regression test `InitializeAsync_FileInMissingDirectory_CreatesDirectory` fails against the pre-fix code; all six `SiteActorPathTests` now pass. Fixed by the commit whose message references `StoreAndForward-014`. ### StoreAndForward-015 — `EnqueueAsync`'s public contract never documents that `maxRetries == 0` means "retry forever" | | | |--|--| | Severity | Medium | | Category | Documentation & comments | | Status | Open | | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:114`–`:130`, `:285` | **Description** The re-review brief asks for the StoreAndForward side of the cross-module ambiguity recorded as `ExternalSystemGateway-015`. The semantics are split across this module and its callers, and the StoreAndForward side carries a genuine documentation/API-contract fault: - `RetryMessageAsync` parks a message only when `message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries` (`StoreAndForwardService.cs:285`). When `MaxRetries == 0` the guard is false on every sweep, so a `0` value means **"no limit — retry forever"**. The `StoreAndForwardMessage.MaxRetries` XML doc (`StoreAndForwardMessage.cs:31`) does state `"0 = no limit"`, so the persistence model is internally consistent. - But `EnqueueAsync` — the *only* public entry point into the engine — exposes a `maxRetries` parameter (`StoreAndForwardService.cs:128`) with **no parameter documentation at all**, and its method summary (lines 114–122) describes the lifecycle only as "On max retries → park" / "parked once `MaxRetries` is reached" (see also the `_deliveryHandlers` field doc, line 50–51). Nothing on the public surface tells a caller that passing `0` flips the meaning from "park immediately / never retry" to "retry forever". A caller reading only `EnqueueAsync` would reasonably assume `0` retries means zero retries. - This is exactly the trap ESG fell into: `ExternalSystemClient.CachedCallAsync` / `DatabaseGateway.CachedWriteAsync` pass the source entity's `MaxRetries` verbatim, intending `0` to mean "never retry", and instead get unbounded retry — the duplicate-delivery / unbounded-buffer-growth hazard the design doc's idempotency note warns against. The fault is not solely ESG's: the S&F public API silently overloads `0` with the opposite of its natural reading and does not document it. The defect is in this module's API contract and documentation, so it is recorded here in addition to `ExternalSystemGateway-015`. (Whether `0` *should* mean "no limit" or "no retry" is the cross-module design decision tracked by ESG-015; this finding is specifically that the StoreAndForward public surface fails to document whichever meaning is chosen.) **Recommendation** Document the `maxRetries` parameter on `EnqueueAsync` explicitly with a `` tag that states the `0` special case in the same words as `StoreAndForwardMessage.MaxRetries` (`"0 = no limit — retried on every sweep until delivered, never parked"`), and add the `0` case to the method summary's lifecycle description. Better still — and consistent with the resolution of ESG-015 — make the engine reject the ambiguity at the API: accept a nullable/`enum` retry policy, or treat `0` as an explicit "no retry" (do not buffer, or park on the first sweep) so the natural reading and the behaviour agree. Either way the public `EnqueueAsync` contract must state the chosen meaning; today it states nothing. **Resolution** _Unresolved._ ### StoreAndForward-016 — Operator-initiated parked-message retry and discard are not replicated to the standby | | | |--|--| | Severity | Medium | | Category | Error handling & resilience | | Status | Open | | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:339`–`:362`; `src/ScadaLink.StoreAndForward/ReplicationService.cs:131`–`:136` | **Description** `StoreAndForward-001`'s fix wired replication into the active *delivery* paths: `BufferAsync` replicates an `Add`, a successful retry replicates a `Remove`, and a park replicates a `Park`. But the two *operator* paths — `RetryParkedMessageAsync` (line 339) and `DiscardParkedMessageAsync` (line 353) — change buffer state and never touch `_replication`: - `RetryParkedMessageAsync` flips a row from `Parked` back to `Pending` (and clears `retry_count` / `last_attempt_at`) in the local SQLite only. The standby's copy stays `Parked`. - `DiscardParkedMessageAsync` `DELETE`s the row from the local SQLite only. The standby's copy is left in place, still `Parked`. The Component design doc ("Persistence") requires the active node to forward "each buffer operation (add, remove, park)" so that on failover "the new active node has a near-complete copy of the buffer." An operator retrying a parked message is a buffer state change; an operator discarding one is a removal. After a failover that follows an operator action: 1. A **discarded** message reappears on the new active node — it is still `Parked` there, so it resurfaces in the central UI's parked-message list and an operator must discard it a second time. For a message deliberately removed (e.g. a known-bad payload) this is a correctness regression of the operator's intent. 2. A **retried** message is still `Parked` on the new active node, so the operator's "move it back to the queue" action is silently lost across the failover and the message is not re-attempted. `ReplicationOperationType` only models `Add`/`Remove`/`Park` — there is no operation for "un-park / move back to pending", so even a minimal fix needs either a new operation type or a re-use of `Add` to overwrite the standby row. This is the same class of defect as the now-resolved `StoreAndForward-001`, for the operator paths rather than the sweep paths. **Recommendation** Replicate both operator actions. `DiscardParkedMessageAsync` should call `_replication?.ReplicateRemove(messageId)` after a successful local delete (the existing `Remove` op already deletes on the standby). For `RetryParkedMessageAsync`, add a `Requeue`/`Unpark` `ReplicationOperationType` whose `ApplyReplicatedOperationAsync` case resets the standby row to `Pending` with `retry_count = 0`, or have the method re-load the updated message and replicate it as an `Add`-style upsert. Add replication tests for both operator paths (the existing `StoreAndForwardReplicationTests` only cover the sweep paths). **Resolution** _Unresolved._ ### StoreAndForward-017 — Retry/Discard activity-log entries hard-code the `ExternalSystem` category | | | |--|--| | Severity | Low | | Category | Correctness & logic bugs | | Status | Open | | Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:344`, `:358` | **Description** `RetryParkedMessageAsync` and `DiscardParkedMessageAsync` raise an S&F activity notification (consumed by Site Event Logging — WP-14) but pass a hard-coded `StoreAndForwardCategory.ExternalSystem` as the category argument: ```csharp RaiseActivity("Retry", StoreAndForwardCategory.ExternalSystem, $"Parked message {messageId} moved back to queue"); RaiseActivity("Discard", StoreAndForwardCategory.ExternalSystem, $"Parked message {messageId} discarded"); ``` Both methods take only a `messageId` and never load the message, so they have no access to its real category. When an operator retries or discards a parked **Notification** or **CachedDbWrite** message, the site event log records the activity under `ExternalSystem`. Every other `RaiseActivity` call in the service passes the message's true `Category` (`EnqueueAsync`, `RetryMessageAsync`), so the operator paths are inconsistent and produce mislabelled audit entries — misleading when an operator later filters or reviews S&F activity by category. **Recommendation** Load the message (or have `StoreAndForwardStorage.RetryParkedMessageAsync` / `DiscardParkedMessageAsync` return the affected row's category) and pass the real `Category` to `RaiseActivity`. If loading the row is considered too costly on these infrequent operator paths, change the `OnActivity` event / `RaiseActivity` signature to allow a nullable category for management actions rather than asserting a false one. **Resolution** _Unresolved._