Files

Joseph Doherty 61253e3269 fix(store-and-forward): resolve S&F delivery + replication wiring (3 Critical findings)

Resolves StoreAndForward-001, ExternalSystemGateway-001, NotificationService-001
— one systemic gap where buffered messages were persisted but never delivered,
and the active node never replicated its buffer to the standby.

Delivery handlers (ExternalSystemGateway-001 / NotificationService-001):
- AkkaHostedService registers delivery handlers for the ExternalSystem,
  CachedDbWrite and Notification categories after StoreAndForwardService starts;
  each resolves its scoped consumer in a fresh DI scope.
- ExternalSystemClient, DatabaseGateway and NotificationDeliveryService each
  gain a DeliverBufferedAsync method: re-resolve the target and re-attempt
  delivery, returning true/false/throwing per the transient-vs-permanent contract.
- EnqueueAsync gains an attemptImmediateDelivery flag; CachedCallAsync and
  NotificationDeliveryService.SendAsync pass false (they already attempted
  delivery themselves) so registering a handler does not dispatch twice.

Replication (StoreAndForward-001):
- ReplicationService is injected into StoreAndForwardService; a new BufferAsync
  helper replicates every enqueue, and successful-retry removes and parks are
  replicated too. Fire-and-forget, no-op when replication is disabled.

Tests: StoreAndForwardReplicationTests (Add/Remove/Park observed),
attemptImmediateDelivery behaviour, and DeliverBufferedAsync paths for each
consumer. Full solution builds; StoreAndForward/ExternalSystemGateway/
NotificationService suites green.

2026-05-16 18:58:11 -04:00

22 KiB

Raw Blame History

Code Review — StoreAndForward

Field	Value
Module	`src/ScadaLink.StoreAndForward`
Design doc	`docs/requirements/Component-StoreAndForward.md`
Status	Reviewed
Last reviewed	2026-05-16
Reviewer	claude-agent
Commit reviewed	`9c60592`
Open findings	12

Summary

The Store-and-Forward module is small and readable, with a clean SQLite persistence layer, a sensible service API, and reasonable test coverage of the storage and service happy paths. However the review surfaced two issues that undermine the module's core purpose. First, the active delivery path never invokes the ReplicationService — ReplicateEnqueue/Remove/Park have no callers anywhere in the codebase, so buffered messages are not replicated to the standby node and the design's failover-durability guarantee (Component doc "Persistence", CLAUDE.md "Store-and-Forward") is not met. Second, there is an off-by-one in retry accounting: the immediate-failure path stores a buffered message with RetryCount = 1, so a message configured with MaxRetries = N is actually attempted N times in total rather than one immediate attempt plus N retries, and a per-source MaxRetries of 1 produces zero retry attempts. Additional themes: SQLite connection-per-call with no transactional grouping of multi-statement operations, no concurrency guard against a parked message being retried while the sweep is mid-flight, an unused enum member (InFlight) that drifts from the documented status set, and untested critical paths (retry-due timing, replication-from-active, the actor bridge). None of the findings are blockers for compilation, but the replication and retry-count issues are functional defects against the design.

Checklist coverage

#	Category	Examined	Notes
1	Correctness & logic bugs	☑	Off-by-one in retry counting (003); parked-message retry timing (010).
2	Akka.NET conventions	☑	`ContinueWith` used instead of `PipeTo`-friendly continuations; default supervision; see 007.
3	Concurrency & thread safety	☑	Sweep guarded by `Interlocked`, but no guard against retry-vs-manage races (005); `OnActivity` event not thread-safe (009).
4	Error handling & resilience	☑	Replication never invoked from active path (001); no-handler messages buffered then stuck (002).
5	Security	☑	No issues found — parameterised SQL throughout; no secrets handled directly; payload JSON treated opaquely.
6	Performance & resource management	☑	New SQLite connection per call; multi-statement operations not wrapped in a transaction (006, 008).
7	Design-document adherence	☑	Replication gap (001); `InFlight` status undocumented/unused (011); "retrying" status from design doc not modelled.
8	Code organization & conventions	☑	`StoreAndForwardMessage` is an entity-like POCO living in the component, not Commons (012).
9	Testing coverage	☑	Retry-due timing, replication-from-active, and `ParkedMessageHandlerActor` are untested (013).
10	Documentation & comments	☑	XML doc on `RegisterDeliveryHandler` contract is inconsistent with code (004).

Findings

StoreAndForward-001 — Replication to standby is never triggered by the active node


Severity	Critical
Category	Error handling & resilience
Status	Resolved
Location	`src/ScadaLink.StoreAndForward/ReplicationService.cs:40`, `:53`, `:66`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:155`, `:212`, `:222`, `:236`

Description

ReplicationService exposes ReplicateEnqueue, ReplicateRemove and ReplicatePark to forward buffer operations to the standby node, but a codebase-wide search shows these methods have no callers. StoreAndForwardService — which performs every add (EnqueueAsync line 155 / 163), remove (RemoveMessageAsync call at line 212) and park (UpdateMessageAsync calls at lines 222/236) — holds no reference to ReplicationService and never invokes it. Only the receiving half is wired (SetReplicationHandler and ApplyReplicatedOperationAsync are used by SiteReplicationActor). The Component design doc ("Persistence") and CLAUDE.md ("Store-and-Forward") require the active node to forward every buffer operation to the standby so that, on failover, the new active node "has a near-complete copy of the buffer." As written, the standby's S&F SQLite database stays empty and a failover loses the entire buffer — a data-loss defect against a core requirement.

Recommendation

Inject ReplicationService into StoreAndForwardService and call ReplicateEnqueue after a successful _storage.EnqueueAsync, ReplicateRemove after RemoveMessageAsync, and ReplicatePark after a park-causing UpdateMessageAsync. Update ServiceCollectionExtensions.AddStoreAndForward to pass the dependency. Add a test that asserts the replication handler observes each operation type.

Resolution

Resolved 2026-05-16. ReplicationService is now injected into StoreAndForwardService (wired in AddStoreAndForward), and every buffer operation is forwarded to the standby: a new BufferAsync helper calls ReplicateEnqueue after each persist, ReplicateRemove runs after a successful retry removes a message, and ReplicatePark runs on both park paths. Replication stays fire-and-forget and is a no-op when ReplicationEnabled is false or no handler is wired. Regression tests StoreAndForwardReplicationTests assert the replication handler observes the Add, Remove and Park operations. Fixed by the commit whose message references StoreAndForward-001.

StoreAndForward-002 — Messages enqueued with no registered handler are buffered but never deliverable


Severity	High
Category	Error handling & resilience
Status	Open
Location	`src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:162`, `:201`

Description

EnqueueAsync falls through to "No handler registered — buffer for later" (line 162) when no delivery handler is registered for the category. The retry sweep (RetryMessageAsync, line 201) then logs "No delivery handler for category" and returns without touching the message. No caller in the codebase ever calls RegisterDeliveryHandler (the External System Gateway, Notification Service and Database Gateway only call EnqueueAsync), so in the current wiring every buffered message lands in this dead state: it is persisted, counts toward buffer depth, but can never be retried, delivered or parked. It will sit Pending forever. Either the handler registration is missing from Host/gateway startup, or the "buffer for later" path is a silent trap. Either way the engine cannot deliver anything.

Recommendation

Decide the intended contract. If handlers are expected to be registered before EnqueueAsync is reachable, make EnqueueAsync reject (or log an error) when no handler exists rather than silently buffering an undeliverable message, and wire RegisterDeliveryHandler calls in Host startup for all three categories. If late registration is intended, the retry sweep should treat a still-missing handler as a transient condition with bounded logging rather than a permanent no-op.

Resolution

Unresolved.

StoreAndForward-003 — Off-by-one in retry accounting: immediate failure pre-counts as retry 1


Severity	High
Category	Correctness & logic bugs
Status	Open
Location	`src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:153`, `:229`, `:233`

Description

On a transient immediate-delivery failure, EnqueueAsync buffers the message with message.RetryCount = 1 (line 153). The retry sweep then increments RetryCount before the max check (RetryCount++ at line 229; RetryCount >= MaxRetries at line 233). Consequences: (1) a message configured with MaxRetries = 1 is parked on the first retry sweep without ever being retried, because after the immediate attempt RetryCount is already 1 and the first sweep makes it 2 ≥ 1 — zero actual retries occur, contradicting the design intent that the immediate attempt and the retry budget are distinct; (2) the design doc's Retry Count field is "Number of attempts so far," but here it is seeded to 1 before any retry has happened, making the parked-message AttemptCount shown to operators off by one relative to configured MaxRetries. The EnqueueAsync_TransientFailure_BuffersForRetry test even asserts RetryCount == 1, locking in the ambiguity.

Recommendation

Choose one consistent meaning for RetryCount (recommended: total delivery attempts, including the immediate one) and apply it uniformly. If MaxRetries is meant to bound retries after the immediate attempt, buffer with RetryCount = 0 and treat the immediate failure as attempt 0; if it bounds total attempts, document that and adjust the comparison. Update the affected test to match the chosen semantics.

Resolution

Unresolved.

StoreAndForward-004 — `RegisterDeliveryHandler` XML doc contradicts the implemented contract


Severity	Medium
Category	Documentation & comments
Status	Open
Location	`src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:38`, `:60`

Description

The XML comment on the handler delegate (lines 37–40) says "Returns true on success, throws on transient failure. Permanent failures should return false (message will NOT be buffered)." That last clause is wrong for the retry path: in RetryMessageAsync, a handler returning false does not "not buffer" — the message is already buffered, and the code parks it immediately (lines 218–224). The comment describes only the EnqueueAsync immediate path and misleads anyone implementing a handler about what false means once a message is in the retry loop.

Recommendation

Reword the contract to cover both paths explicitly: true = delivered (remove from buffer); false = permanent failure (not buffered on immediate attempt, parked on a retry); exception = transient failure (buffer / increment retry).

Resolution

Unresolved.

StoreAndForward-005 — Parked-message retry/discard can race with the in-progress retry sweep


Severity	Medium
Category	Concurrency & thread safety
Status	Open
Location	`src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:184`, `:266`, `:280`

Description

RetryPendingMessagesAsync loads a snapshot of due messages (line 179) and then processes them one by one (line 184), await-ing delivery for each. Meanwhile RetryParkedMessageAsync / DiscardParkedMessageAsync (operator actions arriving via ParkedMessageHandlerActor) run on unrelated threads and mutate the same rows. Because each operation opens its own SQLite connection and there is no row-level coordination, an operator can DiscardParkedMessageAsync a message that the sweep is concurrently delivering: the sweep's later RemoveMessageAsync/UpdateMessageAsync operates on a now-deleted row (harmless) — but if an operator RetryParkedMessageAsync resets a row to Pending while the sweep simultaneously parks the same in-flight message, the operator intent is silently overwritten. The Interlocked guard only prevents overlapping sweeps, not sweep-vs-management races.

Recommendation

Funnel all message-state mutations through a single serialization point — e.g. perform all S&F state changes inside the ParkedMessageHandlerActor (or a dedicated S&F actor) so the actor mailbox serialises them, or make status transitions conditional in SQL (e.g. UPDATE ... WHERE id = @id AND status = @expected) and re-check the affected row count.

Resolution

Unresolved.

StoreAndForward-006 — `GetParkedMessagesAsync` count and page run without a transaction


Severity	Low
Category	Performance & resource management
Status	Open
Location	`src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:166`, `:175`

Description

GetParkedMessagesAsync issues a COUNT(*) and then a separate paged SELECT on two commands on the same connection with no surrounding transaction. A concurrent enqueue/park/discard between the two statements yields a TotalCount inconsistent with the returned page (e.g. total reported as 51 while only 50 distinct parked rows now exist, or a row visible in the page but excluded from the count). For a paginated UI this produces flickering totals and occasional off-by-one page math.

Recommendation

Wrap both reads in a single transaction (BeginTransaction) so they see a consistent snapshot, or accept the staleness and document it. A transaction is cheap here and removes the inconsistency.

Resolution

Unresolved.

StoreAndForward-007 — Async work in `ParkedMessageHandlerActor` uses `ContinueWith` without scheduler/affinity guarantees


Severity	Low
Category	Akka.NET conventions
Status	Open
Location	`src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs:34`, `:68`, `:87`

Description

The three handlers call a Task-returning service method and chain .ContinueWith(...) .PipeTo(sender). Sender is correctly captured into a local first, so the closure is safe. However ContinueWith without an explicit TaskScheduler runs the continuation on a thread-pool thread and the captured continuation builds the response objects there — acceptable since it only touches locals, but it bypasses the idiomatic PipeTo-with-success/failure-projection pattern and is fragile if someone later adds a line touching actor state inside the continuation. There is also no TaskContinuationOptions, so a faulted antecedent still runs the continuation (handled here via IsCompletedSuccessfully, but only by convention).

Recommendation

Replace ContinueWith(...).PipeTo(sender) with PipeTo(sender, success: result => ..., failure: ex => ...), which is the documented Akka pattern, keeps response construction off the actor thread safely, and makes the success/failure branches explicit.

Resolution

Unresolved.

StoreAndForward-008 — A SQLite connection is opened and torn down on every storage call


Severity	Low
Category	Performance & resource management
Status	Open
Location	`src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:28`, `:61`, `:93`, `:117`, `:144`, `:162`, `:199`, `:221`, `:237`, `:267`, `:285`, `:305`, `:319`

Description

Every method in StoreAndForwardStorage constructs a fresh SqliteConnection and calls OpenAsync. Microsoft.Data.Sqlite pools connections, so this is not a correctness bug, but a retry sweep over a large buffer performs one open per UpdateMessageAsync/ RemoveMessageAsync call inside the loop (RetryMessageAsync), multiplying connection churn under load. With no max buffer size (by design) the buffer can grow large, so the per-message connection acquisition is a measurable overhead on the hot retry path.

Recommendation

Consider a batched retry API that opens one connection (and one transaction) per sweep, or pass an open connection into the per-message update calls. At minimum, document that the design relies on the Sqlite connection pool for acceptable performance.

Resolution

Unresolved.

StoreAndForward-009 — `OnActivity` event invocation is not thread-safe against concurrent subscribe/unsubscribe


Severity	Low
Category	Concurrency & thread safety
Status	Open
Location	`src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:46`, `:309`

Description

OnActivity is a public event Action<...> raised via OnActivity?.Invoke(...) in RaiseActivity (line 309). RaiseActivity is called from both EnqueueAsync (caller thread) and RetryMessageAsync (timer thread). The ?.Invoke null-conditional captures the delegate once so it will not NRE, but there is no synchronisation around the event field itself; a subscriber added/removed concurrently with a raise has no defined ordering. More importantly, subscriber callbacks run synchronously on the timer thread, so a slow or throwing subscriber stalls or aborts the retry sweep (an exception in a subscriber propagates out of RaiseActivity into RetryMessageAsync's try and is swallowed as a "transient failure," wrongly incrementing the message's retry count).

Recommendation

Snapshot the delegate (already done) and additionally wrap subscriber invocation in a try/catch so a faulting logging subscriber cannot be misclassified as a delivery failure. Document that handlers must be fast and non-throwing, or dispatch activity notifications asynchronously.

Resolution

Unresolved.

StoreAndForward-010 — Retry of a parked message does not reset `LastAttemptAt`, so its retry timing is unspecified


Severity	Medium
Category	Correctness & logic bugs
Status	Open
Location	`src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:203`, `:101`

Description

RetryParkedMessageAsync sets status = Pending, retry_count = 0, last_error = NULL but leaves last_attempt_at unchanged (line 203–206). The retry-due query (GetMessagesForRetryAsync, line 101–105) selects Pending rows where last_attempt_at IS NULL OR ... elapsed >= retry_interval_ms. A message parked after exhausting retries has an old last_attempt_at; once re-queued, the elapsed time since that stale timestamp is almost certainly already greater than the retry interval, so the operator-retried message is attempted on the very next sweep regardless of the configured interval. That is probably the desired behaviour (operator wants it tried now), but it is unspecified and inconsistent — if retry_interval_ms were very large the behaviour would instead be "try immediately" by accident rather than by design.

Recommendation

Explicitly decide and encode the intent: either set last_attempt_at = NULL on re-queue so the message is unambiguously due now, or set it to "now" so it waits one interval. Document the chosen behaviour in the method's XML comment.

Resolution

Unresolved.

StoreAndForward-011 — `StoreAndForwardMessageStatus.InFlight` is unused and the doc's "retrying" status is unmodelled


Severity	Low
Category	Design-document adherence
Status	Open
Location	`src/ScadaLink.Commons/Types/Enums/StoreAndForwardMessageStatus.cs:9`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:219`, `:235`

Description

The enum defines Pending, InFlight, Parked, Delivered. The module only ever uses Pending and Parked — InFlight and Delivered are never assigned (delivered messages are deleted, not marked Delivered). Meanwhile the Component design doc ("Message Format" -> Status) specifies the set "Pending, retrying, or parked." So the code's enum drifts from the doc in two directions: it carries dead members the doc does not mention (InFlight, Delivered) and omits the doc's retrying state. A message mid-retry is indistinguishable from one that has never been attempted.

Recommendation

Reconcile the enum with the design. Either drop the unused members and update the doc, or implement the documented retrying state and use InFlight to mark a message the sweep is actively delivering (which would also help with finding 005).

Resolution

Unresolved.

StoreAndForward-012 — `StoreAndForwardMessage` is a persistence entity but lives in the component, not Commons


Severity	Low
Category	Code organization & conventions
Status	Open
Location	`src/ScadaLink.StoreAndForward/StoreAndForwardMessage.cs:9`

Description

StoreAndForwardMessage is a persistence-ignorant POCO that maps directly to the sf_messages table and is also carried across the network inside ReplicationOperation (replicated to the standby node over Akka remoting). CLAUDE.md "Code Organization" states that entity classes are persistence-ignorant POCOs in Commons and that message contracts follow additive-only evolution. Because this type doubles as a replication wire contract but lives in the component assembly, it is not co-located with the other Commons entities and its evolution is not governed by the additive-only message-contract rule. This is a borderline case (the type is site-local), but the cross-node use via ReplicationOperation makes it a de-facto message contract.

Recommendation

Either move StoreAndForwardMessage (and ReplicationOperation) into the Commons Entities/Messages hierarchy so they are governed by the contract-evolution rules, or introduce a separate DTO for replication and keep StoreAndForwardMessage purely as the local persistence model. Document the decision.

Resolution

Unresolved.

StoreAndForward-013 — Critical paths lack test coverage: retry-due timing, replication-from-active, and the actor bridge


Severity	Medium
Category	Testing coverage
Status	Open
Location	`tests/ScadaLink.StoreAndForward.Tests/` (whole directory); `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:101`; `src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs`

Description

The existing tests cover storage CRUD and the service happy/failure paths well, but three important behaviours are untested: (1) the retry-due time filter in GetMessagesForRetryAsync — every service test sets DefaultRetryInterval = TimeSpan.Zero, so the julianday elapsed-time comparison (the most error-prone SQL in the module) is never exercised with a non-zero interval; a message that is not yet due should be skipped, and that is never verified. (2) Replication from the active side — no test asserts that an enqueue/remove/park causes a Replicate* call (this is exactly the gap behind finding 001; a test would have caught it). (3) ParkedMessageHandlerActor has no test at all — the Query/Retry/Discard request-to-response mapping and the ExtractMethodName JSON parsing are unverified, including the malformed-JSON branch.

Recommendation

Add tests for: a non-zero retry interval where a recently-attempted message is excluded and an older one is included; active-side replication invocation per operation type (once finding 001 is fixed); and ParkedMessageHandlerActor using Akka.TestKit, including ExtractMethodName for MethodName, Subject, missing-property and invalid-JSON payloads.

Resolution

Unresolved.

22 KiB Raw Blame History Unescape Escape

Code Review — StoreAndForward

Summary

Checklist coverage

Findings

StoreAndForward-001 — Replication to standby is never triggered by the active node

StoreAndForward-002 — Messages enqueued with no registered handler are buffered but never deliverable

StoreAndForward-003 — Off-by-one in retry accounting: immediate failure pre-counts as retry 1

StoreAndForward-004 — RegisterDeliveryHandler XML doc contradicts the implemented contract

StoreAndForward-005 — Parked-message retry/discard can race with the in-progress retry sweep

StoreAndForward-006 — GetParkedMessagesAsync count and page run without a transaction

StoreAndForward-007 — Async work in ParkedMessageHandlerActor uses ContinueWith without scheduler/affinity guarantees

StoreAndForward-008 — A SQLite connection is opened and torn down on every storage call

StoreAndForward-009 — OnActivity event invocation is not thread-safe against concurrent subscribe/unsubscribe

StoreAndForward-010 — Retry of a parked message does not reset LastAttemptAt, so its retry timing is unspecified

StoreAndForward-011 — StoreAndForwardMessageStatus.InFlight is unused and the doc's "retrying" status is unmodelled

StoreAndForward-012 — StoreAndForwardMessage is a persistence entity but lives in the component, not Commons

StoreAndForward-013 — Critical paths lack test coverage: retry-due timing, replication-from-active, and the actor bridge

22 KiB

Raw Blame History

StoreAndForward-004 — `RegisterDeliveryHandler` XML doc contradicts the implemented contract

StoreAndForward-006 — `GetParkedMessagesAsync` count and page run without a transaction

StoreAndForward-007 — Async work in `ParkedMessageHandlerActor` uses `ContinueWith` without scheduler/affinity guarantees

StoreAndForward-009 — `OnActivity` event invocation is not thread-safe against concurrent subscribe/unsubscribe

StoreAndForward-010 — Retry of a parked message does not reset `LastAttemptAt`, so its retry timing is unspecified

StoreAndForward-011 — `StoreAndForwardMessageStatus.InFlight` is unused and the doc's "retrying" status is unmodelled

StoreAndForward-012 — `StoreAndForwardMessage` is a persistence entity but lives in the component, not Commons