39 KiB
Code Review — StoreAndForward
| Field | Value |
|---|---|
| Module | src/ScadaLink.StoreAndForward |
| Design doc | docs/requirements/Component-StoreAndForward.md |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Reviewer | claude-agent |
| Commit reviewed | 9c60592 |
| Open findings | 0 (3 Deferred: 002, 011, 012 — see notes) |
Summary
The Store-and-Forward module is small and readable, with a clean SQLite persistence
layer, a sensible service API, and reasonable test coverage of the storage and service
happy paths. However the review surfaced two issues that undermine the module's core
purpose. First, the active delivery path never invokes the ReplicationService —
ReplicateEnqueue/Remove/Park have no callers anywhere in the codebase, so buffered
messages are not replicated to the standby node and the design's failover-durability
guarantee (Component doc "Persistence", CLAUDE.md "Store-and-Forward") is not met.
Second, there is an off-by-one in retry accounting: the immediate-failure path stores a
buffered message with RetryCount = 1, so a message configured with MaxRetries = N
is actually attempted N times in total rather than one immediate attempt plus N
retries, and a per-source MaxRetries of 1 produces zero retry attempts. Additional
themes: SQLite connection-per-call with no transactional grouping of multi-statement
operations, no concurrency guard against a parked message being retried while the
sweep is mid-flight, an unused enum member (InFlight) that drifts from the documented
status set, and untested critical paths (retry-due timing, replication-from-active,
the actor bridge). None of the findings are blockers for compilation, but the
replication and retry-count issues are functional defects against the design.
Checklist coverage
| # | Category | Examined | Notes |
|---|---|---|---|
| 1 | Correctness & logic bugs | ☑ | Off-by-one in retry counting (003); parked-message retry timing (010). |
| 2 | Akka.NET conventions | ☑ | ContinueWith used instead of PipeTo-friendly continuations; default supervision; see 007. |
| 3 | Concurrency & thread safety | ☑ | Sweep guarded by Interlocked, but no guard against retry-vs-manage races (005); OnActivity event not thread-safe (009). |
| 4 | Error handling & resilience | ☑ | Replication never invoked from active path (001); no-handler messages buffered then stuck (002). |
| 5 | Security | ☑ | No issues found — parameterised SQL throughout; no secrets handled directly; payload JSON treated opaquely. |
| 6 | Performance & resource management | ☑ | New SQLite connection per call; multi-statement operations not wrapped in a transaction (006, 008). |
| 7 | Design-document adherence | ☑ | Replication gap (001); InFlight status undocumented/unused (011); "retrying" status from design doc not modelled. |
| 8 | Code organization & conventions | ☑ | StoreAndForwardMessage is an entity-like POCO living in the component, not Commons (012). |
| 9 | Testing coverage | ☑ | Retry-due timing, replication-from-active, and ParkedMessageHandlerActor are untested (013). |
| 10 | Documentation & comments | ☑ | XML doc on RegisterDeliveryHandler contract is inconsistent with code (004). |
Findings
StoreAndForward-001 — Replication to standby is never triggered by the active node
| Severity | Critical |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.StoreAndForward/ReplicationService.cs:40, :53, :66; src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:155, :212, :222, :236 |
Description
ReplicationService exposes ReplicateEnqueue, ReplicateRemove and ReplicatePark
to forward buffer operations to the standby node, but a codebase-wide search shows these
methods have no callers. StoreAndForwardService — which performs every add (EnqueueAsync
line 155 / 163), remove (RemoveMessageAsync call at line 212) and park
(UpdateMessageAsync calls at lines 222/236) — holds no reference to ReplicationService
and never invokes it. Only the receiving half is wired (SetReplicationHandler and
ApplyReplicatedOperationAsync are used by SiteReplicationActor). The Component design
doc ("Persistence") and CLAUDE.md ("Store-and-Forward") require the active node to
forward every buffer operation to the standby so that, on failover, the new active node
"has a near-complete copy of the buffer." As written, the standby's S&F SQLite database
stays empty and a failover loses the entire buffer — a data-loss defect against a core
requirement.
Recommendation
Inject ReplicationService into StoreAndForwardService and call ReplicateEnqueue
after a successful _storage.EnqueueAsync, ReplicateRemove after RemoveMessageAsync,
and ReplicatePark after a park-causing UpdateMessageAsync. Update
ServiceCollectionExtensions.AddStoreAndForward to pass the dependency. Add a test that
asserts the replication handler observes each operation type.
Resolution
Resolved 2026-05-16. ReplicationService is now injected into StoreAndForwardService
(wired in AddStoreAndForward), and every buffer operation is forwarded to the standby:
a new BufferAsync helper calls ReplicateEnqueue after each persist, ReplicateRemove
runs after a successful retry removes a message, and ReplicatePark runs on both park
paths. Replication stays fire-and-forget and is a no-op when ReplicationEnabled is
false or no handler is wired. Regression tests StoreAndForwardReplicationTests assert
the replication handler observes the Add, Remove and Park operations. Fixed by the
commit whose message references StoreAndForward-001.
StoreAndForward-002 — Messages enqueued with no registered handler are buffered but never deliverable
| Severity | Low |
| Original severity | High (re-triaged down to Low on 2026-05-16 — see Re-triage note) |
| Category | Error handling & resilience |
| Status | Deferred |
| Location | src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:162, :201 |
Description
EnqueueAsync falls through to "No handler registered — buffer for later" (line 162)
when no delivery handler is registered for the category. The retry sweep
(RetryMessageAsync, line 201) then logs "No delivery handler for category" and
returns without touching the message. No caller in the codebase ever calls
RegisterDeliveryHandler (the External System Gateway, Notification Service and
Database Gateway only call EnqueueAsync), so in the current wiring every buffered
message lands in this dead state: it is persisted, counts toward buffer depth, but can
never be retried, delivered or parked. It will sit Pending forever. Either the handler
registration is missing from Host/gateway startup, or the "buffer for later" path is a
silent trap. Either way the engine cannot deliver anything.
Recommendation
Decide the intended contract. If handlers are expected to be registered before
EnqueueAsync is reachable, make EnqueueAsync reject (or log an error) when no
handler exists rather than silently buffering an undeliverable message, and wire
RegisterDeliveryHandler calls in Host startup for all three categories. If late
registration is intended, the retry sweep should treat a still-missing handler as a
transient condition with bounded logging rather than a permanent no-op.
Re-triage note (2026-05-16)
The finding's central factual claim — "No caller in the codebase ever calls
RegisterDeliveryHandler" and therefore "every buffered message lands in this dead
state" — is no longer true at the reviewed code. ScadaLink.Host
(AkkaHostedService.RegisterSiteActors, AkkaHostedService.cs:353-379) registers all
three delivery handlers (ExternalSystem, CachedDbWrite, Notification) at site
startup, immediately after StoreAndForwardService.StartAsync(). The finding was
written against commit 9c60592 before that wiring existed; the High-severity
"engine cannot deliver anything" outcome no longer occurs.
The remaining residual risk is narrow: a message enqueued for a category that genuinely
has no handler (e.g. an enqueue racing ahead of RegisterDeliveryHandler, or a future
category added without a handler) is still buffered and then skipped by the sweep
forever. That is a real but minor robustness gap, hence the downgrade to Low.
It is left Open rather than fixed in this pass because the finding's recommended
fix — making EnqueueAsync reject when no handler is registered — is a behavioural
contract change, not a localised bug fix: the "buffer with no handler yet" path is
exercised by StoreAndForwardReplicationTests and by three NotificationService and
ExternalSystemGateway tests (Send_TransientError_WithStoreAndForward_BuffersMessage,
Send_Smtp4xxCommandException_ClassifiedTransientAndBuffered,
Send_SmtpProtocolException_ClassifiedTransient) which construct a real
StoreAndForwardService without registering a handler and assert WasBuffered.
Changing the contract requires deciding whether late handler registration is supported
and updating tests in modules outside this review's edit scope — a design decision that
should be made deliberately rather than forced here.
Resolution
Deferred 2026-05-16 (re-triaged High → Low). Verified again against the source in this
pass: the finding's premise (no RegisterDeliveryHandler caller anywhere) is stale —
ScadaLink.Host wires all three handlers at site startup — so the High-severity
"engine cannot deliver anything" outcome no longer occurs. The residual gap (a message
enqueued for a category that genuinely has no handler is buffered then skipped forever)
is real but minor. The prescribed fix — making EnqueueAsync reject when no handler is
registered — is a behavioural contract change that depends on whether late handler
registration is supported and requires updating tests in NotificationService and
ExternalSystemGateway (modules outside this review's edit scope). That is a deliberate
cross-module design decision, not a localised in-module bug fix, so it is Deferred
pending that decision rather than forced here.
StoreAndForward-003 — Off-by-one in retry accounting: immediate failure pre-counts as retry 1
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:153, :229, :233 |
Description
On a transient immediate-delivery failure, EnqueueAsync buffers the message with
message.RetryCount = 1 (line 153). The retry sweep then increments RetryCount before
the max check (RetryCount++ at line 229; RetryCount >= MaxRetries at line 233).
Consequences: (1) a message configured with MaxRetries = 1 is parked on the first
retry sweep without ever being retried, because after the immediate attempt RetryCount
is already 1 and the first sweep makes it 2 ≥ 1 — zero actual retries occur, contradicting
the design intent that the immediate attempt and the retry budget are distinct;
(2) the design doc's Retry Count field is "Number of attempts so far," but here it is
seeded to 1 before any retry has happened, making the parked-message AttemptCount
shown to operators off by one relative to configured MaxRetries. The
EnqueueAsync_TransientFailure_BuffersForRetry test even asserts RetryCount == 1,
locking in the ambiguity.
Recommendation
Choose one consistent meaning for RetryCount (recommended: total delivery attempts,
including the immediate one) and apply it uniformly. If MaxRetries is meant to bound
retries after the immediate attempt, buffer with RetryCount = 0 and treat the
immediate failure as attempt 0; if it bounds total attempts, document that and adjust
the comparison. Update the affected test to match the chosen semantics.
Resolution
Resolved 2026-05-16 (commit <pending>). RetryCount now consistently means "number
of background retry-sweep attempts so far"; the initial immediate (or caller-made)
delivery attempt is attempt 0 and is not counted, and MaxRetries bounds retry-sweep
attempts after that initial attempt. EnqueueAsync no longer seeds RetryCount = 1 on
either the transient-immediate-failure path or the attemptImmediateDelivery: false
path — a freshly buffered message has RetryCount = 0. RetryMessageAsync already
increments before the >= MaxRetries check, which is now correct, so a message with
MaxRetries = 1 gets exactly one real retry before parking (previously zero). The
StoreAndForwardMessage.RetryCount XML doc was corrected to match. Regression test
RetryPendingMessagesAsync_MaxRetriesOne_PerformsExactlyOneRetryBeforeParking asserts
the immediate attempt plus exactly one retry occur before parking; the affected
existing tests (EnqueueAsync_TransientFailure_BuffersForRetry,
EnqueueAsync_AttemptImmediateDeliveryFalse_BuffersWithoutInvokingHandler,
RetryPendingMessagesAsync_MaxRetriesReached_ParksMessage) were updated to the
corrected semantics.
StoreAndForward-004 — RegisterDeliveryHandler XML doc contradicts the implemented contract
| Severity | Medium |
| Category | Documentation & comments |
| Status | Resolved |
| Location | src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:38, :60 |
Description
The XML comment on the handler delegate (lines 37–40) says "Returns true on success,
throws on transient failure. Permanent failures should return false (message will NOT
be buffered)." That last clause is wrong for the retry path: in RetryMessageAsync,
a handler returning false does not "not buffer" — the message is already buffered, and
the code parks it immediately (lines 218–224). The comment describes only the
EnqueueAsync immediate path and misleads anyone implementing a handler about what
false means once a message is in the retry loop.
Recommendation
Reword the contract to cover both paths explicitly: true = delivered (remove from
buffer); false = permanent failure (not buffered on immediate attempt, parked on a
retry); exception = transient failure (buffer / increment retry).
Resolution
Resolved 2026-05-16 (commit pending). Confirmed the root cause against the source: the
old XML comment's "Permanent failures should return false (message will NOT be
buffered)" describes only the EnqueueAsync immediate path; on the retry path a
false return parks the already-buffered message. Reworded the _deliveryHandlers
field doc into an explicit three-way contract (true = delivered/removed-or-never-
buffered; false = permanent failure = not-buffered-on-immediate / parked-on-retry;
throws = transient = buffered-for-retry / retry-count-incremented), noting it applies
identically on both paths, and pointed RegisterDeliveryHandler's summary at it.
Documentation-only change — no behavioural code touched, so no regression test
(an XML comment is not test-observable).
StoreAndForward-005 — Parked-message retry/discard can race with the in-progress retry sweep
| Severity | Low — re-triaged down from Medium on 2026-05-16; the described data-loss race is not actually reachable, see Re-triage note |
| Original severity | Medium |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:184, :266, :280 |
Description
RetryPendingMessagesAsync loads a snapshot of due messages (line 179) and then
processes them one by one (line 184), await-ing delivery for each. Meanwhile
RetryParkedMessageAsync / DiscardParkedMessageAsync (operator actions arriving via
ParkedMessageHandlerActor) run on unrelated threads and mutate the same rows. Because
each operation opens its own SQLite connection and there is no row-level coordination,
an operator can DiscardParkedMessageAsync a message that the sweep is concurrently
delivering: the sweep's later RemoveMessageAsync/UpdateMessageAsync operates on a
now-deleted row (harmless) — but if an operator RetryParkedMessageAsync resets a row
to Pending while the sweep simultaneously parks the same in-flight message, the operator
intent is silently overwritten. The Interlocked guard only prevents overlapping
sweeps, not sweep-vs-management races.
Recommendation
Funnel all message-state mutations through a single serialization point — e.g. perform
all S&F state changes inside the ParkedMessageHandlerActor (or a dedicated S&F actor)
so the actor mailbox serialises them, or make status transitions conditional in SQL
(e.g. UPDATE ... WHERE id = @id AND status = @expected) and re-check the affected
row count.
Re-triage note (2026-05-16)
Verified against the source: the specific race the finding describes — "an operator
RetryParkedMessageAsync resets a row to Pending while the sweep simultaneously parks
the same in-flight message" — cannot occur at the reviewed code. The two operator
operations (StoreAndForwardStorage.RetryParkedMessageAsync /
DiscardParkedMessageAsync) are already SQL-conditional on status = Parked, and the
retry sweep only ever loads and processes rows that are Pending. For the operator and
the sweep to touch the same message the row would have to be simultaneously Parked
(operator-actionable) and in the sweep's in-flight Pending snapshot — mutually
exclusive states. Overlapping sweeps are additionally prevented by the Interlocked
guard, so there is only ever one sweep writer. The described data-loss outcome is
therefore not reachable, which makes the original Medium ("risky behaviour") severity
an over-statement — re-triaged to Low.
What is real is a latent fragility: the sweep's own state-changing writes used the
unconditional UpdateMessageAsync (WHERE id = @id), so the no-clobber guarantee
rested entirely on the Interlocked invariant with zero defence-in-depth if that
invariant were ever broken (e.g. RetryMessageAsync called outside the sweep, or the
guard removed). That residual Low-severity weakness is worth closing, so the
recommendation's conditional-SQL option was applied.
Resolution
Resolved 2026-05-16 (commit pending). Re-triaged Medium → Low (the described race is
unreachable; see Re-triage note) and the residual latent fragility fixed: added
StoreAndForwardStorage.UpdateMessageIfStatusAsync, a conditional update
(UPDATE ... WHERE id = @id AND status = @expectedStatus) returning whether a row was
written. RetryMessageAsync's three state-changing writes (park-on-permanent-failure,
park-on-max-retries, retry-count increment) now use it with expectedStatus = Pending,
so the sweep can never overwrite a row whose status changed underneath it; a skipped
write is logged and the message left for the next sweep. Regression test
RetryMessageAsync_StatusChangedDuringDelivery_SweepParkWriteIsSkipped drives a writer
that moves the row out of Pending mid-delivery and asserts the sweep's stale park
write is skipped (it failed against the pre-fix unconditional update, clobbering the
other writer's RetryCount).
StoreAndForward-006 — GetParkedMessagesAsync count and page run without a transaction
| Severity | Low |
| Category | Performance & resource management |
| Status | Resolved |
| Location | src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:166, :175 |
Description
GetParkedMessagesAsync issues a COUNT(*) and then a separate paged SELECT on two
commands on the same connection with no surrounding transaction. A concurrent
enqueue/park/discard between the two statements yields a TotalCount inconsistent with
the returned page (e.g. total reported as 51 while only 50 distinct parked rows now
exist, or a row visible in the page but excluded from the count). For a paginated UI
this produces flickering totals and occasional off-by-one page math.
Recommendation
Wrap both reads in a single transaction (BeginTransaction) so they see a consistent
snapshot, or accept the staleness and document it. A transaction is cheap here and
removes the inconsistency.
Resolution
Resolved 2026-05-16 (commit pending). Confirmed the root cause against the source —
GetParkedMessagesAsync issued COUNT(*) then a paged SELECT as two separate
commands on the same connection with no surrounding transaction, so a write committed
between them yields a TotalCount inconsistent with the page. Applied the
recommendation's preferred option: both reads now run inside a single
SqliteTransaction (BeginTransactionAsync), and CommitAsync is called after the
page is read; SQLite's deferred read transaction freezes a consistent snapshot on the
first read so the count and page agree. Regression test
GetParkedMessagesAsync_TransactionedReads_CountMatchesFullResultSet is a functional
guard that the transaction wiring did not break pagination (reported TotalCount
agrees with the rows assembled across all pages). Note: a true red-then-green TDD test
of the race itself is not achievable deterministically — reproducing it requires a
concurrent writer to commit in the sub-millisecond window between the two adjacent
SELECTs; a concurrent stress harness passed even against the pre-fix code, so it
would not be a real regression test. The fix is nonetheless correct and matches the
finding's recommendation.
StoreAndForward-007 — Async work in ParkedMessageHandlerActor uses ContinueWith without scheduler/affinity guarantees
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs:34, :68, :87 |
Description
The three handlers call a Task-returning service method and chain .ContinueWith(...) .PipeTo(sender). Sender is correctly captured into a local first, so the closure is
safe. However ContinueWith without an explicit TaskScheduler runs the continuation
on a thread-pool thread and the captured continuation builds the response objects there
— acceptable since it only touches locals, but it bypasses the idiomatic
PipeTo-with-success/failure-projection pattern and is fragile if someone later adds a
line touching actor state inside the continuation. There is also no TaskContinuationOptions,
so a faulted antecedent still runs the continuation (handled here via IsCompletedSuccessfully,
but only by convention).
Recommendation
Replace ContinueWith(...).PipeTo(sender) with PipeTo(sender, success: result => ..., failure: ex => ...), which is the documented Akka pattern, keeps response construction
off the actor thread safely, and makes the success/failure branches explicit.
Resolution
Resolved 2026-05-16 (commit pending). Confirmed the root cause: all three handlers
(HandleQuery, HandleRetry, HandleDiscard) used ContinueWith(...).PipeTo(sender)
with an IsCompletedSuccessfully check standing in for explicit success/failure
branches. Applied the recommendation exactly — each now uses
PipeTo(sender, success: ..., failure: ...), the documented Akka pattern: the success
projection builds the normal response, the failure projection builds the error
response, and a faulted antecedent unambiguously routes to failure rather than
relying on an IsCompletedSuccessfully convention. Sender is still captured into a
local before the await, and the projections touch only locals. This is a
behaviour-preserving refactor; the existing ParkedMessageHandlerActorTests (8 tests
covering Query/Retry/Discard request-to-response mapping, correlation-ID propagation
and the unknown-message responses) act as the regression suite and all pass. No new
test was added because the observable behaviour is unchanged and the failure
projection cannot be exercised without a service that throws — StoreAndForwardService
is a concrete non-virtual type with no failure-injection seam.
StoreAndForward-008 — A SQLite connection is opened and torn down on every storage call
| Severity | Low |
| Category | Performance & resource management |
| Status | Resolved |
| Location | src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:28, :61, :93, :117, :144, :162, :199, :221, :237, :267, :285, :305, :319 |
Description
Every method in StoreAndForwardStorage constructs a fresh SqliteConnection and calls
OpenAsync. Microsoft.Data.Sqlite pools connections, so this is not a correctness bug,
but a retry sweep over a large buffer performs one open per UpdateMessageAsync/
RemoveMessageAsync call inside the loop (RetryMessageAsync), multiplying connection
churn under load. With no max buffer size (by design) the buffer can grow large, so the
per-message connection acquisition is a measurable overhead on the hot retry path.
Recommendation
Consider a batched retry API that opens one connection (and one transaction) per sweep, or pass an open connection into the per-message update calls. At minimum, document that the design relies on the Sqlite connection pool for acceptable performance.
Resolution
Resolved 2026-05-16 (commit pending). Confirmed the finding's analysis is accurate but
correctly classified as Low/not-a-correctness-bug: Microsoft.Data.Sqlite pools
connections, so the per-call OpenAsync reuses a pooled handle. Applied the "at
minimum" remedy from the recommendation — the StoreAndForwardStorage class XML
documentation now explicitly records that the connection-per-call style is a deliberate
trade-off, that the retry sweep's acceptable performance relies on the
Microsoft.Data.Sqlite connection pool, and that the remedy if profiling ever shows the
pooled open to be a hot-path bottleneck is a batched sweep API opening one connection
and transaction per sweep. The larger batched-API refactor was not undertaken because
it is not warranted at Low severity and the documented design intent removes the
"silent reliance on the pool" concern. Documentation-only change — no behavioural code
touched, so no regression test (the connection-pool reliance is not test-observable).
StoreAndForward-009 — OnActivity event invocation is not thread-safe against concurrent subscribe/unsubscribe
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:46, :309 |
Description
OnActivity is a public event Action<...> raised via OnActivity?.Invoke(...) in
RaiseActivity (line 309). RaiseActivity is called from both EnqueueAsync (caller
thread) and RetryMessageAsync (timer thread). The ?.Invoke null-conditional captures
the delegate once so it will not NRE, but there is no synchronisation around the event
field itself; a subscriber added/removed concurrently with a raise has no defined
ordering. More importantly, subscriber callbacks run synchronously on the timer thread,
so a slow or throwing subscriber stalls or aborts the retry sweep (an exception in a
subscriber propagates out of RaiseActivity into RetryMessageAsync's try and is
swallowed as a "transient failure," wrongly incrementing the message's retry count).
Recommendation
Snapshot the delegate (already done) and additionally wrap subscriber invocation in a
try/catch so a faulting logging subscriber cannot be misclassified as a delivery
failure. Document that handlers must be fast and non-throwing, or dispatch activity
notifications asynchronously.
Resolution
Resolved 2026-05-16 (commit pending). Confirmed — and found the impact is worse than
the finding states. RaiseActivity previously did OnActivity?.Invoke(...); a
throwing subscriber's exception escaped it. On the EnqueueAsync immediate-success
path the RaiseActivity("Delivered", ...) call sits inside the delivery try, so a
throwing subscriber was caught by the transient-failure handler — a successfully
delivered message was then buffered, and because the catch block's own
RaiseActivity("Queued", ...) also threw, the exception escaped EnqueueAsync
entirely. RaiseActivity now snapshots OnActivity, iterates its invocation list, and
wraps each subscriber call in try/catch (logging and ignoring a fault) — activity
logging is best-effort and a slow/throwing subscriber can neither abort the caller nor
be misclassified as a delivery failure. Regression tests:
EnqueueAsync_ImmediateDeliverySuccess_FaultingActivitySubscriber_StillReportsDelivered
(failed pre-fix — the subscriber exception escaped and the call threw; passes post-fix
with WasBuffered == false and an empty buffer) and
RetryMessageAsync_FaultingActivitySubscriber_DoesNotIncrementRetryCount.
StoreAndForward-010 — Retry of a parked message does not reset LastAttemptAt, so its retry timing is unspecified
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:203, :101 |
Description
RetryParkedMessageAsync sets status = Pending, retry_count = 0, last_error = NULL
but leaves last_attempt_at unchanged (line 203–206). The retry-due query
(GetMessagesForRetryAsync, line 101–105) selects Pending rows where
last_attempt_at IS NULL OR ... elapsed >= retry_interval_ms. A message parked after
exhausting retries has an old last_attempt_at; once re-queued, the elapsed time since
that stale timestamp is almost certainly already greater than the retry interval, so the
operator-retried message is attempted on the very next sweep regardless of the
configured interval. That is probably the desired behaviour (operator wants it tried
now), but it is unspecified and inconsistent — if retry_interval_ms were very large the
behaviour would instead be "try immediately" by accident rather than by design.
Recommendation
Explicitly decide and encode the intent: either set last_attempt_at = NULL on
re-queue so the message is unambiguously due now, or set it to "now" so it waits one
interval. Document the chosen behaviour in the method's XML comment.
Resolution
Resolved 2026-05-16 (commit pending). Confirmed against the source: RetryParkedMessage Async reset status, retry_count and last_error but left last_attempt_at stale,
so the operator-retried message's retry timing depended on the elapsed time since the
original pre-park attempt. Encoded the intent explicitly — an operator retry means
"attempt this again now" — by also setting last_attempt_at = NULL in the UPDATE, so
the re-queued message is unambiguously due on the next sweep regardless of the
configured retry_interval_ms. The method's XML comment now documents this. Regression
test RetryParkedMessageAsync_ClearsLastAttemptAt_SoMessageIsImmediatelyDue uses a
1-hour retry interval and a recent last_attempt_at; it failed pre-fix (timestamp not
cleared, message excluded from the retry-due set) and passes post-fix.
StoreAndForward-011 — StoreAndForwardMessageStatus.InFlight is unused and the doc's "retrying" status is unmodelled
| Severity | Low |
| Category | Design-document adherence |
| Status | Deferred |
| Location | src/ScadaLink.Commons/Types/Enums/StoreAndForwardMessageStatus.cs:9; src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:219, :235 |
Description
The enum defines Pending, InFlight, Parked, Delivered. The module only ever uses
Pending and Parked — InFlight and Delivered are never assigned (delivered
messages are deleted, not marked Delivered). Meanwhile the Component design doc
("Message Format" -> Status) specifies the set "Pending, retrying, or parked." So the
code's enum drifts from the doc in two directions: it carries dead members the doc does
not mention (InFlight, Delivered) and omits the doc's retrying state. A message
mid-retry is indistinguishable from one that has never been attempted.
Recommendation
Reconcile the enum with the design. Either drop the unused members and update the doc,
or implement the documented retrying state and use InFlight to mark a message the
sweep is actively delivering (which would also help with finding 005).
Resolution
Deferred 2026-05-16. Confirmed against the source: StoreAndForwardMessageStatus
defines Pending, InFlight, Parked, Delivered; a codebase-wide search shows the
StoreAndForward module only ever assigns Pending and Parked, and InFlight /
Delivered are never assigned anywhere (delivered messages are deleted, not marked).
The design doc's retrying state is unmodelled. Both options the recommendation offers
— (a) drop the unused InFlight/Delivered members, or (b) add a Retrying member —
require editing StoreAndForwardMessageStatus.cs, which lives in src/ScadaLink.Commons
(outside this review's edit scope: only src/ScadaLink.StoreAndForward/** may be
changed). The enum is also referenced by IntegrationTests and HealthMonitoring tests, so
removing members is a cross-module change. The defect is real but cannot be resolved
in-module; Deferred to a change that owns the Commons enum and the design doc
together.
StoreAndForward-012 — StoreAndForwardMessage is a persistence entity but lives in the component, not Commons
| Severity | Low |
| Category | Code organization & conventions |
| Status | Deferred |
| Location | src/ScadaLink.StoreAndForward/StoreAndForwardMessage.cs:9 |
Description
StoreAndForwardMessage is a persistence-ignorant POCO that maps directly to the
sf_messages table and is also carried across the network inside ReplicationOperation
(replicated to the standby node over Akka remoting). CLAUDE.md "Code Organization" states
that entity classes are persistence-ignorant POCOs in Commons and that message contracts
follow additive-only evolution. Because this type doubles as a replication wire contract
but lives in the component assembly, it is not co-located with the other Commons
entities and its evolution is not governed by the additive-only message-contract rule.
This is a borderline case (the type is site-local), but the cross-node use via
ReplicationOperation makes it a de-facto message contract.
Recommendation
Either move StoreAndForwardMessage (and ReplicationOperation) into the Commons
Entities/Messages hierarchy so they are governed by the contract-evolution rules, or
introduce a separate DTO for replication and keep StoreAndForwardMessage purely as the
local persistence model. Document the decision.
Resolution
Deferred 2026-05-16. Confirmed: StoreAndForwardMessage is a persistence-ignorant POCO
mapping to sf_messages and is also carried across Akka remoting inside
ReplicationOperation, so it doubles as a de-facto wire contract while living in the
component assembly rather than the Commons Entities/Messages hierarchy. The
recommendation's primary remedy — moving StoreAndForwardMessage (and
ReplicationOperation) into Commons — crosses module boundaries (it would add a type to
src/ScadaLink.Commons, outside this review's edit scope of
src/ScadaLink.StoreAndForward/**, and change every referencing module). The alternative
"separate replication DTO" still leaves the persistence entity in the component, so it
does not actually resolve the finding's core concern (entity placement / contract-
evolution governance). Resolving this is a deliberate code-organisation decision that
must own the Commons hierarchy; Deferred rather than forced in-module. Flagged for a
cross-module follow-up.
StoreAndForward-013 — Critical paths lack test coverage: retry-due timing, replication-from-active, and the actor bridge
| Severity | Medium |
| Category | Testing coverage |
| Status | Resolved |
| Location | tests/ScadaLink.StoreAndForward.Tests/ (whole directory); src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:101; src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs |
Description
The existing tests cover storage CRUD and the service happy/failure paths well, but
three important behaviours are untested: (1) the retry-due time filter in
GetMessagesForRetryAsync — every service test sets DefaultRetryInterval = TimeSpan.Zero,
so the julianday elapsed-time comparison (the most error-prone SQL in the module) is
never exercised with a non-zero interval; a message that is not yet due should be
skipped, and that is never verified. (2) Replication from the active side — no test
asserts that an enqueue/remove/park causes a Replicate* call (this is exactly the gap
behind finding 001; a test would have caught it). (3) ParkedMessageHandlerActor has no
test at all — the Query/Retry/Discard request-to-response mapping and the
ExtractMethodName JSON parsing are unverified, including the malformed-JSON branch.
Recommendation
Add tests for: a non-zero retry interval where a recently-attempted message is excluded
and an older one is included; active-side replication invocation per operation type
(once finding 001 is fixed); and ParkedMessageHandlerActor using Akka.TestKit,
including ExtractMethodName for MethodName, Subject, missing-property and
invalid-JSON payloads.
Resolution
Resolved 2026-05-16 (commit pending). All three gaps closed.
(1) Retry-due timing: GetMessagesForRetryAsync_NonZeroInterval_ExcludesNotYetDue IncludesDue exercises the julianday elapsed-time SQL with non-zero intervals — a
just-attempted message (1-hour interval) is excluded, an old one (attempted 2h ago,
5-min interval) and a never-attempted one are included.
(2) Active-side replication: this sub-claim was already stale at the reviewed code —
StoreAndForwardReplicationTests (added with finding 001's fix) asserts the Add, Remove
and Park replication operations; no new test was needed.
(3) ParkedMessageHandlerActor: new ParkedMessageHandlerActorTests using
Akka.TestKit.Xunit2 (package reference added to the test project) covers Query/Retry/
Discard request-to-response mapping, correlation-ID propagation, the unknown-message
failure responses, and ExtractMethodName for MethodName, Subject, missing-property
and malformed-JSON payloads (all falling back to the category name without throwing).
These are coverage-gap tests over already-correct code, so they pass on first run.
StoreAndForward-014 — Storage does not create its SQLite database directory
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:26 |
Found 2026-05-16 while verifying the store-and-forward fixes — this defect was not part of the original baseline review.
Description
StoreAndForwardStorage.InitializeAsync opened a SqliteConnection against the
configured SqliteDbPath (default ./data/store-and-forward.db) without ensuring the
parent directory exists. SQLite creates the database file on demand but not its
directory, so when data/ does not already exist the connection fails to open with
SQLite Error 14: 'unable to open database file'. Every site-host boot therefore failed
in any environment whose working directory has no data/ folder — the cause of the six
failing SiteActorPathTests (the host's RegisterSiteActors aborts at
StoreAndForwardService.StartAsync). Production masked it because data/ is created by
the Docker image / deployment.
Recommendation
Create the parent directory of a file-backed SQLite database before opening it.
Resolution
Resolved 2026-05-16. InitializeAsync now calls a new EnsureDatabaseDirectoryExists
helper that parses the connection string with SqliteConnectionStringBuilder and, for a
file-backed database, creates the parent directory if it is missing (in-memory databases
and bare filenames are skipped). Regression test
InitializeAsync_FileInMissingDirectory_CreatesDirectory fails against the pre-fix code;
all six SiteActorPathTests now pass. Fixed by the commit whose message references
StoreAndForward-014.