Files
ScadaBridge/code-reviews/StoreAndForward/findings.md
T
Joseph Doherty ac96b83b08 fix(high-severity): close 9 of 10 open High findings across 8 modules
Comm-016: delete dead HandleConnectionStateChanged + _debugSubscriptions /
_inProgressDeployments tracking + ConnectionStateChanged message record.
Disconnect detection is owned by the transport layers (gRPC keepalive PING
~25s; Ask-timeout at CommunicationService). Updates the
Component-Communication.md design doc to make that explicit.

SnF-018: NotificationForwarder.DeliverAsync now discards a corrupt buffered
payload (Warning log + return true) instead of returning false and parking
the row — honoring the design's "notifications do not park" invariant.

DM-018: reconciliation no longer force-sets Enabled, preserving an
intentional Disabled state after central failover.

ESG-018: DeliverBufferedAsync (both ExternalSystemClient + DatabaseGateway)
catches JsonException and returns false, turning a corrupt buffered row
into a parked operation instead of a retry-forever poison message.

InboundAPI-022: register ActiveNodeGate as IActiveNodeGate in the Central
DI branch so standby-node gating is actually wired up in production.

NS-019: remove orphaned NotificationDeliveryService /
INotificationDeliveryService / NotificationResult; central notification
delivery now lives entirely in NotificationOutbox.

SEL-016: normalise From/To filters to UTC before ISO-string compare so
non-UTC DateTimeOffset clients no longer get spuriously excluded events.

TE-017: include Description on attributes/alarms and a HashableConnections
projection (protocol, endpoint JSON, failover count) in the revision hash
and DiffService; staleness detection now catches description-only and
connection-endpoint edits.

Transport-001 and Transport-002 (also High) remain Open — they're being
handled in a follow-up batch because both touch BundleImporter.cs and
must serialise.
2026-05-28 05:40:15 -04:00

82 KiB
Raw Blame History

Code Review — StoreAndForward

Field Value
Module src/ScadaLink.StoreAndForward
Design doc docs/requirements/Component-StoreAndForward.md
Status Reviewed
Last reviewed 2026-05-28
Reviewer claude-agent
Commit reviewed 1eb6e97
Open findings 7 (3 Deferred: 002, 011, 012; 7 new Open: 018024 — see Re-review 2026-05-28)

Summary

The Store-and-Forward module is small and readable, with a clean SQLite persistence layer, a sensible service API, and reasonable test coverage of the storage and service happy paths. However the review surfaced two issues that undermine the module's core purpose. First, the active delivery path never invokes the ReplicationServiceReplicateEnqueue/Remove/Park have no callers anywhere in the codebase, so buffered messages are not replicated to the standby node and the design's failover-durability guarantee (Component doc "Persistence", CLAUDE.md "Store-and-Forward") is not met. Second, there is an off-by-one in retry accounting: the immediate-failure path stores a buffered message with RetryCount = 1, so a message configured with MaxRetries = N is actually attempted N times in total rather than one immediate attempt plus N retries, and a per-source MaxRetries of 1 produces zero retry attempts. Additional themes: SQLite connection-per-call with no transactional grouping of multi-statement operations, no concurrency guard against a parked message being retried while the sweep is mid-flight, an unused enum member (InFlight) that drifts from the documented status set, and untested critical paths (retry-due timing, replication-from-active, the actor bridge). None of the findings are blockers for compilation, but the replication and retry-count issues are functional defects against the design.

Re-review 2026-05-17 (commit 39d737e)

Re-reviewed at commit 39d737e after the batch-3 fixes. All of findings 001 and 003010, plus 013014, are confirmed Resolved against the current source: the replication wiring (BufferAsync/ReplicateRemove/ReplicatePark), the corrected retry-count semantics, the conditional UpdateMessageIfStatusAsync writes, the transactioned parked-message reads, the PipeTo refactor, the RaiseActivity hardening, the RetryParkedMessageAsync last_attempt_at reset and the database directory creation are all present as described. Findings 002, 011 and 012 were re-verified and remain validly Deferred — their preconditions are unchanged (002's residual no-handler gap, 011's Commons-owned enum, 012's Commons-owned entity placement).

This pass surfaced three new findings. StoreAndForward-015 records the StoreAndForward side of the cross-module MaxRetries == 0 ambiguity flagged by ExternalSystemGateway-015: EnqueueAsync's public contract documents maxRetries only as "parked once MaxRetries is reached" and never states the 0 = no limit / retry forever special case that RetryMessageAsync actually enforces, so an ESG caller passing 0 to mean "never retry" gets the opposite behaviour with no warning from the S&F API surface. StoreAndForward-016 records that operator-initiated parked-message retry and discard are not replicated to the standby — only the add/remove/park sweep paths are — so a failover diverges the standby buffer from the active one. StoreAndForward-017 records that the Retry/Discard activity-log entries hard-code the ExternalSystem category, mislabelling notification and cached-DB-write messages in the site event log.

Re-review 2026-05-28 (commit 1eb6e97)

Full re-review against commit 1eb6e97 with the same 10-category checklist. The batch-3 / batch-4 resolutions (001, 003010, 013017) are still present and intact; no regressions detected on prior fixes. Findings 002, 011 and 012 remain validly Deferred (their preconditions are unchanged) and findings 005, 006, 010, 013, 014, 016, 017 are confirmed Resolved against the current source.

This pass surfaced seven new findings clustered around two themes:

The first theme is design-doc drift on the notification path, which has acquired two now-real defects since the engine became central-targeted. StoreAndForward-018 (High) records that a corrupt notification payload — handled in NotificationForwarder. DeliverAsync by returning false — parks a notification on its first retry-sweep encounter, despite the design doc stating "Notifications do not park" (line 47, "Parking applies only to the external-system-call and cached-database-write categories"). The same path becomes a poison-payload retry-forever trap on the active node if the engine ever softened the false semantics. StoreAndForward-019 (Medium) records the sibling defect: notifications are enqueued with MaxRetries defaulting to StoreAndForwardOptions.DefaultMaxRetries (50), and the legacy SMTP path (NotificationDeliveryService.SendAsync) passes a positive bounded smtpConfig. MaxRetries — so an unreachable central will silently park notifications after a finite retry budget rather than "retry at the fixed forward interval until central acks" as the design requires. The contract 0 = no limit is not enforced for the notification category.

The second theme is subtle correctness and contract gaps around the operator paths that survived the StoreAndForward-016/017 batch. StoreAndForward-020 (Medium) records that RetryParkedMessageAsync skips replication entirely if GetMessageByIdAsync returns null after a successful local requeue (a narrow but real race window with a concurrent discard / sweep delete), re-introducing the StoreAndForward-016 standby divergence in that corner. StoreAndForward-021 (Medium) is a design-doc-vs-code drift that should be reconciled in the doc: the operation tracking table is documented inside Component-StoreAndForward.md as a S&F responsibility (lines 21, 49, 7787, 108, 114), but the actual OperationTrackingStore lives in src/ScadaLink.SiteRuntime/ Tracking/ and is not consumed by S&F at all — the brief's own note flags this. The design doc should be updated to point at SiteRuntime, or the store moved to StoreAndForward.

StoreAndForward-022 (Low) records that _cachedCallObserver silently drops audit telemetry when a buffered cached-call's Id is not a parseable TrackedOperationId GUID — the engine returns from NotifyCachedCallObserverAsync before emitting anything, so a legacy enqueue path that buffered a non-GUID id (the engine's own default minting produces "N"-formatted GUIDs, which TrackedOperationId.TryParse accepts, but any caller passing a custom non-GUID id silently bypasses the entire Submitted/Forwarded/ Attempted/Delivered/Parked/Discarded audit lifecycle). StoreAndForward-023 (Low) records that siteId is silently defaulted to string.Empty when no IStoreAndForwardSiteContext is registered, so a misconfigured host produces audit telemetry with SourceSite = "" and the central audit-log's (SourceSite, TrackedOperationId) correlation degrades to a per-id-only index. StoreAndForward-024 (Low) is a stop-time ordering defect: StopAsync disposes the timer but a mid-flight RetryPendingMessagesAsync invocation continues using _storage and _replication after StopAsync returns; downstream resources disposed by the host shutdown sequence (the DI container) can then NRE through the still-running sweep.

Checklist coverage — Re-review 2026-05-28

# Category Examined Notes
1 Correctness & logic bugs Notification corrupt-payload parks contrary to design (018); RetryParkedMessageAsync skips replication when message reload races a deletion (020).
2 Akka.NET conventions ParkedMessageHandlerActor uses PipeTo correctly with success/failure projections (007 resolution preserved). No new findings.
3 Concurrency & thread safety Sweep-vs-stop race: a timer callback running while StopAsync returns can touch disposed dependencies (024).
4 Error handling & resilience Notifications park after DefaultMaxRetries exhaustion (019) — contradicts the design doc's "retried until central acks".
5 Security No issues found — parameterised SQL throughout, payload JSON opaque, no secret material handled.
6 Performance & resource management No new findings — the connection-per-call documented trade-off and pooled OpenAsync remain acceptable.
7 Design-document adherence Operation Tracking Table documented in StoreAndForward but actually lives in SiteRuntime (021); notification non-parking guarantee broken by 018 + 019.
8 Code organization & conventions IStoreAndForwardSiteContext silently defaults SiteId to empty (023) — a configuration hole rather than an entity placement issue.
9 Testing coverage The seven new findings have no regression tests in tests/ScadaLink.StoreAndForward.Tests/ — particularly the notification-doesn't-park invariant (018, 019), the requeue-after-reload-null replication gap (020), and the stop-during-sweep behaviour (024).
10 Documentation & comments CachedCallAttemptOutcome.ParkedMaxRetries XML doc says "S&F semantics" but the code applies it to notifications too if 018/019 fire — minor drift, captured under 018. The TrackedOperationId.TryParse silent-skip behaviour in NotifyCachedCallObserverAsync is documented in the source but not on the public observer contract (022).

Checklist coverage

# Category Examined Notes
1 Correctness & logic bugs Off-by-one in retry counting (003); parked-message retry timing (010); Retry/Discard activity log hard-codes the category (017).
2 Akka.NET conventions ContinueWith used instead of PipeTo-friendly continuations; default supervision; see 007.
3 Concurrency & thread safety Sweep guarded by Interlocked, but no guard against retry-vs-manage races (005); OnActivity event not thread-safe (009).
4 Error handling & resilience Replication never invoked from active path (001); no-handler messages buffered then stuck (002); operator retry/discard not replicated to standby (016).
5 Security No issues found — parameterised SQL throughout; no secrets handled directly; payload JSON treated opaquely.
6 Performance & resource management New SQLite connection per call; multi-statement operations not wrapped in a transaction (006, 008).
7 Design-document adherence Replication gap (001); InFlight status undocumented/unused (011); "retrying" status from design doc not modelled.
8 Code organization & conventions StoreAndForwardMessage is an entity-like POCO living in the component, not Commons (012).
9 Testing coverage Retry-due timing, replication-from-active, and ParkedMessageHandlerActor are untested (013).
10 Documentation & comments XML doc on RegisterDeliveryHandler contract is inconsistent with code (004); EnqueueAsync never documents the maxRetries == 0 = "retry forever" special case (015).

Findings

StoreAndForward-001 — Replication to standby is never triggered by the active node

Severity Critical
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.StoreAndForward/ReplicationService.cs:40, :53, :66; src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:155, :212, :222, :236

Description

ReplicationService exposes ReplicateEnqueue, ReplicateRemove and ReplicatePark to forward buffer operations to the standby node, but a codebase-wide search shows these methods have no callers. StoreAndForwardService — which performs every add (EnqueueAsync line 155 / 163), remove (RemoveMessageAsync call at line 212) and park (UpdateMessageAsync calls at lines 222/236) — holds no reference to ReplicationService and never invokes it. Only the receiving half is wired (SetReplicationHandler and ApplyReplicatedOperationAsync are used by SiteReplicationActor). The Component design doc ("Persistence") and CLAUDE.md ("Store-and-Forward") require the active node to forward every buffer operation to the standby so that, on failover, the new active node "has a near-complete copy of the buffer." As written, the standby's S&F SQLite database stays empty and a failover loses the entire buffer — a data-loss defect against a core requirement.

Recommendation

Inject ReplicationService into StoreAndForwardService and call ReplicateEnqueue after a successful _storage.EnqueueAsync, ReplicateRemove after RemoveMessageAsync, and ReplicatePark after a park-causing UpdateMessageAsync. Update ServiceCollectionExtensions.AddStoreAndForward to pass the dependency. Add a test that asserts the replication handler observes each operation type.

Resolution

Resolved 2026-05-16. ReplicationService is now injected into StoreAndForwardService (wired in AddStoreAndForward), and every buffer operation is forwarded to the standby: a new BufferAsync helper calls ReplicateEnqueue after each persist, ReplicateRemove runs after a successful retry removes a message, and ReplicatePark runs on both park paths. Replication stays fire-and-forget and is a no-op when ReplicationEnabled is false or no handler is wired. Regression tests StoreAndForwardReplicationTests assert the replication handler observes the Add, Remove and Park operations. Fixed by the commit whose message references StoreAndForward-001.

StoreAndForward-002 — Messages enqueued with no registered handler are buffered but never deliverable

Severity Low
Original severity High (re-triaged down to Low on 2026-05-16 — see Re-triage note)
Category Error handling & resilience
Status Deferred
Location src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:162, :201

Description

EnqueueAsync falls through to "No handler registered — buffer for later" (line 162) when no delivery handler is registered for the category. The retry sweep (RetryMessageAsync, line 201) then logs "No delivery handler for category" and returns without touching the message. No caller in the codebase ever calls RegisterDeliveryHandler (the External System Gateway, Notification Service and Database Gateway only call EnqueueAsync), so in the current wiring every buffered message lands in this dead state: it is persisted, counts toward buffer depth, but can never be retried, delivered or parked. It will sit Pending forever. Either the handler registration is missing from Host/gateway startup, or the "buffer for later" path is a silent trap. Either way the engine cannot deliver anything.

Recommendation

Decide the intended contract. If handlers are expected to be registered before EnqueueAsync is reachable, make EnqueueAsync reject (or log an error) when no handler exists rather than silently buffering an undeliverable message, and wire RegisterDeliveryHandler calls in Host startup for all three categories. If late registration is intended, the retry sweep should treat a still-missing handler as a transient condition with bounded logging rather than a permanent no-op.

Re-triage note (2026-05-16)

The finding's central factual claim — "No caller in the codebase ever calls RegisterDeliveryHandler" and therefore "every buffered message lands in this dead state" — is no longer true at the reviewed code. ScadaLink.Host (AkkaHostedService.RegisterSiteActors, AkkaHostedService.cs:353-379) registers all three delivery handlers (ExternalSystem, CachedDbWrite, Notification) at site startup, immediately after StoreAndForwardService.StartAsync(). The finding was written against commit 9c60592 before that wiring existed; the High-severity "engine cannot deliver anything" outcome no longer occurs.

The remaining residual risk is narrow: a message enqueued for a category that genuinely has no handler (e.g. an enqueue racing ahead of RegisterDeliveryHandler, or a future category added without a handler) is still buffered and then skipped by the sweep forever. That is a real but minor robustness gap, hence the downgrade to Low.

It is left Open rather than fixed in this pass because the finding's recommended fix — making EnqueueAsync reject when no handler is registered — is a behavioural contract change, not a localised bug fix: the "buffer with no handler yet" path is exercised by StoreAndForwardReplicationTests and by three NotificationService and ExternalSystemGateway tests (Send_TransientError_WithStoreAndForward_BuffersMessage, Send_Smtp4xxCommandException_ClassifiedTransientAndBuffered, Send_SmtpProtocolException_ClassifiedTransient) which construct a real StoreAndForwardService without registering a handler and assert WasBuffered. Changing the contract requires deciding whether late handler registration is supported and updating tests in modules outside this review's edit scope — a design decision that should be made deliberately rather than forced here.

Resolution

Deferred 2026-05-16 (re-triaged High → Low). Verified again against the source in this pass: the finding's premise (no RegisterDeliveryHandler caller anywhere) is stale — ScadaLink.Host wires all three handlers at site startup — so the High-severity "engine cannot deliver anything" outcome no longer occurs. The residual gap (a message enqueued for a category that genuinely has no handler is buffered then skipped forever) is real but minor. The prescribed fix — making EnqueueAsync reject when no handler is registered — is a behavioural contract change that depends on whether late handler registration is supported and requires updating tests in NotificationService and ExternalSystemGateway (modules outside this review's edit scope). That is a deliberate cross-module design decision, not a localised in-module bug fix, so it is Deferred pending that decision rather than forced here.

StoreAndForward-003 — Off-by-one in retry accounting: immediate failure pre-counts as retry 1

Severity High
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:153, :229, :233

Description

On a transient immediate-delivery failure, EnqueueAsync buffers the message with message.RetryCount = 1 (line 153). The retry sweep then increments RetryCount before the max check (RetryCount++ at line 229; RetryCount >= MaxRetries at line 233). Consequences: (1) a message configured with MaxRetries = 1 is parked on the first retry sweep without ever being retried, because after the immediate attempt RetryCount is already 1 and the first sweep makes it 2 ≥ 1 — zero actual retries occur, contradicting the design intent that the immediate attempt and the retry budget are distinct; (2) the design doc's Retry Count field is "Number of attempts so far," but here it is seeded to 1 before any retry has happened, making the parked-message AttemptCount shown to operators off by one relative to configured MaxRetries. The EnqueueAsync_TransientFailure_BuffersForRetry test even asserts RetryCount == 1, locking in the ambiguity.

Recommendation

Choose one consistent meaning for RetryCount (recommended: total delivery attempts, including the immediate one) and apply it uniformly. If MaxRetries is meant to bound retries after the immediate attempt, buffer with RetryCount = 0 and treat the immediate failure as attempt 0; if it bounds total attempts, document that and adjust the comparison. Update the affected test to match the chosen semantics.

Resolution

Resolved 2026-05-16 (commit <pending>). RetryCount now consistently means "number of background retry-sweep attempts so far"; the initial immediate (or caller-made) delivery attempt is attempt 0 and is not counted, and MaxRetries bounds retry-sweep attempts after that initial attempt. EnqueueAsync no longer seeds RetryCount = 1 on either the transient-immediate-failure path or the attemptImmediateDelivery: false path — a freshly buffered message has RetryCount = 0. RetryMessageAsync already increments before the >= MaxRetries check, which is now correct, so a message with MaxRetries = 1 gets exactly one real retry before parking (previously zero). The StoreAndForwardMessage.RetryCount XML doc was corrected to match. Regression test RetryPendingMessagesAsync_MaxRetriesOne_PerformsExactlyOneRetryBeforeParking asserts the immediate attempt plus exactly one retry occur before parking; the affected existing tests (EnqueueAsync_TransientFailure_BuffersForRetry, EnqueueAsync_AttemptImmediateDeliveryFalse_BuffersWithoutInvokingHandler, RetryPendingMessagesAsync_MaxRetriesReached_ParksMessage) were updated to the corrected semantics.

StoreAndForward-004 — RegisterDeliveryHandler XML doc contradicts the implemented contract

Severity Medium
Category Documentation & comments
Status Resolved
Location src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:38, :60

Description

The XML comment on the handler delegate (lines 3740) says "Returns true on success, throws on transient failure. Permanent failures should return false (message will NOT be buffered)." That last clause is wrong for the retry path: in RetryMessageAsync, a handler returning false does not "not buffer" — the message is already buffered, and the code parks it immediately (lines 218224). The comment describes only the EnqueueAsync immediate path and misleads anyone implementing a handler about what false means once a message is in the retry loop.

Recommendation

Reword the contract to cover both paths explicitly: true = delivered (remove from buffer); false = permanent failure (not buffered on immediate attempt, parked on a retry); exception = transient failure (buffer / increment retry).

Resolution

Resolved 2026-05-16 (commit pending). Confirmed the root cause against the source: the old XML comment's "Permanent failures should return false (message will NOT be buffered)" describes only the EnqueueAsync immediate path; on the retry path a false return parks the already-buffered message. Reworded the _deliveryHandlers field doc into an explicit three-way contract (true = delivered/removed-or-never- buffered; false = permanent failure = not-buffered-on-immediate / parked-on-retry; throws = transient = buffered-for-retry / retry-count-incremented), noting it applies identically on both paths, and pointed RegisterDeliveryHandler's summary at it. Documentation-only change — no behavioural code touched, so no regression test (an XML comment is not test-observable).

StoreAndForward-005 — Parked-message retry/discard can race with the in-progress retry sweep

Severity Low — re-triaged down from Medium on 2026-05-16; the described data-loss race is not actually reachable, see Re-triage note
Original severity Medium
Category Concurrency & thread safety
Status Resolved
Location src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:184, :266, :280

Description

RetryPendingMessagesAsync loads a snapshot of due messages (line 179) and then processes them one by one (line 184), await-ing delivery for each. Meanwhile RetryParkedMessageAsync / DiscardParkedMessageAsync (operator actions arriving via ParkedMessageHandlerActor) run on unrelated threads and mutate the same rows. Because each operation opens its own SQLite connection and there is no row-level coordination, an operator can DiscardParkedMessageAsync a message that the sweep is concurrently delivering: the sweep's later RemoveMessageAsync/UpdateMessageAsync operates on a now-deleted row (harmless) — but if an operator RetryParkedMessageAsync resets a row to Pending while the sweep simultaneously parks the same in-flight message, the operator intent is silently overwritten. The Interlocked guard only prevents overlapping sweeps, not sweep-vs-management races.

Recommendation

Funnel all message-state mutations through a single serialization point — e.g. perform all S&F state changes inside the ParkedMessageHandlerActor (or a dedicated S&F actor) so the actor mailbox serialises them, or make status transitions conditional in SQL (e.g. UPDATE ... WHERE id = @id AND status = @expected) and re-check the affected row count.

Re-triage note (2026-05-16)

Verified against the source: the specific race the finding describes — "an operator RetryParkedMessageAsync resets a row to Pending while the sweep simultaneously parks the same in-flight message"cannot occur at the reviewed code. The two operator operations (StoreAndForwardStorage.RetryParkedMessageAsync / DiscardParkedMessageAsync) are already SQL-conditional on status = Parked, and the retry sweep only ever loads and processes rows that are Pending. For the operator and the sweep to touch the same message the row would have to be simultaneously Parked (operator-actionable) and in the sweep's in-flight Pending snapshot — mutually exclusive states. Overlapping sweeps are additionally prevented by the Interlocked guard, so there is only ever one sweep writer. The described data-loss outcome is therefore not reachable, which makes the original Medium ("risky behaviour") severity an over-statement — re-triaged to Low.

What is real is a latent fragility: the sweep's own state-changing writes used the unconditional UpdateMessageAsync (WHERE id = @id), so the no-clobber guarantee rested entirely on the Interlocked invariant with zero defence-in-depth if that invariant were ever broken (e.g. RetryMessageAsync called outside the sweep, or the guard removed). That residual Low-severity weakness is worth closing, so the recommendation's conditional-SQL option was applied.

Resolution

Resolved 2026-05-16 (commit pending). Re-triaged Medium → Low (the described race is unreachable; see Re-triage note) and the residual latent fragility fixed: added StoreAndForwardStorage.UpdateMessageIfStatusAsync, a conditional update (UPDATE ... WHERE id = @id AND status = @expectedStatus) returning whether a row was written. RetryMessageAsync's three state-changing writes (park-on-permanent-failure, park-on-max-retries, retry-count increment) now use it with expectedStatus = Pending, so the sweep can never overwrite a row whose status changed underneath it; a skipped write is logged and the message left for the next sweep. Regression test RetryMessageAsync_StatusChangedDuringDelivery_SweepParkWriteIsSkipped drives a writer that moves the row out of Pending mid-delivery and asserts the sweep's stale park write is skipped (it failed against the pre-fix unconditional update, clobbering the other writer's RetryCount).

StoreAndForward-006 — GetParkedMessagesAsync count and page run without a transaction

Severity Low
Category Performance & resource management
Status Resolved
Location src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:166, :175

Description

GetParkedMessagesAsync issues a COUNT(*) and then a separate paged SELECT on two commands on the same connection with no surrounding transaction. A concurrent enqueue/park/discard between the two statements yields a TotalCount inconsistent with the returned page (e.g. total reported as 51 while only 50 distinct parked rows now exist, or a row visible in the page but excluded from the count). For a paginated UI this produces flickering totals and occasional off-by-one page math.

Recommendation

Wrap both reads in a single transaction (BeginTransaction) so they see a consistent snapshot, or accept the staleness and document it. A transaction is cheap here and removes the inconsistency.

Resolution

Resolved 2026-05-16 (commit pending). Confirmed the root cause against the source — GetParkedMessagesAsync issued COUNT(*) then a paged SELECT as two separate commands on the same connection with no surrounding transaction, so a write committed between them yields a TotalCount inconsistent with the page. Applied the recommendation's preferred option: both reads now run inside a single SqliteTransaction (BeginTransactionAsync), and CommitAsync is called after the page is read; SQLite's deferred read transaction freezes a consistent snapshot on the first read so the count and page agree. Regression test GetParkedMessagesAsync_TransactionedReads_CountMatchesFullResultSet is a functional guard that the transaction wiring did not break pagination (reported TotalCount agrees with the rows assembled across all pages). Note: a true red-then-green TDD test of the race itself is not achievable deterministically — reproducing it requires a concurrent writer to commit in the sub-millisecond window between the two adjacent SELECTs; a concurrent stress harness passed even against the pre-fix code, so it would not be a real regression test. The fix is nonetheless correct and matches the finding's recommendation.

StoreAndForward-007 — Async work in ParkedMessageHandlerActor uses ContinueWith without scheduler/affinity guarantees

Severity Low
Category Akka.NET conventions
Status Resolved
Location src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs:34, :68, :87

Description

The three handlers call a Task-returning service method and chain .ContinueWith(...) .PipeTo(sender). Sender is correctly captured into a local first, so the closure is safe. However ContinueWith without an explicit TaskScheduler runs the continuation on a thread-pool thread and the captured continuation builds the response objects there — acceptable since it only touches locals, but it bypasses the idiomatic PipeTo-with-success/failure-projection pattern and is fragile if someone later adds a line touching actor state inside the continuation. There is also no TaskContinuationOptions, so a faulted antecedent still runs the continuation (handled here via IsCompletedSuccessfully, but only by convention).

Recommendation

Replace ContinueWith(...).PipeTo(sender) with PipeTo(sender, success: result => ..., failure: ex => ...), which is the documented Akka pattern, keeps response construction off the actor thread safely, and makes the success/failure branches explicit.

Resolution

Resolved 2026-05-16 (commit pending). Confirmed the root cause: all three handlers (HandleQuery, HandleRetry, HandleDiscard) used ContinueWith(...).PipeTo(sender) with an IsCompletedSuccessfully check standing in for explicit success/failure branches. Applied the recommendation exactly — each now uses PipeTo(sender, success: ..., failure: ...), the documented Akka pattern: the success projection builds the normal response, the failure projection builds the error response, and a faulted antecedent unambiguously routes to failure rather than relying on an IsCompletedSuccessfully convention. Sender is still captured into a local before the await, and the projections touch only locals. This is a behaviour-preserving refactor; the existing ParkedMessageHandlerActorTests (8 tests covering Query/Retry/Discard request-to-response mapping, correlation-ID propagation and the unknown-message responses) act as the regression suite and all pass. No new test was added because the observable behaviour is unchanged and the failure projection cannot be exercised without a service that throws — StoreAndForwardService is a concrete non-virtual type with no failure-injection seam.

StoreAndForward-008 — A SQLite connection is opened and torn down on every storage call

Severity Low
Category Performance & resource management
Status Resolved
Location src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:28, :61, :93, :117, :144, :162, :199, :221, :237, :267, :285, :305, :319

Description

Every method in StoreAndForwardStorage constructs a fresh SqliteConnection and calls OpenAsync. Microsoft.Data.Sqlite pools connections, so this is not a correctness bug, but a retry sweep over a large buffer performs one open per UpdateMessageAsync/ RemoveMessageAsync call inside the loop (RetryMessageAsync), multiplying connection churn under load. With no max buffer size (by design) the buffer can grow large, so the per-message connection acquisition is a measurable overhead on the hot retry path.

Recommendation

Consider a batched retry API that opens one connection (and one transaction) per sweep, or pass an open connection into the per-message update calls. At minimum, document that the design relies on the Sqlite connection pool for acceptable performance.

Resolution

Resolved 2026-05-16 (commit pending). Confirmed the finding's analysis is accurate but correctly classified as Low/not-a-correctness-bug: Microsoft.Data.Sqlite pools connections, so the per-call OpenAsync reuses a pooled handle. Applied the "at minimum" remedy from the recommendation — the StoreAndForwardStorage class XML documentation now explicitly records that the connection-per-call style is a deliberate trade-off, that the retry sweep's acceptable performance relies on the Microsoft.Data.Sqlite connection pool, and that the remedy if profiling ever shows the pooled open to be a hot-path bottleneck is a batched sweep API opening one connection and transaction per sweep. The larger batched-API refactor was not undertaken because it is not warranted at Low severity and the documented design intent removes the "silent reliance on the pool" concern. Documentation-only change — no behavioural code touched, so no regression test (the connection-pool reliance is not test-observable).

StoreAndForward-009 — OnActivity event invocation is not thread-safe against concurrent subscribe/unsubscribe

Severity Low
Category Concurrency & thread safety
Status Resolved
Location src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:46, :309

Description

OnActivity is a public event Action<...> raised via OnActivity?.Invoke(...) in RaiseActivity (line 309). RaiseActivity is called from both EnqueueAsync (caller thread) and RetryMessageAsync (timer thread). The ?.Invoke null-conditional captures the delegate once so it will not NRE, but there is no synchronisation around the event field itself; a subscriber added/removed concurrently with a raise has no defined ordering. More importantly, subscriber callbacks run synchronously on the timer thread, so a slow or throwing subscriber stalls or aborts the retry sweep (an exception in a subscriber propagates out of RaiseActivity into RetryMessageAsync's try and is swallowed as a "transient failure," wrongly incrementing the message's retry count).

Recommendation

Snapshot the delegate (already done) and additionally wrap subscriber invocation in a try/catch so a faulting logging subscriber cannot be misclassified as a delivery failure. Document that handlers must be fast and non-throwing, or dispatch activity notifications asynchronously.

Resolution

Resolved 2026-05-16 (commit pending). Confirmed — and found the impact is worse than the finding states. RaiseActivity previously did OnActivity?.Invoke(...); a throwing subscriber's exception escaped it. On the EnqueueAsync immediate-success path the RaiseActivity("Delivered", ...) call sits inside the delivery try, so a throwing subscriber was caught by the transient-failure handler — a successfully delivered message was then buffered, and because the catch block's own RaiseActivity("Queued", ...) also threw, the exception escaped EnqueueAsync entirely. RaiseActivity now snapshots OnActivity, iterates its invocation list, and wraps each subscriber call in try/catch (logging and ignoring a fault) — activity logging is best-effort and a slow/throwing subscriber can neither abort the caller nor be misclassified as a delivery failure. Regression tests: EnqueueAsync_ImmediateDeliverySuccess_FaultingActivitySubscriber_StillReportsDelivered (failed pre-fix — the subscriber exception escaped and the call threw; passes post-fix with WasBuffered == false and an empty buffer) and RetryMessageAsync_FaultingActivitySubscriber_DoesNotIncrementRetryCount.

StoreAndForward-010 — Retry of a parked message does not reset LastAttemptAt, so its retry timing is unspecified

Severity Medium
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:203, :101

Description

RetryParkedMessageAsync sets status = Pending, retry_count = 0, last_error = NULL but leaves last_attempt_at unchanged (line 203206). The retry-due query (GetMessagesForRetryAsync, line 101105) selects Pending rows where last_attempt_at IS NULL OR ... elapsed >= retry_interval_ms. A message parked after exhausting retries has an old last_attempt_at; once re-queued, the elapsed time since that stale timestamp is almost certainly already greater than the retry interval, so the operator-retried message is attempted on the very next sweep regardless of the configured interval. That is probably the desired behaviour (operator wants it tried now), but it is unspecified and inconsistent — if retry_interval_ms were very large the behaviour would instead be "try immediately" by accident rather than by design.

Recommendation

Explicitly decide and encode the intent: either set last_attempt_at = NULL on re-queue so the message is unambiguously due now, or set it to "now" so it waits one interval. Document the chosen behaviour in the method's XML comment.

Resolution

Resolved 2026-05-16 (commit pending). Confirmed against the source: RetryParkedMessage Async reset status, retry_count and last_error but left last_attempt_at stale, so the operator-retried message's retry timing depended on the elapsed time since the original pre-park attempt. Encoded the intent explicitly — an operator retry means "attempt this again now" — by also setting last_attempt_at = NULL in the UPDATE, so the re-queued message is unambiguously due on the next sweep regardless of the configured retry_interval_ms. The method's XML comment now documents this. Regression test RetryParkedMessageAsync_ClearsLastAttemptAt_SoMessageIsImmediatelyDue uses a 1-hour retry interval and a recent last_attempt_at; it failed pre-fix (timestamp not cleared, message excluded from the retry-due set) and passes post-fix.

StoreAndForward-011 — StoreAndForwardMessageStatus.InFlight is unused and the doc's "retrying" status is unmodelled

Severity Low
Category Design-document adherence
Status Deferred
Location src/ScadaLink.Commons/Types/Enums/StoreAndForwardMessageStatus.cs:9; src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:219, :235

Description

The enum defines Pending, InFlight, Parked, Delivered. The module only ever uses Pending and ParkedInFlight and Delivered are never assigned (delivered messages are deleted, not marked Delivered). Meanwhile the Component design doc ("Message Format" -> Status) specifies the set "Pending, retrying, or parked." So the code's enum drifts from the doc in two directions: it carries dead members the doc does not mention (InFlight, Delivered) and omits the doc's retrying state. A message mid-retry is indistinguishable from one that has never been attempted.

Recommendation

Reconcile the enum with the design. Either drop the unused members and update the doc, or implement the documented retrying state and use InFlight to mark a message the sweep is actively delivering (which would also help with finding 005).

Resolution

Deferred 2026-05-16. Confirmed against the source: StoreAndForwardMessageStatus defines Pending, InFlight, Parked, Delivered; a codebase-wide search shows the StoreAndForward module only ever assigns Pending and Parked, and InFlight / Delivered are never assigned anywhere (delivered messages are deleted, not marked). The design doc's retrying state is unmodelled. Both options the recommendation offers — (a) drop the unused InFlight/Delivered members, or (b) add a Retrying member — require editing StoreAndForwardMessageStatus.cs, which lives in src/ScadaLink.Commons (outside this review's edit scope: only src/ScadaLink.StoreAndForward/** may be changed). The enum is also referenced by IntegrationTests and HealthMonitoring tests, so removing members is a cross-module change. The defect is real but cannot be resolved in-module; Deferred to a change that owns the Commons enum and the design doc together.

StoreAndForward-012 — StoreAndForwardMessage is a persistence entity but lives in the component, not Commons

Severity Low
Category Code organization & conventions
Status Deferred
Location src/ScadaLink.StoreAndForward/StoreAndForwardMessage.cs:9

Description

StoreAndForwardMessage is a persistence-ignorant POCO that maps directly to the sf_messages table and is also carried across the network inside ReplicationOperation (replicated to the standby node over Akka remoting). CLAUDE.md "Code Organization" states that entity classes are persistence-ignorant POCOs in Commons and that message contracts follow additive-only evolution. Because this type doubles as a replication wire contract but lives in the component assembly, it is not co-located with the other Commons entities and its evolution is not governed by the additive-only message-contract rule. This is a borderline case (the type is site-local), but the cross-node use via ReplicationOperation makes it a de-facto message contract.

Recommendation

Either move StoreAndForwardMessage (and ReplicationOperation) into the Commons Entities/Messages hierarchy so they are governed by the contract-evolution rules, or introduce a separate DTO for replication and keep StoreAndForwardMessage purely as the local persistence model. Document the decision.

Resolution

Deferred 2026-05-16. Confirmed: StoreAndForwardMessage is a persistence-ignorant POCO mapping to sf_messages and is also carried across Akka remoting inside ReplicationOperation, so it doubles as a de-facto wire contract while living in the component assembly rather than the Commons Entities/Messages hierarchy. The recommendation's primary remedy — moving StoreAndForwardMessage (and ReplicationOperation) into Commons — crosses module boundaries (it would add a type to src/ScadaLink.Commons, outside this review's edit scope of src/ScadaLink.StoreAndForward/**, and change every referencing module). The alternative "separate replication DTO" still leaves the persistence entity in the component, so it does not actually resolve the finding's core concern (entity placement / contract- evolution governance). Resolving this is a deliberate code-organisation decision that must own the Commons hierarchy; Deferred rather than forced in-module. Flagged for a cross-module follow-up.

StoreAndForward-013 — Critical paths lack test coverage: retry-due timing, replication-from-active, and the actor bridge

Severity Medium
Category Testing coverage
Status Resolved
Location tests/ScadaLink.StoreAndForward.Tests/ (whole directory); src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:101; src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs

Description

The existing tests cover storage CRUD and the service happy/failure paths well, but three important behaviours are untested: (1) the retry-due time filter in GetMessagesForRetryAsync — every service test sets DefaultRetryInterval = TimeSpan.Zero, so the julianday elapsed-time comparison (the most error-prone SQL in the module) is never exercised with a non-zero interval; a message that is not yet due should be skipped, and that is never verified. (2) Replication from the active side — no test asserts that an enqueue/remove/park causes a Replicate* call (this is exactly the gap behind finding 001; a test would have caught it). (3) ParkedMessageHandlerActor has no test at all — the Query/Retry/Discard request-to-response mapping and the ExtractMethodName JSON parsing are unverified, including the malformed-JSON branch.

Recommendation

Add tests for: a non-zero retry interval where a recently-attempted message is excluded and an older one is included; active-side replication invocation per operation type (once finding 001 is fixed); and ParkedMessageHandlerActor using Akka.TestKit, including ExtractMethodName for MethodName, Subject, missing-property and invalid-JSON payloads.

Resolution

Resolved 2026-05-16 (commit pending). All three gaps closed. (1) Retry-due timing: GetMessagesForRetryAsync_NonZeroInterval_ExcludesNotYetDue IncludesDue exercises the julianday elapsed-time SQL with non-zero intervals — a just-attempted message (1-hour interval) is excluded, an old one (attempted 2h ago, 5-min interval) and a never-attempted one are included. (2) Active-side replication: this sub-claim was already stale at the reviewed code — StoreAndForwardReplicationTests (added with finding 001's fix) asserts the Add, Remove and Park replication operations; no new test was needed. (3) ParkedMessageHandlerActor: new ParkedMessageHandlerActorTests using Akka.TestKit.Xunit2 (package reference added to the test project) covers Query/Retry/ Discard request-to-response mapping, correlation-ID propagation, the unknown-message failure responses, and ExtractMethodName for MethodName, Subject, missing-property and malformed-JSON payloads (all falling back to the category name without throwing). These are coverage-gap tests over already-correct code, so they pass on first run.

StoreAndForward-014 — Storage does not create its SQLite database directory

Severity Medium
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:26

Found 2026-05-16 while verifying the store-and-forward fixes — this defect was not part of the original baseline review.

Description

StoreAndForwardStorage.InitializeAsync opened a SqliteConnection against the configured SqliteDbPath (default ./data/store-and-forward.db) without ensuring the parent directory exists. SQLite creates the database file on demand but not its directory, so when data/ does not already exist the connection fails to open with SQLite Error 14: 'unable to open database file'. Every site-host boot therefore failed in any environment whose working directory has no data/ folder — the cause of the six failing SiteActorPathTests (the host's RegisterSiteActors aborts at StoreAndForwardService.StartAsync). Production masked it because data/ is created by the Docker image / deployment.

Recommendation

Create the parent directory of a file-backed SQLite database before opening it.

Resolution

Resolved 2026-05-16. InitializeAsync now calls a new EnsureDatabaseDirectoryExists helper that parses the connection string with SqliteConnectionStringBuilder and, for a file-backed database, creates the parent directory if it is missing (in-memory databases and bare filenames are skipped). Regression test InitializeAsync_FileInMissingDirectory_CreatesDirectory fails against the pre-fix code; all six SiteActorPathTests now pass. Fixed by the commit whose message references StoreAndForward-014.

StoreAndForward-015 — EnqueueAsync's public contract never documents that maxRetries == 0 means "retry forever"

Severity Medium
Category Documentation & comments
Status Resolved
Location src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:114:130, :285

Description

The re-review brief asks for the StoreAndForward side of the cross-module ambiguity recorded as ExternalSystemGateway-015. The semantics are split across this module and its callers, and the StoreAndForward side carries a genuine documentation/API-contract fault:

  • RetryMessageAsync parks a message only when message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries (StoreAndForwardService.cs:285). When MaxRetries == 0 the guard is false on every sweep, so a 0 value means "no limit — retry forever". The StoreAndForwardMessage.MaxRetries XML doc (StoreAndForwardMessage.cs:31) does state "0 = no limit", so the persistence model is internally consistent.
  • But EnqueueAsync — the only public entry point into the engine — exposes a maxRetries parameter (StoreAndForwardService.cs:128) with no parameter documentation at all, and its method summary (lines 114122) describes the lifecycle only as "On max retries → park" / "parked once MaxRetries is reached" (see also the _deliveryHandlers field doc, line 5051). Nothing on the public surface tells a caller that passing 0 flips the meaning from "park immediately / never retry" to "retry forever". A caller reading only EnqueueAsync would reasonably assume 0 retries means zero retries.
  • This is exactly the trap ESG fell into: ExternalSystemClient.CachedCallAsync / DatabaseGateway.CachedWriteAsync pass the source entity's MaxRetries verbatim, intending 0 to mean "never retry", and instead get unbounded retry — the duplicate-delivery / unbounded-buffer-growth hazard the design doc's idempotency note warns against. The fault is not solely ESG's: the S&F public API silently overloads 0 with the opposite of its natural reading and does not document it.

The defect is in this module's API contract and documentation, so it is recorded here in addition to ExternalSystemGateway-015. (Whether 0 should mean "no limit" or "no retry" is the cross-module design decision tracked by ESG-015; this finding is specifically that the StoreAndForward public surface fails to document whichever meaning is chosen.)

Recommendation

Document the maxRetries parameter on EnqueueAsync explicitly with a <param> tag that states the 0 special case in the same words as StoreAndForwardMessage.MaxRetries ("0 = no limit — retried on every sweep until delivered, never parked"), and add the 0 case to the method summary's lifecycle description. Better still — and consistent with the resolution of ESG-015 — make the engine reject the ambiguity at the API: accept a nullable/enum retry policy, or treat 0 as an explicit "no retry" (do not buffer, or park on the first sweep) so the natural reading and the behaviour agree. Either way the public EnqueueAsync contract must state the chosen meaning; today it states nothing.

Resolution

Resolved 2026-05-17. Documentation-only fix — retry semantics confirmed correct and left unchanged. Root cause verified against the source: EnqueueAsync's maxRetries parameter had no <param> documentation and the method/class summaries described only the "park on max retries" path, never the 0 = no limit / retry forever special case that RetryMessageAsync's MaxRetries > 0 guard actually enforces. Added an explicit <param> tag for every EnqueueAsync parameter — maxRetries now states in bold that 0 means "no limit, never parked for retry exhaustion" and is not a "never retry" value — extended the method summary with a retry-count lifecycle paragraph, updated the class-level lifecycle bullet, and tightened the StoreAndForwardMessage.MaxRetries field doc to the same wording. No behavioural code touched; an XML comment is not test-observable so no regression test was added.

StoreAndForward-016 — Operator-initiated parked-message retry and discard are not replicated to the standby

Severity Medium
Category Error handling & resilience
Status Resolved
Location src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:339:362; src/ScadaLink.StoreAndForward/ReplicationService.cs:131:136

Description

StoreAndForward-001's fix wired replication into the active delivery paths: BufferAsync replicates an Add, a successful retry replicates a Remove, and a park replicates a Park. But the two operator paths — RetryParkedMessageAsync (line 339) and DiscardParkedMessageAsync (line 353) — change buffer state and never touch _replication:

  • RetryParkedMessageAsync flips a row from Parked back to Pending (and clears retry_count / last_attempt_at) in the local SQLite only. The standby's copy stays Parked.
  • DiscardParkedMessageAsync DELETEs the row from the local SQLite only. The standby's copy is left in place, still Parked.

The Component design doc ("Persistence") requires the active node to forward "each buffer operation (add, remove, park)" so that on failover "the new active node has a near-complete copy of the buffer." An operator retrying a parked message is a buffer state change; an operator discarding one is a removal. After a failover that follows an operator action:

  1. A discarded message reappears on the new active node — it is still Parked there, so it resurfaces in the central UI's parked-message list and an operator must discard it a second time. For a message deliberately removed (e.g. a known-bad payload) this is a correctness regression of the operator's intent.
  2. A retried message is still Parked on the new active node, so the operator's "move it back to the queue" action is silently lost across the failover and the message is not re-attempted.

ReplicationOperationType only models Add/Remove/Park — there is no operation for "un-park / move back to pending", so even a minimal fix needs either a new operation type or a re-use of Add to overwrite the standby row. This is the same class of defect as the now-resolved StoreAndForward-001, for the operator paths rather than the sweep paths.

Recommendation

Replicate both operator actions. DiscardParkedMessageAsync should call _replication?.ReplicateRemove(messageId) after a successful local delete (the existing Remove op already deletes on the standby). For RetryParkedMessageAsync, add a Requeue/Unpark ReplicationOperationType whose ApplyReplicatedOperationAsync case resets the standby row to Pending with retry_count = 0, or have the method re-load the updated message and replicate it as an Add-style upsert. Add replication tests for both operator paths (the existing StoreAndForwardReplicationTests only cover the sweep paths).

Resolution

Resolved 2026-05-17. Root cause confirmed: the two operator paths (RetryParkedMessageAsync, DiscardParkedMessageAsync) changed local SQLite state but never touched _replication, so a failover after an operator action diverged the standby buffer. DiscardParkedMessageAsync now calls _replication?.ReplicateRemove after a successful local delete (the existing Remove op deletes on the standby). A new ReplicationOperationType.Requeue was added; RetryParkedMessageAsync re-loads the requeued row and calls _replication?.ReplicateRequeue, and the standby's ApplyReplicatedOperationAsync Requeue case resets its matching row to Pending with retry_count = 0. Regression tests DiscardingAParkedMessage_ReplicatesARemoveOperation, RetryingAParkedMessage_ReplicatesARequeueOperation and ApplyReplicatedOperation_Requeue_MovesStandbyRowBackToPending (in StoreAndForwardReplicationTests) all pass; the first two fail against the pre-fix code.

StoreAndForward-017 — Retry/Discard activity-log entries hard-code the ExternalSystem category

Severity Low
Category Correctness & logic bugs
Status Resolved
Location src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:344, :358

Description

RetryParkedMessageAsync and DiscardParkedMessageAsync raise an S&F activity notification (consumed by Site Event Logging — WP-14) but pass a hard-coded StoreAndForwardCategory.ExternalSystem as the category argument:

RaiseActivity("Retry",   StoreAndForwardCategory.ExternalSystem, $"Parked message {messageId} moved back to queue");
RaiseActivity("Discard", StoreAndForwardCategory.ExternalSystem, $"Parked message {messageId} discarded");

Both methods take only a messageId and never load the message, so they have no access to its real category. When an operator retries or discards a parked Notification or CachedDbWrite message, the site event log records the activity under ExternalSystem. Every other RaiseActivity call in the service passes the message's true Category (EnqueueAsync, RetryMessageAsync), so the operator paths are inconsistent and produce mislabelled audit entries — misleading when an operator later filters or reviews S&F activity by category.

Recommendation

Load the message (or have StoreAndForwardStorage.RetryParkedMessageAsync / DiscardParkedMessageAsync return the affected row's category) and pass the real Category to RaiseActivity. If loading the row is considered too costly on these infrequent operator paths, change the OnActivity event / RaiseActivity signature to allow a nullable category for management actions rather than asserting a false one.

Resolution

Resolved 2026-05-17. Root cause confirmed: RetryParkedMessageAsync and DiscardParkedMessageAsync raised activity notifications with a hard-coded StoreAndForwardCategory.ExternalSystem, mislabelling parked Notification and CachedDbWrite messages in the site event log. Both methods now obtain the message's real category — DiscardParkedMessageAsync loads the row via GetMessageByIdAsync before the delete, RetryParkedMessageAsync re-loads the requeued row (also used for the StoreAndForward-016 replication) — and pass it to RaiseActivity (falling back to ExternalSystem only if the row is unexpectedly absent). Regression tests RetryParkedMessageAsync_ActivityUsesMessageRealCategory and DiscardParkedMessageAsync_ActivityUsesMessageRealCategory assert the activity carries Notification / CachedDbWrite respectively; both fail against the pre-fix code.

StoreAndForward-018 — Notification corrupt-payload parks the buffered message, contradicting the "notifications do not park" design invariant

Severity High
Category Design-document adherence
Status Resolved
Location src/ScadaLink.StoreAndForward/NotificationForwarder.cs:62:69, :105:122; src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:369:397

ResolutionNotificationForwarder.DeliverAsync now discards a corrupt buffered payload instead of returning false. The corrupt path logs a Warning with the buffered row id + length-capped payload preview via an injected ILogger<NotificationForwarder> (NullLogger by default for back-compat), then returns true so the S&F engine clears the row via its standard success-path cleanup — honoring the "notifications do not park" design invariant. Two regression tests in NotificationForwarderTests cover the two corrupt shapes (invalid JSON, null deserialisation) and pin that nothing is forwarded to central in either case.

Description

The Component design doc explicitly carves out notifications from the parking lifecycle:

"Notifications do not park — they are retried at the fixed forward interval until central acks." (docs/requirements/Component-StoreAndForward.md:47) "Parking applies only to the external-system-call and cached-database-write categories." (same line)

NotificationForwarder.DeliverAsync violates this. When TryBuildSubmit fails to deserialize the buffered payload — either because JsonSerializer.Deserialize throws a JsonException (line 114) or because it returns null (line 119) — DeliverAsync returns false (line 68). On the retry path the S&F engine treats handler false as a permanent failure and parks the message immediately via the conditional UpdateMessageIfStatusAsync(... Parked) write at StoreAndForwardService.cs:373:385. Result: a notification with a corrupt buffered payload — a row that the engine itself treats as opaque ("Payload: Serialized message content…"; Component-StoreAndForward.md: 110) — enters the parked state and surfaces in the central UI's parked-message list under the Notification category, contradicting the doc's invariant and the resolved StoreAndForward-017's "Notification / CachedDbWrite" Retry/Discard category mapping.

The defect is real today: the inline comment on NotificationForwarder.cs:64 even documents the violation ("An unreadable payload cannot be fixed by retrying — park it (return false)") as the intended behaviour, but that behaviour is what the design doc forbids. Either the doc needs to acknowledge a poison-payload parking exception for notifications, or the forwarder needs a different escape hatch (discard? log + drop? permanent-failure-as-true to clear the buffer?). Today there is no consistent answer between code and design.

Additionally, on the immediate-delivery path (a fresh enqueue followed by a DeliverAsync returning false), the engine returns WasBuffered: false and the row is never persisted — so the corrupt-payload "park" only occurs on the retry path, where the message has already been buffered (and replicated to the standby). The inconsistency between the two paths ("not buffered" vs "parked") for the same permanent-failure outcome is itself a contract surprise; the resolved StoreAndForward-004 documents the immediate vs retry asymmetry, but does not anticipate that the retry asymmetry will violate a per-category invariant.

Recommendation

Choose one consistent reconciliation. Preferred option: change NotificationForwarder. DeliverAsync to discard a corrupt payload rather than park it — delete the buffered row directly, log a Site Event Log entry under Discard, and return true so the engine clears the buffer. This preserves the design's "notifications do not park" invariant. Alternatives: (a) update the design doc to acknowledge a poison-payload parking exception specifically for notifications, and revise the resolved StoreAndForward-017 wording; (b) treat JsonException as transient (would retry-forever on a corrupt payload — bad); (c) introduce a per-category park-allowed flag on the engine and gate the retry-path park behind it for the Notification category. Add a regression test in tests/ScadaLink.StoreAndForward.Tests/NotificationForwarderTests. cs asserting that a corrupt-payload notification reaches a terminal non-Parked state — today the corrupt-payload behaviour is uncovered.

Resolution

Unresolved.

StoreAndForward-019 — Notifications park after DefaultMaxRetries exhaustion, contradicting "retried until central acks"

Severity Medium
Category Error handling & resilience
Status Open
Location src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:229, :407:437; src/ScadaLink.StoreAndForward/StoreAndForwardOptions.cs:18; src/ScadaLink.SiteRuntime/Scripts/ScriptRuntimeContext.cs:1773:1778; src/ScadaLink.NotificationService/NotificationDeliveryService.cs:149:156

Description

The design doc requires a buffered notification to be retried indefinitely until central acks:

"The notification category retries differently: it has no source-entity setting. The site→central forward uses a single fixed retry interval configured in the host appsettings.json. … A buffered notification is retried until central acks it; it is not parked on a retry limit (central, once reachable, owns delivery, retry, and parking from that point on)." (Component-StoreAndForward.md:55:59)

The current engine cannot honour that. RetryMessageAsync enforces parking at message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries (StoreAndForwardService.cs:407); a MaxRetries == 0 is the documented "no limit" escape hatch (now correctly explained by the resolved StoreAndForward-015). But the two notification enqueue paths both supply a positive bounded MaxRetries:

  • ScriptRuntimeContext.cs:1773:1778 (the Notify.Send site script path) calls EnqueueAsync without supplying the maxRetries argument, so the engine defaults to StoreAndForwardOptions.DefaultMaxRetries = 50 (StoreAndForwardOptions. cs:18). After 50 retry sweeps with central unreachable, the notification is parked.
  • NotificationDeliveryService.cs:149:156 (the legacy SMTP-style path retained for the central-side INotificationDeliveryService callers) passes smtpConfig.MaxRetries > 0 ? smtpConfig.MaxRetries : nullnull falls back to the same 50-retry default, and any positive smtpConfig.MaxRetries still bounds the retry budget. Either way, a long central outage parks the notification.

A parked notification cannot be cleared by a central recovery: it stays parked until an operator clicks Retry in the parked-message UI. The design's invariant — that notification delivery converges automatically as soon as central is reachable — is broken: an extended central outage requires manual intervention to clear the backlog, which is exactly the behaviour the central-only outbox redesign was meant to remove from the site.

This is closely related to (but distinct from) StoreAndForward-018: 018 is the permanent-failure-path parking violation; 019 is the transient-failure-path parking violation under the engine's normal max-retries policy.

Recommendation

Make the notification enqueue paths pass maxRetries: 0 so the documented "no limit / never parked" semantics apply, and guard against regression by adding an integration test in tests/ScadaLink.StoreAndForward.Tests/NotificationForwarderTests.cs that runs a sweep many more times than DefaultMaxRetries against an always-failing handler and asserts the buffered notification's status stays Pending (not Parked). A cleaner alternative is to special-case the Notification category inside RetryMessageAsync's max-retries guard (treat it as MaxRetries == 0 regardless of the field value) so the invariant is enforced at the single chokepoint rather than relying on every caller to pass the right value — this also fixes the legacy NotificationDeliveryService path without editing the consumer.

Resolution

Unresolved.

StoreAndForward-020 — RetryParkedMessageAsync skips standby replication when the message is deleted between local update and re-load

Severity Medium
Category Concurrency & thread safety
Status Open
Location src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:599:616

Description

The StoreAndForward-016 resolution wired Requeue replication into operator-initiated retry. The fix uses a two-step pattern:

public async Task<bool> RetryParkedMessageAsync(string messageId)
{
    var success = await _storage.RetryParkedMessageAsync(messageId);   // step 1
    if (success)
    {
        var message = await _storage.GetMessageByIdAsync(messageId);    // step 2 (no txn)
        var category = message?.Category ?? StoreAndForwardCategory.ExternalSystem;
        if (message != null)
        {
            _replication?.ReplicateRequeue(message);                    // step 3
        }
        RaiseActivity("Retry", category, ...);
    }
    return success;
}

The two storage calls are on separate connections with no surrounding transaction. A concurrent writer between step 1 (which moved the row from Parked → Pending) and step 2 (which re-reads the row) can delete or mutate the row:

  • An operator who issues DiscardParkedMessageAsync immediately after retry — the DiscardParkedMessageAsync storage call is conditional on status = Parked, so it will be a no-op (correct), but a sweep that succeeds in delivering the just-requeued row will then call _storage.RemoveMessageAsync (unconditional), which deletes it. In a single retry-sweep cycle this race is real because DefaultRetryInterval = Zero is the standard test default and the operator action and a sweep tick can overlap.
  • A RemoveMessageAsync runs in step 1's wake; GetMessageByIdAsync returns null; step 3 (_replication?.ReplicateRequeue) is skipped entirely, but step 1 already requeued the row locally. The standby is now left in Parked state while the active node has Pending-then-Deleted, exactly the standby-divergence StoreAndForward-016 was supposed to fix. (On the active node a subsequent failover lands on a Parked standby copy of a discarded message — the same regression 016 already documented.)

The category-fallback path (StoreAndForwardCategory.ExternalSystem when message is null) silently mislabels the activity log entry too — the same defect that StoreAndForward-017 fixed for the non-racy path, except this branch handed back a hard-coded fallback rather than re-loading. The activity log entry is a minor side effect; the missing replication is the real defect.

Recommendation

Capture the message once, before the local Parked → Pending storage update, so the replication path has the row in hand even if a concurrent writer deletes it afterwards:

var message = await _storage.GetMessageByIdAsync(messageId);  // before the update
if (message == null || message.Status != StoreAndForwardMessageStatus.Parked)
    return false;

var success = await _storage.RetryParkedMessageAsync(messageId);
if (!success) return false;

// `message` was the parked row; the active node just wrote it back to Pending with
// retry_count = 0 — construct the replicated state from those known mutations.
message.Status = StoreAndForwardMessageStatus.Pending;
message.RetryCount = 0;
message.LastError = null;
message.LastAttemptAt = null;
_replication?.ReplicateRequeue(message);
RaiseActivity("Retry", message.Category, $"Parked message {messageId} moved back to queue");
return true;

Add a regression test in StoreAndForwardReplicationTests that simulates the delete-between-update-and-reload race and asserts the Requeue replication operation is still emitted with the correct category.

Resolution

Unresolved.

StoreAndForward-021 — Design doc claims the Operation Tracking Table lives in StoreAndForward but the implementation is in SiteRuntime

Severity Medium
Category Design-document adherence
Status Open
Location docs/requirements/Component-StoreAndForward.md:21, :49:51, :77:87, :108, :114; src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:37; src/ScadaLink.StoreAndForward/ (whole module)

Description

Component-StoreAndForward.md repeatedly assigns the Operation Tracking Table to this component:

  • Responsibilities (line 21): "Maintain a site-local operation tracking table holding one row per TrackedOperationId for cached calls … the authoritative status record consulted by Tracking.Status(id)."
  • Message Lifecycle (lines 4951): "the operation tracking table is the status record and the S&F buffer is purely the retry mechanism. A cached call that succeeds on its first immediate attempt is written directly as a terminal Delivered tracking row and never enters the S&F buffer."
  • Operation Tracking Table section (lines 7787): "Alongside the S&F buffer DB, each site node holds a site-local operation tracking table in SQLite. … Each row records the operation kind (TrackedOperationKind) …"

The actual implementation lives outside this module: src/ScadaLink.SiteRuntime/ Tracking/OperationTrackingStore.cs (and IOperationTrackingStore, OperationTrackingOptions). The StoreAndForward project contains no references to the tracking store, owns no operation_tracking table, and StoreAndForwardService.NotifyCachedCallObserverAsync is only a hook handing telemetry context to an ICachedCallLifecycleObserver — the audit bridge wired in ScadaLink.AuditLog. The S&F module is not the table's owner; SiteRuntime is.

This is a real design-doc drift, not a code defect, and is flagged explicitly in the brief's "Module-specific notes". The drift matters because the design doc's discussion of the lifecycle — "immediate success writes a terminal Delivered tracking row directly here", "operator discard sets terminal Discarded", "central never mutates the mirror row directly" — places coordination responsibilities on the wrong component. A reader looking for the source of truth for Tracking.Status(id) would read Component-StoreAndForward.md and search src/ScadaLink.StoreAndForward/ in vain. The doc also lists Site Call Audit / Audit Log telemetry-emission as a S&F responsibility (line 22), but the emission actually happens via the AuditLog site component subscribing to ICachedCallLifecycleObserver.

Recommendation

Reconcile the doc with the code. The simplest fix is doc-side: update Component-StoreAndForward.md to scope its responsibilities back to the retry mechanism + replication + parked-message management, and add a cross-reference to a new (or existing) component doc for Operation Tracking (Component-SiteRuntime.md, or a new Component-OperationTracking.md). The code is internally consistent — the audit bridge subscribes to the observer hook, the SiteRuntime store writes the rows, the S&F engine emits attempt telemetry on the cached-call hot path — but the design doc is several refactors out of date. The hierarchical map should be:

  • Component-StoreAndForward.md → S&F buffer + Replication + Parked-message management + Notification forwarding to central + cached-call telemetry hook.
  • New doc / SiteRuntime doc → Operation Tracking Table semantics and lifecycle.
  • Component-SiteCallAudit.md / Component-AuditLog.md → telemetry emission + central-side mirror.

Resolution

Unresolved.

StoreAndForward-022 — NotifyCachedCallObserverAsync silently drops the entire audit lifecycle when the message id is not a parseable TrackedOperationId

Severity Low
Category Documentation & comments
Status Open
Location src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:484:515

Description

NotifyCachedCallObserverAsync (the per-attempt observer notifier wired by the M3 Bundle E rollout) bails out with no audit emission when TrackedOperationId.TryParse(message.Id, out var trackedId) returns false (StoreAndForwardService.cs:510:515). The inline comment justifies the behaviour as back-compat for "pre-M3 message (random GUID-N id from S&F itself, no TrackedOperationId threaded in)", but the documented contract is broken in two ways:

  1. Silent dropping of every audit row, not just the first one. The skip means no Attempted row, no CachedResolve terminal row, no audit trail at all for that operation's S&F lifecycle — yet the rest of the system (script trust boundary, parked-message UI, etc.) still treats the operation as audit-tracked. The drop is not surfaced via a metric, log warning (the path is a silent return), or counter, so a misconfigured caller bypasses the audit hot path with zero feedback.

  2. The contract is hidden in field-level XML. The ICachedCallLifecycleObserver public interface contract (defined in ScadaLink.Commons) does not document that the observer will be silently skipped when the underlying S&F message id is not a GUID. A consumer reading the interface contract reasonably expects every cached-call attempt to surface — the audit pipeline depends on it. The silent-drop is an implementation detail of the S&F bridge that should be either lifted onto the contract or removed.

The engine itself mints GUID-N ids via Guid.NewGuid().ToString("N") (line 224), which TrackedOperationId.TryParse accepts, so the skip path is unreachable for engine-minted ids. It is reachable only for callers that supply their own messageId argument with a non-GUID format. The current callers (NotificationOutbox enqueue path with NotificationId, cached-call enqueue path with TrackedOperationId.ToString()) all supply GUID-shaped ids. The defect is latent — a future caller passing a non-GUID id would silently bypass audit.

Recommendation

Two options. The cheap fix: change the skip to a _logger.LogWarning with the offending id so a misconfigured caller is observable, and update the ICachedCallLifecycleObserver XML doc to mention the "non-GUID id → no telemetry" contract explicitly. The more correct fix: emit a still-audited row for the non-GUID case (e.g. synthesise a TrackedOperationId from the underlying id, or emit a distinguished "tracking-id-missing" audit row) so the audit pipeline never has silent holes. Add a regression test in CachedCallAttemptEmissionTests capturing the chosen contract — the existing Attempt_MessageIdNotAGuid_NoObserverNotification test pins today's silent-skip; if the fix is "log + skip", that test should be updated to also assert the log emission; if the fix is "emit anyway", the test should be replaced.

Resolution

Unresolved.

StoreAndForward-023 — siteId silently defaults to empty when no IStoreAndForwardSiteContext is registered, degrading audit telemetry correlation

Severity Low
Category Code organization & conventions
Status Open
Location src/ScadaLink.StoreAndForward/ServiceCollectionExtensions.cs:43:53; src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:99, :524

Description

AddStoreAndForward's service-collection factory resolves the optional IStoreAndForwardSiteContext and falls back to string.Empty when not registered:

var siteContext = sp.GetService<IStoreAndForwardSiteContext>();
var siteId = siteContext?.SiteId ?? string.Empty;
return new StoreAndForwardService(storage, options, logger, replication,
    cachedCallObserver, siteId);

The constructor's parameter is even defaulted to "". The empty-string siteId flows straight into every emitted CachedCallAttemptContext.SourceSite, which the central audit pipeline uses as part of the (SourceSite, TrackedOperationId) correlation key. A host that registers an ICachedCallLifecycleObserver (the audit observer wired by AddAuditLog) but forgets to register IStoreAndForwardSiteContext will produce a stream of telemetry rows with SourceSite = "" — the central audit mirror cannot distinguish them by site, and the central-site routing of RetryParkedOperation/DiscardParkedOperation commands keyed on SourceSite will fail to find the owning site.

The Host's IStoreAndForwardSiteContext adapter and the audit observer registration are wired in lock-step, so the current configuration is correct, but the silent empty-string fallback is a contract hazard for future hosts (CLI test harness, second site cluster topology, etc.) and for tests that wire one without the other.

Recommendation

Make the contract explicit: when cachedCallObserver is non-null, require IStoreAndForwardSiteContext to be registered — throw an InvalidOperationException with a clear "Audit observer registered without a site context — register IStoreAndForwardSiteContext" message at construction time. When the audit observer is absent (no AddAuditLog), keep the empty-string default since _siteId is unused. Alternatively, change siteId from a parameter to a Func<string> resolved lazily from the service provider so a late-registered context still takes effect.

Resolution

Unresolved.

StoreAndForward-024 — StopAsync does not wait for an in-flight retry sweep, so disposed dependencies can be touched after shutdown

Severity Low
Category Concurrency & thread safety
Status Open
Location src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:122:127, :136:143, :303:329

Description

StartAsync arms _retryTimer with _ => _ = RetryPendingMessagesAsync() (line 123). The _ = discards the returned Task, so when the timer fires the sweep runs fire and forget on a thread-pool thread. StopAsync (lines 136143) disposes the timer:

if (_retryTimer != null)
{
    await _retryTimer.DisposeAsync();
    _retryTimer = null;
}

Timer.DisposeAsync() returns once any in-flight timer callback has completed — but the timer callback in this service is a one-line _ = RetryPendingMessagesAsync() that synchronously returns immediately and leaves the actual sweep running on the thread pool. So Timer.DisposeAsync does not wait for the sweep; only for the synchronous _ = ... discarding step. StopAsync returns while a sweep is potentially still running, touching _storage (which the host will dispose), _replication (which the host will tear down), and _cachedCallObserver (whose downstream gRPC channel the host will shut down).

The host shutdown sequence (AkkaHostedService) tears down the actor system and the DI container after this service's StopAsync completes — meaning a sweep that runs past StopAsync can call into disposed SqliteConnections (yielding ObjectDisposedException, caught by the sweep's outer try/catch as a log) or, more seriously, push a replication operation into a half-disposed Akka actor pipeline and trigger noisy dead-letter warnings during a clean shutdown.

The race window is small (the sweep typically finishes in <100 ms in tests) but it is real, particularly when shutting down a site under load with a non-empty buffer.

Recommendation

Track in-flight sweep tasks and await them in StopAsync:

private Task? _currentSweep;

public async Task StopAsync()
{
    if (_retryTimer != null)
    {
        await _retryTimer.DisposeAsync();
        _retryTimer = null;
    }
    if (_currentSweep is { } sweep)
    {
        try { await sweep; } catch { /* logged inside RetryPendingMessagesAsync */ }
    }
}

Change the timer callback to:

_retryTimer = new Timer(_ => _currentSweep = RetryPendingMessagesAsync(), ...);

Add a CancellationTokenSource so a long sweep can be cooperatively aborted on stop; plumb the cancellation token into _storage / _replication / _cachedCallObserver calls. Add a regression test in StoreAndForwardServiceTests that calls StopAsync mid-sweep and asserts no further storage activity occurs after StopAsync returns.

Resolution

Unresolved.