555 lines
27 KiB
Markdown
555 lines
27 KiB
Markdown
# Code Review — StoreAndForward
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| Module | `src/ScadaLink.StoreAndForward` |
|
||
| Design doc | `docs/requirements/Component-StoreAndForward.md` |
|
||
| Status | Reviewed |
|
||
| Last reviewed | 2026-05-16 |
|
||
| Reviewer | claude-agent |
|
||
| Commit reviewed | `9c60592` |
|
||
| Open findings | 11 |
|
||
|
||
## Summary
|
||
|
||
The Store-and-Forward module is small and readable, with a clean SQLite persistence
|
||
layer, a sensible service API, and reasonable test coverage of the storage and service
|
||
happy paths. However the review surfaced two issues that undermine the module's core
|
||
purpose. First, the active delivery path never invokes the `ReplicationService` —
|
||
`ReplicateEnqueue/Remove/Park` have no callers anywhere in the codebase, so buffered
|
||
messages are not replicated to the standby node and the design's failover-durability
|
||
guarantee (Component doc "Persistence", CLAUDE.md "Store-and-Forward") is not met.
|
||
Second, there is an off-by-one in retry accounting: the immediate-failure path stores a
|
||
buffered message with `RetryCount = 1`, so a message configured with `MaxRetries = N`
|
||
is actually attempted `N` times in total rather than one immediate attempt plus `N`
|
||
retries, and a per-source `MaxRetries` of 1 produces zero retry attempts. Additional
|
||
themes: SQLite connection-per-call with no transactional grouping of multi-statement
|
||
operations, no concurrency guard against a parked message being retried while the
|
||
sweep is mid-flight, an unused enum member (`InFlight`) that drifts from the documented
|
||
status set, and untested critical paths (retry-due timing, replication-from-active,
|
||
the actor bridge). None of the findings are blockers for compilation, but the
|
||
replication and retry-count issues are functional defects against the design.
|
||
|
||
## Checklist coverage
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | ☑ | Off-by-one in retry counting (003); parked-message retry timing (010). |
|
||
| 2 | Akka.NET conventions | ☑ | `ContinueWith` used instead of `PipeTo`-friendly continuations; default supervision; see 007. |
|
||
| 3 | Concurrency & thread safety | ☑ | Sweep guarded by `Interlocked`, but no guard against retry-vs-manage races (005); `OnActivity` event not thread-safe (009). |
|
||
| 4 | Error handling & resilience | ☑ | Replication never invoked from active path (001); no-handler messages buffered then stuck (002). |
|
||
| 5 | Security | ☑ | No issues found — parameterised SQL throughout; no secrets handled directly; payload JSON treated opaquely. |
|
||
| 6 | Performance & resource management | ☑ | New SQLite connection per call; multi-statement operations not wrapped in a transaction (006, 008). |
|
||
| 7 | Design-document adherence | ☑ | Replication gap (001); `InFlight` status undocumented/unused (011); "retrying" status from design doc not modelled. |
|
||
| 8 | Code organization & conventions | ☑ | `StoreAndForwardMessage` is an entity-like POCO living in the component, not Commons (012). |
|
||
| 9 | Testing coverage | ☑ | Retry-due timing, replication-from-active, and `ParkedMessageHandlerActor` are untested (013). |
|
||
| 10 | Documentation & comments | ☑ | XML doc on `RegisterDeliveryHandler` contract is inconsistent with code (004). |
|
||
|
||
## Findings
|
||
|
||
### StoreAndForward-001 — Replication to standby is never triggered by the active node
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Critical |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/ReplicationService.cs:40`, `:53`, `:66`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:155`, `:212`, `:222`, `:236` |
|
||
|
||
**Description**
|
||
|
||
`ReplicationService` exposes `ReplicateEnqueue`, `ReplicateRemove` and `ReplicatePark`
|
||
to forward buffer operations to the standby node, but a codebase-wide search shows these
|
||
methods have no callers. `StoreAndForwardService` — which performs every add (`EnqueueAsync`
|
||
line 155 / 163), remove (`RemoveMessageAsync` call at line 212) and park
|
||
(`UpdateMessageAsync` calls at lines 222/236) — holds no reference to `ReplicationService`
|
||
and never invokes it. Only the receiving half is wired (`SetReplicationHandler` and
|
||
`ApplyReplicatedOperationAsync` are used by `SiteReplicationActor`). The Component design
|
||
doc ("Persistence") and CLAUDE.md ("Store-and-Forward") require the active node to
|
||
forward every buffer operation to the standby so that, on failover, the new active node
|
||
"has a near-complete copy of the buffer." As written, the standby's S&F SQLite database
|
||
stays empty and a failover loses the entire buffer — a data-loss defect against a core
|
||
requirement.
|
||
|
||
**Recommendation**
|
||
|
||
Inject `ReplicationService` into `StoreAndForwardService` and call `ReplicateEnqueue`
|
||
after a successful `_storage.EnqueueAsync`, `ReplicateRemove` after `RemoveMessageAsync`,
|
||
and `ReplicatePark` after a park-causing `UpdateMessageAsync`. Update
|
||
`ServiceCollectionExtensions.AddStoreAndForward` to pass the dependency. Add a test that
|
||
asserts the replication handler observes each operation type.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16. `ReplicationService` is now injected into `StoreAndForwardService`
|
||
(wired in `AddStoreAndForward`), and every buffer operation is forwarded to the standby:
|
||
a new `BufferAsync` helper calls `ReplicateEnqueue` after each persist, `ReplicateRemove`
|
||
runs after a successful retry removes a message, and `ReplicatePark` runs on both park
|
||
paths. Replication stays fire-and-forget and is a no-op when `ReplicationEnabled` is
|
||
false or no handler is wired. Regression tests `StoreAndForwardReplicationTests` assert
|
||
the replication handler observes the Add, Remove and Park operations. Fixed by the
|
||
commit whose message references `StoreAndForward-001`.
|
||
|
||
### StoreAndForward-002 — Messages enqueued with no registered handler are buffered but never deliverable
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | ~~High~~ → Low (re-triaged) |
|
||
| Category | Error handling & resilience |
|
||
| Status | Open |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:162`, `:201` |
|
||
|
||
**Description**
|
||
|
||
`EnqueueAsync` falls through to "No handler registered — buffer for later" (line 162)
|
||
when no delivery handler is registered for the category. The retry sweep
|
||
(`RetryMessageAsync`, line 201) then logs "No delivery handler for category" and
|
||
`return`s without touching the message. No caller in the codebase ever calls
|
||
`RegisterDeliveryHandler` (the External System Gateway, Notification Service and
|
||
Database Gateway only call `EnqueueAsync`), so in the current wiring **every** buffered
|
||
message lands in this dead state: it is persisted, counts toward buffer depth, but can
|
||
never be retried, delivered or parked. It will sit Pending forever. Either the handler
|
||
registration is missing from Host/gateway startup, or the "buffer for later" path is a
|
||
silent trap. Either way the engine cannot deliver anything.
|
||
|
||
**Recommendation**
|
||
|
||
Decide the intended contract. If handlers are expected to be registered before
|
||
`EnqueueAsync` is reachable, make `EnqueueAsync` reject (or log an error) when no
|
||
handler exists rather than silently buffering an undeliverable message, and wire
|
||
`RegisterDeliveryHandler` calls in Host startup for all three categories. If late
|
||
registration is intended, the retry sweep should treat a still-missing handler as a
|
||
transient condition with bounded logging rather than a permanent no-op.
|
||
|
||
**Re-triage note (2026-05-16)**
|
||
|
||
The finding's central factual claim — *"No caller in the codebase ever calls
|
||
`RegisterDeliveryHandler`"* and therefore *"every buffered message lands in this dead
|
||
state"* — is **no longer true at the reviewed code**. `ScadaLink.Host`
|
||
(`AkkaHostedService.RegisterSiteActors`, `AkkaHostedService.cs:353-379`) registers all
|
||
three delivery handlers (`ExternalSystem`, `CachedDbWrite`, `Notification`) at site
|
||
startup, immediately after `StoreAndForwardService.StartAsync()`. The finding was
|
||
written against commit `9c60592` before that wiring existed; the High-severity
|
||
"engine cannot deliver anything" outcome no longer occurs.
|
||
|
||
The remaining residual risk is narrow: a message enqueued for a category that genuinely
|
||
has no handler (e.g. an enqueue racing ahead of `RegisterDeliveryHandler`, or a future
|
||
category added without a handler) is still buffered and then skipped by the sweep
|
||
forever. That is a real but minor robustness gap, hence the **downgrade to Low**.
|
||
|
||
It is left **Open** rather than fixed in this pass because the finding's recommended
|
||
fix — making `EnqueueAsync` reject when no handler is registered — is a behavioural
|
||
contract change, not a localised bug fix: the "buffer with no handler yet" path is
|
||
exercised by `StoreAndForwardReplicationTests` and by three NotificationService and
|
||
ExternalSystemGateway tests (`Send_TransientError_WithStoreAndForward_BuffersMessage`,
|
||
`Send_Smtp4xxCommandException_ClassifiedTransientAndBuffered`,
|
||
`Send_SmtpProtocolException_ClassifiedTransient`) which construct a real
|
||
`StoreAndForwardService` without registering a handler and assert `WasBuffered`.
|
||
Changing the contract requires deciding whether late handler registration is supported
|
||
and updating tests in modules outside this review's edit scope — a design decision that
|
||
should be made deliberately rather than forced here.
|
||
|
||
**Resolution**
|
||
|
||
_Open — re-triaged to Low. Premise (no handler registration anywhere) is stale: Host
|
||
now wires all three handlers. Residual gap is minor and the prescribed fix is a
|
||
cross-module contract change needing a design decision._
|
||
|
||
### StoreAndForward-003 — Off-by-one in retry accounting: immediate failure pre-counts as retry 1
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:153`, `:229`, `:233` |
|
||
|
||
**Description**
|
||
|
||
On a transient immediate-delivery failure, `EnqueueAsync` buffers the message with
|
||
`message.RetryCount = 1` (line 153). The retry sweep then increments `RetryCount` before
|
||
the max check (`RetryCount++` at line 229; `RetryCount >= MaxRetries` at line 233).
|
||
Consequences: (1) a message configured with `MaxRetries = 1` is parked on the *first*
|
||
retry sweep without ever being retried, because after the immediate attempt `RetryCount`
|
||
is already 1 and the first sweep makes it 2 ≥ 1 — zero actual retries occur, contradicting
|
||
the design intent that the immediate attempt and the retry budget are distinct;
|
||
(2) the design doc's `Retry Count` field is "Number of attempts so far," but here it is
|
||
seeded to 1 before any *retry* has happened, making the parked-message `AttemptCount`
|
||
shown to operators off by one relative to configured `MaxRetries`. The
|
||
`EnqueueAsync_TransientFailure_BuffersForRetry` test even asserts `RetryCount == 1`,
|
||
locking in the ambiguity.
|
||
|
||
**Recommendation**
|
||
|
||
Choose one consistent meaning for `RetryCount` (recommended: total delivery attempts,
|
||
including the immediate one) and apply it uniformly. If `MaxRetries` is meant to bound
|
||
*retries* after the immediate attempt, buffer with `RetryCount = 0` and treat the
|
||
immediate failure as attempt 0; if it bounds *total attempts*, document that and adjust
|
||
the comparison. Update the affected test to match the chosen semantics.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `<pending>`). `RetryCount` now consistently means "number
|
||
of background retry-sweep attempts so far"; the initial immediate (or caller-made)
|
||
delivery attempt is attempt 0 and is not counted, and `MaxRetries` bounds retry-sweep
|
||
attempts after that initial attempt. `EnqueueAsync` no longer seeds `RetryCount = 1` on
|
||
either the transient-immediate-failure path or the `attemptImmediateDelivery: false`
|
||
path — a freshly buffered message has `RetryCount = 0`. `RetryMessageAsync` already
|
||
increments before the `>= MaxRetries` check, which is now correct, so a message with
|
||
`MaxRetries = 1` gets exactly one real retry before parking (previously zero). The
|
||
`StoreAndForwardMessage.RetryCount` XML doc was corrected to match. Regression test
|
||
`RetryPendingMessagesAsync_MaxRetriesOne_PerformsExactlyOneRetryBeforeParking` asserts
|
||
the immediate attempt plus exactly one retry occur before parking; the affected
|
||
existing tests (`EnqueueAsync_TransientFailure_BuffersForRetry`,
|
||
`EnqueueAsync_AttemptImmediateDeliveryFalse_BuffersWithoutInvokingHandler`,
|
||
`RetryPendingMessagesAsync_MaxRetriesReached_ParksMessage`) were updated to the
|
||
corrected semantics.
|
||
|
||
### StoreAndForward-004 — `RegisterDeliveryHandler` XML doc contradicts the implemented contract
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Documentation & comments |
|
||
| Status | Open |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:38`, `:60` |
|
||
|
||
**Description**
|
||
|
||
The XML comment on the handler delegate (lines 37–40) says "Returns true on success,
|
||
throws on transient failure. Permanent failures should return false (message will NOT
|
||
be buffered)." That last clause is wrong for the retry path: in `RetryMessageAsync`,
|
||
a handler returning `false` does not "not buffer" — the message is already buffered, and
|
||
the code *parks* it immediately (lines 218–224). The comment describes only the
|
||
`EnqueueAsync` immediate path and misleads anyone implementing a handler about what
|
||
`false` means once a message is in the retry loop.
|
||
|
||
**Recommendation**
|
||
|
||
Reword the contract to cover both paths explicitly: `true` = delivered (remove from
|
||
buffer); `false` = permanent failure (not buffered on immediate attempt, parked on a
|
||
retry); exception = transient failure (buffer / increment retry).
|
||
|
||
**Resolution**
|
||
|
||
_Unresolved._
|
||
|
||
### StoreAndForward-005 — Parked-message retry/discard can race with the in-progress retry sweep
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Open |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:184`, `:266`, `:280` |
|
||
|
||
**Description**
|
||
|
||
`RetryPendingMessagesAsync` loads a snapshot of due messages (line 179) and then
|
||
processes them one by one (line 184), `await`-ing delivery for each. Meanwhile
|
||
`RetryParkedMessageAsync` / `DiscardParkedMessageAsync` (operator actions arriving via
|
||
`ParkedMessageHandlerActor`) run on unrelated threads and mutate the same rows. Because
|
||
each operation opens its own SQLite connection and there is no row-level coordination,
|
||
an operator can `DiscardParkedMessageAsync` a message that the sweep is concurrently
|
||
delivering: the sweep's later `RemoveMessageAsync`/`UpdateMessageAsync` operates on a
|
||
now-deleted row (harmless) — but if an operator `RetryParkedMessageAsync` resets a row
|
||
to Pending while the sweep simultaneously parks the same in-flight message, the operator
|
||
intent is silently overwritten. The `Interlocked` guard only prevents *overlapping
|
||
sweeps*, not sweep-vs-management races.
|
||
|
||
**Recommendation**
|
||
|
||
Funnel all message-state mutations through a single serialization point — e.g. perform
|
||
all S&F state changes inside the `ParkedMessageHandlerActor` (or a dedicated S&F actor)
|
||
so the actor mailbox serialises them, or make status transitions conditional in SQL
|
||
(e.g. `UPDATE ... WHERE id = @id AND status = @expected`) and re-check the affected
|
||
row count.
|
||
|
||
**Resolution**
|
||
|
||
_Unresolved._
|
||
|
||
### StoreAndForward-006 — `GetParkedMessagesAsync` count and page run without a transaction
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Performance & resource management |
|
||
| Status | Open |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:166`, `:175` |
|
||
|
||
**Description**
|
||
|
||
`GetParkedMessagesAsync` issues a `COUNT(*)` and then a separate paged `SELECT` on two
|
||
commands on the same connection with no surrounding transaction. A concurrent
|
||
enqueue/park/discard between the two statements yields a `TotalCount` inconsistent with
|
||
the returned page (e.g. total reported as 51 while only 50 distinct parked rows now
|
||
exist, or a row visible in the page but excluded from the count). For a paginated UI
|
||
this produces flickering totals and occasional off-by-one page math.
|
||
|
||
**Recommendation**
|
||
|
||
Wrap both reads in a single transaction (`BeginTransaction`) so they see a consistent
|
||
snapshot, or accept the staleness and document it. A transaction is cheap here and
|
||
removes the inconsistency.
|
||
|
||
**Resolution**
|
||
|
||
_Unresolved._
|
||
|
||
### StoreAndForward-007 — Async work in `ParkedMessageHandlerActor` uses `ContinueWith` without scheduler/affinity guarantees
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Akka.NET conventions |
|
||
| Status | Open |
|
||
| Location | `src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs:34`, `:68`, `:87` |
|
||
|
||
**Description**
|
||
|
||
The three handlers call a `Task`-returning service method and chain `.ContinueWith(...)
|
||
.PipeTo(sender)`. `Sender` is correctly captured into a local first, so the closure is
|
||
safe. However `ContinueWith` without an explicit `TaskScheduler` runs the continuation
|
||
on a thread-pool thread and the captured continuation builds the response objects there
|
||
— acceptable since it only touches locals, but it bypasses the idiomatic
|
||
`PipeTo`-with-success/failure-projection pattern and is fragile if someone later adds a
|
||
line touching actor state inside the continuation. There is also no `TaskContinuationOptions`,
|
||
so a faulted antecedent still runs the continuation (handled here via `IsCompletedSuccessfully`,
|
||
but only by convention).
|
||
|
||
**Recommendation**
|
||
|
||
Replace `ContinueWith(...).PipeTo(sender)` with `PipeTo(sender, success: result => ...,
|
||
failure: ex => ...)`, which is the documented Akka pattern, keeps response construction
|
||
off the actor thread safely, and makes the success/failure branches explicit.
|
||
|
||
**Resolution**
|
||
|
||
_Unresolved._
|
||
|
||
### StoreAndForward-008 — A SQLite connection is opened and torn down on every storage call
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Performance & resource management |
|
||
| Status | Open |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:28`, `:61`, `:93`, `:117`, `:144`, `:162`, `:199`, `:221`, `:237`, `:267`, `:285`, `:305`, `:319` |
|
||
|
||
**Description**
|
||
|
||
Every method in `StoreAndForwardStorage` constructs a fresh `SqliteConnection` and calls
|
||
`OpenAsync`. Microsoft.Data.Sqlite pools connections, so this is not a correctness bug,
|
||
but a retry sweep over a large buffer performs one open per `UpdateMessageAsync`/
|
||
`RemoveMessageAsync` call inside the loop (`RetryMessageAsync`), multiplying connection
|
||
churn under load. With no max buffer size (by design) the buffer can grow large, so the
|
||
per-message connection acquisition is a measurable overhead on the hot retry path.
|
||
|
||
**Recommendation**
|
||
|
||
Consider a batched retry API that opens one connection (and one transaction) per sweep,
|
||
or pass an open connection into the per-message update calls. At minimum, document that
|
||
the design relies on the Sqlite connection pool for acceptable performance.
|
||
|
||
**Resolution**
|
||
|
||
_Unresolved._
|
||
|
||
### StoreAndForward-009 — `OnActivity` event invocation is not thread-safe against concurrent subscribe/unsubscribe
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Open |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:46`, `:309` |
|
||
|
||
**Description**
|
||
|
||
`OnActivity` is a public `event Action<...>` raised via `OnActivity?.Invoke(...)` in
|
||
`RaiseActivity` (line 309). `RaiseActivity` is called from both `EnqueueAsync` (caller
|
||
thread) and `RetryMessageAsync` (timer thread). The `?.Invoke` null-conditional captures
|
||
the delegate once so it will not NRE, but there is no synchronisation around the event
|
||
field itself; a subscriber added/removed concurrently with a raise has no defined
|
||
ordering. More importantly, subscriber callbacks run synchronously on the timer thread,
|
||
so a slow or throwing subscriber stalls or aborts the retry sweep (an exception in a
|
||
subscriber propagates out of `RaiseActivity` into `RetryMessageAsync`'s `try` and is
|
||
swallowed as a "transient failure," wrongly incrementing the message's retry count).
|
||
|
||
**Recommendation**
|
||
|
||
Snapshot the delegate (already done) and additionally wrap subscriber invocation in a
|
||
`try/catch` so a faulting logging subscriber cannot be misclassified as a delivery
|
||
failure. Document that handlers must be fast and non-throwing, or dispatch activity
|
||
notifications asynchronously.
|
||
|
||
**Resolution**
|
||
|
||
_Unresolved._
|
||
|
||
### StoreAndForward-010 — Retry of a parked message does not reset `LastAttemptAt`, so its retry timing is unspecified
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Open |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:203`, `:101` |
|
||
|
||
**Description**
|
||
|
||
`RetryParkedMessageAsync` sets `status = Pending, retry_count = 0, last_error = NULL`
|
||
but leaves `last_attempt_at` unchanged (line 203–206). The retry-due query
|
||
(`GetMessagesForRetryAsync`, line 101–105) selects Pending rows where
|
||
`last_attempt_at IS NULL OR ... elapsed >= retry_interval_ms`. A message parked after
|
||
exhausting retries has an old `last_attempt_at`; once re-queued, the elapsed time since
|
||
that stale timestamp is almost certainly already greater than the retry interval, so the
|
||
operator-retried message is attempted on the very next sweep regardless of the
|
||
configured interval. That is probably the desired behaviour (operator wants it tried
|
||
now), but it is unspecified and inconsistent — if `retry_interval_ms` were very large the
|
||
behaviour would instead be "try immediately" by accident rather than by design.
|
||
|
||
**Recommendation**
|
||
|
||
Explicitly decide and encode the intent: either set `last_attempt_at = NULL` on
|
||
re-queue so the message is unambiguously due now, or set it to "now" so it waits one
|
||
interval. Document the chosen behaviour in the method's XML comment.
|
||
|
||
**Resolution**
|
||
|
||
_Unresolved._
|
||
|
||
### StoreAndForward-011 — `StoreAndForwardMessageStatus.InFlight` is unused and the doc's "retrying" status is unmodelled
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Design-document adherence |
|
||
| Status | Open |
|
||
| Location | `src/ScadaLink.Commons/Types/Enums/StoreAndForwardMessageStatus.cs:9`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:219`, `:235` |
|
||
|
||
**Description**
|
||
|
||
The enum defines `Pending, InFlight, Parked, Delivered`. The module only ever uses
|
||
`Pending` and `Parked` — `InFlight` and `Delivered` are never assigned (delivered
|
||
messages are deleted, not marked `Delivered`). Meanwhile the Component design doc
|
||
("Message Format" -> Status) specifies the set "Pending, retrying, or parked." So the
|
||
code's enum drifts from the doc in two directions: it carries dead members the doc does
|
||
not mention (`InFlight`, `Delivered`) and omits the doc's `retrying` state. A message
|
||
mid-retry is indistinguishable from one that has never been attempted.
|
||
|
||
**Recommendation**
|
||
|
||
Reconcile the enum with the design. Either drop the unused members and update the doc,
|
||
or implement the documented `retrying` state and use `InFlight` to mark a message the
|
||
sweep is actively delivering (which would also help with finding 005).
|
||
|
||
**Resolution**
|
||
|
||
_Unresolved._
|
||
|
||
### StoreAndForward-012 — `StoreAndForwardMessage` is a persistence entity but lives in the component, not Commons
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Status | Open |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardMessage.cs:9` |
|
||
|
||
**Description**
|
||
|
||
`StoreAndForwardMessage` is a persistence-ignorant POCO that maps directly to the
|
||
`sf_messages` table and is also carried across the network inside `ReplicationOperation`
|
||
(replicated to the standby node over Akka remoting). CLAUDE.md "Code Organization" states
|
||
that entity classes are persistence-ignorant POCOs in Commons and that message contracts
|
||
follow additive-only evolution. Because this type doubles as a replication wire contract
|
||
but lives in the component assembly, it is not co-located with the other Commons
|
||
entities and its evolution is not governed by the additive-only message-contract rule.
|
||
This is a borderline case (the type is site-local), but the cross-node use via
|
||
`ReplicationOperation` makes it a de-facto message contract.
|
||
|
||
**Recommendation**
|
||
|
||
Either move `StoreAndForwardMessage` (and `ReplicationOperation`) into the Commons
|
||
`Entities`/`Messages` hierarchy so they are governed by the contract-evolution rules, or
|
||
introduce a separate DTO for replication and keep `StoreAndForwardMessage` purely as the
|
||
local persistence model. Document the decision.
|
||
|
||
**Resolution**
|
||
|
||
_Unresolved._
|
||
|
||
### StoreAndForward-013 — Critical paths lack test coverage: retry-due timing, replication-from-active, and the actor bridge
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Testing coverage |
|
||
| Status | Open |
|
||
| Location | `tests/ScadaLink.StoreAndForward.Tests/` (whole directory); `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:101`; `src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs` |
|
||
|
||
**Description**
|
||
|
||
The existing tests cover storage CRUD and the service happy/failure paths well, but
|
||
three important behaviours are untested: (1) the retry-due time filter in
|
||
`GetMessagesForRetryAsync` — every service test sets `DefaultRetryInterval = TimeSpan.Zero`,
|
||
so the `julianday` elapsed-time comparison (the most error-prone SQL in the module) is
|
||
never exercised with a non-zero interval; a message that is *not yet due* should be
|
||
skipped, and that is never verified. (2) Replication from the active side — no test
|
||
asserts that an enqueue/remove/park causes a `Replicate*` call (this is exactly the gap
|
||
behind finding 001; a test would have caught it). (3) `ParkedMessageHandlerActor` has no
|
||
test at all — the Query/Retry/Discard request-to-response mapping and the
|
||
`ExtractMethodName` JSON parsing are unverified, including the malformed-JSON branch.
|
||
|
||
**Recommendation**
|
||
|
||
Add tests for: a non-zero retry interval where a recently-attempted message is excluded
|
||
and an older one is included; active-side replication invocation per operation type
|
||
(once finding 001 is fixed); and `ParkedMessageHandlerActor` using `Akka.TestKit`,
|
||
including `ExtractMethodName` for `MethodName`, `Subject`, missing-property and
|
||
invalid-JSON payloads.
|
||
|
||
**Resolution**
|
||
|
||
_Unresolved._
|
||
|
||
### StoreAndForward-014 — Storage does not create its SQLite database directory
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:26` |
|
||
|
||
**Found 2026-05-16** while verifying the store-and-forward fixes — this defect was
|
||
not part of the original baseline review.
|
||
|
||
**Description**
|
||
|
||
`StoreAndForwardStorage.InitializeAsync` opened a `SqliteConnection` against the
|
||
configured `SqliteDbPath` (default `./data/store-and-forward.db`) without ensuring the
|
||
parent directory exists. SQLite creates the database *file* on demand but not its
|
||
*directory*, so when `data/` does not already exist the connection fails to open with
|
||
`SQLite Error 14: 'unable to open database file'`. Every site-host boot therefore failed
|
||
in any environment whose working directory has no `data/` folder — the cause of the six
|
||
failing `SiteActorPathTests` (the host's `RegisterSiteActors` aborts at
|
||
`StoreAndForwardService.StartAsync`). Production masked it because `data/` is created by
|
||
the Docker image / deployment.
|
||
|
||
**Recommendation**
|
||
|
||
Create the parent directory of a file-backed SQLite database before opening it.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16. `InitializeAsync` now calls a new `EnsureDatabaseDirectoryExists`
|
||
helper that parses the connection string with `SqliteConnectionStringBuilder` and, for a
|
||
file-backed database, creates the parent directory if it is missing (in-memory databases
|
||
and bare filenames are skipped). Regression test
|
||
`InitializeAsync_FileInMissingDirectory_CreatesDirectory` fails against the pre-fix code;
|
||
all six `SiteActorPathTests` now pass. Fixed by the commit whose message references
|
||
`StoreAndForward-014`.
|