docs: add code review process and baseline review of all 19 modules

Establishes a per-module code review workflow under code-reviews/ and records the 2026-05-16 baseline review (commit 9c60592): 241 findings across all src/ modules (6 Critical, 46 High, 100 Medium, 89 Low). This is the clean starting point for remediation work.
2026-05-16 18:09:09 -04:00
parent 9c60592632
commit 977d7369a7
23 changed files with 8899 additions and 0 deletions
--- a/code-reviews/StoreAndForward/findings.md
+++ b/code-reviews/StoreAndForward/findings.md
@@ -0,0 +1,465 @@
+# Code Review — StoreAndForward
+
+| Field | Value |
+|-------|-------|
+| Module | `src/ScadaLink.StoreAndForward` |
+| Design doc | `docs/requirements/Component-StoreAndForward.md` |
+| Status | Reviewed |
+| Last reviewed | 2026-05-16 |
+| Reviewer | claude-agent |
+| Commit reviewed | `9c60592` |
+| Open findings | 13 |
+
+## Summary
+
+The Store-and-Forward module is small and readable, with a clean SQLite persistence
+layer, a sensible service API, and reasonable test coverage of the storage and service
+happy paths. However the review surfaced two issues that undermine the module's core
+purpose. First, the active delivery path never invokes the `ReplicationService` —
+`ReplicateEnqueue/Remove/Park` have no callers anywhere in the codebase, so buffered
+messages are not replicated to the standby node and the design's failover-durability
+guarantee (Component doc "Persistence", CLAUDE.md "Store-and-Forward") is not met.
+Second, there is an off-by-one in retry accounting: the immediate-failure path stores a
+buffered message with `RetryCount = 1`, so a message configured with `MaxRetries = N`
+is actually attempted `N` times in total rather than one immediate attempt plus `N`
+retries, and a per-source `MaxRetries` of 1 produces zero retry attempts. Additional
+themes: SQLite connection-per-call with no transactional grouping of multi-statement
+operations, no concurrency guard against a parked message being retried while the
+sweep is mid-flight, an unused enum member (`InFlight`) that drifts from the documented
+status set, and untested critical paths (retry-due timing, replication-from-active,
+the actor bridge). None of the findings are blockers for compilation, but the
+replication and retry-count issues are functional defects against the design.
+
+## Checklist coverage
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | Off-by-one in retry counting (003); parked-message retry timing (010). |
+| 2 | Akka.NET conventions | ☑ | `ContinueWith` used instead of `PipeTo`-friendly continuations; default supervision; see 007. |
+| 3 | Concurrency & thread safety | ☑ | Sweep guarded by `Interlocked`, but no guard against retry-vs-manage races (005); `OnActivity` event not thread-safe (009). |
+| 4 | Error handling & resilience | ☑ | Replication never invoked from active path (001); no-handler messages buffered then stuck (002). |
+| 5 | Security | ☑ | No issues found — parameterised SQL throughout; no secrets handled directly; payload JSON treated opaquely. |
+| 6 | Performance & resource management | ☑ | New SQLite connection per call; multi-statement operations not wrapped in a transaction (006, 008). |
+| 7 | Design-document adherence | ☑ | Replication gap (001); `InFlight` status undocumented/unused (011); "retrying" status from design doc not modelled. |
+| 8 | Code organization & conventions | ☑ | `StoreAndForwardMessage` is an entity-like POCO living in the component, not Commons (012). |
+| 9 | Testing coverage | ☑ | Retry-due timing, replication-from-active, and `ParkedMessageHandlerActor` are untested (013). |
+| 10 | Documentation & comments | ☑ | XML doc on `RegisterDeliveryHandler` contract is inconsistent with code (004). |
+
+## Findings
+
+### StoreAndForward-001 — Replication to standby is never triggered by the active node
+
+| | |
+|--|--|
+| Severity | Critical |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/ReplicationService.cs:40`, `:53`, `:66`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:155`, `:212`, `:222`, `:236` |
+
+**Description**
+
+`ReplicationService` exposes `ReplicateEnqueue`, `ReplicateRemove` and `ReplicatePark`
+to forward buffer operations to the standby node, but a codebase-wide search shows these
+methods have no callers. `StoreAndForwardService` — which performs every add (`EnqueueAsync`
+line 155 / 163), remove (`RemoveMessageAsync` call at line 212) and park
+(`UpdateMessageAsync` calls at lines 222/236) — holds no reference to `ReplicationService`
+and never invokes it. Only the receiving half is wired (`SetReplicationHandler` and
+`ApplyReplicatedOperationAsync` are used by `SiteReplicationActor`). The Component design
+doc ("Persistence") and CLAUDE.md ("Store-and-Forward") require the active node to
+forward every buffer operation to the standby so that, on failover, the new active node
+"has a near-complete copy of the buffer." As written, the standby's S&F SQLite database
+stays empty and a failover loses the entire buffer — a data-loss defect against a core
+requirement.
+
+**Recommendation**
+
+Inject `ReplicationService` into `StoreAndForwardService` and call `ReplicateEnqueue`
+after a successful `_storage.EnqueueAsync`, `ReplicateRemove` after `RemoveMessageAsync`,
+and `ReplicatePark` after a park-causing `UpdateMessageAsync`. Update
+`ServiceCollectionExtensions.AddStoreAndForward` to pass the dependency. Add a test that
+asserts the replication handler observes each operation type.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-002 — Messages enqueued with no registered handler are buffered but never deliverable
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:162`, `:201` |
+
+**Description**
+
+`EnqueueAsync` falls through to "No handler registered — buffer for later" (line 162)
+when no delivery handler is registered for the category. The retry sweep
+(`RetryMessageAsync`, line 201) then logs "No delivery handler for category" and
+`return`s without touching the message. No caller in the codebase ever calls
+`RegisterDeliveryHandler` (the External System Gateway, Notification Service and
+Database Gateway only call `EnqueueAsync`), so in the current wiring **every** buffered
+message lands in this dead state: it is persisted, counts toward buffer depth, but can
+never be retried, delivered or parked. It will sit Pending forever. Either the handler
+registration is missing from Host/gateway startup, or the "buffer for later" path is a
+silent trap. Either way the engine cannot deliver anything.
+
+**Recommendation**
+
+Decide the intended contract. If handlers are expected to be registered before
+`EnqueueAsync` is reachable, make `EnqueueAsync` reject (or log an error) when no
+handler exists rather than silently buffering an undeliverable message, and wire
+`RegisterDeliveryHandler` calls in Host startup for all three categories. If late
+registration is intended, the retry sweep should treat a still-missing handler as a
+transient condition with bounded logging rather than a permanent no-op.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-003 — Off-by-one in retry accounting: immediate failure pre-counts as retry 1
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:153`, `:229`, `:233` |
+
+**Description**
+
+On a transient immediate-delivery failure, `EnqueueAsync` buffers the message with
+`message.RetryCount = 1` (line 153). The retry sweep then increments `RetryCount` before
+the max check (`RetryCount++` at line 229; `RetryCount >= MaxRetries` at line 233).
+Consequences: (1) a message configured with `MaxRetries = 1` is parked on the *first*
+retry sweep without ever being retried, because after the immediate attempt `RetryCount`
+is already 1 and the first sweep makes it 2 ≥ 1 — zero actual retries occur, contradicting
+the design intent that the immediate attempt and the retry budget are distinct;
+(2) the design doc's `Retry Count` field is "Number of attempts so far," but here it is
+seeded to 1 before any *retry* has happened, making the parked-message `AttemptCount`
+shown to operators off by one relative to configured `MaxRetries`. The
+`EnqueueAsync_TransientFailure_BuffersForRetry` test even asserts `RetryCount == 1`,
+locking in the ambiguity.
+
+**Recommendation**
+
+Choose one consistent meaning for `RetryCount` (recommended: total delivery attempts,
+including the immediate one) and apply it uniformly. If `MaxRetries` is meant to bound
+*retries* after the immediate attempt, buffer with `RetryCount = 0` and treat the
+immediate failure as attempt 0; if it bounds *total attempts*, document that and adjust
+the comparison. Update the affected test to match the chosen semantics.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-004 — `RegisterDeliveryHandler` XML doc contradicts the implemented contract
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:38`, `:60` |
+
+**Description**
+
+The XML comment on the handler delegate (lines 37–40) says "Returns true on success,
+throws on transient failure. Permanent failures should return false (message will NOT
+be buffered)." That last clause is wrong for the retry path: in `RetryMessageAsync`,
+a handler returning `false` does not "not buffer" — the message is already buffered, and
+the code *parks* it immediately (lines 218–224). The comment describes only the
+`EnqueueAsync` immediate path and misleads anyone implementing a handler about what
+`false` means once a message is in the retry loop.
+
+**Recommendation**
+
+Reword the contract to cover both paths explicitly: `true` = delivered (remove from
+buffer); `false` = permanent failure (not buffered on immediate attempt, parked on a
+retry); exception = transient failure (buffer / increment retry).
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-005 — Parked-message retry/discard can race with the in-progress retry sweep
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:184`, `:266`, `:280` |
+
+**Description**
+
+`RetryPendingMessagesAsync` loads a snapshot of due messages (line 179) and then
+processes them one by one (line 184), `await`-ing delivery for each. Meanwhile
+`RetryParkedMessageAsync` / `DiscardParkedMessageAsync` (operator actions arriving via
+`ParkedMessageHandlerActor`) run on unrelated threads and mutate the same rows. Because
+each operation opens its own SQLite connection and there is no row-level coordination,
+an operator can `DiscardParkedMessageAsync` a message that the sweep is concurrently
+delivering: the sweep's later `RemoveMessageAsync`/`UpdateMessageAsync` operates on a
+now-deleted row (harmless) — but if an operator `RetryParkedMessageAsync` resets a row
+to Pending while the sweep simultaneously parks the same in-flight message, the operator
+intent is silently overwritten. The `Interlocked` guard only prevents *overlapping
+sweeps*, not sweep-vs-management races.
+
+**Recommendation**
+
+Funnel all message-state mutations through a single serialization point — e.g. perform
+all S&F state changes inside the `ParkedMessageHandlerActor` (or a dedicated S&F actor)
+so the actor mailbox serialises them, or make status transitions conditional in SQL
+(e.g. `UPDATE ... WHERE id = @id AND status = @expected`) and re-check the affected
+row count.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-006 — `GetParkedMessagesAsync` count and page run without a transaction
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:166`, `:175` |
+
+**Description**
+
+`GetParkedMessagesAsync` issues a `COUNT(*)` and then a separate paged `SELECT` on two
+commands on the same connection with no surrounding transaction. A concurrent
+enqueue/park/discard between the two statements yields a `TotalCount` inconsistent with
+the returned page (e.g. total reported as 51 while only 50 distinct parked rows now
+exist, or a row visible in the page but excluded from the count). For a paginated UI
+this produces flickering totals and occasional off-by-one page math.
+
+**Recommendation**
+
+Wrap both reads in a single transaction (`BeginTransaction`) so they see a consistent
+snapshot, or accept the staleness and document it. A transaction is cheap here and
+removes the inconsistency.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-007 — Async work in `ParkedMessageHandlerActor` uses `ContinueWith` without scheduler/affinity guarantees
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Akka.NET conventions |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs:34`, `:68`, `:87` |
+
+**Description**
+
+The three handlers call a `Task`-returning service method and chain `.ContinueWith(...)
+.PipeTo(sender)`. `Sender` is correctly captured into a local first, so the closure is
+safe. However `ContinueWith` without an explicit `TaskScheduler` runs the continuation
+on a thread-pool thread and the captured continuation builds the response objects there
+— acceptable since it only touches locals, but it bypasses the idiomatic
+`PipeTo`-with-success/failure-projection pattern and is fragile if someone later adds a
+line touching actor state inside the continuation. There is also no `TaskContinuationOptions`,
+so a faulted antecedent still runs the continuation (handled here via `IsCompletedSuccessfully`,
+but only by convention).
+
+**Recommendation**
+
+Replace `ContinueWith(...).PipeTo(sender)` with `PipeTo(sender, success: result => ...,
+failure: ex => ...)`, which is the documented Akka pattern, keeps response construction
+off the actor thread safely, and makes the success/failure branches explicit.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-008 — A SQLite connection is opened and torn down on every storage call
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Performance & resource management |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:28`, `:61`, `:93`, `:117`, `:144`, `:162`, `:199`, `:221`, `:237`, `:267`, `:285`, `:305`, `:319` |
+
+**Description**
+
+Every method in `StoreAndForwardStorage` constructs a fresh `SqliteConnection` and calls
+`OpenAsync`. Microsoft.Data.Sqlite pools connections, so this is not a correctness bug,
+but a retry sweep over a large buffer performs one open per `UpdateMessageAsync`/
+`RemoveMessageAsync` call inside the loop (`RetryMessageAsync`), multiplying connection
+churn under load. With no max buffer size (by design) the buffer can grow large, so the
+per-message connection acquisition is a measurable overhead on the hot retry path.
+
+**Recommendation**
+
+Consider a batched retry API that opens one connection (and one transaction) per sweep,
+or pass an open connection into the per-message update calls. At minimum, document that
+the design relies on the Sqlite connection pool for acceptable performance.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-009 — `OnActivity` event invocation is not thread-safe against concurrent subscribe/unsubscribe
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:46`, `:309` |
+
+**Description**
+
+`OnActivity` is a public `event Action<...>` raised via `OnActivity?.Invoke(...)` in
+`RaiseActivity` (line 309). `RaiseActivity` is called from both `EnqueueAsync` (caller
+thread) and `RetryMessageAsync` (timer thread). The `?.Invoke` null-conditional captures
+the delegate once so it will not NRE, but there is no synchronisation around the event
+field itself; a subscriber added/removed concurrently with a raise has no defined
+ordering. More importantly, subscriber callbacks run synchronously on the timer thread,
+so a slow or throwing subscriber stalls or aborts the retry sweep (an exception in a
+subscriber propagates out of `RaiseActivity` into `RetryMessageAsync`'s `try` and is
+swallowed as a "transient failure," wrongly incrementing the message's retry count).
+
+**Recommendation**
+
+Snapshot the delegate (already done) and additionally wrap subscriber invocation in a
+`try/catch` so a faulting logging subscriber cannot be misclassified as a delivery
+failure. Document that handlers must be fast and non-throwing, or dispatch activity
+notifications asynchronously.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-010 — Retry of a parked message does not reset `LastAttemptAt`, so its retry timing is unspecified
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Correctness & logic bugs |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:203`, `:101` |
+
+**Description**
+
+`RetryParkedMessageAsync` sets `status = Pending, retry_count = 0, last_error = NULL`
+but leaves `last_attempt_at` unchanged (line 203–206). The retry-due query
+(`GetMessagesForRetryAsync`, line 101–105) selects Pending rows where
+`last_attempt_at IS NULL OR ... elapsed >= retry_interval_ms`. A message parked after
+exhausting retries has an old `last_attempt_at`; once re-queued, the elapsed time since
+that stale timestamp is almost certainly already greater than the retry interval, so the
+operator-retried message is attempted on the very next sweep regardless of the
+configured interval. That is probably the desired behaviour (operator wants it tried
+now), but it is unspecified and inconsistent — if `retry_interval_ms` were very large the
+behaviour would instead be "try immediately" by accident rather than by design.
+
+**Recommendation**
+
+Explicitly decide and encode the intent: either set `last_attempt_at = NULL` on
+re-queue so the message is unambiguously due now, or set it to "now" so it waits one
+interval. Document the chosen behaviour in the method's XML comment.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-011 — `StoreAndForwardMessageStatus.InFlight` is unused and the doc's "retrying" status is unmodelled
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.Commons/Types/Enums/StoreAndForwardMessageStatus.cs:9`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:219`, `:235` |
+
+**Description**
+
+The enum defines `Pending, InFlight, Parked, Delivered`. The module only ever uses
+`Pending` and `Parked` — `InFlight` and `Delivered` are never assigned (delivered
+messages are deleted, not marked `Delivered`). Meanwhile the Component design doc
+("Message Format" -> Status) specifies the set "Pending, retrying, or parked." So the
+code's enum drifts from the doc in two directions: it carries dead members the doc does
+not mention (`InFlight`, `Delivered`) and omits the doc's `retrying` state. A message
+mid-retry is indistinguishable from one that has never been attempted.
+
+**Recommendation**
+
+Reconcile the enum with the design. Either drop the unused members and update the doc,
+or implement the documented `retrying` state and use `InFlight` to mark a message the
+sweep is actively delivering (which would also help with finding 005).
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-012 — `StoreAndForwardMessage` is a persistence entity but lives in the component, not Commons
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardMessage.cs:9` |
+
+**Description**
+
+`StoreAndForwardMessage` is a persistence-ignorant POCO that maps directly to the
+`sf_messages` table and is also carried across the network inside `ReplicationOperation`
+(replicated to the standby node over Akka remoting). CLAUDE.md "Code Organization" states
+that entity classes are persistence-ignorant POCOs in Commons and that message contracts
+follow additive-only evolution. Because this type doubles as a replication wire contract
+but lives in the component assembly, it is not co-located with the other Commons
+entities and its evolution is not governed by the additive-only message-contract rule.
+This is a borderline case (the type is site-local), but the cross-node use via
+`ReplicationOperation` makes it a de-facto message contract.
+
+**Recommendation**
+
+Either move `StoreAndForwardMessage` (and `ReplicationOperation`) into the Commons
+`Entities`/`Messages` hierarchy so they are governed by the contract-evolution rules, or
+introduce a separate DTO for replication and keep `StoreAndForwardMessage` purely as the
+local persistence model. Document the decision.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-013 — Critical paths lack test coverage: retry-due timing, replication-from-active, and the actor bridge
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Testing coverage |
+| Status | Open |
+| Location | `tests/ScadaLink.StoreAndForward.Tests/` (whole directory); `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:101`; `src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs` |
+
+**Description**
+
+The existing tests cover storage CRUD and the service happy/failure paths well, but
+three important behaviours are untested: (1) the retry-due time filter in
+`GetMessagesForRetryAsync` — every service test sets `DefaultRetryInterval = TimeSpan.Zero`,
+so the `julianday` elapsed-time comparison (the most error-prone SQL in the module) is
+never exercised with a non-zero interval; a message that is *not yet due* should be
+skipped, and that is never verified. (2) Replication from the active side — no test
+asserts that an enqueue/remove/park causes a `Replicate*` call (this is exactly the gap
+behind finding 001; a test would have caught it). (3) `ParkedMessageHandlerActor` has no
+test at all — the Query/Retry/Discard request-to-response mapping and the
+`ExtractMethodName` JSON parsing are unverified, including the malformed-JSON branch.
+
+**Recommendation**
+
+Add tests for: a non-zero retry interval where a recently-attempted message is excluded
+and an older one is included; active-side replication invocation per operation type
+(once finding 001 is fixed); and `ParkedMessageHandlerActor` using `Akka.TestKit`,
+including `ExtractMethodName` for `MethodName`, `Subject`, missing-property and
+invalid-JSON payloads.
+
+**Resolution**
+
+_Unresolved._