Files
ScadaBridge/code-reviews/StoreAndForward/findings.md
T
Joseph Doherty 77cb0ad0e2 fix(api-surface): close Theme 9 — 27 naming / dead-code / config / hygiene findings
The largest themed batch — small mechanical fixes across 11 modules.

API / message hygiene:
- Comm-020: SiteAddressCacheLoaded now carries IReadOnlyDictionary /
  IReadOnlyList — Akka messages must be immutable.
- Commons-016: BundleSession.MaxUnlockAttempts named constant replaces
  magic 3.
- Commons-018: IOperationTrackingStore + IPartitionMaintenance moved from
  Interfaces/ root to Interfaces/Services/ (namespace preserved — 9
  consumers exceeded the in-prompt move threshold).
- Commons-023: TrackingStatusSnapshot.SourceNode now consistent with the
  trailing-optional-with-default pattern used elsewhere.
- SR-022: AuditingDbCommand.DbConnection.set no longer uses reflection —
  exposes AuditingDbConnection.Inner via internal API surface.

Dead code / config cleanup:
- ClusterInfra-011: decorative SectionName constant deleted.
- ClusterInfra-014: dead AddClusterInfrastructureActors method + its
  "throws-when-called" test deleted.
- Host-021: Microsoft Logging:LogLevel block deleted from appsettings.json
  (dead under Serilog).

Fail-loud over fail-silent:
- DM-021: ResolveSiteIdentifierAsync throws on missing site (was silently
  substituting a DB id).
- DM-022: dropped transient Pending write — record now lands directly in
  InProgress (no UI flicker, one fewer DB write).
- Host-020: LoggerConfigurationFactory emits a Console.Error warning when
  both Serilog:MinimumLevel and ScadaLink:Logging:MinimumLevel are set
  (ScadaLink remains truth per Host-011).
- SnF-022: NotifyCachedCallObserverAsync logs Warning on unparseable
  TrackedOperationId (was silently dropping).
- SnF-023: empty siteId default replaced with $unknown-site sentinel
  + constructor normalisation.

Correctness:
- SCA-001: SupervisorStrategy XML rewritten to match actual
  DefaultDecider/Restart semantics (was claiming Resume).
- SCA-003: OnUpsertAsync now restamps IngestedAtUtc on every upsert.
- SR-021: HandleDeployArtifacts now dispatches an internal
  ApplyArtifactDataConnectionsToDcl message after the SQLite write so
  system-wide artifact-deploy data-connection changes go live
  immediately (was requiring a site restart).
- SnF-020: RetryParkedMessageAsync captures the parked row BEFORE the
  local write so a concurrent delete can't skip standby replication.

Sentinels / naming collisions:
- HM-021: CentralSiteId changed from "central" to "$central"
  (uncollideable — leading $ is forbidden in real SiteIdentifiers).

Doc / surface cleanups:
- SEL-018: FailedWriteCount promoted to ISiteEventLogger; XML softened
  to "Available for future Health Monitoring integration".
- SnF-019: VERIFY outcome — documented parking-after-DefaultMaxRetries
  in Component-StoreAndForward.md + DefaultMaxRetries XML (uniform
  cap; maxRetries:0 is the unbounded escape hatch).
- SnF-021: Component-StoreAndForward.md no longer claims the tracking
  table lives in SnF — it's in SiteRuntime, the interface is in Commons.
- CLI-020: bundle export response parse guarded with try/catch on
  JsonException / KeyNotFoundException / FormatException — emits a
  clean INVALID_RESPONSE exit instead of a stack trace.

Config:
- ClusterInfra-013: intent comment added to "catastrophic config" test.
- Host-016: appsettings.Site.json second CentralContactPoints entry
  removed (was pointing at the SITE's own port); doc-key explains how
  to extend.
- Host-018: NodeName added to both shipped per-role configs (was
  causing SourceNode to be null on audit rows).

UI:
- CentralUI-029: replaced JS.InvokeAsync<int>("eval", …) with an ES
  module import (new wwwroot/js/browser-time.js).
- CentralUI-032: AuditResultsGrid gains a Previous button backed by a
  cursor stack.

10+ new regression tests across the affected projects. Build clean;
all suites green. README regenerated: 6 open (was 33).

Session-to-date: 130 of 136 originally-open Theme findings closed.
2026-05-28 08:39:01 -04:00

1523 lines
86 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Code Review — StoreAndForward
| Field | Value |
|-------|-------|
| Module | `src/ScadaLink.StoreAndForward` |
| Design doc | `docs/requirements/Component-StoreAndForward.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 0 (3 Deferred: 002, 011, 012; all 5 Open from Re-review 2026-05-28 resolved 2026-05-28) |
## Summary
The Store-and-Forward module is small and readable, with a clean SQLite persistence
layer, a sensible service API, and reasonable test coverage of the storage and service
happy paths. However the review surfaced two issues that undermine the module's core
purpose. First, the active delivery path never invokes the `ReplicationService`
`ReplicateEnqueue/Remove/Park` have no callers anywhere in the codebase, so buffered
messages are not replicated to the standby node and the design's failover-durability
guarantee (Component doc "Persistence", CLAUDE.md "Store-and-Forward") is not met.
Second, there is an off-by-one in retry accounting: the immediate-failure path stores a
buffered message with `RetryCount = 1`, so a message configured with `MaxRetries = N`
is actually attempted `N` times in total rather than one immediate attempt plus `N`
retries, and a per-source `MaxRetries` of 1 produces zero retry attempts. Additional
themes: SQLite connection-per-call with no transactional grouping of multi-statement
operations, no concurrency guard against a parked message being retried while the
sweep is mid-flight, an unused enum member (`InFlight`) that drifts from the documented
status set, and untested critical paths (retry-due timing, replication-from-active,
the actor bridge). None of the findings are blockers for compilation, but the
replication and retry-count issues are functional defects against the design.
#### Re-review 2026-05-17 (commit `39d737e`)
Re-reviewed at commit `39d737e` after the batch-3 fixes. All of findings 001 and
003010, plus 013014, are confirmed `Resolved` against the current source: the
replication wiring (`BufferAsync`/`ReplicateRemove`/`ReplicatePark`), the corrected
retry-count semantics, the conditional `UpdateMessageIfStatusAsync` writes, the
transactioned parked-message reads, the `PipeTo` refactor, the `RaiseActivity`
hardening, the `RetryParkedMessageAsync` `last_attempt_at` reset and the database
directory creation are all present as described. Findings 002, 011 and 012 were
re-verified and remain validly `Deferred` — their preconditions are unchanged (002's
residual no-handler gap, 011's Commons-owned enum, 012's Commons-owned entity placement).
This pass surfaced **three new findings**. StoreAndForward-015 records the
StoreAndForward side of the cross-module `MaxRetries == 0` ambiguity flagged by
ExternalSystemGateway-015: `EnqueueAsync`'s public contract documents `maxRetries` only
as "parked once `MaxRetries` is reached" and never states the `0 = no limit / retry
forever` special case that `RetryMessageAsync` actually enforces, so an ESG caller
passing `0` to mean "never retry" gets the opposite behaviour with no warning from the
S&F API surface. StoreAndForward-016 records that operator-initiated parked-message
retry and discard are not replicated to the standby — only the add/remove/park sweep
paths are — so a failover diverges the standby buffer from the active one.
StoreAndForward-017 records that the Retry/Discard activity-log entries hard-code the
`ExternalSystem` category, mislabelling notification and cached-DB-write messages in
the site event log.
#### Re-review 2026-05-28 (commit `1eb6e97`)
Full re-review against commit `1eb6e97` with the same 10-category checklist. The
batch-3 / batch-4 resolutions (001, 003010, 013017) are still present and intact; no
regressions detected on prior fixes. Findings 002, 011 and 012 remain validly
`Deferred` (their preconditions are unchanged) and findings 005, 006, 010, 013, 014,
016, 017 are confirmed `Resolved` against the current source.
This pass surfaced **seven new findings** clustered around two themes:
The first theme is **design-doc drift on the notification path**, which has acquired
two now-real defects since the engine became central-targeted. `StoreAndForward-018`
(High) records that a corrupt notification payload — handled in `NotificationForwarder.
DeliverAsync` by returning `false` — parks a notification on its first retry-sweep
encounter, despite the design doc stating "Notifications do not park" (line 47, "Parking
applies only to the external-system-call and cached-database-write categories"). The
same path becomes a poison-payload retry-forever trap on the active node if the engine
ever softened the `false` semantics. `StoreAndForward-019` (Medium) records the
sibling defect: notifications are enqueued with `MaxRetries` defaulting to
`StoreAndForwardOptions.DefaultMaxRetries` (50), and the legacy SMTP path
(`NotificationDeliveryService.SendAsync`) passes a positive bounded `smtpConfig.
MaxRetries` — so an unreachable central will silently park notifications after a
finite retry budget rather than "retry at the fixed forward interval until central acks"
as the design requires. The contract `0 = no limit` is not enforced for the
notification category.
The second theme is **subtle correctness and contract gaps around the operator paths**
that survived the StoreAndForward-016/017 batch. `StoreAndForward-020` (Medium) records
that `RetryParkedMessageAsync` skips replication entirely if `GetMessageByIdAsync`
returns null after a successful local requeue (a narrow but real race window with a
concurrent discard / sweep delete), re-introducing the StoreAndForward-016 standby
divergence in that corner. `StoreAndForward-021` (Medium) is a design-doc-vs-code drift
that should be reconciled in the doc: the **operation tracking table** is documented
inside Component-StoreAndForward.md as a S&F responsibility (lines 21, 49, 7787, 108,
114), but the actual `OperationTrackingStore` lives in `src/ScadaLink.SiteRuntime/
Tracking/` and is not consumed by S&F at all — the brief's own note flags this. The
design doc should be updated to point at SiteRuntime, or the store moved to
StoreAndForward.
`StoreAndForward-022` (Low) records that `_cachedCallObserver` silently drops audit
telemetry when a buffered cached-call's `Id` is not a parseable `TrackedOperationId`
GUID — the engine returns from `NotifyCachedCallObserverAsync` before emitting anything,
so a legacy enqueue path that buffered a non-GUID id (the engine's own default minting
produces "N"-formatted GUIDs, which TrackedOperationId.TryParse accepts, but any
caller passing a custom non-GUID id silently bypasses the entire `Submitted/Forwarded/
Attempted/Delivered/Parked/Discarded` audit lifecycle). `StoreAndForward-023` (Low)
records that `siteId` is silently defaulted to `string.Empty` when no
`IStoreAndForwardSiteContext` is registered, so a misconfigured host produces audit
telemetry with `SourceSite = ""` and the central audit-log's `(SourceSite,
TrackedOperationId)` correlation degrades to a per-id-only index. `StoreAndForward-024`
(Low) is a stop-time ordering defect: `StopAsync` disposes the timer but a
mid-flight `RetryPendingMessagesAsync` invocation continues using `_storage` and
`_replication` after `StopAsync` returns; downstream resources disposed by the host
shutdown sequence (the DI container) can then NRE through the still-running sweep.
## Checklist coverage — Re-review 2026-05-28
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Notification corrupt-payload parks contrary to design (018); RetryParkedMessageAsync skips replication when message reload races a deletion (020). |
| 2 | Akka.NET conventions | ☑ | `ParkedMessageHandlerActor` uses `PipeTo` correctly with success/failure projections (007 resolution preserved). No new findings. |
| 3 | Concurrency & thread safety | ☑ | Sweep-vs-stop race: a timer callback running while `StopAsync` returns can touch disposed dependencies (024). |
| 4 | Error handling & resilience | ☑ | Notifications park after `DefaultMaxRetries` exhaustion (019) — contradicts the design doc's "retried until central acks". |
| 5 | Security | ☑ | No issues found — parameterised SQL throughout, payload JSON opaque, no secret material handled. |
| 6 | Performance & resource management | ☑ | No new findings — the connection-per-call documented trade-off and pooled `OpenAsync` remain acceptable. |
| 7 | Design-document adherence | ☑ | Operation Tracking Table documented in StoreAndForward but actually lives in SiteRuntime (021); notification non-parking guarantee broken by 018 + 019. |
| 8 | Code organization & conventions | ☑ | `IStoreAndForwardSiteContext` silently defaults `SiteId` to empty (023) — a configuration hole rather than an entity placement issue. |
| 9 | Testing coverage | ☑ | The seven new findings have no regression tests in `tests/ScadaLink.StoreAndForward.Tests/` — particularly the notification-doesn't-park invariant (018, 019), the requeue-after-reload-null replication gap (020), and the stop-during-sweep behaviour (024). |
| 10 | Documentation & comments | ☑ | `CachedCallAttemptOutcome.ParkedMaxRetries` XML doc says "S&F semantics" but the code applies it to notifications too if 018/019 fire — minor drift, captured under 018. The `TrackedOperationId.TryParse` silent-skip behaviour in `NotifyCachedCallObserverAsync` is documented in the source but not on the public observer contract (022). |
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Off-by-one in retry counting (003); parked-message retry timing (010); Retry/Discard activity log hard-codes the category (017). |
| 2 | Akka.NET conventions | ☑ | `ContinueWith` used instead of `PipeTo`-friendly continuations; default supervision; see 007. |
| 3 | Concurrency & thread safety | ☑ | Sweep guarded by `Interlocked`, but no guard against retry-vs-manage races (005); `OnActivity` event not thread-safe (009). |
| 4 | Error handling & resilience | ☑ | Replication never invoked from active path (001); no-handler messages buffered then stuck (002); operator retry/discard not replicated to standby (016). |
| 5 | Security | ☑ | No issues found — parameterised SQL throughout; no secrets handled directly; payload JSON treated opaquely. |
| 6 | Performance & resource management | ☑ | New SQLite connection per call; multi-statement operations not wrapped in a transaction (006, 008). |
| 7 | Design-document adherence | ☑ | Replication gap (001); `InFlight` status undocumented/unused (011); "retrying" status from design doc not modelled. |
| 8 | Code organization & conventions | ☑ | `StoreAndForwardMessage` is an entity-like POCO living in the component, not Commons (012). |
| 9 | Testing coverage | ☑ | Retry-due timing, replication-from-active, and `ParkedMessageHandlerActor` are untested (013). |
| 10 | Documentation & comments | ☑ | XML doc on `RegisterDeliveryHandler` contract is inconsistent with code (004); `EnqueueAsync` never documents the `maxRetries == 0` = "retry forever" special case (015). |
## Findings
### StoreAndForward-001 — Replication to standby is never triggered by the active node
| | |
|--|--|
| Severity | Critical |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/ReplicationService.cs:40`, `:53`, `:66`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:155`, `:212`, `:222`, `:236` |
**Description**
`ReplicationService` exposes `ReplicateEnqueue`, `ReplicateRemove` and `ReplicatePark`
to forward buffer operations to the standby node, but a codebase-wide search shows these
methods have no callers. `StoreAndForwardService` — which performs every add (`EnqueueAsync`
line 155 / 163), remove (`RemoveMessageAsync` call at line 212) and park
(`UpdateMessageAsync` calls at lines 222/236) — holds no reference to `ReplicationService`
and never invokes it. Only the receiving half is wired (`SetReplicationHandler` and
`ApplyReplicatedOperationAsync` are used by `SiteReplicationActor`). The Component design
doc ("Persistence") and CLAUDE.md ("Store-and-Forward") require the active node to
forward every buffer operation to the standby so that, on failover, the new active node
"has a near-complete copy of the buffer." As written, the standby's S&F SQLite database
stays empty and a failover loses the entire buffer — a data-loss defect against a core
requirement.
**Recommendation**
Inject `ReplicationService` into `StoreAndForwardService` and call `ReplicateEnqueue`
after a successful `_storage.EnqueueAsync`, `ReplicateRemove` after `RemoveMessageAsync`,
and `ReplicatePark` after a park-causing `UpdateMessageAsync`. Update
`ServiceCollectionExtensions.AddStoreAndForward` to pass the dependency. Add a test that
asserts the replication handler observes each operation type.
**Resolution**
Resolved 2026-05-16. `ReplicationService` is now injected into `StoreAndForwardService`
(wired in `AddStoreAndForward`), and every buffer operation is forwarded to the standby:
a new `BufferAsync` helper calls `ReplicateEnqueue` after each persist, `ReplicateRemove`
runs after a successful retry removes a message, and `ReplicatePark` runs on both park
paths. Replication stays fire-and-forget and is a no-op when `ReplicationEnabled` is
false or no handler is wired. Regression tests `StoreAndForwardReplicationTests` assert
the replication handler observes the Add, Remove and Park operations. Fixed by the
commit whose message references `StoreAndForward-001`.
### StoreAndForward-002 — Messages enqueued with no registered handler are buffered but never deliverable
| | |
|--|--|
| Severity | Low |
| Original severity | High (re-triaged down to Low on 2026-05-16 — see Re-triage note) |
| Category | Error handling & resilience |
| Status | Deferred |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:162`, `:201` |
**Description**
`EnqueueAsync` falls through to "No handler registered — buffer for later" (line 162)
when no delivery handler is registered for the category. The retry sweep
(`RetryMessageAsync`, line 201) then logs "No delivery handler for category" and
`return`s without touching the message. No caller in the codebase ever calls
`RegisterDeliveryHandler` (the External System Gateway, Notification Service and
Database Gateway only call `EnqueueAsync`), so in the current wiring **every** buffered
message lands in this dead state: it is persisted, counts toward buffer depth, but can
never be retried, delivered or parked. It will sit Pending forever. Either the handler
registration is missing from Host/gateway startup, or the "buffer for later" path is a
silent trap. Either way the engine cannot deliver anything.
**Recommendation**
Decide the intended contract. If handlers are expected to be registered before
`EnqueueAsync` is reachable, make `EnqueueAsync` reject (or log an error) when no
handler exists rather than silently buffering an undeliverable message, and wire
`RegisterDeliveryHandler` calls in Host startup for all three categories. If late
registration is intended, the retry sweep should treat a still-missing handler as a
transient condition with bounded logging rather than a permanent no-op.
**Re-triage note (2026-05-16)**
The finding's central factual claim — *"No caller in the codebase ever calls
`RegisterDeliveryHandler`"* and therefore *"every buffered message lands in this dead
state"* — is **no longer true at the reviewed code**. `ScadaLink.Host`
(`AkkaHostedService.RegisterSiteActors`, `AkkaHostedService.cs:353-379`) registers all
three delivery handlers (`ExternalSystem`, `CachedDbWrite`, `Notification`) at site
startup, immediately after `StoreAndForwardService.StartAsync()`. The finding was
written against commit `9c60592` before that wiring existed; the High-severity
"engine cannot deliver anything" outcome no longer occurs.
The remaining residual risk is narrow: a message enqueued for a category that genuinely
has no handler (e.g. an enqueue racing ahead of `RegisterDeliveryHandler`, or a future
category added without a handler) is still buffered and then skipped by the sweep
forever. That is a real but minor robustness gap, hence the **downgrade to Low**.
It is left **Open** rather than fixed in this pass because the finding's recommended
fix — making `EnqueueAsync` reject when no handler is registered — is a behavioural
contract change, not a localised bug fix: the "buffer with no handler yet" path is
exercised by `StoreAndForwardReplicationTests` and by three NotificationService and
ExternalSystemGateway tests (`Send_TransientError_WithStoreAndForward_BuffersMessage`,
`Send_Smtp4xxCommandException_ClassifiedTransientAndBuffered`,
`Send_SmtpProtocolException_ClassifiedTransient`) which construct a real
`StoreAndForwardService` without registering a handler and assert `WasBuffered`.
Changing the contract requires deciding whether late handler registration is supported
and updating tests in modules outside this review's edit scope — a design decision that
should be made deliberately rather than forced here.
**Resolution**
_Deferred 2026-05-16 (re-triaged High → Low). Verified again against the source in this
pass: the finding's premise (no `RegisterDeliveryHandler` caller anywhere) is stale —
`ScadaLink.Host` wires all three handlers at site startup — so the High-severity
"engine cannot deliver anything" outcome no longer occurs. The residual gap (a message
enqueued for a category that genuinely has no handler is buffered then skipped forever)
is real but minor. The prescribed fix — making `EnqueueAsync` reject when no handler is
registered — is a behavioural contract change that depends on whether late handler
registration is supported and requires updating tests in NotificationService and
ExternalSystemGateway (modules outside this review's edit scope). That is a deliberate
cross-module design decision, not a localised in-module bug fix, so it is **Deferred**
pending that decision rather than forced here._
### StoreAndForward-003 — Off-by-one in retry accounting: immediate failure pre-counts as retry 1
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:153`, `:229`, `:233` |
**Description**
On a transient immediate-delivery failure, `EnqueueAsync` buffers the message with
`message.RetryCount = 1` (line 153). The retry sweep then increments `RetryCount` before
the max check (`RetryCount++` at line 229; `RetryCount >= MaxRetries` at line 233).
Consequences: (1) a message configured with `MaxRetries = 1` is parked on the *first*
retry sweep without ever being retried, because after the immediate attempt `RetryCount`
is already 1 and the first sweep makes it 2 ≥ 1 — zero actual retries occur, contradicting
the design intent that the immediate attempt and the retry budget are distinct;
(2) the design doc's `Retry Count` field is "Number of attempts so far," but here it is
seeded to 1 before any *retry* has happened, making the parked-message `AttemptCount`
shown to operators off by one relative to configured `MaxRetries`. The
`EnqueueAsync_TransientFailure_BuffersForRetry` test even asserts `RetryCount == 1`,
locking in the ambiguity.
**Recommendation**
Choose one consistent meaning for `RetryCount` (recommended: total delivery attempts,
including the immediate one) and apply it uniformly. If `MaxRetries` is meant to bound
*retries* after the immediate attempt, buffer with `RetryCount = 0` and treat the
immediate failure as attempt 0; if it bounds *total attempts*, document that and adjust
the comparison. Update the affected test to match the chosen semantics.
**Resolution**
Resolved 2026-05-16 (commit `<pending>`). `RetryCount` now consistently means "number
of background retry-sweep attempts so far"; the initial immediate (or caller-made)
delivery attempt is attempt 0 and is not counted, and `MaxRetries` bounds retry-sweep
attempts after that initial attempt. `EnqueueAsync` no longer seeds `RetryCount = 1` on
either the transient-immediate-failure path or the `attemptImmediateDelivery: false`
path — a freshly buffered message has `RetryCount = 0`. `RetryMessageAsync` already
increments before the `>= MaxRetries` check, which is now correct, so a message with
`MaxRetries = 1` gets exactly one real retry before parking (previously zero). The
`StoreAndForwardMessage.RetryCount` XML doc was corrected to match. Regression test
`RetryPendingMessagesAsync_MaxRetriesOne_PerformsExactlyOneRetryBeforeParking` asserts
the immediate attempt plus exactly one retry occur before parking; the affected
existing tests (`EnqueueAsync_TransientFailure_BuffersForRetry`,
`EnqueueAsync_AttemptImmediateDeliveryFalse_BuffersWithoutInvokingHandler`,
`RetryPendingMessagesAsync_MaxRetriesReached_ParksMessage`) were updated to the
corrected semantics.
### StoreAndForward-004 — `RegisterDeliveryHandler` XML doc contradicts the implemented contract
| | |
|--|--|
| Severity | Medium |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:38`, `:60` |
**Description**
The XML comment on the handler delegate (lines 3740) says "Returns true on success,
throws on transient failure. Permanent failures should return false (message will NOT
be buffered)." That last clause is wrong for the retry path: in `RetryMessageAsync`,
a handler returning `false` does not "not buffer" — the message is already buffered, and
the code *parks* it immediately (lines 218224). The comment describes only the
`EnqueueAsync` immediate path and misleads anyone implementing a handler about what
`false` means once a message is in the retry loop.
**Recommendation**
Reword the contract to cover both paths explicitly: `true` = delivered (remove from
buffer); `false` = permanent failure (not buffered on immediate attempt, parked on a
retry); exception = transient failure (buffer / increment retry).
**Resolution**
Resolved 2026-05-16 (commit pending). Confirmed the root cause against the source: the
old XML comment's "Permanent failures should return false (message will NOT be
buffered)" describes only the `EnqueueAsync` immediate path; on the retry path a
`false` return parks the already-buffered message. Reworded the `_deliveryHandlers`
field doc into an explicit three-way contract (`true` = delivered/removed-or-never-
buffered; `false` = permanent failure = not-buffered-on-immediate / parked-on-retry;
throws = transient = buffered-for-retry / retry-count-incremented), noting it applies
identically on both paths, and pointed `RegisterDeliveryHandler`'s summary at it.
Documentation-only change — no behavioural code touched, so no regression test
(an XML comment is not test-observable).
### StoreAndForward-005 — Parked-message retry/discard can race with the in-progress retry sweep
| | |
|--|--|
| Severity | Low — re-triaged down from Medium on 2026-05-16; the described data-loss race is not actually reachable, see Re-triage note |
| Original severity | Medium |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:184`, `:266`, `:280` |
**Description**
`RetryPendingMessagesAsync` loads a snapshot of due messages (line 179) and then
processes them one by one (line 184), `await`-ing delivery for each. Meanwhile
`RetryParkedMessageAsync` / `DiscardParkedMessageAsync` (operator actions arriving via
`ParkedMessageHandlerActor`) run on unrelated threads and mutate the same rows. Because
each operation opens its own SQLite connection and there is no row-level coordination,
an operator can `DiscardParkedMessageAsync` a message that the sweep is concurrently
delivering: the sweep's later `RemoveMessageAsync`/`UpdateMessageAsync` operates on a
now-deleted row (harmless) — but if an operator `RetryParkedMessageAsync` resets a row
to Pending while the sweep simultaneously parks the same in-flight message, the operator
intent is silently overwritten. The `Interlocked` guard only prevents *overlapping
sweeps*, not sweep-vs-management races.
**Recommendation**
Funnel all message-state mutations through a single serialization point — e.g. perform
all S&F state changes inside the `ParkedMessageHandlerActor` (or a dedicated S&F actor)
so the actor mailbox serialises them, or make status transitions conditional in SQL
(e.g. `UPDATE ... WHERE id = @id AND status = @expected`) and re-check the affected
row count.
**Re-triage note (2026-05-16)**
Verified against the source: the *specific* race the finding describes — *"an operator
`RetryParkedMessageAsync` resets a row to Pending while the sweep simultaneously parks
the same in-flight message"* — **cannot occur** at the reviewed code. The two operator
operations (`StoreAndForwardStorage.RetryParkedMessageAsync` /
`DiscardParkedMessageAsync`) are already SQL-conditional on `status = Parked`, and the
retry sweep only ever loads and processes rows that are `Pending`. For the operator and
the sweep to touch the *same* message the row would have to be simultaneously `Parked`
(operator-actionable) and in the sweep's in-flight `Pending` snapshot — mutually
exclusive states. Overlapping sweeps are additionally prevented by the `Interlocked`
guard, so there is only ever one sweep writer. The described data-loss outcome is
therefore not reachable, which makes the original Medium ("risky behaviour") severity
an over-statement — **re-triaged to Low**.
What *is* real is a latent fragility: the sweep's own state-changing writes used the
unconditional `UpdateMessageAsync` (`WHERE id = @id`), so the no-clobber guarantee
rested entirely on the `Interlocked` invariant with zero defence-in-depth if that
invariant were ever broken (e.g. `RetryMessageAsync` called outside the sweep, or the
guard removed). That residual Low-severity weakness is worth closing, so the
recommendation's conditional-SQL option was applied.
**Resolution**
Resolved 2026-05-16 (commit pending). Re-triaged Medium → Low (the described race is
unreachable; see Re-triage note) and the residual latent fragility fixed: added
`StoreAndForwardStorage.UpdateMessageIfStatusAsync`, a conditional update
(`UPDATE ... WHERE id = @id AND status = @expectedStatus`) returning whether a row was
written. `RetryMessageAsync`'s three state-changing writes (park-on-permanent-failure,
park-on-max-retries, retry-count increment) now use it with `expectedStatus = Pending`,
so the sweep can never overwrite a row whose status changed underneath it; a skipped
write is logged and the message left for the next sweep. Regression test
`RetryMessageAsync_StatusChangedDuringDelivery_SweepParkWriteIsSkipped` drives a writer
that moves the row out of `Pending` mid-delivery and asserts the sweep's stale park
write is skipped (it failed against the pre-fix unconditional update, clobbering the
other writer's `RetryCount`).
### StoreAndForward-006 — `GetParkedMessagesAsync` count and page run without a transaction
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:166`, `:175` |
**Description**
`GetParkedMessagesAsync` issues a `COUNT(*)` and then a separate paged `SELECT` on two
commands on the same connection with no surrounding transaction. A concurrent
enqueue/park/discard between the two statements yields a `TotalCount` inconsistent with
the returned page (e.g. total reported as 51 while only 50 distinct parked rows now
exist, or a row visible in the page but excluded from the count). For a paginated UI
this produces flickering totals and occasional off-by-one page math.
**Recommendation**
Wrap both reads in a single transaction (`BeginTransaction`) so they see a consistent
snapshot, or accept the staleness and document it. A transaction is cheap here and
removes the inconsistency.
**Resolution**
Resolved 2026-05-16 (commit pending). Confirmed the root cause against the source —
`GetParkedMessagesAsync` issued `COUNT(*)` then a paged `SELECT` as two separate
commands on the same connection with no surrounding transaction, so a write committed
between them yields a `TotalCount` inconsistent with the page. Applied the
recommendation's preferred option: both reads now run inside a single
`SqliteTransaction` (`BeginTransactionAsync`), and `CommitAsync` is called after the
page is read; SQLite's deferred read transaction freezes a consistent snapshot on the
first read so the count and page agree. Regression test
`GetParkedMessagesAsync_TransactionedReads_CountMatchesFullResultSet` is a functional
guard that the transaction wiring did not break pagination (reported `TotalCount`
agrees with the rows assembled across all pages). Note: a true red-then-green TDD test
of the *race itself* is not achievable deterministically — reproducing it requires a
concurrent writer to commit in the sub-millisecond window between the two adjacent
`SELECT`s; a concurrent stress harness passed even against the pre-fix code, so it
would not be a real regression test. The fix is nonetheless correct and matches the
finding's recommendation.
### StoreAndForward-007 — Async work in `ParkedMessageHandlerActor` uses `ContinueWith` without scheduler/affinity guarantees
| | |
|--|--|
| Severity | Low |
| Category | Akka.NET conventions |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs:34`, `:68`, `:87` |
**Description**
The three handlers call a `Task`-returning service method and chain `.ContinueWith(...)
.PipeTo(sender)`. `Sender` is correctly captured into a local first, so the closure is
safe. However `ContinueWith` without an explicit `TaskScheduler` runs the continuation
on a thread-pool thread and the captured continuation builds the response objects there
— acceptable since it only touches locals, but it bypasses the idiomatic
`PipeTo`-with-success/failure-projection pattern and is fragile if someone later adds a
line touching actor state inside the continuation. There is also no `TaskContinuationOptions`,
so a faulted antecedent still runs the continuation (handled here via `IsCompletedSuccessfully`,
but only by convention).
**Recommendation**
Replace `ContinueWith(...).PipeTo(sender)` with `PipeTo(sender, success: result => ...,
failure: ex => ...)`, which is the documented Akka pattern, keeps response construction
off the actor thread safely, and makes the success/failure branches explicit.
**Resolution**
Resolved 2026-05-16 (commit pending). Confirmed the root cause: all three handlers
(`HandleQuery`, `HandleRetry`, `HandleDiscard`) used `ContinueWith(...).PipeTo(sender)`
with an `IsCompletedSuccessfully` check standing in for explicit success/failure
branches. Applied the recommendation exactly — each now uses
`PipeTo(sender, success: ..., failure: ...)`, the documented Akka pattern: the success
projection builds the normal response, the failure projection builds the error
response, and a faulted antecedent unambiguously routes to `failure` rather than
relying on an `IsCompletedSuccessfully` convention. `Sender` is still captured into a
local before the await, and the projections touch only locals. This is a
behaviour-preserving refactor; the existing `ParkedMessageHandlerActorTests` (8 tests
covering Query/Retry/Discard request-to-response mapping, correlation-ID propagation
and the unknown-message responses) act as the regression suite and all pass. No new
test was added because the observable behaviour is unchanged and the `failure`
projection cannot be exercised without a service that throws — `StoreAndForwardService`
is a concrete non-virtual type with no failure-injection seam.
### StoreAndForward-008 — A SQLite connection is opened and torn down on every storage call
| | |
|--|--|
| Severity | Low |
| Category | Performance & resource management |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:28`, `:61`, `:93`, `:117`, `:144`, `:162`, `:199`, `:221`, `:237`, `:267`, `:285`, `:305`, `:319` |
**Description**
Every method in `StoreAndForwardStorage` constructs a fresh `SqliteConnection` and calls
`OpenAsync`. Microsoft.Data.Sqlite pools connections, so this is not a correctness bug,
but a retry sweep over a large buffer performs one open per `UpdateMessageAsync`/
`RemoveMessageAsync` call inside the loop (`RetryMessageAsync`), multiplying connection
churn under load. With no max buffer size (by design) the buffer can grow large, so the
per-message connection acquisition is a measurable overhead on the hot retry path.
**Recommendation**
Consider a batched retry API that opens one connection (and one transaction) per sweep,
or pass an open connection into the per-message update calls. At minimum, document that
the design relies on the Sqlite connection pool for acceptable performance.
**Resolution**
Resolved 2026-05-16 (commit pending). Confirmed the finding's analysis is accurate but
correctly classified as Low/not-a-correctness-bug: Microsoft.Data.Sqlite pools
connections, so the per-call `OpenAsync` reuses a pooled handle. Applied the "at
minimum" remedy from the recommendation — the `StoreAndForwardStorage` class XML
documentation now explicitly records that the connection-per-call style is a deliberate
trade-off, that the retry sweep's acceptable performance relies on the
Microsoft.Data.Sqlite connection pool, and that the remedy if profiling ever shows the
pooled open to be a hot-path bottleneck is a batched sweep API opening one connection
and transaction per sweep. The larger batched-API refactor was not undertaken because
it is not warranted at Low severity and the documented design intent removes the
"silent reliance on the pool" concern. Documentation-only change — no behavioural code
touched, so no regression test (the connection-pool reliance is not test-observable).
### StoreAndForward-009 — `OnActivity` event invocation is not thread-safe against concurrent subscribe/unsubscribe
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:46`, `:309` |
**Description**
`OnActivity` is a public `event Action<...>` raised via `OnActivity?.Invoke(...)` in
`RaiseActivity` (line 309). `RaiseActivity` is called from both `EnqueueAsync` (caller
thread) and `RetryMessageAsync` (timer thread). The `?.Invoke` null-conditional captures
the delegate once so it will not NRE, but there is no synchronisation around the event
field itself; a subscriber added/removed concurrently with a raise has no defined
ordering. More importantly, subscriber callbacks run synchronously on the timer thread,
so a slow or throwing subscriber stalls or aborts the retry sweep (an exception in a
subscriber propagates out of `RaiseActivity` into `RetryMessageAsync`'s `try` and is
swallowed as a "transient failure," wrongly incrementing the message's retry count).
**Recommendation**
Snapshot the delegate (already done) and additionally wrap subscriber invocation in a
`try/catch` so a faulting logging subscriber cannot be misclassified as a delivery
failure. Document that handlers must be fast and non-throwing, or dispatch activity
notifications asynchronously.
**Resolution**
Resolved 2026-05-16 (commit pending). Confirmed — and found the impact is worse than
the finding states. `RaiseActivity` previously did `OnActivity?.Invoke(...)`; a
throwing subscriber's exception escaped it. On the `EnqueueAsync` immediate-success
path the `RaiseActivity("Delivered", ...)` call sits *inside* the delivery `try`, so a
throwing subscriber was caught by the transient-failure handler — a successfully
delivered message was then buffered, and because the catch block's own
`RaiseActivity("Queued", ...)` also threw, the exception escaped `EnqueueAsync`
entirely. `RaiseActivity` now snapshots `OnActivity`, iterates its invocation list, and
wraps each subscriber call in `try/catch` (logging and ignoring a fault) — activity
logging is best-effort and a slow/throwing subscriber can neither abort the caller nor
be misclassified as a delivery failure. Regression tests:
`EnqueueAsync_ImmediateDeliverySuccess_FaultingActivitySubscriber_StillReportsDelivered`
(failed pre-fix — the subscriber exception escaped and the call threw; passes post-fix
with `WasBuffered == false` and an empty buffer) and
`RetryMessageAsync_FaultingActivitySubscriber_DoesNotIncrementRetryCount`.
### StoreAndForward-010 — Retry of a parked message does not reset `LastAttemptAt`, so its retry timing is unspecified
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:203`, `:101` |
**Description**
`RetryParkedMessageAsync` sets `status = Pending, retry_count = 0, last_error = NULL`
but leaves `last_attempt_at` unchanged (line 203206). The retry-due query
(`GetMessagesForRetryAsync`, line 101105) selects Pending rows where
`last_attempt_at IS NULL OR ... elapsed >= retry_interval_ms`. A message parked after
exhausting retries has an old `last_attempt_at`; once re-queued, the elapsed time since
that stale timestamp is almost certainly already greater than the retry interval, so the
operator-retried message is attempted on the very next sweep regardless of the
configured interval. That is probably the desired behaviour (operator wants it tried
now), but it is unspecified and inconsistent — if `retry_interval_ms` were very large the
behaviour would instead be "try immediately" by accident rather than by design.
**Recommendation**
Explicitly decide and encode the intent: either set `last_attempt_at = NULL` on
re-queue so the message is unambiguously due now, or set it to "now" so it waits one
interval. Document the chosen behaviour in the method's XML comment.
**Resolution**
Resolved 2026-05-16 (commit pending). Confirmed against the source: `RetryParkedMessage
Async` reset `status`, `retry_count` and `last_error` but left `last_attempt_at` stale,
so the operator-retried message's retry timing depended on the elapsed time since the
original pre-park attempt. Encoded the intent explicitly — an operator retry means
"attempt this again now" — by also setting `last_attempt_at = NULL` in the UPDATE, so
the re-queued message is unambiguously due on the next sweep regardless of the
configured `retry_interval_ms`. The method's XML comment now documents this. Regression
test `RetryParkedMessageAsync_ClearsLastAttemptAt_SoMessageIsImmediatelyDue` uses a
1-hour retry interval and a recent `last_attempt_at`; it failed pre-fix (timestamp not
cleared, message excluded from the retry-due set) and passes post-fix.
### StoreAndForward-011 — `StoreAndForwardMessageStatus.InFlight` is unused and the doc's "retrying" status is unmodelled
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Deferred |
| Location | `src/ScadaLink.Commons/Types/Enums/StoreAndForwardMessageStatus.cs:9`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:219`, `:235` |
**Description**
The enum defines `Pending, InFlight, Parked, Delivered`. The module only ever uses
`Pending` and `Parked``InFlight` and `Delivered` are never assigned (delivered
messages are deleted, not marked `Delivered`). Meanwhile the Component design doc
("Message Format" -> Status) specifies the set "Pending, retrying, or parked." So the
code's enum drifts from the doc in two directions: it carries dead members the doc does
not mention (`InFlight`, `Delivered`) and omits the doc's `retrying` state. A message
mid-retry is indistinguishable from one that has never been attempted.
**Recommendation**
Reconcile the enum with the design. Either drop the unused members and update the doc,
or implement the documented `retrying` state and use `InFlight` to mark a message the
sweep is actively delivering (which would also help with finding 005).
**Resolution**
_Deferred 2026-05-16. Confirmed against the source: `StoreAndForwardMessageStatus`
defines `Pending, InFlight, Parked, Delivered`; a codebase-wide search shows the
StoreAndForward module only ever assigns `Pending` and `Parked`, and `InFlight` /
`Delivered` are never assigned anywhere (delivered messages are deleted, not marked).
The design doc's `retrying` state is unmodelled. Both options the recommendation offers
— (a) drop the unused `InFlight`/`Delivered` members, or (b) add a `Retrying` member —
require editing `StoreAndForwardMessageStatus.cs`, which lives in `src/ScadaLink.Commons`
(outside this review's edit scope: only `src/ScadaLink.StoreAndForward/**` may be
changed). The enum is also referenced by IntegrationTests and HealthMonitoring tests, so
removing members is a cross-module change. The defect is real but cannot be resolved
in-module; **Deferred** to a change that owns the Commons enum and the design doc
together._
### StoreAndForward-012 — `StoreAndForwardMessage` is a persistence entity but lives in the component, not Commons
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Deferred |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardMessage.cs:9` |
**Description**
`StoreAndForwardMessage` is a persistence-ignorant POCO that maps directly to the
`sf_messages` table and is also carried across the network inside `ReplicationOperation`
(replicated to the standby node over Akka remoting). CLAUDE.md "Code Organization" states
that entity classes are persistence-ignorant POCOs in Commons and that message contracts
follow additive-only evolution. Because this type doubles as a replication wire contract
but lives in the component assembly, it is not co-located with the other Commons
entities and its evolution is not governed by the additive-only message-contract rule.
This is a borderline case (the type is site-local), but the cross-node use via
`ReplicationOperation` makes it a de-facto message contract.
**Recommendation**
Either move `StoreAndForwardMessage` (and `ReplicationOperation`) into the Commons
`Entities`/`Messages` hierarchy so they are governed by the contract-evolution rules, or
introduce a separate DTO for replication and keep `StoreAndForwardMessage` purely as the
local persistence model. Document the decision.
**Resolution**
_Deferred 2026-05-16. Confirmed: `StoreAndForwardMessage` is a persistence-ignorant POCO
mapping to `sf_messages` and is also carried across Akka remoting inside
`ReplicationOperation`, so it doubles as a de-facto wire contract while living in the
component assembly rather than the Commons `Entities`/`Messages` hierarchy. The
recommendation's primary remedy — moving `StoreAndForwardMessage` (and
`ReplicationOperation`) into Commons — crosses module boundaries (it would add a type to
`src/ScadaLink.Commons`, outside this review's edit scope of
`src/ScadaLink.StoreAndForward/**`, and change every referencing module). The alternative
"separate replication DTO" still leaves the persistence entity in the component, so it
does not actually resolve the finding's core concern (entity placement / contract-
evolution governance). Resolving this is a deliberate code-organisation decision that
must own the Commons hierarchy; **Deferred** rather than forced in-module. Flagged for a
cross-module follow-up._
### StoreAndForward-013 — Critical paths lack test coverage: retry-due timing, replication-from-active, and the actor bridge
| | |
|--|--|
| Severity | Medium |
| Category | Testing coverage |
| Status | Resolved |
| Location | `tests/ScadaLink.StoreAndForward.Tests/` (whole directory); `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:101`; `src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs` |
**Description**
The existing tests cover storage CRUD and the service happy/failure paths well, but
three important behaviours are untested: (1) the retry-due time filter in
`GetMessagesForRetryAsync` — every service test sets `DefaultRetryInterval = TimeSpan.Zero`,
so the `julianday` elapsed-time comparison (the most error-prone SQL in the module) is
never exercised with a non-zero interval; a message that is *not yet due* should be
skipped, and that is never verified. (2) Replication from the active side — no test
asserts that an enqueue/remove/park causes a `Replicate*` call (this is exactly the gap
behind finding 001; a test would have caught it). (3) `ParkedMessageHandlerActor` has no
test at all — the Query/Retry/Discard request-to-response mapping and the
`ExtractMethodName` JSON parsing are unverified, including the malformed-JSON branch.
**Recommendation**
Add tests for: a non-zero retry interval where a recently-attempted message is excluded
and an older one is included; active-side replication invocation per operation type
(once finding 001 is fixed); and `ParkedMessageHandlerActor` using `Akka.TestKit`,
including `ExtractMethodName` for `MethodName`, `Subject`, missing-property and
invalid-JSON payloads.
**Resolution**
Resolved 2026-05-16 (commit pending). All three gaps closed.
(1) Retry-due timing: `GetMessagesForRetryAsync_NonZeroInterval_ExcludesNotYetDue
IncludesDue` exercises the `julianday` elapsed-time SQL with non-zero intervals — a
just-attempted message (1-hour interval) is excluded, an old one (attempted 2h ago,
5-min interval) and a never-attempted one are included.
(2) Active-side replication: this sub-claim was **already stale** at the reviewed code —
`StoreAndForwardReplicationTests` (added with finding 001's fix) asserts the Add, Remove
and Park replication operations; no new test was needed.
(3) `ParkedMessageHandlerActor`: new `ParkedMessageHandlerActorTests` using
`Akka.TestKit.Xunit2` (package reference added to the test project) covers Query/Retry/
Discard request-to-response mapping, correlation-ID propagation, the unknown-message
failure responses, and `ExtractMethodName` for `MethodName`, `Subject`, missing-property
and malformed-JSON payloads (all falling back to the category name without throwing).
These are coverage-gap tests over already-correct code, so they pass on first run.
### StoreAndForward-014 — Storage does not create its SQLite database directory
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:26` |
**Found 2026-05-16** while verifying the store-and-forward fixes — this defect was
not part of the original baseline review.
**Description**
`StoreAndForwardStorage.InitializeAsync` opened a `SqliteConnection` against the
configured `SqliteDbPath` (default `./data/store-and-forward.db`) without ensuring the
parent directory exists. SQLite creates the database *file* on demand but not its
*directory*, so when `data/` does not already exist the connection fails to open with
`SQLite Error 14: 'unable to open database file'`. Every site-host boot therefore failed
in any environment whose working directory has no `data/` folder — the cause of the six
failing `SiteActorPathTests` (the host's `RegisterSiteActors` aborts at
`StoreAndForwardService.StartAsync`). Production masked it because `data/` is created by
the Docker image / deployment.
**Recommendation**
Create the parent directory of a file-backed SQLite database before opening it.
**Resolution**
Resolved 2026-05-16. `InitializeAsync` now calls a new `EnsureDatabaseDirectoryExists`
helper that parses the connection string with `SqliteConnectionStringBuilder` and, for a
file-backed database, creates the parent directory if it is missing (in-memory databases
and bare filenames are skipped). Regression test
`InitializeAsync_FileInMissingDirectory_CreatesDirectory` fails against the pre-fix code;
all six `SiteActorPathTests` now pass. Fixed by the commit whose message references
`StoreAndForward-014`.
### StoreAndForward-015 — `EnqueueAsync`'s public contract never documents that `maxRetries == 0` means "retry forever"
| | |
|--|--|
| Severity | Medium |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:114``:130`, `:285` |
**Description**
The re-review brief asks for the StoreAndForward side of the cross-module ambiguity
recorded as `ExternalSystemGateway-015`. The semantics are split across this module and
its callers, and the StoreAndForward side carries a genuine documentation/API-contract
fault:
- `RetryMessageAsync` parks a message only when `message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries`
(`StoreAndForwardService.cs:285`). When `MaxRetries == 0` the guard is false on every
sweep, so a `0` value means **"no limit — retry forever"**. The
`StoreAndForwardMessage.MaxRetries` XML doc (`StoreAndForwardMessage.cs:31`) does state
`"0 = no limit"`, so the persistence model is internally consistent.
- But `EnqueueAsync` — the *only* public entry point into the engine — exposes a
`maxRetries` parameter (`StoreAndForwardService.cs:128`) with **no parameter
documentation at all**, and its method summary (lines 114122) describes the lifecycle
only as "On max retries → park" / "parked once `MaxRetries` is reached" (see also the
`_deliveryHandlers` field doc, line 5051). Nothing on the public surface tells a
caller that passing `0` flips the meaning from "park immediately / never retry" to
"retry forever". A caller reading only `EnqueueAsync` would reasonably assume `0`
retries means zero retries.
- This is exactly the trap ESG fell into: `ExternalSystemClient.CachedCallAsync` /
`DatabaseGateway.CachedWriteAsync` pass the source entity's `MaxRetries` verbatim,
intending `0` to mean "never retry", and instead get unbounded retry — the
duplicate-delivery / unbounded-buffer-growth hazard the design doc's idempotency note
warns against. The fault is not solely ESG's: the S&F public API silently overloads
`0` with the opposite of its natural reading and does not document it.
The defect is in this module's API contract and documentation, so it is recorded here
in addition to `ExternalSystemGateway-015`. (Whether `0` *should* mean "no limit" or
"no retry" is the cross-module design decision tracked by ESG-015; this finding is
specifically that the StoreAndForward public surface fails to document whichever
meaning is chosen.)
**Recommendation**
Document the `maxRetries` parameter on `EnqueueAsync` explicitly with a `<param>` tag
that states the `0` special case in the same words as `StoreAndForwardMessage.MaxRetries`
(`"0 = no limit — retried on every sweep until delivered, never parked"`), and add the
`0` case to the method summary's lifecycle description. Better still — and consistent
with the resolution of ESG-015 — make the engine reject the ambiguity at the API: accept
a nullable/`enum` retry policy, or treat `0` as an explicit "no retry" (do not buffer, or
park on the first sweep) so the natural reading and the behaviour agree. Either way the
public `EnqueueAsync` contract must state the chosen meaning; today it states nothing.
**Resolution**
Resolved 2026-05-17. Documentation-only fix — retry semantics confirmed correct and
left unchanged. Root cause verified against the source: `EnqueueAsync`'s `maxRetries`
parameter had no `<param>` documentation and the method/class summaries described only
the "park on max retries" path, never the `0 = no limit / retry forever` special case
that `RetryMessageAsync`'s `MaxRetries > 0` guard actually enforces. Added an explicit
`<param>` tag for every `EnqueueAsync` parameter — `maxRetries` now states in bold that
`0` means "no limit, never parked for retry exhaustion" and is **not** a "never retry"
value — extended the method summary with a retry-count lifecycle paragraph, updated the
class-level lifecycle bullet, and tightened the `StoreAndForwardMessage.MaxRetries` field
doc to the same wording. No behavioural code touched; an XML comment is not
test-observable so no regression test was added.
### StoreAndForward-016 — Operator-initiated parked-message retry and discard are not replicated to the standby
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:339``:362`; `src/ScadaLink.StoreAndForward/ReplicationService.cs:131``:136` |
**Description**
`StoreAndForward-001`'s fix wired replication into the active *delivery* paths:
`BufferAsync` replicates an `Add`, a successful retry replicates a `Remove`, and a park
replicates a `Park`. But the two *operator* paths — `RetryParkedMessageAsync` (line 339)
and `DiscardParkedMessageAsync` (line 353) — change buffer state and never touch
`_replication`:
- `RetryParkedMessageAsync` flips a row from `Parked` back to `Pending` (and clears
`retry_count` / `last_attempt_at`) in the local SQLite only. The standby's copy stays
`Parked`.
- `DiscardParkedMessageAsync` `DELETE`s the row from the local SQLite only. The standby's
copy is left in place, still `Parked`.
The Component design doc ("Persistence") requires the active node to forward "each
buffer operation (add, remove, park)" so that on failover "the new active node has a
near-complete copy of the buffer." An operator retrying a parked message is a buffer
state change; an operator discarding one is a removal. After a failover that follows an
operator action:
1. A **discarded** message reappears on the new active node — it is still `Parked`
there, so it resurfaces in the central UI's parked-message list and an operator must
discard it a second time. For a message deliberately removed (e.g. a known-bad
payload) this is a correctness regression of the operator's intent.
2. A **retried** message is still `Parked` on the new active node, so the operator's
"move it back to the queue" action is silently lost across the failover and the
message is not re-attempted.
`ReplicationOperationType` only models `Add`/`Remove`/`Park` — there is no operation for
"un-park / move back to pending", so even a minimal fix needs either a new operation
type or a re-use of `Add` to overwrite the standby row. This is the same class of defect
as the now-resolved `StoreAndForward-001`, for the operator paths rather than the sweep
paths.
**Recommendation**
Replicate both operator actions. `DiscardParkedMessageAsync` should call
`_replication?.ReplicateRemove(messageId)` after a successful local delete (the existing
`Remove` op already deletes on the standby). For `RetryParkedMessageAsync`, add a
`Requeue`/`Unpark` `ReplicationOperationType` whose `ApplyReplicatedOperationAsync` case
resets the standby row to `Pending` with `retry_count = 0`, or have the method re-load
the updated message and replicate it as an `Add`-style upsert. Add replication tests for
both operator paths (the existing `StoreAndForwardReplicationTests` only cover the sweep
paths).
**Resolution**
Resolved 2026-05-17. Root cause confirmed: the two operator paths
(`RetryParkedMessageAsync`, `DiscardParkedMessageAsync`) changed local SQLite state but
never touched `_replication`, so a failover after an operator action diverged the
standby buffer. `DiscardParkedMessageAsync` now calls `_replication?.ReplicateRemove`
after a successful local delete (the existing `Remove` op deletes on the standby). A new
`ReplicationOperationType.Requeue` was added; `RetryParkedMessageAsync` re-loads the
requeued row and calls `_replication?.ReplicateRequeue`, and the standby's
`ApplyReplicatedOperationAsync` `Requeue` case resets its matching row to `Pending` with
`retry_count = 0`. Regression tests `DiscardingAParkedMessage_ReplicatesARemoveOperation`,
`RetryingAParkedMessage_ReplicatesARequeueOperation` and
`ApplyReplicatedOperation_Requeue_MovesStandbyRowBackToPending` (in
`StoreAndForwardReplicationTests`) all pass; the first two fail against the pre-fix code.
### StoreAndForward-017 — Retry/Discard activity-log entries hard-code the `ExternalSystem` category
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:344`, `:358` |
**Description**
`RetryParkedMessageAsync` and `DiscardParkedMessageAsync` raise an S&F activity
notification (consumed by Site Event Logging — WP-14) but pass a hard-coded
`StoreAndForwardCategory.ExternalSystem` as the category argument:
```csharp
RaiseActivity("Retry", StoreAndForwardCategory.ExternalSystem, $"Parked message {messageId} moved back to queue");
RaiseActivity("Discard", StoreAndForwardCategory.ExternalSystem, $"Parked message {messageId} discarded");
```
Both methods take only a `messageId` and never load the message, so they have no access
to its real category. When an operator retries or discards a parked **Notification** or
**CachedDbWrite** message, the site event log records the activity under
`ExternalSystem`. Every other `RaiseActivity` call in the service passes the message's
true `Category` (`EnqueueAsync`, `RetryMessageAsync`), so the operator paths are
inconsistent and produce mislabelled audit entries — misleading when an operator later
filters or reviews S&F activity by category.
**Recommendation**
Load the message (or have `StoreAndForwardStorage.RetryParkedMessageAsync` /
`DiscardParkedMessageAsync` return the affected row's category) and pass the real
`Category` to `RaiseActivity`. If loading the row is considered too costly on these
infrequent operator paths, change the `OnActivity` event / `RaiseActivity` signature to
allow a nullable category for management actions rather than asserting a false one.
**Resolution**
Resolved 2026-05-17. Root cause confirmed: `RetryParkedMessageAsync` and
`DiscardParkedMessageAsync` raised activity notifications with a hard-coded
`StoreAndForwardCategory.ExternalSystem`, mislabelling parked `Notification` and
`CachedDbWrite` messages in the site event log. Both methods now obtain the message's
real category — `DiscardParkedMessageAsync` loads the row via `GetMessageByIdAsync`
before the delete, `RetryParkedMessageAsync` re-loads the requeued row (also used for
the StoreAndForward-016 replication) — and pass it to `RaiseActivity` (falling back to
`ExternalSystem` only if the row is unexpectedly absent). Regression tests
`RetryParkedMessageAsync_ActivityUsesMessageRealCategory` and
`DiscardParkedMessageAsync_ActivityUsesMessageRealCategory` assert the activity carries
`Notification` / `CachedDbWrite` respectively; both fail against the pre-fix code.
### StoreAndForward-018 — Notification corrupt-payload parks the buffered message, contradicting the "notifications do not park" design invariant
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/NotificationForwarder.cs:62``:69`, `:105``:122`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:369``:397` |
**Resolution**`NotificationForwarder.DeliverAsync` now discards a corrupt
buffered payload instead of returning false. The corrupt path logs a Warning
with the buffered row id + length-capped payload preview via an injected
`ILogger<NotificationForwarder>` (NullLogger by default for back-compat),
then returns true so the S&F engine clears the row via its standard
success-path cleanup — honoring the "notifications do not park" design
invariant. Two regression tests in `NotificationForwarderTests` cover the
two corrupt shapes (invalid JSON, `null` deserialisation) and pin that
nothing is forwarded to central in either case.
**Description**
The Component design doc explicitly carves out notifications from the parking lifecycle:
> "Notifications do not park — they are retried at the fixed forward interval until
> central acks." (`docs/requirements/Component-StoreAndForward.md:47`)
> "Parking applies only to the external-system-call and cached-database-write
> categories." (same line)
`NotificationForwarder.DeliverAsync` violates this. When `TryBuildSubmit` fails to
deserialize the buffered payload — either because `JsonSerializer.Deserialize` throws a
`JsonException` (line 114) or because it returns `null` (line 119) — `DeliverAsync`
returns `false` (line 68). On the **retry path** the S&F engine treats handler `false`
as a permanent failure and **parks the message immediately** via the conditional
`UpdateMessageIfStatusAsync(... Parked)` write at `StoreAndForwardService.cs:373``:385`.
Result: a notification with a corrupt buffered payload — a row that the engine itself
treats as opaque ("Payload: Serialized message content…"; `Component-StoreAndForward.md:
110`) — enters the parked state and surfaces in the central UI's parked-message list
under the `Notification` category, contradicting the doc's invariant and the resolved
StoreAndForward-017's "Notification / CachedDbWrite" Retry/Discard category mapping.
The defect is real today: the inline comment on `NotificationForwarder.cs:64` even
documents the violation ("An unreadable payload cannot be fixed by retrying — park it
(return false)") as the intended behaviour, but that behaviour is what the design doc
forbids. Either the doc needs to acknowledge a poison-payload parking exception for
notifications, or the forwarder needs a different escape hatch (discard? log + drop?
permanent-failure-as-`true` to clear the buffer?). Today there is no consistent answer
between code and design.
Additionally, on the **immediate-delivery** path (a fresh enqueue followed by a
`DeliverAsync` returning `false`), the engine returns `WasBuffered: false` and the row
is never persisted — so the corrupt-payload "park" only occurs on the retry path, where
the message has already been buffered (and replicated to the standby). The
**inconsistency between the two paths** ("not buffered" vs "parked") for the same
permanent-failure outcome is itself a contract surprise; the resolved StoreAndForward-004
documents the immediate vs retry asymmetry, but does not anticipate that the retry
asymmetry will violate a per-category invariant.
**Recommendation**
Choose one consistent reconciliation. Preferred option: change `NotificationForwarder.
DeliverAsync` to **discard** a corrupt payload rather than park it — delete the
buffered row directly, log a Site Event Log entry under `Discard`, and return `true` so
the engine clears the buffer. This preserves the design's "notifications do not park"
invariant. Alternatives: (a) update the design doc to acknowledge a poison-payload
parking exception specifically for notifications, and revise the resolved
StoreAndForward-017 wording; (b) treat `JsonException` as transient (would retry-forever
on a corrupt payload — bad); (c) introduce a per-category park-allowed flag on the
engine and gate the retry-path park behind it for the Notification category.
Add a regression test in `tests/ScadaLink.StoreAndForward.Tests/NotificationForwarderTests.
cs` asserting that a corrupt-payload notification reaches a terminal **non-Parked**
state — today the corrupt-payload behaviour is uncovered.
**Resolution**
_Unresolved._
### StoreAndForward-019 — Notifications park after `DefaultMaxRetries` exhaustion, contradicting "retried until central acks"
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:229`, `:407``:437`; `src/ScadaLink.StoreAndForward/StoreAndForwardOptions.cs:18`; `src/ScadaLink.SiteRuntime/Scripts/ScriptRuntimeContext.cs:1773``:1778`; `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:149``:156` |
**Description**
The design doc requires a buffered notification to be retried indefinitely until
central acks:
> "The **notification** category retries differently: it has no source-entity setting.
> The site→central forward uses a single fixed retry interval configured in the host
> `appsettings.json`. … A buffered notification is retried until central acks it; it is
> not parked on a retry limit (central, once reachable, owns delivery, retry, and
> parking from that point on)." (`Component-StoreAndForward.md:55``:59`)
The current engine cannot honour that. `RetryMessageAsync` enforces parking at
`message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries`
(`StoreAndForwardService.cs:407`); a `MaxRetries == 0` is the documented "no limit"
escape hatch (now correctly explained by the resolved StoreAndForward-015). But the two
notification enqueue paths both supply a positive bounded `MaxRetries`:
- `ScriptRuntimeContext.cs:1773``:1778` (the `Notify.Send` site script path) calls
`EnqueueAsync` without supplying the `maxRetries` argument, so the engine
defaults to `StoreAndForwardOptions.DefaultMaxRetries = 50` (`StoreAndForwardOptions.
cs:18`). After 50 retry sweeps with central unreachable, the notification is parked.
- `NotificationDeliveryService.cs:149``:156` (the legacy SMTP-style path retained for
the central-side `INotificationDeliveryService` callers) passes
`smtpConfig.MaxRetries > 0 ? smtpConfig.MaxRetries : null``null` falls back to the
same 50-retry default, and any positive `smtpConfig.MaxRetries` still bounds the
retry budget. Either way, a long central outage parks the notification.
A parked notification cannot be cleared by a central recovery: it stays parked until an
operator clicks **Retry** in the parked-message UI. The design's invariant — that
notification delivery converges automatically as soon as central is reachable — is
broken: an extended central outage requires manual intervention to clear the backlog,
which is exactly the behaviour the central-only outbox redesign was meant to remove
from the site.
This is closely related to (but distinct from) StoreAndForward-018: 018 is the
*permanent-failure-path* parking violation; 019 is the *transient-failure-path*
parking violation under the engine's normal max-retries policy.
**Recommendation**
Make the notification enqueue paths pass `maxRetries: 0` so the documented "no limit /
never parked" semantics apply, and guard against regression by adding an integration
test in `tests/ScadaLink.StoreAndForward.Tests/NotificationForwarderTests.cs` that runs
a sweep many more times than `DefaultMaxRetries` against an always-failing handler and
asserts the buffered notification's status stays `Pending` (not `Parked`). A cleaner
alternative is to special-case the `Notification` category inside
`RetryMessageAsync`'s max-retries guard (treat it as `MaxRetries == 0` regardless of
the field value) so the invariant is enforced at the single chokepoint rather than
relying on every caller to pass the right value — this also fixes the legacy
`NotificationDeliveryService` path without editing the consumer.
**Resolution (2026-05-28):**
VERIFY outcome — the design doc's "Notifications do not park" wording (lines 47, 59)
was the *operational intent* for the happy path, not an absolute invariant: the engine
has always enforced `DefaultMaxRetries` uniformly across every category, and every
sibling system (ESG, CachedDbWrite) bounds retry-then-parks for the same disk-pressure
and operator-visibility reasons. Removing the cap for notifications would let a single
unreachable central exhaust local disk via an unbounded buffer — worse than the
documented "park after retry budget" behaviour. Resolution is therefore the brief's
**default**: document the parking behaviour. Updated
`Component-StoreAndForward.md` lines 46/58 to clarify that the `DefaultMaxRetries` cap
applies uniformly (including to notifications) and that `maxRetries: 0` is the explicit
escape hatch for callers that need unbounded retry. Added a `StoreAndForward-019` block
to `StoreAndForwardOptions.DefaultMaxRetries`'s XML doc explaining the same invariant.
No behavioural code change — existing tests (104 in
`ScadaLink.StoreAndForward.Tests`) continue to pass.
### StoreAndForward-020 — `RetryParkedMessageAsync` skips standby replication when the message is deleted between local update and re-load
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:599``:616` |
**Description**
The StoreAndForward-016 resolution wired Requeue replication into operator-initiated
retry. The fix uses a two-step pattern:
```csharp
public async Task<bool> RetryParkedMessageAsync(string messageId)
{
var success = await _storage.RetryParkedMessageAsync(messageId); // step 1
if (success)
{
var message = await _storage.GetMessageByIdAsync(messageId); // step 2 (no txn)
var category = message?.Category ?? StoreAndForwardCategory.ExternalSystem;
if (message != null)
{
_replication?.ReplicateRequeue(message); // step 3
}
RaiseActivity("Retry", category, ...);
}
return success;
}
```
The two storage calls are on separate connections with no surrounding transaction. A
concurrent writer between step 1 (which moved the row from Parked → Pending) and step 2
(which re-reads the row) can delete or mutate the row:
- An operator who issues `DiscardParkedMessageAsync` immediately after retry — the
`DiscardParkedMessageAsync` storage call is conditional on `status = Parked`, so it
will be a no-op (correct), but a sweep that succeeds in delivering the just-requeued
row will then call `_storage.RemoveMessageAsync` (unconditional), which deletes it.
In a single retry-sweep cycle this race is real because `DefaultRetryInterval = Zero`
is the standard test default and the operator action and a sweep tick can overlap.
- A `RemoveMessageAsync` runs in step 1's wake; `GetMessageByIdAsync` returns null;
step 3 (`_replication?.ReplicateRequeue`) is **skipped entirely**, but step 1
already requeued the row locally. The standby is now left in `Parked` state while
the active node has Pending-then-Deleted, exactly the standby-divergence StoreAndForward-016
was supposed to fix. (On the active node a subsequent failover lands on a Parked
standby copy of a discarded message — the same regression 016 already documented.)
The category-fallback path (`StoreAndForwardCategory.ExternalSystem` when message is
null) silently mislabels the activity log entry too — the same defect that
StoreAndForward-017 fixed for the non-racy path, except this branch handed back a
hard-coded fallback rather than re-loading. The activity log entry is a minor side
effect; the missing replication is the real defect.
**Recommendation**
Capture the message **once**, before the local Parked → Pending storage update, so the
replication path has the row in hand even if a concurrent writer deletes it
afterwards:
```csharp
var message = await _storage.GetMessageByIdAsync(messageId); // before the update
if (message == null || message.Status != StoreAndForwardMessageStatus.Parked)
return false;
var success = await _storage.RetryParkedMessageAsync(messageId);
if (!success) return false;
// `message` was the parked row; the active node just wrote it back to Pending with
// retry_count = 0 — construct the replicated state from those known mutations.
message.Status = StoreAndForwardMessageStatus.Pending;
message.RetryCount = 0;
message.LastError = null;
message.LastAttemptAt = null;
_replication?.ReplicateRequeue(message);
RaiseActivity("Retry", message.Category, $"Parked message {messageId} moved back to queue");
return true;
```
Add a regression test in `StoreAndForwardReplicationTests` that simulates the
delete-between-update-and-reload race and asserts the `Requeue` replication
operation is still emitted with the correct category.
**Resolution (2026-05-28):**
Applied the brief's primary recommendation — `RetryParkedMessageAsync` now captures
the parked row up front via `GetMessageByIdAsync` (and rejects the call early if the
row is missing or no longer `Parked`), then performs the local `RetryParkedMessageAsync`
storage write, and finally reconstructs the post-requeue state on the captured POCO
(`Status = Pending, RetryCount = 0, LastError = null, LastAttemptAt = null`) and
replicates it. A concurrent `RemoveMessageAsync` or `DiscardParkedMessageAsync` running
between the local write and the original re-load can no longer skip replication — the
row is in hand. The category-fallback misllabelling on the racy path is gone because
the activity log uses the captured `Category` directly.
### StoreAndForward-021 — Design doc claims the Operation Tracking Table lives in StoreAndForward but the implementation is in SiteRuntime
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `docs/requirements/Component-StoreAndForward.md:21`, `:49``:51`, `:77``:87`, `:108`, `:114`; `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:37`; `src/ScadaLink.StoreAndForward/` (whole module) |
**Description**
Component-StoreAndForward.md repeatedly assigns the **Operation Tracking Table** to
this component:
- **Responsibilities** (line 21): "Maintain a site-local **operation tracking table**
holding one row per `TrackedOperationId` for cached calls … the authoritative status
record consulted by `Tracking.Status(id)`."
- **Message Lifecycle** (lines 4951): "the operation tracking table is the status
record and the S&F buffer is purely the retry mechanism. A cached call that succeeds
on its first immediate attempt is written directly as a terminal `Delivered` tracking
row and never enters the S&F buffer."
- **Operation Tracking Table** section (lines 7787): "Alongside the S&F buffer DB,
each site node holds a **site-local operation tracking table** in SQLite. … Each row
records the operation kind (`TrackedOperationKind`) …"
The actual implementation lives outside this module: `src/ScadaLink.SiteRuntime/
Tracking/OperationTrackingStore.cs` (and `IOperationTrackingStore`, `OperationTrackingOptions`).
The StoreAndForward project contains no references to the tracking store, owns no
`operation_tracking` table, and `StoreAndForwardService.NotifyCachedCallObserverAsync`
is only a hook handing telemetry context to an `ICachedCallLifecycleObserver` — the
audit bridge wired in `ScadaLink.AuditLog`. The S&F module is **not** the table's
owner; SiteRuntime is.
This is a real design-doc drift, not a code defect, and is flagged explicitly in the
brief's "Module-specific notes". The drift matters because the design doc's
discussion of the lifecycle — "immediate success writes a terminal Delivered tracking
row directly here", "operator discard sets terminal `Discarded`", "central never
mutates the mirror row directly" — places coordination responsibilities on the wrong
component. A reader looking for the source of truth for `Tracking.Status(id)` would
read `Component-StoreAndForward.md` and search `src/ScadaLink.StoreAndForward/` in
vain. The doc also lists Site Call Audit / Audit Log telemetry-emission as a S&F
responsibility (line 22), but the emission actually happens via the `AuditLog` site
component subscribing to `ICachedCallLifecycleObserver`.
**Recommendation**
Reconcile the doc with the code. The simplest fix is doc-side: update
Component-StoreAndForward.md to scope its responsibilities back to the retry
mechanism + replication + parked-message management, and add a cross-reference to a
new (or existing) component doc for Operation Tracking (Component-SiteRuntime.md, or
a new Component-OperationTracking.md). The code is internally consistent — the audit
bridge subscribes to the observer hook, the SiteRuntime store writes the rows, the S&F
engine emits attempt telemetry on the cached-call hot path — but the design doc is
several refactors out of date. The hierarchical map should be:
- `Component-StoreAndForward.md` → S&F buffer + Replication + Parked-message
management + Notification forwarding to central + cached-call telemetry **hook**.
- New doc / SiteRuntime doc → Operation Tracking Table semantics and lifecycle.
- `Component-SiteCallAudit.md` / `Component-AuditLog.md` → telemetry emission +
central-side mirror.
**Resolution (2026-05-28):**
Doc-side fix applied (per the brief, the simplest of the two options). Updated
`Component-StoreAndForward.md`: (1) removed the "Maintain a site-local operation
tracking table" line from Responsibilities and reworded the cached-call telemetry
responsibility to point at the `ICachedCallLifecycleObserver` hook; (2) renamed the
"Operation Tracking Table" section to "Operation Tracking Table (lives in Site
Runtime, not here)" with an explicit `StoreAndForward-021` callout cross-linking to
`Component-SiteRuntime.md` and the `IOperationTrackingStore` interface in
Commons. The rest of the section is retained for cross-component context (the
buffered cached-call rows carry `TrackedOperationId` so the link to the tracking row
must still be documented somewhere) but is reworded to make clear the table itself is
not owned here.
### StoreAndForward-022 — `NotifyCachedCallObserverAsync` silently drops the entire audit lifecycle when the message id is not a parseable `TrackedOperationId`
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:484``:515` |
**Description**
`NotifyCachedCallObserverAsync` (the per-attempt observer notifier wired by the M3
Bundle E rollout) bails out with no audit emission when
`TrackedOperationId.TryParse(message.Id, out var trackedId)` returns false
(`StoreAndForwardService.cs:510``:515`). The inline comment justifies the behaviour as
back-compat for "pre-M3 message (random GUID-N id from S&F itself, no
TrackedOperationId threaded in)", but the documented contract is broken in two ways:
1. **Silent dropping of every audit row, not just the first one.** The skip means no
`Attempted` row, no `CachedResolve` terminal row, no audit trail at all for that
operation's S&F lifecycle — yet the rest of the system (script trust boundary,
parked-message UI, etc.) still treats the operation as audit-tracked. The drop is
not surfaced via a metric, log warning (the path is a silent `return`), or counter,
so a misconfigured caller bypasses the audit hot path with zero feedback.
2. **The contract is hidden in field-level XML.** The `ICachedCallLifecycleObserver`
public interface contract (defined in `ScadaLink.Commons`) does not document that
the observer will be silently skipped when the underlying S&F message id is not a
GUID. A consumer reading the interface contract reasonably expects every cached-call
attempt to surface — the audit pipeline depends on it. The silent-drop is an
implementation detail of the S&F bridge that should be either lifted onto the
contract or removed.
The engine itself mints GUID-N ids via `Guid.NewGuid().ToString("N")` (line 224), which
`TrackedOperationId.TryParse` accepts, so the skip path is unreachable for engine-minted
ids. It is reachable only for callers that supply their own `messageId` argument with a
non-GUID format. The current callers (`NotificationOutbox` enqueue path with
NotificationId, cached-call enqueue path with `TrackedOperationId.ToString()`) all
supply GUID-shaped ids. The defect is latent — a future caller passing a non-GUID id
would silently bypass audit.
**Recommendation**
Two options. The cheap fix: change the skip to a `_logger.LogWarning` with the offending
id so a misconfigured caller is observable, and update the
`ICachedCallLifecycleObserver` XML doc to mention the "non-GUID id → no telemetry"
contract explicitly. The more correct fix: emit a still-audited row for the
non-GUID case (e.g. synthesise a `TrackedOperationId` from the underlying id, or emit a
distinguished "tracking-id-missing" audit row) so the audit pipeline never has silent
holes. Add a regression test in `CachedCallAttemptEmissionTests` capturing the chosen
contract — the existing
`Attempt_MessageIdNotAGuid_NoObserverNotification` test pins today's silent-skip; if
the fix is "log + skip", that test should be updated to also assert the log emission;
if the fix is "emit anyway", the test should be replaced.
**Resolution (2026-05-28):**
Applied the brief's "cheap fix" — the non-GUID skip path now logs a Warning naming
the offending `MessageId`, `Category` and `Outcome` before returning, so a
misconfigured caller is observable instead of silently bypassing the audit pipeline.
S&F retry bookkeeping remains untouched (the observer is still best-effort, the skip
still returns without throwing). The existing
`Attempt_MessageIdNotAGuid_NoObserverNotification` test still passes — its assertion
is on `_observer.Notifications` being empty, which is unchanged.
### StoreAndForward-023 — `siteId` silently defaults to empty when no `IStoreAndForwardSiteContext` is registered, degrading audit telemetry correlation
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/ServiceCollectionExtensions.cs:43``:53`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:99`, `:524` |
**Description**
`AddStoreAndForward`'s service-collection factory resolves the optional
`IStoreAndForwardSiteContext` and falls back to `string.Empty` when not registered:
```csharp
var siteContext = sp.GetService<IStoreAndForwardSiteContext>();
var siteId = siteContext?.SiteId ?? string.Empty;
return new StoreAndForwardService(storage, options, logger, replication,
cachedCallObserver, siteId);
```
The constructor's parameter is even defaulted to `""`. The empty-string `siteId` flows
straight into every emitted `CachedCallAttemptContext.SourceSite`, which the central
audit pipeline uses as part of the `(SourceSite, TrackedOperationId)` correlation key.
A host that registers an `ICachedCallLifecycleObserver` (the audit observer wired by
`AddAuditLog`) but forgets to register `IStoreAndForwardSiteContext` will produce a
stream of telemetry rows with `SourceSite = ""` — the central audit mirror cannot
distinguish them by site, and the central-site routing of
`RetryParkedOperation`/`DiscardParkedOperation` commands keyed on `SourceSite` will
fail to find the owning site.
The Host's `IStoreAndForwardSiteContext` adapter and the audit observer registration
are wired in lock-step, so the current configuration is correct, but the silent
empty-string fallback is a contract hazard for future hosts (CLI test harness, second
site cluster topology, etc.) and for tests that wire one without the other.
**Recommendation**
Make the contract explicit: when `cachedCallObserver` is non-null, require
`IStoreAndForwardSiteContext` to be registered — throw an `InvalidOperationException`
with a clear "Audit observer registered without a site context — register
IStoreAndForwardSiteContext" message at construction time. When the audit observer is
absent (no `AddAuditLog`), keep the empty-string default since `_siteId` is unused.
Alternatively, change `siteId` from a parameter to a `Func<string>` resolved lazily
from the service provider so a late-registered context still takes effect.
**Resolution (2026-05-28):**
Applied the brief's sentinel option (less invasive than throwing — preserves the
existing test wiring that constructs `StoreAndForwardService` without a site context).
Introduced `StoreAndForwardService.UnknownSiteSentinel = "$unknown-site"` (leading
`$` chosen so it cannot collide with a real site id) and the constructor now
normalises any null/empty/whitespace `siteId` argument to that sentinel. The empty
string can no longer reach `CachedCallAttemptContext.SourceSite`; a misconfigured
host without an `IStoreAndForwardSiteContext` produces audit rows tagged with the
sentinel — recognisably bad in the central audit log instead of silently merging
into the empty bucket. All 104 existing tests pass; the only test that asserts a
literal `SourceSite` (`CachedCallAttemptEmissionTests`) supplies `"site-77"` so the
normalisation is a no-op there.
### StoreAndForward-024 — `StopAsync` does not wait for an in-flight retry sweep, so disposed dependencies can be touched after shutdown
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Resolved |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:122``:127`, `:136``:143`, `:303``:329` |
**Resolution (2026-05-28):** the timer callback now captures the sweep task
into a `_sweepTask` field via `Volatile.Write`, and `StopAsync` disposes the
timer first (so no new sweep starts) then `await`s the captured task with a
bounded `SweepShutdownWaitTimeout` (10 s) via `Task.WaitAsync` — so a sweep
in-flight when shutdown begins is given a chance to finish before the host
disposes `_storage`/`_replication`. A genuinely hung sweep cannot block
shutdown indefinitely (the timeout fires, the wait is abandoned, the
warning is logged). Regression test
`StopAsync_AwaitsInFlightRetrySweep_BeforeReturning` parks a sweep inside a
blocking handler, asserts `StopAsync`'s returned task is not completed while
the sweep is paused, then releases the handler and asserts the sweep ran to
completion before `StopAsync` returned.
**Description**
`StartAsync` arms `_retryTimer` with `_ => _ = RetryPendingMessagesAsync()` (line 123).
The `_ =` discards the returned `Task`, so when the timer fires the sweep runs **fire
and forget** on a thread-pool thread. `StopAsync` (lines 136143) disposes the timer:
```csharp
if (_retryTimer != null)
{
await _retryTimer.DisposeAsync();
_retryTimer = null;
}
```
`Timer.DisposeAsync()` returns once any in-flight timer **callback** has completed —
but the timer callback in this service is a one-line `_ = RetryPendingMessagesAsync()`
that synchronously returns immediately and leaves the actual sweep running on the
thread pool. So `Timer.DisposeAsync` does not wait for the sweep; only for the
synchronous `_ = ...` discarding step. `StopAsync` returns while a sweep is potentially
still running, touching `_storage` (which the host will dispose), `_replication`
(which the host will tear down), and `_cachedCallObserver` (whose downstream gRPC
channel the host will shut down).
The host shutdown sequence (`AkkaHostedService`) tears down the actor system and the
DI container after this service's `StopAsync` completes — meaning a sweep that runs
past `StopAsync` can call into disposed `SqliteConnection`s (yielding
`ObjectDisposedException`, caught by the sweep's outer `try/catch` as a log) or, more
seriously, push a replication operation into a half-disposed Akka actor pipeline and
trigger noisy dead-letter warnings during a clean shutdown.
The race window is small (the sweep typically finishes in <100 ms in tests) but it is
real, particularly when shutting down a site under load with a non-empty buffer.
**Recommendation**
Track in-flight sweep tasks and `await` them in `StopAsync`:
```csharp
private Task? _currentSweep;
public async Task StopAsync()
{
if (_retryTimer != null)
{
await _retryTimer.DisposeAsync();
_retryTimer = null;
}
if (_currentSweep is { } sweep)
{
try { await sweep; } catch { /* logged inside RetryPendingMessagesAsync */ }
}
}
```
Change the timer callback to:
```csharp
_retryTimer = new Timer(_ => _currentSweep = RetryPendingMessagesAsync(), ...);
```
Add a `CancellationTokenSource` so a long sweep can be cooperatively aborted on stop;
plumb the cancellation token into `_storage` / `_replication` / `_cachedCallObserver`
calls. Add a regression test in `StoreAndForwardServiceTests` that calls `StopAsync`
mid-sweep and asserts no further storage activity occurs after `StopAsync` returns.
**Resolution**
_Unresolved._