77cb0ad0e2
The largest themed batch — small mechanical fixes across 11 modules.
API / message hygiene:
- Comm-020: SiteAddressCacheLoaded now carries IReadOnlyDictionary /
IReadOnlyList — Akka messages must be immutable.
- Commons-016: BundleSession.MaxUnlockAttempts named constant replaces
magic 3.
- Commons-018: IOperationTrackingStore + IPartitionMaintenance moved from
Interfaces/ root to Interfaces/Services/ (namespace preserved — 9
consumers exceeded the in-prompt move threshold).
- Commons-023: TrackingStatusSnapshot.SourceNode now consistent with the
trailing-optional-with-default pattern used elsewhere.
- SR-022: AuditingDbCommand.DbConnection.set no longer uses reflection —
exposes AuditingDbConnection.Inner via internal API surface.
Dead code / config cleanup:
- ClusterInfra-011: decorative SectionName constant deleted.
- ClusterInfra-014: dead AddClusterInfrastructureActors method + its
"throws-when-called" test deleted.
- Host-021: Microsoft Logging:LogLevel block deleted from appsettings.json
(dead under Serilog).
Fail-loud over fail-silent:
- DM-021: ResolveSiteIdentifierAsync throws on missing site (was silently
substituting a DB id).
- DM-022: dropped transient Pending write — record now lands directly in
InProgress (no UI flicker, one fewer DB write).
- Host-020: LoggerConfigurationFactory emits a Console.Error warning when
both Serilog:MinimumLevel and ScadaLink:Logging:MinimumLevel are set
(ScadaLink remains truth per Host-011).
- SnF-022: NotifyCachedCallObserverAsync logs Warning on unparseable
TrackedOperationId (was silently dropping).
- SnF-023: empty siteId default replaced with $unknown-site sentinel
+ constructor normalisation.
Correctness:
- SCA-001: SupervisorStrategy XML rewritten to match actual
DefaultDecider/Restart semantics (was claiming Resume).
- SCA-003: OnUpsertAsync now restamps IngestedAtUtc on every upsert.
- SR-021: HandleDeployArtifacts now dispatches an internal
ApplyArtifactDataConnectionsToDcl message after the SQLite write so
system-wide artifact-deploy data-connection changes go live
immediately (was requiring a site restart).
- SnF-020: RetryParkedMessageAsync captures the parked row BEFORE the
local write so a concurrent delete can't skip standby replication.
Sentinels / naming collisions:
- HM-021: CentralSiteId changed from "central" to "$central"
(uncollideable — leading $ is forbidden in real SiteIdentifiers).
Doc / surface cleanups:
- SEL-018: FailedWriteCount promoted to ISiteEventLogger; XML softened
to "Available for future Health Monitoring integration".
- SnF-019: VERIFY outcome — documented parking-after-DefaultMaxRetries
in Component-StoreAndForward.md + DefaultMaxRetries XML (uniform
cap; maxRetries:0 is the unbounded escape hatch).
- SnF-021: Component-StoreAndForward.md no longer claims the tracking
table lives in SnF — it's in SiteRuntime, the interface is in Commons.
- CLI-020: bundle export response parse guarded with try/catch on
JsonException / KeyNotFoundException / FormatException — emits a
clean INVALID_RESPONSE exit instead of a stack trace.
Config:
- ClusterInfra-013: intent comment added to "catastrophic config" test.
- Host-016: appsettings.Site.json second CentralContactPoints entry
removed (was pointing at the SITE's own port); doc-key explains how
to extend.
- Host-018: NodeName added to both shipped per-role configs (was
causing SourceNode to be null on audit rows).
UI:
- CentralUI-029: replaced JS.InvokeAsync<int>("eval", …) with an ES
module import (new wwwroot/js/browser-time.js).
- CentralUI-032: AuditResultsGrid gains a Previous button backed by a
cursor stack.
10+ new regression tests across the affected projects. Build clean;
all suites green. README regenerated: 6 open (was 33).
Session-to-date: 130 of 136 originally-open Theme findings closed.
1523 lines
86 KiB
Markdown
1523 lines
86 KiB
Markdown
# Code Review — StoreAndForward
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| Module | `src/ScadaLink.StoreAndForward` |
|
||
| Design doc | `docs/requirements/Component-StoreAndForward.md` |
|
||
| Status | Reviewed |
|
||
| Last reviewed | 2026-05-28 |
|
||
| Reviewer | claude-agent |
|
||
| Commit reviewed | `1eb6e97` |
|
||
| Open findings | 0 (3 Deferred: 002, 011, 012; all 5 Open from Re-review 2026-05-28 resolved 2026-05-28) |
|
||
|
||
## Summary
|
||
|
||
The Store-and-Forward module is small and readable, with a clean SQLite persistence
|
||
layer, a sensible service API, and reasonable test coverage of the storage and service
|
||
happy paths. However the review surfaced two issues that undermine the module's core
|
||
purpose. First, the active delivery path never invokes the `ReplicationService` —
|
||
`ReplicateEnqueue/Remove/Park` have no callers anywhere in the codebase, so buffered
|
||
messages are not replicated to the standby node and the design's failover-durability
|
||
guarantee (Component doc "Persistence", CLAUDE.md "Store-and-Forward") is not met.
|
||
Second, there is an off-by-one in retry accounting: the immediate-failure path stores a
|
||
buffered message with `RetryCount = 1`, so a message configured with `MaxRetries = N`
|
||
is actually attempted `N` times in total rather than one immediate attempt plus `N`
|
||
retries, and a per-source `MaxRetries` of 1 produces zero retry attempts. Additional
|
||
themes: SQLite connection-per-call with no transactional grouping of multi-statement
|
||
operations, no concurrency guard against a parked message being retried while the
|
||
sweep is mid-flight, an unused enum member (`InFlight`) that drifts from the documented
|
||
status set, and untested critical paths (retry-due timing, replication-from-active,
|
||
the actor bridge). None of the findings are blockers for compilation, but the
|
||
replication and retry-count issues are functional defects against the design.
|
||
|
||
#### Re-review 2026-05-17 (commit `39d737e`)
|
||
|
||
Re-reviewed at commit `39d737e` after the batch-3 fixes. All of findings 001 and
|
||
003–010, plus 013–014, are confirmed `Resolved` against the current source: the
|
||
replication wiring (`BufferAsync`/`ReplicateRemove`/`ReplicatePark`), the corrected
|
||
retry-count semantics, the conditional `UpdateMessageIfStatusAsync` writes, the
|
||
transactioned parked-message reads, the `PipeTo` refactor, the `RaiseActivity`
|
||
hardening, the `RetryParkedMessageAsync` `last_attempt_at` reset and the database
|
||
directory creation are all present as described. Findings 002, 011 and 012 were
|
||
re-verified and remain validly `Deferred` — their preconditions are unchanged (002's
|
||
residual no-handler gap, 011's Commons-owned enum, 012's Commons-owned entity placement).
|
||
|
||
This pass surfaced **three new findings**. StoreAndForward-015 records the
|
||
StoreAndForward side of the cross-module `MaxRetries == 0` ambiguity flagged by
|
||
ExternalSystemGateway-015: `EnqueueAsync`'s public contract documents `maxRetries` only
|
||
as "parked once `MaxRetries` is reached" and never states the `0 = no limit / retry
|
||
forever` special case that `RetryMessageAsync` actually enforces, so an ESG caller
|
||
passing `0` to mean "never retry" gets the opposite behaviour with no warning from the
|
||
S&F API surface. StoreAndForward-016 records that operator-initiated parked-message
|
||
retry and discard are not replicated to the standby — only the add/remove/park sweep
|
||
paths are — so a failover diverges the standby buffer from the active one.
|
||
StoreAndForward-017 records that the Retry/Discard activity-log entries hard-code the
|
||
`ExternalSystem` category, mislabelling notification and cached-DB-write messages in
|
||
the site event log.
|
||
|
||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||
|
||
Full re-review against commit `1eb6e97` with the same 10-category checklist. The
|
||
batch-3 / batch-4 resolutions (001, 003–010, 013–017) are still present and intact; no
|
||
regressions detected on prior fixes. Findings 002, 011 and 012 remain validly
|
||
`Deferred` (their preconditions are unchanged) and findings 005, 006, 010, 013, 014,
|
||
016, 017 are confirmed `Resolved` against the current source.
|
||
|
||
This pass surfaced **seven new findings** clustered around two themes:
|
||
|
||
The first theme is **design-doc drift on the notification path**, which has acquired
|
||
two now-real defects since the engine became central-targeted. `StoreAndForward-018`
|
||
(High) records that a corrupt notification payload — handled in `NotificationForwarder.
|
||
DeliverAsync` by returning `false` — parks a notification on its first retry-sweep
|
||
encounter, despite the design doc stating "Notifications do not park" (line 47, "Parking
|
||
applies only to the external-system-call and cached-database-write categories"). The
|
||
same path becomes a poison-payload retry-forever trap on the active node if the engine
|
||
ever softened the `false` semantics. `StoreAndForward-019` (Medium) records the
|
||
sibling defect: notifications are enqueued with `MaxRetries` defaulting to
|
||
`StoreAndForwardOptions.DefaultMaxRetries` (50), and the legacy SMTP path
|
||
(`NotificationDeliveryService.SendAsync`) passes a positive bounded `smtpConfig.
|
||
MaxRetries` — so an unreachable central will silently park notifications after a
|
||
finite retry budget rather than "retry at the fixed forward interval until central acks"
|
||
as the design requires. The contract `0 = no limit` is not enforced for the
|
||
notification category.
|
||
|
||
The second theme is **subtle correctness and contract gaps around the operator paths**
|
||
that survived the StoreAndForward-016/017 batch. `StoreAndForward-020` (Medium) records
|
||
that `RetryParkedMessageAsync` skips replication entirely if `GetMessageByIdAsync`
|
||
returns null after a successful local requeue (a narrow but real race window with a
|
||
concurrent discard / sweep delete), re-introducing the StoreAndForward-016 standby
|
||
divergence in that corner. `StoreAndForward-021` (Medium) is a design-doc-vs-code drift
|
||
that should be reconciled in the doc: the **operation tracking table** is documented
|
||
inside Component-StoreAndForward.md as a S&F responsibility (lines 21, 49, 77–87, 108,
|
||
114), but the actual `OperationTrackingStore` lives in `src/ScadaLink.SiteRuntime/
|
||
Tracking/` and is not consumed by S&F at all — the brief's own note flags this. The
|
||
design doc should be updated to point at SiteRuntime, or the store moved to
|
||
StoreAndForward.
|
||
|
||
`StoreAndForward-022` (Low) records that `_cachedCallObserver` silently drops audit
|
||
telemetry when a buffered cached-call's `Id` is not a parseable `TrackedOperationId`
|
||
GUID — the engine returns from `NotifyCachedCallObserverAsync` before emitting anything,
|
||
so a legacy enqueue path that buffered a non-GUID id (the engine's own default minting
|
||
produces "N"-formatted GUIDs, which TrackedOperationId.TryParse accepts, but any
|
||
caller passing a custom non-GUID id silently bypasses the entire `Submitted/Forwarded/
|
||
Attempted/Delivered/Parked/Discarded` audit lifecycle). `StoreAndForward-023` (Low)
|
||
records that `siteId` is silently defaulted to `string.Empty` when no
|
||
`IStoreAndForwardSiteContext` is registered, so a misconfigured host produces audit
|
||
telemetry with `SourceSite = ""` and the central audit-log's `(SourceSite,
|
||
TrackedOperationId)` correlation degrades to a per-id-only index. `StoreAndForward-024`
|
||
(Low) is a stop-time ordering defect: `StopAsync` disposes the timer but a
|
||
mid-flight `RetryPendingMessagesAsync` invocation continues using `_storage` and
|
||
`_replication` after `StopAsync` returns; downstream resources disposed by the host
|
||
shutdown sequence (the DI container) can then NRE through the still-running sweep.
|
||
|
||
## Checklist coverage — Re-review 2026-05-28
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | ☑ | Notification corrupt-payload parks contrary to design (018); RetryParkedMessageAsync skips replication when message reload races a deletion (020). |
|
||
| 2 | Akka.NET conventions | ☑ | `ParkedMessageHandlerActor` uses `PipeTo` correctly with success/failure projections (007 resolution preserved). No new findings. |
|
||
| 3 | Concurrency & thread safety | ☑ | Sweep-vs-stop race: a timer callback running while `StopAsync` returns can touch disposed dependencies (024). |
|
||
| 4 | Error handling & resilience | ☑ | Notifications park after `DefaultMaxRetries` exhaustion (019) — contradicts the design doc's "retried until central acks". |
|
||
| 5 | Security | ☑ | No issues found — parameterised SQL throughout, payload JSON opaque, no secret material handled. |
|
||
| 6 | Performance & resource management | ☑ | No new findings — the connection-per-call documented trade-off and pooled `OpenAsync` remain acceptable. |
|
||
| 7 | Design-document adherence | ☑ | Operation Tracking Table documented in StoreAndForward but actually lives in SiteRuntime (021); notification non-parking guarantee broken by 018 + 019. |
|
||
| 8 | Code organization & conventions | ☑ | `IStoreAndForwardSiteContext` silently defaults `SiteId` to empty (023) — a configuration hole rather than an entity placement issue. |
|
||
| 9 | Testing coverage | ☑ | The seven new findings have no regression tests in `tests/ScadaLink.StoreAndForward.Tests/` — particularly the notification-doesn't-park invariant (018, 019), the requeue-after-reload-null replication gap (020), and the stop-during-sweep behaviour (024). |
|
||
| 10 | Documentation & comments | ☑ | `CachedCallAttemptOutcome.ParkedMaxRetries` XML doc says "S&F semantics" but the code applies it to notifications too if 018/019 fire — minor drift, captured under 018. The `TrackedOperationId.TryParse` silent-skip behaviour in `NotifyCachedCallObserverAsync` is documented in the source but not on the public observer contract (022). |
|
||
|
||
## Checklist coverage
|
||
|
||
| # | Category | Examined | Notes |
|
||
|---|----------|----------|-------|
|
||
| 1 | Correctness & logic bugs | ☑ | Off-by-one in retry counting (003); parked-message retry timing (010); Retry/Discard activity log hard-codes the category (017). |
|
||
| 2 | Akka.NET conventions | ☑ | `ContinueWith` used instead of `PipeTo`-friendly continuations; default supervision; see 007. |
|
||
| 3 | Concurrency & thread safety | ☑ | Sweep guarded by `Interlocked`, but no guard against retry-vs-manage races (005); `OnActivity` event not thread-safe (009). |
|
||
| 4 | Error handling & resilience | ☑ | Replication never invoked from active path (001); no-handler messages buffered then stuck (002); operator retry/discard not replicated to standby (016). |
|
||
| 5 | Security | ☑ | No issues found — parameterised SQL throughout; no secrets handled directly; payload JSON treated opaquely. |
|
||
| 6 | Performance & resource management | ☑ | New SQLite connection per call; multi-statement operations not wrapped in a transaction (006, 008). |
|
||
| 7 | Design-document adherence | ☑ | Replication gap (001); `InFlight` status undocumented/unused (011); "retrying" status from design doc not modelled. |
|
||
| 8 | Code organization & conventions | ☑ | `StoreAndForwardMessage` is an entity-like POCO living in the component, not Commons (012). |
|
||
| 9 | Testing coverage | ☑ | Retry-due timing, replication-from-active, and `ParkedMessageHandlerActor` are untested (013). |
|
||
| 10 | Documentation & comments | ☑ | XML doc on `RegisterDeliveryHandler` contract is inconsistent with code (004); `EnqueueAsync` never documents the `maxRetries == 0` = "retry forever" special case (015). |
|
||
|
||
## Findings
|
||
|
||
### StoreAndForward-001 — Replication to standby is never triggered by the active node
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Critical |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/ReplicationService.cs:40`, `:53`, `:66`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:155`, `:212`, `:222`, `:236` |
|
||
|
||
**Description**
|
||
|
||
`ReplicationService` exposes `ReplicateEnqueue`, `ReplicateRemove` and `ReplicatePark`
|
||
to forward buffer operations to the standby node, but a codebase-wide search shows these
|
||
methods have no callers. `StoreAndForwardService` — which performs every add (`EnqueueAsync`
|
||
line 155 / 163), remove (`RemoveMessageAsync` call at line 212) and park
|
||
(`UpdateMessageAsync` calls at lines 222/236) — holds no reference to `ReplicationService`
|
||
and never invokes it. Only the receiving half is wired (`SetReplicationHandler` and
|
||
`ApplyReplicatedOperationAsync` are used by `SiteReplicationActor`). The Component design
|
||
doc ("Persistence") and CLAUDE.md ("Store-and-Forward") require the active node to
|
||
forward every buffer operation to the standby so that, on failover, the new active node
|
||
"has a near-complete copy of the buffer." As written, the standby's S&F SQLite database
|
||
stays empty and a failover loses the entire buffer — a data-loss defect against a core
|
||
requirement.
|
||
|
||
**Recommendation**
|
||
|
||
Inject `ReplicationService` into `StoreAndForwardService` and call `ReplicateEnqueue`
|
||
after a successful `_storage.EnqueueAsync`, `ReplicateRemove` after `RemoveMessageAsync`,
|
||
and `ReplicatePark` after a park-causing `UpdateMessageAsync`. Update
|
||
`ServiceCollectionExtensions.AddStoreAndForward` to pass the dependency. Add a test that
|
||
asserts the replication handler observes each operation type.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16. `ReplicationService` is now injected into `StoreAndForwardService`
|
||
(wired in `AddStoreAndForward`), and every buffer operation is forwarded to the standby:
|
||
a new `BufferAsync` helper calls `ReplicateEnqueue` after each persist, `ReplicateRemove`
|
||
runs after a successful retry removes a message, and `ReplicatePark` runs on both park
|
||
paths. Replication stays fire-and-forget and is a no-op when `ReplicationEnabled` is
|
||
false or no handler is wired. Regression tests `StoreAndForwardReplicationTests` assert
|
||
the replication handler observes the Add, Remove and Park operations. Fixed by the
|
||
commit whose message references `StoreAndForward-001`.
|
||
|
||
### StoreAndForward-002 — Messages enqueued with no registered handler are buffered but never deliverable
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Original severity | High (re-triaged down to Low on 2026-05-16 — see Re-triage note) |
|
||
| Category | Error handling & resilience |
|
||
| Status | Deferred |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:162`, `:201` |
|
||
|
||
**Description**
|
||
|
||
`EnqueueAsync` falls through to "No handler registered — buffer for later" (line 162)
|
||
when no delivery handler is registered for the category. The retry sweep
|
||
(`RetryMessageAsync`, line 201) then logs "No delivery handler for category" and
|
||
`return`s without touching the message. No caller in the codebase ever calls
|
||
`RegisterDeliveryHandler` (the External System Gateway, Notification Service and
|
||
Database Gateway only call `EnqueueAsync`), so in the current wiring **every** buffered
|
||
message lands in this dead state: it is persisted, counts toward buffer depth, but can
|
||
never be retried, delivered or parked. It will sit Pending forever. Either the handler
|
||
registration is missing from Host/gateway startup, or the "buffer for later" path is a
|
||
silent trap. Either way the engine cannot deliver anything.
|
||
|
||
**Recommendation**
|
||
|
||
Decide the intended contract. If handlers are expected to be registered before
|
||
`EnqueueAsync` is reachable, make `EnqueueAsync` reject (or log an error) when no
|
||
handler exists rather than silently buffering an undeliverable message, and wire
|
||
`RegisterDeliveryHandler` calls in Host startup for all three categories. If late
|
||
registration is intended, the retry sweep should treat a still-missing handler as a
|
||
transient condition with bounded logging rather than a permanent no-op.
|
||
|
||
**Re-triage note (2026-05-16)**
|
||
|
||
The finding's central factual claim — *"No caller in the codebase ever calls
|
||
`RegisterDeliveryHandler`"* and therefore *"every buffered message lands in this dead
|
||
state"* — is **no longer true at the reviewed code**. `ScadaLink.Host`
|
||
(`AkkaHostedService.RegisterSiteActors`, `AkkaHostedService.cs:353-379`) registers all
|
||
three delivery handlers (`ExternalSystem`, `CachedDbWrite`, `Notification`) at site
|
||
startup, immediately after `StoreAndForwardService.StartAsync()`. The finding was
|
||
written against commit `9c60592` before that wiring existed; the High-severity
|
||
"engine cannot deliver anything" outcome no longer occurs.
|
||
|
||
The remaining residual risk is narrow: a message enqueued for a category that genuinely
|
||
has no handler (e.g. an enqueue racing ahead of `RegisterDeliveryHandler`, or a future
|
||
category added without a handler) is still buffered and then skipped by the sweep
|
||
forever. That is a real but minor robustness gap, hence the **downgrade to Low**.
|
||
|
||
It is left **Open** rather than fixed in this pass because the finding's recommended
|
||
fix — making `EnqueueAsync` reject when no handler is registered — is a behavioural
|
||
contract change, not a localised bug fix: the "buffer with no handler yet" path is
|
||
exercised by `StoreAndForwardReplicationTests` and by three NotificationService and
|
||
ExternalSystemGateway tests (`Send_TransientError_WithStoreAndForward_BuffersMessage`,
|
||
`Send_Smtp4xxCommandException_ClassifiedTransientAndBuffered`,
|
||
`Send_SmtpProtocolException_ClassifiedTransient`) which construct a real
|
||
`StoreAndForwardService` without registering a handler and assert `WasBuffered`.
|
||
Changing the contract requires deciding whether late handler registration is supported
|
||
and updating tests in modules outside this review's edit scope — a design decision that
|
||
should be made deliberately rather than forced here.
|
||
|
||
**Resolution**
|
||
|
||
_Deferred 2026-05-16 (re-triaged High → Low). Verified again against the source in this
|
||
pass: the finding's premise (no `RegisterDeliveryHandler` caller anywhere) is stale —
|
||
`ScadaLink.Host` wires all three handlers at site startup — so the High-severity
|
||
"engine cannot deliver anything" outcome no longer occurs. The residual gap (a message
|
||
enqueued for a category that genuinely has no handler is buffered then skipped forever)
|
||
is real but minor. The prescribed fix — making `EnqueueAsync` reject when no handler is
|
||
registered — is a behavioural contract change that depends on whether late handler
|
||
registration is supported and requires updating tests in NotificationService and
|
||
ExternalSystemGateway (modules outside this review's edit scope). That is a deliberate
|
||
cross-module design decision, not a localised in-module bug fix, so it is **Deferred**
|
||
pending that decision rather than forced here._
|
||
|
||
### StoreAndForward-003 — Off-by-one in retry accounting: immediate failure pre-counts as retry 1
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:153`, `:229`, `:233` |
|
||
|
||
**Description**
|
||
|
||
On a transient immediate-delivery failure, `EnqueueAsync` buffers the message with
|
||
`message.RetryCount = 1` (line 153). The retry sweep then increments `RetryCount` before
|
||
the max check (`RetryCount++` at line 229; `RetryCount >= MaxRetries` at line 233).
|
||
Consequences: (1) a message configured with `MaxRetries = 1` is parked on the *first*
|
||
retry sweep without ever being retried, because after the immediate attempt `RetryCount`
|
||
is already 1 and the first sweep makes it 2 ≥ 1 — zero actual retries occur, contradicting
|
||
the design intent that the immediate attempt and the retry budget are distinct;
|
||
(2) the design doc's `Retry Count` field is "Number of attempts so far," but here it is
|
||
seeded to 1 before any *retry* has happened, making the parked-message `AttemptCount`
|
||
shown to operators off by one relative to configured `MaxRetries`. The
|
||
`EnqueueAsync_TransientFailure_BuffersForRetry` test even asserts `RetryCount == 1`,
|
||
locking in the ambiguity.
|
||
|
||
**Recommendation**
|
||
|
||
Choose one consistent meaning for `RetryCount` (recommended: total delivery attempts,
|
||
including the immediate one) and apply it uniformly. If `MaxRetries` is meant to bound
|
||
*retries* after the immediate attempt, buffer with `RetryCount = 0` and treat the
|
||
immediate failure as attempt 0; if it bounds *total attempts*, document that and adjust
|
||
the comparison. Update the affected test to match the chosen semantics.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit `<pending>`). `RetryCount` now consistently means "number
|
||
of background retry-sweep attempts so far"; the initial immediate (or caller-made)
|
||
delivery attempt is attempt 0 and is not counted, and `MaxRetries` bounds retry-sweep
|
||
attempts after that initial attempt. `EnqueueAsync` no longer seeds `RetryCount = 1` on
|
||
either the transient-immediate-failure path or the `attemptImmediateDelivery: false`
|
||
path — a freshly buffered message has `RetryCount = 0`. `RetryMessageAsync` already
|
||
increments before the `>= MaxRetries` check, which is now correct, so a message with
|
||
`MaxRetries = 1` gets exactly one real retry before parking (previously zero). The
|
||
`StoreAndForwardMessage.RetryCount` XML doc was corrected to match. Regression test
|
||
`RetryPendingMessagesAsync_MaxRetriesOne_PerformsExactlyOneRetryBeforeParking` asserts
|
||
the immediate attempt plus exactly one retry occur before parking; the affected
|
||
existing tests (`EnqueueAsync_TransientFailure_BuffersForRetry`,
|
||
`EnqueueAsync_AttemptImmediateDeliveryFalse_BuffersWithoutInvokingHandler`,
|
||
`RetryPendingMessagesAsync_MaxRetriesReached_ParksMessage`) were updated to the
|
||
corrected semantics.
|
||
|
||
### StoreAndForward-004 — `RegisterDeliveryHandler` XML doc contradicts the implemented contract
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Documentation & comments |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:38`, `:60` |
|
||
|
||
**Description**
|
||
|
||
The XML comment on the handler delegate (lines 37–40) says "Returns true on success,
|
||
throws on transient failure. Permanent failures should return false (message will NOT
|
||
be buffered)." That last clause is wrong for the retry path: in `RetryMessageAsync`,
|
||
a handler returning `false` does not "not buffer" — the message is already buffered, and
|
||
the code *parks* it immediately (lines 218–224). The comment describes only the
|
||
`EnqueueAsync` immediate path and misleads anyone implementing a handler about what
|
||
`false` means once a message is in the retry loop.
|
||
|
||
**Recommendation**
|
||
|
||
Reword the contract to cover both paths explicitly: `true` = delivered (remove from
|
||
buffer); `false` = permanent failure (not buffered on immediate attempt, parked on a
|
||
retry); exception = transient failure (buffer / increment retry).
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). Confirmed the root cause against the source: the
|
||
old XML comment's "Permanent failures should return false (message will NOT be
|
||
buffered)" describes only the `EnqueueAsync` immediate path; on the retry path a
|
||
`false` return parks the already-buffered message. Reworded the `_deliveryHandlers`
|
||
field doc into an explicit three-way contract (`true` = delivered/removed-or-never-
|
||
buffered; `false` = permanent failure = not-buffered-on-immediate / parked-on-retry;
|
||
throws = transient = buffered-for-retry / retry-count-incremented), noting it applies
|
||
identically on both paths, and pointed `RegisterDeliveryHandler`'s summary at it.
|
||
Documentation-only change — no behavioural code touched, so no regression test
|
||
(an XML comment is not test-observable).
|
||
|
||
### StoreAndForward-005 — Parked-message retry/discard can race with the in-progress retry sweep
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low — re-triaged down from Medium on 2026-05-16; the described data-loss race is not actually reachable, see Re-triage note |
|
||
| Original severity | Medium |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:184`, `:266`, `:280` |
|
||
|
||
**Description**
|
||
|
||
`RetryPendingMessagesAsync` loads a snapshot of due messages (line 179) and then
|
||
processes them one by one (line 184), `await`-ing delivery for each. Meanwhile
|
||
`RetryParkedMessageAsync` / `DiscardParkedMessageAsync` (operator actions arriving via
|
||
`ParkedMessageHandlerActor`) run on unrelated threads and mutate the same rows. Because
|
||
each operation opens its own SQLite connection and there is no row-level coordination,
|
||
an operator can `DiscardParkedMessageAsync` a message that the sweep is concurrently
|
||
delivering: the sweep's later `RemoveMessageAsync`/`UpdateMessageAsync` operates on a
|
||
now-deleted row (harmless) — but if an operator `RetryParkedMessageAsync` resets a row
|
||
to Pending while the sweep simultaneously parks the same in-flight message, the operator
|
||
intent is silently overwritten. The `Interlocked` guard only prevents *overlapping
|
||
sweeps*, not sweep-vs-management races.
|
||
|
||
**Recommendation**
|
||
|
||
Funnel all message-state mutations through a single serialization point — e.g. perform
|
||
all S&F state changes inside the `ParkedMessageHandlerActor` (or a dedicated S&F actor)
|
||
so the actor mailbox serialises them, or make status transitions conditional in SQL
|
||
(e.g. `UPDATE ... WHERE id = @id AND status = @expected`) and re-check the affected
|
||
row count.
|
||
|
||
**Re-triage note (2026-05-16)**
|
||
|
||
Verified against the source: the *specific* race the finding describes — *"an operator
|
||
`RetryParkedMessageAsync` resets a row to Pending while the sweep simultaneously parks
|
||
the same in-flight message"* — **cannot occur** at the reviewed code. The two operator
|
||
operations (`StoreAndForwardStorage.RetryParkedMessageAsync` /
|
||
`DiscardParkedMessageAsync`) are already SQL-conditional on `status = Parked`, and the
|
||
retry sweep only ever loads and processes rows that are `Pending`. For the operator and
|
||
the sweep to touch the *same* message the row would have to be simultaneously `Parked`
|
||
(operator-actionable) and in the sweep's in-flight `Pending` snapshot — mutually
|
||
exclusive states. Overlapping sweeps are additionally prevented by the `Interlocked`
|
||
guard, so there is only ever one sweep writer. The described data-loss outcome is
|
||
therefore not reachable, which makes the original Medium ("risky behaviour") severity
|
||
an over-statement — **re-triaged to Low**.
|
||
|
||
What *is* real is a latent fragility: the sweep's own state-changing writes used the
|
||
unconditional `UpdateMessageAsync` (`WHERE id = @id`), so the no-clobber guarantee
|
||
rested entirely on the `Interlocked` invariant with zero defence-in-depth if that
|
||
invariant were ever broken (e.g. `RetryMessageAsync` called outside the sweep, or the
|
||
guard removed). That residual Low-severity weakness is worth closing, so the
|
||
recommendation's conditional-SQL option was applied.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). Re-triaged Medium → Low (the described race is
|
||
unreachable; see Re-triage note) and the residual latent fragility fixed: added
|
||
`StoreAndForwardStorage.UpdateMessageIfStatusAsync`, a conditional update
|
||
(`UPDATE ... WHERE id = @id AND status = @expectedStatus`) returning whether a row was
|
||
written. `RetryMessageAsync`'s three state-changing writes (park-on-permanent-failure,
|
||
park-on-max-retries, retry-count increment) now use it with `expectedStatus = Pending`,
|
||
so the sweep can never overwrite a row whose status changed underneath it; a skipped
|
||
write is logged and the message left for the next sweep. Regression test
|
||
`RetryMessageAsync_StatusChangedDuringDelivery_SweepParkWriteIsSkipped` drives a writer
|
||
that moves the row out of `Pending` mid-delivery and asserts the sweep's stale park
|
||
write is skipped (it failed against the pre-fix unconditional update, clobbering the
|
||
other writer's `RetryCount`).
|
||
|
||
### StoreAndForward-006 — `GetParkedMessagesAsync` count and page run without a transaction
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Performance & resource management |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:166`, `:175` |
|
||
|
||
**Description**
|
||
|
||
`GetParkedMessagesAsync` issues a `COUNT(*)` and then a separate paged `SELECT` on two
|
||
commands on the same connection with no surrounding transaction. A concurrent
|
||
enqueue/park/discard between the two statements yields a `TotalCount` inconsistent with
|
||
the returned page (e.g. total reported as 51 while only 50 distinct parked rows now
|
||
exist, or a row visible in the page but excluded from the count). For a paginated UI
|
||
this produces flickering totals and occasional off-by-one page math.
|
||
|
||
**Recommendation**
|
||
|
||
Wrap both reads in a single transaction (`BeginTransaction`) so they see a consistent
|
||
snapshot, or accept the staleness and document it. A transaction is cheap here and
|
||
removes the inconsistency.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). Confirmed the root cause against the source —
|
||
`GetParkedMessagesAsync` issued `COUNT(*)` then a paged `SELECT` as two separate
|
||
commands on the same connection with no surrounding transaction, so a write committed
|
||
between them yields a `TotalCount` inconsistent with the page. Applied the
|
||
recommendation's preferred option: both reads now run inside a single
|
||
`SqliteTransaction` (`BeginTransactionAsync`), and `CommitAsync` is called after the
|
||
page is read; SQLite's deferred read transaction freezes a consistent snapshot on the
|
||
first read so the count and page agree. Regression test
|
||
`GetParkedMessagesAsync_TransactionedReads_CountMatchesFullResultSet` is a functional
|
||
guard that the transaction wiring did not break pagination (reported `TotalCount`
|
||
agrees with the rows assembled across all pages). Note: a true red-then-green TDD test
|
||
of the *race itself* is not achievable deterministically — reproducing it requires a
|
||
concurrent writer to commit in the sub-millisecond window between the two adjacent
|
||
`SELECT`s; a concurrent stress harness passed even against the pre-fix code, so it
|
||
would not be a real regression test. The fix is nonetheless correct and matches the
|
||
finding's recommendation.
|
||
|
||
### StoreAndForward-007 — Async work in `ParkedMessageHandlerActor` uses `ContinueWith` without scheduler/affinity guarantees
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Akka.NET conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs:34`, `:68`, `:87` |
|
||
|
||
**Description**
|
||
|
||
The three handlers call a `Task`-returning service method and chain `.ContinueWith(...)
|
||
.PipeTo(sender)`. `Sender` is correctly captured into a local first, so the closure is
|
||
safe. However `ContinueWith` without an explicit `TaskScheduler` runs the continuation
|
||
on a thread-pool thread and the captured continuation builds the response objects there
|
||
— acceptable since it only touches locals, but it bypasses the idiomatic
|
||
`PipeTo`-with-success/failure-projection pattern and is fragile if someone later adds a
|
||
line touching actor state inside the continuation. There is also no `TaskContinuationOptions`,
|
||
so a faulted antecedent still runs the continuation (handled here via `IsCompletedSuccessfully`,
|
||
but only by convention).
|
||
|
||
**Recommendation**
|
||
|
||
Replace `ContinueWith(...).PipeTo(sender)` with `PipeTo(sender, success: result => ...,
|
||
failure: ex => ...)`, which is the documented Akka pattern, keeps response construction
|
||
off the actor thread safely, and makes the success/failure branches explicit.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). Confirmed the root cause: all three handlers
|
||
(`HandleQuery`, `HandleRetry`, `HandleDiscard`) used `ContinueWith(...).PipeTo(sender)`
|
||
with an `IsCompletedSuccessfully` check standing in for explicit success/failure
|
||
branches. Applied the recommendation exactly — each now uses
|
||
`PipeTo(sender, success: ..., failure: ...)`, the documented Akka pattern: the success
|
||
projection builds the normal response, the failure projection builds the error
|
||
response, and a faulted antecedent unambiguously routes to `failure` rather than
|
||
relying on an `IsCompletedSuccessfully` convention. `Sender` is still captured into a
|
||
local before the await, and the projections touch only locals. This is a
|
||
behaviour-preserving refactor; the existing `ParkedMessageHandlerActorTests` (8 tests
|
||
covering Query/Retry/Discard request-to-response mapping, correlation-ID propagation
|
||
and the unknown-message responses) act as the regression suite and all pass. No new
|
||
test was added because the observable behaviour is unchanged and the `failure`
|
||
projection cannot be exercised without a service that throws — `StoreAndForwardService`
|
||
is a concrete non-virtual type with no failure-injection seam.
|
||
|
||
### StoreAndForward-008 — A SQLite connection is opened and torn down on every storage call
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Performance & resource management |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:28`, `:61`, `:93`, `:117`, `:144`, `:162`, `:199`, `:221`, `:237`, `:267`, `:285`, `:305`, `:319` |
|
||
|
||
**Description**
|
||
|
||
Every method in `StoreAndForwardStorage` constructs a fresh `SqliteConnection` and calls
|
||
`OpenAsync`. Microsoft.Data.Sqlite pools connections, so this is not a correctness bug,
|
||
but a retry sweep over a large buffer performs one open per `UpdateMessageAsync`/
|
||
`RemoveMessageAsync` call inside the loop (`RetryMessageAsync`), multiplying connection
|
||
churn under load. With no max buffer size (by design) the buffer can grow large, so the
|
||
per-message connection acquisition is a measurable overhead on the hot retry path.
|
||
|
||
**Recommendation**
|
||
|
||
Consider a batched retry API that opens one connection (and one transaction) per sweep,
|
||
or pass an open connection into the per-message update calls. At minimum, document that
|
||
the design relies on the Sqlite connection pool for acceptable performance.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). Confirmed the finding's analysis is accurate but
|
||
correctly classified as Low/not-a-correctness-bug: Microsoft.Data.Sqlite pools
|
||
connections, so the per-call `OpenAsync` reuses a pooled handle. Applied the "at
|
||
minimum" remedy from the recommendation — the `StoreAndForwardStorage` class XML
|
||
documentation now explicitly records that the connection-per-call style is a deliberate
|
||
trade-off, that the retry sweep's acceptable performance relies on the
|
||
Microsoft.Data.Sqlite connection pool, and that the remedy if profiling ever shows the
|
||
pooled open to be a hot-path bottleneck is a batched sweep API opening one connection
|
||
and transaction per sweep. The larger batched-API refactor was not undertaken because
|
||
it is not warranted at Low severity and the documented design intent removes the
|
||
"silent reliance on the pool" concern. Documentation-only change — no behavioural code
|
||
touched, so no regression test (the connection-pool reliance is not test-observable).
|
||
|
||
### StoreAndForward-009 — `OnActivity` event invocation is not thread-safe against concurrent subscribe/unsubscribe
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:46`, `:309` |
|
||
|
||
**Description**
|
||
|
||
`OnActivity` is a public `event Action<...>` raised via `OnActivity?.Invoke(...)` in
|
||
`RaiseActivity` (line 309). `RaiseActivity` is called from both `EnqueueAsync` (caller
|
||
thread) and `RetryMessageAsync` (timer thread). The `?.Invoke` null-conditional captures
|
||
the delegate once so it will not NRE, but there is no synchronisation around the event
|
||
field itself; a subscriber added/removed concurrently with a raise has no defined
|
||
ordering. More importantly, subscriber callbacks run synchronously on the timer thread,
|
||
so a slow or throwing subscriber stalls or aborts the retry sweep (an exception in a
|
||
subscriber propagates out of `RaiseActivity` into `RetryMessageAsync`'s `try` and is
|
||
swallowed as a "transient failure," wrongly incrementing the message's retry count).
|
||
|
||
**Recommendation**
|
||
|
||
Snapshot the delegate (already done) and additionally wrap subscriber invocation in a
|
||
`try/catch` so a faulting logging subscriber cannot be misclassified as a delivery
|
||
failure. Document that handlers must be fast and non-throwing, or dispatch activity
|
||
notifications asynchronously.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). Confirmed — and found the impact is worse than
|
||
the finding states. `RaiseActivity` previously did `OnActivity?.Invoke(...)`; a
|
||
throwing subscriber's exception escaped it. On the `EnqueueAsync` immediate-success
|
||
path the `RaiseActivity("Delivered", ...)` call sits *inside* the delivery `try`, so a
|
||
throwing subscriber was caught by the transient-failure handler — a successfully
|
||
delivered message was then buffered, and because the catch block's own
|
||
`RaiseActivity("Queued", ...)` also threw, the exception escaped `EnqueueAsync`
|
||
entirely. `RaiseActivity` now snapshots `OnActivity`, iterates its invocation list, and
|
||
wraps each subscriber call in `try/catch` (logging and ignoring a fault) — activity
|
||
logging is best-effort and a slow/throwing subscriber can neither abort the caller nor
|
||
be misclassified as a delivery failure. Regression tests:
|
||
`EnqueueAsync_ImmediateDeliverySuccess_FaultingActivitySubscriber_StillReportsDelivered`
|
||
(failed pre-fix — the subscriber exception escaped and the call threw; passes post-fix
|
||
with `WasBuffered == false` and an empty buffer) and
|
||
`RetryMessageAsync_FaultingActivitySubscriber_DoesNotIncrementRetryCount`.
|
||
|
||
### StoreAndForward-010 — Retry of a parked message does not reset `LastAttemptAt`, so its retry timing is unspecified
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:203`, `:101` |
|
||
|
||
**Description**
|
||
|
||
`RetryParkedMessageAsync` sets `status = Pending, retry_count = 0, last_error = NULL`
|
||
but leaves `last_attempt_at` unchanged (line 203–206). The retry-due query
|
||
(`GetMessagesForRetryAsync`, line 101–105) selects Pending rows where
|
||
`last_attempt_at IS NULL OR ... elapsed >= retry_interval_ms`. A message parked after
|
||
exhausting retries has an old `last_attempt_at`; once re-queued, the elapsed time since
|
||
that stale timestamp is almost certainly already greater than the retry interval, so the
|
||
operator-retried message is attempted on the very next sweep regardless of the
|
||
configured interval. That is probably the desired behaviour (operator wants it tried
|
||
now), but it is unspecified and inconsistent — if `retry_interval_ms` were very large the
|
||
behaviour would instead be "try immediately" by accident rather than by design.
|
||
|
||
**Recommendation**
|
||
|
||
Explicitly decide and encode the intent: either set `last_attempt_at = NULL` on
|
||
re-queue so the message is unambiguously due now, or set it to "now" so it waits one
|
||
interval. Document the chosen behaviour in the method's XML comment.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). Confirmed against the source: `RetryParkedMessage
|
||
Async` reset `status`, `retry_count` and `last_error` but left `last_attempt_at` stale,
|
||
so the operator-retried message's retry timing depended on the elapsed time since the
|
||
original pre-park attempt. Encoded the intent explicitly — an operator retry means
|
||
"attempt this again now" — by also setting `last_attempt_at = NULL` in the UPDATE, so
|
||
the re-queued message is unambiguously due on the next sweep regardless of the
|
||
configured `retry_interval_ms`. The method's XML comment now documents this. Regression
|
||
test `RetryParkedMessageAsync_ClearsLastAttemptAt_SoMessageIsImmediatelyDue` uses a
|
||
1-hour retry interval and a recent `last_attempt_at`; it failed pre-fix (timestamp not
|
||
cleared, message excluded from the retry-due set) and passes post-fix.
|
||
|
||
### StoreAndForward-011 — `StoreAndForwardMessageStatus.InFlight` is unused and the doc's "retrying" status is unmodelled
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Design-document adherence |
|
||
| Status | Deferred |
|
||
| Location | `src/ScadaLink.Commons/Types/Enums/StoreAndForwardMessageStatus.cs:9`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:219`, `:235` |
|
||
|
||
**Description**
|
||
|
||
The enum defines `Pending, InFlight, Parked, Delivered`. The module only ever uses
|
||
`Pending` and `Parked` — `InFlight` and `Delivered` are never assigned (delivered
|
||
messages are deleted, not marked `Delivered`). Meanwhile the Component design doc
|
||
("Message Format" -> Status) specifies the set "Pending, retrying, or parked." So the
|
||
code's enum drifts from the doc in two directions: it carries dead members the doc does
|
||
not mention (`InFlight`, `Delivered`) and omits the doc's `retrying` state. A message
|
||
mid-retry is indistinguishable from one that has never been attempted.
|
||
|
||
**Recommendation**
|
||
|
||
Reconcile the enum with the design. Either drop the unused members and update the doc,
|
||
or implement the documented `retrying` state and use `InFlight` to mark a message the
|
||
sweep is actively delivering (which would also help with finding 005).
|
||
|
||
**Resolution**
|
||
|
||
_Deferred 2026-05-16. Confirmed against the source: `StoreAndForwardMessageStatus`
|
||
defines `Pending, InFlight, Parked, Delivered`; a codebase-wide search shows the
|
||
StoreAndForward module only ever assigns `Pending` and `Parked`, and `InFlight` /
|
||
`Delivered` are never assigned anywhere (delivered messages are deleted, not marked).
|
||
The design doc's `retrying` state is unmodelled. Both options the recommendation offers
|
||
— (a) drop the unused `InFlight`/`Delivered` members, or (b) add a `Retrying` member —
|
||
require editing `StoreAndForwardMessageStatus.cs`, which lives in `src/ScadaLink.Commons`
|
||
(outside this review's edit scope: only `src/ScadaLink.StoreAndForward/**` may be
|
||
changed). The enum is also referenced by IntegrationTests and HealthMonitoring tests, so
|
||
removing members is a cross-module change. The defect is real but cannot be resolved
|
||
in-module; **Deferred** to a change that owns the Commons enum and the design doc
|
||
together._
|
||
|
||
### StoreAndForward-012 — `StoreAndForwardMessage` is a persistence entity but lives in the component, not Commons
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Status | Deferred |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardMessage.cs:9` |
|
||
|
||
**Description**
|
||
|
||
`StoreAndForwardMessage` is a persistence-ignorant POCO that maps directly to the
|
||
`sf_messages` table and is also carried across the network inside `ReplicationOperation`
|
||
(replicated to the standby node over Akka remoting). CLAUDE.md "Code Organization" states
|
||
that entity classes are persistence-ignorant POCOs in Commons and that message contracts
|
||
follow additive-only evolution. Because this type doubles as a replication wire contract
|
||
but lives in the component assembly, it is not co-located with the other Commons
|
||
entities and its evolution is not governed by the additive-only message-contract rule.
|
||
This is a borderline case (the type is site-local), but the cross-node use via
|
||
`ReplicationOperation` makes it a de-facto message contract.
|
||
|
||
**Recommendation**
|
||
|
||
Either move `StoreAndForwardMessage` (and `ReplicationOperation`) into the Commons
|
||
`Entities`/`Messages` hierarchy so they are governed by the contract-evolution rules, or
|
||
introduce a separate DTO for replication and keep `StoreAndForwardMessage` purely as the
|
||
local persistence model. Document the decision.
|
||
|
||
**Resolution**
|
||
|
||
_Deferred 2026-05-16. Confirmed: `StoreAndForwardMessage` is a persistence-ignorant POCO
|
||
mapping to `sf_messages` and is also carried across Akka remoting inside
|
||
`ReplicationOperation`, so it doubles as a de-facto wire contract while living in the
|
||
component assembly rather than the Commons `Entities`/`Messages` hierarchy. The
|
||
recommendation's primary remedy — moving `StoreAndForwardMessage` (and
|
||
`ReplicationOperation`) into Commons — crosses module boundaries (it would add a type to
|
||
`src/ScadaLink.Commons`, outside this review's edit scope of
|
||
`src/ScadaLink.StoreAndForward/**`, and change every referencing module). The alternative
|
||
"separate replication DTO" still leaves the persistence entity in the component, so it
|
||
does not actually resolve the finding's core concern (entity placement / contract-
|
||
evolution governance). Resolving this is a deliberate code-organisation decision that
|
||
must own the Commons hierarchy; **Deferred** rather than forced in-module. Flagged for a
|
||
cross-module follow-up._
|
||
|
||
### StoreAndForward-013 — Critical paths lack test coverage: retry-due timing, replication-from-active, and the actor bridge
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Testing coverage |
|
||
| Status | Resolved |
|
||
| Location | `tests/ScadaLink.StoreAndForward.Tests/` (whole directory); `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:101`; `src/ScadaLink.StoreAndForward/ParkedMessageHandlerActor.cs` |
|
||
|
||
**Description**
|
||
|
||
The existing tests cover storage CRUD and the service happy/failure paths well, but
|
||
three important behaviours are untested: (1) the retry-due time filter in
|
||
`GetMessagesForRetryAsync` — every service test sets `DefaultRetryInterval = TimeSpan.Zero`,
|
||
so the `julianday` elapsed-time comparison (the most error-prone SQL in the module) is
|
||
never exercised with a non-zero interval; a message that is *not yet due* should be
|
||
skipped, and that is never verified. (2) Replication from the active side — no test
|
||
asserts that an enqueue/remove/park causes a `Replicate*` call (this is exactly the gap
|
||
behind finding 001; a test would have caught it). (3) `ParkedMessageHandlerActor` has no
|
||
test at all — the Query/Retry/Discard request-to-response mapping and the
|
||
`ExtractMethodName` JSON parsing are unverified, including the malformed-JSON branch.
|
||
|
||
**Recommendation**
|
||
|
||
Add tests for: a non-zero retry interval where a recently-attempted message is excluded
|
||
and an older one is included; active-side replication invocation per operation type
|
||
(once finding 001 is fixed); and `ParkedMessageHandlerActor` using `Akka.TestKit`,
|
||
including `ExtractMethodName` for `MethodName`, `Subject`, missing-property and
|
||
invalid-JSON payloads.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16 (commit pending). All three gaps closed.
|
||
(1) Retry-due timing: `GetMessagesForRetryAsync_NonZeroInterval_ExcludesNotYetDue
|
||
IncludesDue` exercises the `julianday` elapsed-time SQL with non-zero intervals — a
|
||
just-attempted message (1-hour interval) is excluded, an old one (attempted 2h ago,
|
||
5-min interval) and a never-attempted one are included.
|
||
(2) Active-side replication: this sub-claim was **already stale** at the reviewed code —
|
||
`StoreAndForwardReplicationTests` (added with finding 001's fix) asserts the Add, Remove
|
||
and Park replication operations; no new test was needed.
|
||
(3) `ParkedMessageHandlerActor`: new `ParkedMessageHandlerActorTests` using
|
||
`Akka.TestKit.Xunit2` (package reference added to the test project) covers Query/Retry/
|
||
Discard request-to-response mapping, correlation-ID propagation, the unknown-message
|
||
failure responses, and `ExtractMethodName` for `MethodName`, `Subject`, missing-property
|
||
and malformed-JSON payloads (all falling back to the category name without throwing).
|
||
These are coverage-gap tests over already-correct code, so they pass on first run.
|
||
|
||
### StoreAndForward-014 — Storage does not create its SQLite database directory
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardStorage.cs:26` |
|
||
|
||
**Found 2026-05-16** while verifying the store-and-forward fixes — this defect was
|
||
not part of the original baseline review.
|
||
|
||
**Description**
|
||
|
||
`StoreAndForwardStorage.InitializeAsync` opened a `SqliteConnection` against the
|
||
configured `SqliteDbPath` (default `./data/store-and-forward.db`) without ensuring the
|
||
parent directory exists. SQLite creates the database *file* on demand but not its
|
||
*directory*, so when `data/` does not already exist the connection fails to open with
|
||
`SQLite Error 14: 'unable to open database file'`. Every site-host boot therefore failed
|
||
in any environment whose working directory has no `data/` folder — the cause of the six
|
||
failing `SiteActorPathTests` (the host's `RegisterSiteActors` aborts at
|
||
`StoreAndForwardService.StartAsync`). Production masked it because `data/` is created by
|
||
the Docker image / deployment.
|
||
|
||
**Recommendation**
|
||
|
||
Create the parent directory of a file-backed SQLite database before opening it.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-16. `InitializeAsync` now calls a new `EnsureDatabaseDirectoryExists`
|
||
helper that parses the connection string with `SqliteConnectionStringBuilder` and, for a
|
||
file-backed database, creates the parent directory if it is missing (in-memory databases
|
||
and bare filenames are skipped). Regression test
|
||
`InitializeAsync_FileInMissingDirectory_CreatesDirectory` fails against the pre-fix code;
|
||
all six `SiteActorPathTests` now pass. Fixed by the commit whose message references
|
||
`StoreAndForward-014`.
|
||
|
||
### StoreAndForward-015 — `EnqueueAsync`'s public contract never documents that `maxRetries == 0` means "retry forever"
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Documentation & comments |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:114`–`:130`, `:285` |
|
||
|
||
**Description**
|
||
|
||
The re-review brief asks for the StoreAndForward side of the cross-module ambiguity
|
||
recorded as `ExternalSystemGateway-015`. The semantics are split across this module and
|
||
its callers, and the StoreAndForward side carries a genuine documentation/API-contract
|
||
fault:
|
||
|
||
- `RetryMessageAsync` parks a message only when `message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries`
|
||
(`StoreAndForwardService.cs:285`). When `MaxRetries == 0` the guard is false on every
|
||
sweep, so a `0` value means **"no limit — retry forever"**. The
|
||
`StoreAndForwardMessage.MaxRetries` XML doc (`StoreAndForwardMessage.cs:31`) does state
|
||
`"0 = no limit"`, so the persistence model is internally consistent.
|
||
- But `EnqueueAsync` — the *only* public entry point into the engine — exposes a
|
||
`maxRetries` parameter (`StoreAndForwardService.cs:128`) with **no parameter
|
||
documentation at all**, and its method summary (lines 114–122) describes the lifecycle
|
||
only as "On max retries → park" / "parked once `MaxRetries` is reached" (see also the
|
||
`_deliveryHandlers` field doc, line 50–51). Nothing on the public surface tells a
|
||
caller that passing `0` flips the meaning from "park immediately / never retry" to
|
||
"retry forever". A caller reading only `EnqueueAsync` would reasonably assume `0`
|
||
retries means zero retries.
|
||
- This is exactly the trap ESG fell into: `ExternalSystemClient.CachedCallAsync` /
|
||
`DatabaseGateway.CachedWriteAsync` pass the source entity's `MaxRetries` verbatim,
|
||
intending `0` to mean "never retry", and instead get unbounded retry — the
|
||
duplicate-delivery / unbounded-buffer-growth hazard the design doc's idempotency note
|
||
warns against. The fault is not solely ESG's: the S&F public API silently overloads
|
||
`0` with the opposite of its natural reading and does not document it.
|
||
|
||
The defect is in this module's API contract and documentation, so it is recorded here
|
||
in addition to `ExternalSystemGateway-015`. (Whether `0` *should* mean "no limit" or
|
||
"no retry" is the cross-module design decision tracked by ESG-015; this finding is
|
||
specifically that the StoreAndForward public surface fails to document whichever
|
||
meaning is chosen.)
|
||
|
||
**Recommendation**
|
||
|
||
Document the `maxRetries` parameter on `EnqueueAsync` explicitly with a `<param>` tag
|
||
that states the `0` special case in the same words as `StoreAndForwardMessage.MaxRetries`
|
||
(`"0 = no limit — retried on every sweep until delivered, never parked"`), and add the
|
||
`0` case to the method summary's lifecycle description. Better still — and consistent
|
||
with the resolution of ESG-015 — make the engine reject the ambiguity at the API: accept
|
||
a nullable/`enum` retry policy, or treat `0` as an explicit "no retry" (do not buffer, or
|
||
park on the first sweep) so the natural reading and the behaviour agree. Either way the
|
||
public `EnqueueAsync` contract must state the chosen meaning; today it states nothing.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17. Documentation-only fix — retry semantics confirmed correct and
|
||
left unchanged. Root cause verified against the source: `EnqueueAsync`'s `maxRetries`
|
||
parameter had no `<param>` documentation and the method/class summaries described only
|
||
the "park on max retries" path, never the `0 = no limit / retry forever` special case
|
||
that `RetryMessageAsync`'s `MaxRetries > 0` guard actually enforces. Added an explicit
|
||
`<param>` tag for every `EnqueueAsync` parameter — `maxRetries` now states in bold that
|
||
`0` means "no limit, never parked for retry exhaustion" and is **not** a "never retry"
|
||
value — extended the method summary with a retry-count lifecycle paragraph, updated the
|
||
class-level lifecycle bullet, and tightened the `StoreAndForwardMessage.MaxRetries` field
|
||
doc to the same wording. No behavioural code touched; an XML comment is not
|
||
test-observable so no regression test was added.
|
||
|
||
### StoreAndForward-016 — Operator-initiated parked-message retry and discard are not replicated to the standby
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:339`–`:362`; `src/ScadaLink.StoreAndForward/ReplicationService.cs:131`–`:136` |
|
||
|
||
**Description**
|
||
|
||
`StoreAndForward-001`'s fix wired replication into the active *delivery* paths:
|
||
`BufferAsync` replicates an `Add`, a successful retry replicates a `Remove`, and a park
|
||
replicates a `Park`. But the two *operator* paths — `RetryParkedMessageAsync` (line 339)
|
||
and `DiscardParkedMessageAsync` (line 353) — change buffer state and never touch
|
||
`_replication`:
|
||
|
||
- `RetryParkedMessageAsync` flips a row from `Parked` back to `Pending` (and clears
|
||
`retry_count` / `last_attempt_at`) in the local SQLite only. The standby's copy stays
|
||
`Parked`.
|
||
- `DiscardParkedMessageAsync` `DELETE`s the row from the local SQLite only. The standby's
|
||
copy is left in place, still `Parked`.
|
||
|
||
The Component design doc ("Persistence") requires the active node to forward "each
|
||
buffer operation (add, remove, park)" so that on failover "the new active node has a
|
||
near-complete copy of the buffer." An operator retrying a parked message is a buffer
|
||
state change; an operator discarding one is a removal. After a failover that follows an
|
||
operator action:
|
||
|
||
1. A **discarded** message reappears on the new active node — it is still `Parked`
|
||
there, so it resurfaces in the central UI's parked-message list and an operator must
|
||
discard it a second time. For a message deliberately removed (e.g. a known-bad
|
||
payload) this is a correctness regression of the operator's intent.
|
||
2. A **retried** message is still `Parked` on the new active node, so the operator's
|
||
"move it back to the queue" action is silently lost across the failover and the
|
||
message is not re-attempted.
|
||
|
||
`ReplicationOperationType` only models `Add`/`Remove`/`Park` — there is no operation for
|
||
"un-park / move back to pending", so even a minimal fix needs either a new operation
|
||
type or a re-use of `Add` to overwrite the standby row. This is the same class of defect
|
||
as the now-resolved `StoreAndForward-001`, for the operator paths rather than the sweep
|
||
paths.
|
||
|
||
**Recommendation**
|
||
|
||
Replicate both operator actions. `DiscardParkedMessageAsync` should call
|
||
`_replication?.ReplicateRemove(messageId)` after a successful local delete (the existing
|
||
`Remove` op already deletes on the standby). For `RetryParkedMessageAsync`, add a
|
||
`Requeue`/`Unpark` `ReplicationOperationType` whose `ApplyReplicatedOperationAsync` case
|
||
resets the standby row to `Pending` with `retry_count = 0`, or have the method re-load
|
||
the updated message and replicate it as an `Add`-style upsert. Add replication tests for
|
||
both operator paths (the existing `StoreAndForwardReplicationTests` only cover the sweep
|
||
paths).
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17. Root cause confirmed: the two operator paths
|
||
(`RetryParkedMessageAsync`, `DiscardParkedMessageAsync`) changed local SQLite state but
|
||
never touched `_replication`, so a failover after an operator action diverged the
|
||
standby buffer. `DiscardParkedMessageAsync` now calls `_replication?.ReplicateRemove`
|
||
after a successful local delete (the existing `Remove` op deletes on the standby). A new
|
||
`ReplicationOperationType.Requeue` was added; `RetryParkedMessageAsync` re-loads the
|
||
requeued row and calls `_replication?.ReplicateRequeue`, and the standby's
|
||
`ApplyReplicatedOperationAsync` `Requeue` case resets its matching row to `Pending` with
|
||
`retry_count = 0`. Regression tests `DiscardingAParkedMessage_ReplicatesARemoveOperation`,
|
||
`RetryingAParkedMessage_ReplicatesARequeueOperation` and
|
||
`ApplyReplicatedOperation_Requeue_MovesStandbyRowBackToPending` (in
|
||
`StoreAndForwardReplicationTests`) all pass; the first two fail against the pre-fix code.
|
||
|
||
### StoreAndForward-017 — Retry/Discard activity-log entries hard-code the `ExternalSystem` category
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:344`, `:358` |
|
||
|
||
**Description**
|
||
|
||
`RetryParkedMessageAsync` and `DiscardParkedMessageAsync` raise an S&F activity
|
||
notification (consumed by Site Event Logging — WP-14) but pass a hard-coded
|
||
`StoreAndForwardCategory.ExternalSystem` as the category argument:
|
||
|
||
```csharp
|
||
RaiseActivity("Retry", StoreAndForwardCategory.ExternalSystem, $"Parked message {messageId} moved back to queue");
|
||
RaiseActivity("Discard", StoreAndForwardCategory.ExternalSystem, $"Parked message {messageId} discarded");
|
||
```
|
||
|
||
Both methods take only a `messageId` and never load the message, so they have no access
|
||
to its real category. When an operator retries or discards a parked **Notification** or
|
||
**CachedDbWrite** message, the site event log records the activity under
|
||
`ExternalSystem`. Every other `RaiseActivity` call in the service passes the message's
|
||
true `Category` (`EnqueueAsync`, `RetryMessageAsync`), so the operator paths are
|
||
inconsistent and produce mislabelled audit entries — misleading when an operator later
|
||
filters or reviews S&F activity by category.
|
||
|
||
**Recommendation**
|
||
|
||
Load the message (or have `StoreAndForwardStorage.RetryParkedMessageAsync` /
|
||
`DiscardParkedMessageAsync` return the affected row's category) and pass the real
|
||
`Category` to `RaiseActivity`. If loading the row is considered too costly on these
|
||
infrequent operator paths, change the `OnActivity` event / `RaiseActivity` signature to
|
||
allow a nullable category for management actions rather than asserting a false one.
|
||
|
||
**Resolution**
|
||
|
||
Resolved 2026-05-17. Root cause confirmed: `RetryParkedMessageAsync` and
|
||
`DiscardParkedMessageAsync` raised activity notifications with a hard-coded
|
||
`StoreAndForwardCategory.ExternalSystem`, mislabelling parked `Notification` and
|
||
`CachedDbWrite` messages in the site event log. Both methods now obtain the message's
|
||
real category — `DiscardParkedMessageAsync` loads the row via `GetMessageByIdAsync`
|
||
before the delete, `RetryParkedMessageAsync` re-loads the requeued row (also used for
|
||
the StoreAndForward-016 replication) — and pass it to `RaiseActivity` (falling back to
|
||
`ExternalSystem` only if the row is unexpectedly absent). Regression tests
|
||
`RetryParkedMessageAsync_ActivityUsesMessageRealCategory` and
|
||
`DiscardParkedMessageAsync_ActivityUsesMessageRealCategory` assert the activity carries
|
||
`Notification` / `CachedDbWrite` respectively; both fail against the pre-fix code.
|
||
|
||
### StoreAndForward-018 — Notification corrupt-payload parks the buffered message, contradicting the "notifications do not park" design invariant
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | High |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/NotificationForwarder.cs:62`–`:69`, `:105`–`:122`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:369`–`:397` |
|
||
|
||
**Resolution** — `NotificationForwarder.DeliverAsync` now discards a corrupt
|
||
buffered payload instead of returning false. The corrupt path logs a Warning
|
||
with the buffered row id + length-capped payload preview via an injected
|
||
`ILogger<NotificationForwarder>` (NullLogger by default for back-compat),
|
||
then returns true so the S&F engine clears the row via its standard
|
||
success-path cleanup — honoring the "notifications do not park" design
|
||
invariant. Two regression tests in `NotificationForwarderTests` cover the
|
||
two corrupt shapes (invalid JSON, `null` deserialisation) and pin that
|
||
nothing is forwarded to central in either case.
|
||
|
||
**Description**
|
||
|
||
The Component design doc explicitly carves out notifications from the parking lifecycle:
|
||
|
||
> "Notifications do not park — they are retried at the fixed forward interval until
|
||
> central acks." (`docs/requirements/Component-StoreAndForward.md:47`)
|
||
> "Parking applies only to the external-system-call and cached-database-write
|
||
> categories." (same line)
|
||
|
||
`NotificationForwarder.DeliverAsync` violates this. When `TryBuildSubmit` fails to
|
||
deserialize the buffered payload — either because `JsonSerializer.Deserialize` throws a
|
||
`JsonException` (line 114) or because it returns `null` (line 119) — `DeliverAsync`
|
||
returns `false` (line 68). On the **retry path** the S&F engine treats handler `false`
|
||
as a permanent failure and **parks the message immediately** via the conditional
|
||
`UpdateMessageIfStatusAsync(... Parked)` write at `StoreAndForwardService.cs:373`–`:385`.
|
||
Result: a notification with a corrupt buffered payload — a row that the engine itself
|
||
treats as opaque ("Payload: Serialized message content…"; `Component-StoreAndForward.md:
|
||
110`) — enters the parked state and surfaces in the central UI's parked-message list
|
||
under the `Notification` category, contradicting the doc's invariant and the resolved
|
||
StoreAndForward-017's "Notification / CachedDbWrite" Retry/Discard category mapping.
|
||
|
||
The defect is real today: the inline comment on `NotificationForwarder.cs:64` even
|
||
documents the violation ("An unreadable payload cannot be fixed by retrying — park it
|
||
(return false)") as the intended behaviour, but that behaviour is what the design doc
|
||
forbids. Either the doc needs to acknowledge a poison-payload parking exception for
|
||
notifications, or the forwarder needs a different escape hatch (discard? log + drop?
|
||
permanent-failure-as-`true` to clear the buffer?). Today there is no consistent answer
|
||
between code and design.
|
||
|
||
Additionally, on the **immediate-delivery** path (a fresh enqueue followed by a
|
||
`DeliverAsync` returning `false`), the engine returns `WasBuffered: false` and the row
|
||
is never persisted — so the corrupt-payload "park" only occurs on the retry path, where
|
||
the message has already been buffered (and replicated to the standby). The
|
||
**inconsistency between the two paths** ("not buffered" vs "parked") for the same
|
||
permanent-failure outcome is itself a contract surprise; the resolved StoreAndForward-004
|
||
documents the immediate vs retry asymmetry, but does not anticipate that the retry
|
||
asymmetry will violate a per-category invariant.
|
||
|
||
**Recommendation**
|
||
|
||
Choose one consistent reconciliation. Preferred option: change `NotificationForwarder.
|
||
DeliverAsync` to **discard** a corrupt payload rather than park it — delete the
|
||
buffered row directly, log a Site Event Log entry under `Discard`, and return `true` so
|
||
the engine clears the buffer. This preserves the design's "notifications do not park"
|
||
invariant. Alternatives: (a) update the design doc to acknowledge a poison-payload
|
||
parking exception specifically for notifications, and revise the resolved
|
||
StoreAndForward-017 wording; (b) treat `JsonException` as transient (would retry-forever
|
||
on a corrupt payload — bad); (c) introduce a per-category park-allowed flag on the
|
||
engine and gate the retry-path park behind it for the Notification category.
|
||
Add a regression test in `tests/ScadaLink.StoreAndForward.Tests/NotificationForwarderTests.
|
||
cs` asserting that a corrupt-payload notification reaches a terminal **non-Parked**
|
||
state — today the corrupt-payload behaviour is uncovered.
|
||
|
||
**Resolution**
|
||
|
||
_Unresolved._
|
||
|
||
### StoreAndForward-019 — Notifications park after `DefaultMaxRetries` exhaustion, contradicting "retried until central acks"
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Error handling & resilience |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:229`, `:407`–`:437`; `src/ScadaLink.StoreAndForward/StoreAndForwardOptions.cs:18`; `src/ScadaLink.SiteRuntime/Scripts/ScriptRuntimeContext.cs:1773`–`:1778`; `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:149`–`:156` |
|
||
|
||
**Description**
|
||
|
||
The design doc requires a buffered notification to be retried indefinitely until
|
||
central acks:
|
||
|
||
> "The **notification** category retries differently: it has no source-entity setting.
|
||
> The site→central forward uses a single fixed retry interval configured in the host
|
||
> `appsettings.json`. … A buffered notification is retried until central acks it; it is
|
||
> not parked on a retry limit (central, once reachable, owns delivery, retry, and
|
||
> parking from that point on)." (`Component-StoreAndForward.md:55`–`:59`)
|
||
|
||
The current engine cannot honour that. `RetryMessageAsync` enforces parking at
|
||
`message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries`
|
||
(`StoreAndForwardService.cs:407`); a `MaxRetries == 0` is the documented "no limit"
|
||
escape hatch (now correctly explained by the resolved StoreAndForward-015). But the two
|
||
notification enqueue paths both supply a positive bounded `MaxRetries`:
|
||
|
||
- `ScriptRuntimeContext.cs:1773`–`:1778` (the `Notify.Send` site script path) calls
|
||
`EnqueueAsync` without supplying the `maxRetries` argument, so the engine
|
||
defaults to `StoreAndForwardOptions.DefaultMaxRetries = 50` (`StoreAndForwardOptions.
|
||
cs:18`). After 50 retry sweeps with central unreachable, the notification is parked.
|
||
- `NotificationDeliveryService.cs:149`–`:156` (the legacy SMTP-style path retained for
|
||
the central-side `INotificationDeliveryService` callers) passes
|
||
`smtpConfig.MaxRetries > 0 ? smtpConfig.MaxRetries : null` — `null` falls back to the
|
||
same 50-retry default, and any positive `smtpConfig.MaxRetries` still bounds the
|
||
retry budget. Either way, a long central outage parks the notification.
|
||
|
||
A parked notification cannot be cleared by a central recovery: it stays parked until an
|
||
operator clicks **Retry** in the parked-message UI. The design's invariant — that
|
||
notification delivery converges automatically as soon as central is reachable — is
|
||
broken: an extended central outage requires manual intervention to clear the backlog,
|
||
which is exactly the behaviour the central-only outbox redesign was meant to remove
|
||
from the site.
|
||
|
||
This is closely related to (but distinct from) StoreAndForward-018: 018 is the
|
||
*permanent-failure-path* parking violation; 019 is the *transient-failure-path*
|
||
parking violation under the engine's normal max-retries policy.
|
||
|
||
**Recommendation**
|
||
|
||
Make the notification enqueue paths pass `maxRetries: 0` so the documented "no limit /
|
||
never parked" semantics apply, and guard against regression by adding an integration
|
||
test in `tests/ScadaLink.StoreAndForward.Tests/NotificationForwarderTests.cs` that runs
|
||
a sweep many more times than `DefaultMaxRetries` against an always-failing handler and
|
||
asserts the buffered notification's status stays `Pending` (not `Parked`). A cleaner
|
||
alternative is to special-case the `Notification` category inside
|
||
`RetryMessageAsync`'s max-retries guard (treat it as `MaxRetries == 0` regardless of
|
||
the field value) so the invariant is enforced at the single chokepoint rather than
|
||
relying on every caller to pass the right value — this also fixes the legacy
|
||
`NotificationDeliveryService` path without editing the consumer.
|
||
|
||
**Resolution (2026-05-28):**
|
||
VERIFY outcome — the design doc's "Notifications do not park" wording (lines 47, 59)
|
||
was the *operational intent* for the happy path, not an absolute invariant: the engine
|
||
has always enforced `DefaultMaxRetries` uniformly across every category, and every
|
||
sibling system (ESG, CachedDbWrite) bounds retry-then-parks for the same disk-pressure
|
||
and operator-visibility reasons. Removing the cap for notifications would let a single
|
||
unreachable central exhaust local disk via an unbounded buffer — worse than the
|
||
documented "park after retry budget" behaviour. Resolution is therefore the brief's
|
||
**default**: document the parking behaviour. Updated
|
||
`Component-StoreAndForward.md` lines 46/58 to clarify that the `DefaultMaxRetries` cap
|
||
applies uniformly (including to notifications) and that `maxRetries: 0` is the explicit
|
||
escape hatch for callers that need unbounded retry. Added a `StoreAndForward-019` block
|
||
to `StoreAndForwardOptions.DefaultMaxRetries`'s XML doc explaining the same invariant.
|
||
No behavioural code change — existing tests (104 in
|
||
`ScadaLink.StoreAndForward.Tests`) continue to pass.
|
||
|
||
### StoreAndForward-020 — `RetryParkedMessageAsync` skips standby replication when the message is deleted between local update and re-load
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:599`–`:616` |
|
||
|
||
**Description**
|
||
|
||
The StoreAndForward-016 resolution wired Requeue replication into operator-initiated
|
||
retry. The fix uses a two-step pattern:
|
||
|
||
```csharp
|
||
public async Task<bool> RetryParkedMessageAsync(string messageId)
|
||
{
|
||
var success = await _storage.RetryParkedMessageAsync(messageId); // step 1
|
||
if (success)
|
||
{
|
||
var message = await _storage.GetMessageByIdAsync(messageId); // step 2 (no txn)
|
||
var category = message?.Category ?? StoreAndForwardCategory.ExternalSystem;
|
||
if (message != null)
|
||
{
|
||
_replication?.ReplicateRequeue(message); // step 3
|
||
}
|
||
RaiseActivity("Retry", category, ...);
|
||
}
|
||
return success;
|
||
}
|
||
```
|
||
|
||
The two storage calls are on separate connections with no surrounding transaction. A
|
||
concurrent writer between step 1 (which moved the row from Parked → Pending) and step 2
|
||
(which re-reads the row) can delete or mutate the row:
|
||
|
||
- An operator who issues `DiscardParkedMessageAsync` immediately after retry — the
|
||
`DiscardParkedMessageAsync` storage call is conditional on `status = Parked`, so it
|
||
will be a no-op (correct), but a sweep that succeeds in delivering the just-requeued
|
||
row will then call `_storage.RemoveMessageAsync` (unconditional), which deletes it.
|
||
In a single retry-sweep cycle this race is real because `DefaultRetryInterval = Zero`
|
||
is the standard test default and the operator action and a sweep tick can overlap.
|
||
- A `RemoveMessageAsync` runs in step 1's wake; `GetMessageByIdAsync` returns null;
|
||
step 3 (`_replication?.ReplicateRequeue`) is **skipped entirely**, but step 1
|
||
already requeued the row locally. The standby is now left in `Parked` state while
|
||
the active node has Pending-then-Deleted, exactly the standby-divergence StoreAndForward-016
|
||
was supposed to fix. (On the active node a subsequent failover lands on a Parked
|
||
standby copy of a discarded message — the same regression 016 already documented.)
|
||
|
||
The category-fallback path (`StoreAndForwardCategory.ExternalSystem` when message is
|
||
null) silently mislabels the activity log entry too — the same defect that
|
||
StoreAndForward-017 fixed for the non-racy path, except this branch handed back a
|
||
hard-coded fallback rather than re-loading. The activity log entry is a minor side
|
||
effect; the missing replication is the real defect.
|
||
|
||
**Recommendation**
|
||
|
||
Capture the message **once**, before the local Parked → Pending storage update, so the
|
||
replication path has the row in hand even if a concurrent writer deletes it
|
||
afterwards:
|
||
|
||
```csharp
|
||
var message = await _storage.GetMessageByIdAsync(messageId); // before the update
|
||
if (message == null || message.Status != StoreAndForwardMessageStatus.Parked)
|
||
return false;
|
||
|
||
var success = await _storage.RetryParkedMessageAsync(messageId);
|
||
if (!success) return false;
|
||
|
||
// `message` was the parked row; the active node just wrote it back to Pending with
|
||
// retry_count = 0 — construct the replicated state from those known mutations.
|
||
message.Status = StoreAndForwardMessageStatus.Pending;
|
||
message.RetryCount = 0;
|
||
message.LastError = null;
|
||
message.LastAttemptAt = null;
|
||
_replication?.ReplicateRequeue(message);
|
||
RaiseActivity("Retry", message.Category, $"Parked message {messageId} moved back to queue");
|
||
return true;
|
||
```
|
||
|
||
Add a regression test in `StoreAndForwardReplicationTests` that simulates the
|
||
delete-between-update-and-reload race and asserts the `Requeue` replication
|
||
operation is still emitted with the correct category.
|
||
|
||
**Resolution (2026-05-28):**
|
||
Applied the brief's primary recommendation — `RetryParkedMessageAsync` now captures
|
||
the parked row up front via `GetMessageByIdAsync` (and rejects the call early if the
|
||
row is missing or no longer `Parked`), then performs the local `RetryParkedMessageAsync`
|
||
storage write, and finally reconstructs the post-requeue state on the captured POCO
|
||
(`Status = Pending, RetryCount = 0, LastError = null, LastAttemptAt = null`) and
|
||
replicates it. A concurrent `RemoveMessageAsync` or `DiscardParkedMessageAsync` running
|
||
between the local write and the original re-load can no longer skip replication — the
|
||
row is in hand. The category-fallback misllabelling on the racy path is gone because
|
||
the activity log uses the captured `Category` directly.
|
||
|
||
### StoreAndForward-021 — Design doc claims the Operation Tracking Table lives in StoreAndForward but the implementation is in SiteRuntime
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Medium |
|
||
| Category | Design-document adherence |
|
||
| Status | Resolved |
|
||
| Location | `docs/requirements/Component-StoreAndForward.md:21`, `:49`–`:51`, `:77`–`:87`, `:108`, `:114`; `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:37`; `src/ScadaLink.StoreAndForward/` (whole module) |
|
||
|
||
**Description**
|
||
|
||
Component-StoreAndForward.md repeatedly assigns the **Operation Tracking Table** to
|
||
this component:
|
||
|
||
- **Responsibilities** (line 21): "Maintain a site-local **operation tracking table**
|
||
holding one row per `TrackedOperationId` for cached calls … the authoritative status
|
||
record consulted by `Tracking.Status(id)`."
|
||
- **Message Lifecycle** (lines 49–51): "the operation tracking table is the status
|
||
record and the S&F buffer is purely the retry mechanism. A cached call that succeeds
|
||
on its first immediate attempt is written directly as a terminal `Delivered` tracking
|
||
row and never enters the S&F buffer."
|
||
- **Operation Tracking Table** section (lines 77–87): "Alongside the S&F buffer DB,
|
||
each site node holds a **site-local operation tracking table** in SQLite. … Each row
|
||
records the operation kind (`TrackedOperationKind`) …"
|
||
|
||
The actual implementation lives outside this module: `src/ScadaLink.SiteRuntime/
|
||
Tracking/OperationTrackingStore.cs` (and `IOperationTrackingStore`, `OperationTrackingOptions`).
|
||
The StoreAndForward project contains no references to the tracking store, owns no
|
||
`operation_tracking` table, and `StoreAndForwardService.NotifyCachedCallObserverAsync`
|
||
is only a hook handing telemetry context to an `ICachedCallLifecycleObserver` — the
|
||
audit bridge wired in `ScadaLink.AuditLog`. The S&F module is **not** the table's
|
||
owner; SiteRuntime is.
|
||
|
||
This is a real design-doc drift, not a code defect, and is flagged explicitly in the
|
||
brief's "Module-specific notes". The drift matters because the design doc's
|
||
discussion of the lifecycle — "immediate success writes a terminal Delivered tracking
|
||
row directly here", "operator discard sets terminal `Discarded`", "central never
|
||
mutates the mirror row directly" — places coordination responsibilities on the wrong
|
||
component. A reader looking for the source of truth for `Tracking.Status(id)` would
|
||
read `Component-StoreAndForward.md` and search `src/ScadaLink.StoreAndForward/` in
|
||
vain. The doc also lists Site Call Audit / Audit Log telemetry-emission as a S&F
|
||
responsibility (line 22), but the emission actually happens via the `AuditLog` site
|
||
component subscribing to `ICachedCallLifecycleObserver`.
|
||
|
||
**Recommendation**
|
||
|
||
Reconcile the doc with the code. The simplest fix is doc-side: update
|
||
Component-StoreAndForward.md to scope its responsibilities back to the retry
|
||
mechanism + replication + parked-message management, and add a cross-reference to a
|
||
new (or existing) component doc for Operation Tracking (Component-SiteRuntime.md, or
|
||
a new Component-OperationTracking.md). The code is internally consistent — the audit
|
||
bridge subscribes to the observer hook, the SiteRuntime store writes the rows, the S&F
|
||
engine emits attempt telemetry on the cached-call hot path — but the design doc is
|
||
several refactors out of date. The hierarchical map should be:
|
||
|
||
- `Component-StoreAndForward.md` → S&F buffer + Replication + Parked-message
|
||
management + Notification forwarding to central + cached-call telemetry **hook**.
|
||
- New doc / SiteRuntime doc → Operation Tracking Table semantics and lifecycle.
|
||
- `Component-SiteCallAudit.md` / `Component-AuditLog.md` → telemetry emission +
|
||
central-side mirror.
|
||
|
||
**Resolution (2026-05-28):**
|
||
Doc-side fix applied (per the brief, the simplest of the two options). Updated
|
||
`Component-StoreAndForward.md`: (1) removed the "Maintain a site-local operation
|
||
tracking table" line from Responsibilities and reworded the cached-call telemetry
|
||
responsibility to point at the `ICachedCallLifecycleObserver` hook; (2) renamed the
|
||
"Operation Tracking Table" section to "Operation Tracking Table (lives in Site
|
||
Runtime, not here)" with an explicit `StoreAndForward-021` callout cross-linking to
|
||
`Component-SiteRuntime.md` and the `IOperationTrackingStore` interface in
|
||
Commons. The rest of the section is retained for cross-component context (the
|
||
buffered cached-call rows carry `TrackedOperationId` so the link to the tracking row
|
||
must still be documented somewhere) but is reworded to make clear the table itself is
|
||
not owned here.
|
||
|
||
### StoreAndForward-022 — `NotifyCachedCallObserverAsync` silently drops the entire audit lifecycle when the message id is not a parseable `TrackedOperationId`
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Documentation & comments |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:484`–`:515` |
|
||
|
||
**Description**
|
||
|
||
`NotifyCachedCallObserverAsync` (the per-attempt observer notifier wired by the M3
|
||
Bundle E rollout) bails out with no audit emission when
|
||
`TrackedOperationId.TryParse(message.Id, out var trackedId)` returns false
|
||
(`StoreAndForwardService.cs:510`–`:515`). The inline comment justifies the behaviour as
|
||
back-compat for "pre-M3 message (random GUID-N id from S&F itself, no
|
||
TrackedOperationId threaded in)", but the documented contract is broken in two ways:
|
||
|
||
1. **Silent dropping of every audit row, not just the first one.** The skip means no
|
||
`Attempted` row, no `CachedResolve` terminal row, no audit trail at all for that
|
||
operation's S&F lifecycle — yet the rest of the system (script trust boundary,
|
||
parked-message UI, etc.) still treats the operation as audit-tracked. The drop is
|
||
not surfaced via a metric, log warning (the path is a silent `return`), or counter,
|
||
so a misconfigured caller bypasses the audit hot path with zero feedback.
|
||
|
||
2. **The contract is hidden in field-level XML.** The `ICachedCallLifecycleObserver`
|
||
public interface contract (defined in `ScadaLink.Commons`) does not document that
|
||
the observer will be silently skipped when the underlying S&F message id is not a
|
||
GUID. A consumer reading the interface contract reasonably expects every cached-call
|
||
attempt to surface — the audit pipeline depends on it. The silent-drop is an
|
||
implementation detail of the S&F bridge that should be either lifted onto the
|
||
contract or removed.
|
||
|
||
The engine itself mints GUID-N ids via `Guid.NewGuid().ToString("N")` (line 224), which
|
||
`TrackedOperationId.TryParse` accepts, so the skip path is unreachable for engine-minted
|
||
ids. It is reachable only for callers that supply their own `messageId` argument with a
|
||
non-GUID format. The current callers (`NotificationOutbox` enqueue path with
|
||
NotificationId, cached-call enqueue path with `TrackedOperationId.ToString()`) all
|
||
supply GUID-shaped ids. The defect is latent — a future caller passing a non-GUID id
|
||
would silently bypass audit.
|
||
|
||
**Recommendation**
|
||
|
||
Two options. The cheap fix: change the skip to a `_logger.LogWarning` with the offending
|
||
id so a misconfigured caller is observable, and update the
|
||
`ICachedCallLifecycleObserver` XML doc to mention the "non-GUID id → no telemetry"
|
||
contract explicitly. The more correct fix: emit a still-audited row for the
|
||
non-GUID case (e.g. synthesise a `TrackedOperationId` from the underlying id, or emit a
|
||
distinguished "tracking-id-missing" audit row) so the audit pipeline never has silent
|
||
holes. Add a regression test in `CachedCallAttemptEmissionTests` capturing the chosen
|
||
contract — the existing
|
||
`Attempt_MessageIdNotAGuid_NoObserverNotification` test pins today's silent-skip; if
|
||
the fix is "log + skip", that test should be updated to also assert the log emission;
|
||
if the fix is "emit anyway", the test should be replaced.
|
||
|
||
**Resolution (2026-05-28):**
|
||
Applied the brief's "cheap fix" — the non-GUID skip path now logs a Warning naming
|
||
the offending `MessageId`, `Category` and `Outcome` before returning, so a
|
||
misconfigured caller is observable instead of silently bypassing the audit pipeline.
|
||
S&F retry bookkeeping remains untouched (the observer is still best-effort, the skip
|
||
still returns without throwing). The existing
|
||
`Attempt_MessageIdNotAGuid_NoObserverNotification` test still passes — its assertion
|
||
is on `_observer.Notifications` being empty, which is unchanged.
|
||
|
||
### StoreAndForward-023 — `siteId` silently defaults to empty when no `IStoreAndForwardSiteContext` is registered, degrading audit telemetry correlation
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/ServiceCollectionExtensions.cs:43`–`:53`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:99`, `:524` |
|
||
|
||
**Description**
|
||
|
||
`AddStoreAndForward`'s service-collection factory resolves the optional
|
||
`IStoreAndForwardSiteContext` and falls back to `string.Empty` when not registered:
|
||
|
||
```csharp
|
||
var siteContext = sp.GetService<IStoreAndForwardSiteContext>();
|
||
var siteId = siteContext?.SiteId ?? string.Empty;
|
||
return new StoreAndForwardService(storage, options, logger, replication,
|
||
cachedCallObserver, siteId);
|
||
```
|
||
|
||
The constructor's parameter is even defaulted to `""`. The empty-string `siteId` flows
|
||
straight into every emitted `CachedCallAttemptContext.SourceSite`, which the central
|
||
audit pipeline uses as part of the `(SourceSite, TrackedOperationId)` correlation key.
|
||
A host that registers an `ICachedCallLifecycleObserver` (the audit observer wired by
|
||
`AddAuditLog`) but forgets to register `IStoreAndForwardSiteContext` will produce a
|
||
stream of telemetry rows with `SourceSite = ""` — the central audit mirror cannot
|
||
distinguish them by site, and the central-site routing of
|
||
`RetryParkedOperation`/`DiscardParkedOperation` commands keyed on `SourceSite` will
|
||
fail to find the owning site.
|
||
|
||
The Host's `IStoreAndForwardSiteContext` adapter and the audit observer registration
|
||
are wired in lock-step, so the current configuration is correct, but the silent
|
||
empty-string fallback is a contract hazard for future hosts (CLI test harness, second
|
||
site cluster topology, etc.) and for tests that wire one without the other.
|
||
|
||
**Recommendation**
|
||
|
||
Make the contract explicit: when `cachedCallObserver` is non-null, require
|
||
`IStoreAndForwardSiteContext` to be registered — throw an `InvalidOperationException`
|
||
with a clear "Audit observer registered without a site context — register
|
||
IStoreAndForwardSiteContext" message at construction time. When the audit observer is
|
||
absent (no `AddAuditLog`), keep the empty-string default since `_siteId` is unused.
|
||
Alternatively, change `siteId` from a parameter to a `Func<string>` resolved lazily
|
||
from the service provider so a late-registered context still takes effect.
|
||
|
||
**Resolution (2026-05-28):**
|
||
Applied the brief's sentinel option (less invasive than throwing — preserves the
|
||
existing test wiring that constructs `StoreAndForwardService` without a site context).
|
||
Introduced `StoreAndForwardService.UnknownSiteSentinel = "$unknown-site"` (leading
|
||
`$` chosen so it cannot collide with a real site id) and the constructor now
|
||
normalises any null/empty/whitespace `siteId` argument to that sentinel. The empty
|
||
string can no longer reach `CachedCallAttemptContext.SourceSite`; a misconfigured
|
||
host without an `IStoreAndForwardSiteContext` produces audit rows tagged with the
|
||
sentinel — recognisably bad in the central audit log instead of silently merging
|
||
into the empty bucket. All 104 existing tests pass; the only test that asserts a
|
||
literal `SourceSite` (`CachedCallAttemptEmissionTests`) supplies `"site-77"` so the
|
||
normalisation is a no-op there.
|
||
|
||
### StoreAndForward-024 — `StopAsync` does not wait for an in-flight retry sweep, so disposed dependencies can be touched after shutdown
|
||
|
||
| | |
|
||
|--|--|
|
||
| Severity | Low |
|
||
| Category | Concurrency & thread safety |
|
||
| Status | Resolved |
|
||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:122`–`:127`, `:136`–`:143`, `:303`–`:329` |
|
||
|
||
**Resolution (2026-05-28):** the timer callback now captures the sweep task
|
||
into a `_sweepTask` field via `Volatile.Write`, and `StopAsync` disposes the
|
||
timer first (so no new sweep starts) then `await`s the captured task with a
|
||
bounded `SweepShutdownWaitTimeout` (10 s) via `Task.WaitAsync` — so a sweep
|
||
in-flight when shutdown begins is given a chance to finish before the host
|
||
disposes `_storage`/`_replication`. A genuinely hung sweep cannot block
|
||
shutdown indefinitely (the timeout fires, the wait is abandoned, the
|
||
warning is logged). Regression test
|
||
`StopAsync_AwaitsInFlightRetrySweep_BeforeReturning` parks a sweep inside a
|
||
blocking handler, asserts `StopAsync`'s returned task is not completed while
|
||
the sweep is paused, then releases the handler and asserts the sweep ran to
|
||
completion before `StopAsync` returned.
|
||
|
||
**Description**
|
||
|
||
`StartAsync` arms `_retryTimer` with `_ => _ = RetryPendingMessagesAsync()` (line 123).
|
||
The `_ =` discards the returned `Task`, so when the timer fires the sweep runs **fire
|
||
and forget** on a thread-pool thread. `StopAsync` (lines 136–143) disposes the timer:
|
||
|
||
```csharp
|
||
if (_retryTimer != null)
|
||
{
|
||
await _retryTimer.DisposeAsync();
|
||
_retryTimer = null;
|
||
}
|
||
```
|
||
|
||
`Timer.DisposeAsync()` returns once any in-flight timer **callback** has completed —
|
||
but the timer callback in this service is a one-line `_ = RetryPendingMessagesAsync()`
|
||
that synchronously returns immediately and leaves the actual sweep running on the
|
||
thread pool. So `Timer.DisposeAsync` does not wait for the sweep; only for the
|
||
synchronous `_ = ...` discarding step. `StopAsync` returns while a sweep is potentially
|
||
still running, touching `_storage` (which the host will dispose), `_replication`
|
||
(which the host will tear down), and `_cachedCallObserver` (whose downstream gRPC
|
||
channel the host will shut down).
|
||
|
||
The host shutdown sequence (`AkkaHostedService`) tears down the actor system and the
|
||
DI container after this service's `StopAsync` completes — meaning a sweep that runs
|
||
past `StopAsync` can call into disposed `SqliteConnection`s (yielding
|
||
`ObjectDisposedException`, caught by the sweep's outer `try/catch` as a log) or, more
|
||
seriously, push a replication operation into a half-disposed Akka actor pipeline and
|
||
trigger noisy dead-letter warnings during a clean shutdown.
|
||
|
||
The race window is small (the sweep typically finishes in <100 ms in tests) but it is
|
||
real, particularly when shutting down a site under load with a non-empty buffer.
|
||
|
||
**Recommendation**
|
||
|
||
Track in-flight sweep tasks and `await` them in `StopAsync`:
|
||
|
||
```csharp
|
||
private Task? _currentSweep;
|
||
|
||
public async Task StopAsync()
|
||
{
|
||
if (_retryTimer != null)
|
||
{
|
||
await _retryTimer.DisposeAsync();
|
||
_retryTimer = null;
|
||
}
|
||
if (_currentSweep is { } sweep)
|
||
{
|
||
try { await sweep; } catch { /* logged inside RetryPendingMessagesAsync */ }
|
||
}
|
||
}
|
||
```
|
||
|
||
Change the timer callback to:
|
||
|
||
```csharp
|
||
_retryTimer = new Timer(_ => _currentSweep = RetryPendingMessagesAsync(), ...);
|
||
```
|
||
|
||
Add a `CancellationTokenSource` so a long sweep can be cooperatively aborted on stop;
|
||
plumb the cancellation token into `_storage` / `_replication` / `_cachedCallObserver`
|
||
calls. Add a regression test in `StoreAndForwardServiceTests` that calls `StopAsync`
|
||
mid-sweep and asserts no further storage activity occurs after `StopAsync` returns.
|
||
|
||
**Resolution**
|
||
|
||
_Unresolved._
|
||
|