code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97

Re-applies the full 10-category checklist to every src/ project — including
first-time reviews of the four newer components (AuditLog, NotificationOutbox,
SiteCallAudit, Transport) — so the code-reviews/ index reflects today's
codebase rather than the 2026-05-16 baseline. 172 new Open findings (0
Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules.

regen-readme.py now derives each module's Last reviewed + Commit from its
findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future
single-module re-reviews show their own date in the Module Status table.
This commit is contained in:
Joseph Doherty
2026-05-28 02:55:47 -04:00
parent 1eb6e972b0
commit f93b7b99bb
25 changed files with 8793 additions and 115 deletions
+544 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.StoreAndForward` |
| Design doc | `docs/requirements/Component-StoreAndForward.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-17 |
| Last reviewed | 2026-05-28 |
| Reviewer | claude-agent |
| Commit reviewed | `39d737e` |
| Open findings | 0 (3 Deferred: 002, 011, 012 — see notes) |
| Commit reviewed | `1eb6e97` |
| Open findings | 7 (3 Deferred: 002, 011, 012; 7 new Open: 018024 — see Re-review 2026-05-28) |
## Summary
@@ -55,6 +55,76 @@ StoreAndForward-017 records that the Retry/Discard activity-log entries hard-cod
`ExternalSystem` category, mislabelling notification and cached-DB-write messages in
the site event log.
#### Re-review 2026-05-28 (commit `1eb6e97`)
Full re-review against commit `1eb6e97` with the same 10-category checklist. The
batch-3 / batch-4 resolutions (001, 003010, 013017) are still present and intact; no
regressions detected on prior fixes. Findings 002, 011 and 012 remain validly
`Deferred` (their preconditions are unchanged) and findings 005, 006, 010, 013, 014,
016, 017 are confirmed `Resolved` against the current source.
This pass surfaced **seven new findings** clustered around two themes:
The first theme is **design-doc drift on the notification path**, which has acquired
two now-real defects since the engine became central-targeted. `StoreAndForward-018`
(High) records that a corrupt notification payload — handled in `NotificationForwarder.
DeliverAsync` by returning `false` — parks a notification on its first retry-sweep
encounter, despite the design doc stating "Notifications do not park" (line 47, "Parking
applies only to the external-system-call and cached-database-write categories"). The
same path becomes a poison-payload retry-forever trap on the active node if the engine
ever softened the `false` semantics. `StoreAndForward-019` (Medium) records the
sibling defect: notifications are enqueued with `MaxRetries` defaulting to
`StoreAndForwardOptions.DefaultMaxRetries` (50), and the legacy SMTP path
(`NotificationDeliveryService.SendAsync`) passes a positive bounded `smtpConfig.
MaxRetries` — so an unreachable central will silently park notifications after a
finite retry budget rather than "retry at the fixed forward interval until central acks"
as the design requires. The contract `0 = no limit` is not enforced for the
notification category.
The second theme is **subtle correctness and contract gaps around the operator paths**
that survived the StoreAndForward-016/017 batch. `StoreAndForward-020` (Medium) records
that `RetryParkedMessageAsync` skips replication entirely if `GetMessageByIdAsync`
returns null after a successful local requeue (a narrow but real race window with a
concurrent discard / sweep delete), re-introducing the StoreAndForward-016 standby
divergence in that corner. `StoreAndForward-021` (Medium) is a design-doc-vs-code drift
that should be reconciled in the doc: the **operation tracking table** is documented
inside Component-StoreAndForward.md as a S&F responsibility (lines 21, 49, 7787, 108,
114), but the actual `OperationTrackingStore` lives in `src/ScadaLink.SiteRuntime/
Tracking/` and is not consumed by S&F at all — the brief's own note flags this. The
design doc should be updated to point at SiteRuntime, or the store moved to
StoreAndForward.
`StoreAndForward-022` (Low) records that `_cachedCallObserver` silently drops audit
telemetry when a buffered cached-call's `Id` is not a parseable `TrackedOperationId`
GUID — the engine returns from `NotifyCachedCallObserverAsync` before emitting anything,
so a legacy enqueue path that buffered a non-GUID id (the engine's own default minting
produces "N"-formatted GUIDs, which TrackedOperationId.TryParse accepts, but any
caller passing a custom non-GUID id silently bypasses the entire `Submitted/Forwarded/
Attempted/Delivered/Parked/Discarded` audit lifecycle). `StoreAndForward-023` (Low)
records that `siteId` is silently defaulted to `string.Empty` when no
`IStoreAndForwardSiteContext` is registered, so a misconfigured host produces audit
telemetry with `SourceSite = ""` and the central audit-log's `(SourceSite,
TrackedOperationId)` correlation degrades to a per-id-only index. `StoreAndForward-024`
(Low) is a stop-time ordering defect: `StopAsync` disposes the timer but a
mid-flight `RetryPendingMessagesAsync` invocation continues using `_storage` and
`_replication` after `StopAsync` returns; downstream resources disposed by the host
shutdown sequence (the DI container) can then NRE through the still-running sweep.
## Checklist coverage — Re-review 2026-05-28
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Notification corrupt-payload parks contrary to design (018); RetryParkedMessageAsync skips replication when message reload races a deletion (020). |
| 2 | Akka.NET conventions | ☑ | `ParkedMessageHandlerActor` uses `PipeTo` correctly with success/failure projections (007 resolution preserved). No new findings. |
| 3 | Concurrency & thread safety | ☑ | Sweep-vs-stop race: a timer callback running while `StopAsync` returns can touch disposed dependencies (024). |
| 4 | Error handling & resilience | ☑ | Notifications park after `DefaultMaxRetries` exhaustion (019) — contradicts the design doc's "retried until central acks". |
| 5 | Security | ☑ | No issues found — parameterised SQL throughout, payload JSON opaque, no secret material handled. |
| 6 | Performance & resource management | ☑ | No new findings — the connection-per-call documented trade-off and pooled `OpenAsync` remain acceptable. |
| 7 | Design-document adherence | ☑ | Operation Tracking Table documented in StoreAndForward but actually lives in SiteRuntime (021); notification non-parking guarantee broken by 018 + 019. |
| 8 | Code organization & conventions | ☑ | `IStoreAndForwardSiteContext` silently defaults `SiteId` to empty (023) — a configuration hole rather than an entity placement issue. |
| 9 | Testing coverage | ☑ | The seven new findings have no regression tests in `tests/ScadaLink.StoreAndForward.Tests/` — particularly the notification-doesn't-park invariant (018, 019), the requeue-after-reload-null replication gap (020), and the stop-during-sweep behaviour (024). |
| 10 | Documentation & comments | ☑ | `CachedCallAttemptOutcome.ParkedMaxRetries` XML doc says "S&F semantics" but the code applies it to notifications too if 018/019 fire — minor drift, captured under 018. The `TrackedOperationId.TryParse` silent-skip behaviour in `NotifyCachedCallObserverAsync` is documented in the source but not on the public observer contract (022). |
## Checklist coverage
| # | Category | Examined | Notes |
@@ -914,3 +984,474 @@ the StoreAndForward-016 replication) — and pass it to `RaiseActivity` (falling
`RetryParkedMessageAsync_ActivityUsesMessageRealCategory` and
`DiscardParkedMessageAsync_ActivityUsesMessageRealCategory` assert the activity carries
`Notification` / `CachedDbWrite` respectively; both fail against the pre-fix code.
### StoreAndForward-018 — Notification corrupt-payload parks the buffered message, contradicting the "notifications do not park" design invariant
| | |
|--|--|
| Severity | High |
| Category | Design-document adherence |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/NotificationForwarder.cs:62``:69`, `:105``:122`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:369``:397` |
**Description**
The Component design doc explicitly carves out notifications from the parking lifecycle:
> "Notifications do not park — they are retried at the fixed forward interval until
> central acks." (`docs/requirements/Component-StoreAndForward.md:47`)
> "Parking applies only to the external-system-call and cached-database-write
> categories." (same line)
`NotificationForwarder.DeliverAsync` violates this. When `TryBuildSubmit` fails to
deserialize the buffered payload — either because `JsonSerializer.Deserialize` throws a
`JsonException` (line 114) or because it returns `null` (line 119) — `DeliverAsync`
returns `false` (line 68). On the **retry path** the S&F engine treats handler `false`
as a permanent failure and **parks the message immediately** via the conditional
`UpdateMessageIfStatusAsync(... Parked)` write at `StoreAndForwardService.cs:373``:385`.
Result: a notification with a corrupt buffered payload — a row that the engine itself
treats as opaque ("Payload: Serialized message content…"; `Component-StoreAndForward.md:
110`) — enters the parked state and surfaces in the central UI's parked-message list
under the `Notification` category, contradicting the doc's invariant and the resolved
StoreAndForward-017's "Notification / CachedDbWrite" Retry/Discard category mapping.
The defect is real today: the inline comment on `NotificationForwarder.cs:64` even
documents the violation ("An unreadable payload cannot be fixed by retrying — park it
(return false)") as the intended behaviour, but that behaviour is what the design doc
forbids. Either the doc needs to acknowledge a poison-payload parking exception for
notifications, or the forwarder needs a different escape hatch (discard? log + drop?
permanent-failure-as-`true` to clear the buffer?). Today there is no consistent answer
between code and design.
Additionally, on the **immediate-delivery** path (a fresh enqueue followed by a
`DeliverAsync` returning `false`), the engine returns `WasBuffered: false` and the row
is never persisted — so the corrupt-payload "park" only occurs on the retry path, where
the message has already been buffered (and replicated to the standby). The
**inconsistency between the two paths** ("not buffered" vs "parked") for the same
permanent-failure outcome is itself a contract surprise; the resolved StoreAndForward-004
documents the immediate vs retry asymmetry, but does not anticipate that the retry
asymmetry will violate a per-category invariant.
**Recommendation**
Choose one consistent reconciliation. Preferred option: change `NotificationForwarder.
DeliverAsync` to **discard** a corrupt payload rather than park it — delete the
buffered row directly, log a Site Event Log entry under `Discard`, and return `true` so
the engine clears the buffer. This preserves the design's "notifications do not park"
invariant. Alternatives: (a) update the design doc to acknowledge a poison-payload
parking exception specifically for notifications, and revise the resolved
StoreAndForward-017 wording; (b) treat `JsonException` as transient (would retry-forever
on a corrupt payload — bad); (c) introduce a per-category park-allowed flag on the
engine and gate the retry-path park behind it for the Notification category.
Add a regression test in `tests/ScadaLink.StoreAndForward.Tests/NotificationForwarderTests.
cs` asserting that a corrupt-payload notification reaches a terminal **non-Parked**
state — today the corrupt-payload behaviour is uncovered.
**Resolution**
_Unresolved._
### StoreAndForward-019 — Notifications park after `DefaultMaxRetries` exhaustion, contradicting "retried until central acks"
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:229`, `:407``:437`; `src/ScadaLink.StoreAndForward/StoreAndForwardOptions.cs:18`; `src/ScadaLink.SiteRuntime/Scripts/ScriptRuntimeContext.cs:1773``:1778`; `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:149``:156` |
**Description**
The design doc requires a buffered notification to be retried indefinitely until
central acks:
> "The **notification** category retries differently: it has no source-entity setting.
> The site→central forward uses a single fixed retry interval configured in the host
> `appsettings.json`. … A buffered notification is retried until central acks it; it is
> not parked on a retry limit (central, once reachable, owns delivery, retry, and
> parking from that point on)." (`Component-StoreAndForward.md:55``:59`)
The current engine cannot honour that. `RetryMessageAsync` enforces parking at
`message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries`
(`StoreAndForwardService.cs:407`); a `MaxRetries == 0` is the documented "no limit"
escape hatch (now correctly explained by the resolved StoreAndForward-015). But the two
notification enqueue paths both supply a positive bounded `MaxRetries`:
- `ScriptRuntimeContext.cs:1773``:1778` (the `Notify.Send` site script path) calls
`EnqueueAsync` without supplying the `maxRetries` argument, so the engine
defaults to `StoreAndForwardOptions.DefaultMaxRetries = 50` (`StoreAndForwardOptions.
cs:18`). After 50 retry sweeps with central unreachable, the notification is parked.
- `NotificationDeliveryService.cs:149``:156` (the legacy SMTP-style path retained for
the central-side `INotificationDeliveryService` callers) passes
`smtpConfig.MaxRetries > 0 ? smtpConfig.MaxRetries : null``null` falls back to the
same 50-retry default, and any positive `smtpConfig.MaxRetries` still bounds the
retry budget. Either way, a long central outage parks the notification.
A parked notification cannot be cleared by a central recovery: it stays parked until an
operator clicks **Retry** in the parked-message UI. The design's invariant — that
notification delivery converges automatically as soon as central is reachable — is
broken: an extended central outage requires manual intervention to clear the backlog,
which is exactly the behaviour the central-only outbox redesign was meant to remove
from the site.
This is closely related to (but distinct from) StoreAndForward-018: 018 is the
*permanent-failure-path* parking violation; 019 is the *transient-failure-path*
parking violation under the engine's normal max-retries policy.
**Recommendation**
Make the notification enqueue paths pass `maxRetries: 0` so the documented "no limit /
never parked" semantics apply, and guard against regression by adding an integration
test in `tests/ScadaLink.StoreAndForward.Tests/NotificationForwarderTests.cs` that runs
a sweep many more times than `DefaultMaxRetries` against an always-failing handler and
asserts the buffered notification's status stays `Pending` (not `Parked`). A cleaner
alternative is to special-case the `Notification` category inside
`RetryMessageAsync`'s max-retries guard (treat it as `MaxRetries == 0` regardless of
the field value) so the invariant is enforced at the single chokepoint rather than
relying on every caller to pass the right value — this also fixes the legacy
`NotificationDeliveryService` path without editing the consumer.
**Resolution**
_Unresolved._
### StoreAndForward-020 — `RetryParkedMessageAsync` skips standby replication when the message is deleted between local update and re-load
| | |
|--|--|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:599``:616` |
**Description**
The StoreAndForward-016 resolution wired Requeue replication into operator-initiated
retry. The fix uses a two-step pattern:
```csharp
public async Task<bool> RetryParkedMessageAsync(string messageId)
{
var success = await _storage.RetryParkedMessageAsync(messageId); // step 1
if (success)
{
var message = await _storage.GetMessageByIdAsync(messageId); // step 2 (no txn)
var category = message?.Category ?? StoreAndForwardCategory.ExternalSystem;
if (message != null)
{
_replication?.ReplicateRequeue(message); // step 3
}
RaiseActivity("Retry", category, ...);
}
return success;
}
```
The two storage calls are on separate connections with no surrounding transaction. A
concurrent writer between step 1 (which moved the row from Parked → Pending) and step 2
(which re-reads the row) can delete or mutate the row:
- An operator who issues `DiscardParkedMessageAsync` immediately after retry — the
`DiscardParkedMessageAsync` storage call is conditional on `status = Parked`, so it
will be a no-op (correct), but a sweep that succeeds in delivering the just-requeued
row will then call `_storage.RemoveMessageAsync` (unconditional), which deletes it.
In a single retry-sweep cycle this race is real because `DefaultRetryInterval = Zero`
is the standard test default and the operator action and a sweep tick can overlap.
- A `RemoveMessageAsync` runs in step 1's wake; `GetMessageByIdAsync` returns null;
step 3 (`_replication?.ReplicateRequeue`) is **skipped entirely**, but step 1
already requeued the row locally. The standby is now left in `Parked` state while
the active node has Pending-then-Deleted, exactly the standby-divergence StoreAndForward-016
was supposed to fix. (On the active node a subsequent failover lands on a Parked
standby copy of a discarded message — the same regression 016 already documented.)
The category-fallback path (`StoreAndForwardCategory.ExternalSystem` when message is
null) silently mislabels the activity log entry too — the same defect that
StoreAndForward-017 fixed for the non-racy path, except this branch handed back a
hard-coded fallback rather than re-loading. The activity log entry is a minor side
effect; the missing replication is the real defect.
**Recommendation**
Capture the message **once**, before the local Parked → Pending storage update, so the
replication path has the row in hand even if a concurrent writer deletes it
afterwards:
```csharp
var message = await _storage.GetMessageByIdAsync(messageId); // before the update
if (message == null || message.Status != StoreAndForwardMessageStatus.Parked)
return false;
var success = await _storage.RetryParkedMessageAsync(messageId);
if (!success) return false;
// `message` was the parked row; the active node just wrote it back to Pending with
// retry_count = 0 — construct the replicated state from those known mutations.
message.Status = StoreAndForwardMessageStatus.Pending;
message.RetryCount = 0;
message.LastError = null;
message.LastAttemptAt = null;
_replication?.ReplicateRequeue(message);
RaiseActivity("Retry", message.Category, $"Parked message {messageId} moved back to queue");
return true;
```
Add a regression test in `StoreAndForwardReplicationTests` that simulates the
delete-between-update-and-reload race and asserts the `Requeue` replication
operation is still emitted with the correct category.
**Resolution**
_Unresolved._
### StoreAndForward-021 — Design doc claims the Operation Tracking Table lives in StoreAndForward but the implementation is in SiteRuntime
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Open |
| Location | `docs/requirements/Component-StoreAndForward.md:21`, `:49``:51`, `:77``:87`, `:108`, `:114`; `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:37`; `src/ScadaLink.StoreAndForward/` (whole module) |
**Description**
Component-StoreAndForward.md repeatedly assigns the **Operation Tracking Table** to
this component:
- **Responsibilities** (line 21): "Maintain a site-local **operation tracking table**
holding one row per `TrackedOperationId` for cached calls … the authoritative status
record consulted by `Tracking.Status(id)`."
- **Message Lifecycle** (lines 4951): "the operation tracking table is the status
record and the S&F buffer is purely the retry mechanism. A cached call that succeeds
on its first immediate attempt is written directly as a terminal `Delivered` tracking
row and never enters the S&F buffer."
- **Operation Tracking Table** section (lines 7787): "Alongside the S&F buffer DB,
each site node holds a **site-local operation tracking table** in SQLite. … Each row
records the operation kind (`TrackedOperationKind`) …"
The actual implementation lives outside this module: `src/ScadaLink.SiteRuntime/
Tracking/OperationTrackingStore.cs` (and `IOperationTrackingStore`, `OperationTrackingOptions`).
The StoreAndForward project contains no references to the tracking store, owns no
`operation_tracking` table, and `StoreAndForwardService.NotifyCachedCallObserverAsync`
is only a hook handing telemetry context to an `ICachedCallLifecycleObserver` — the
audit bridge wired in `ScadaLink.AuditLog`. The S&F module is **not** the table's
owner; SiteRuntime is.
This is a real design-doc drift, not a code defect, and is flagged explicitly in the
brief's "Module-specific notes". The drift matters because the design doc's
discussion of the lifecycle — "immediate success writes a terminal Delivered tracking
row directly here", "operator discard sets terminal `Discarded`", "central never
mutates the mirror row directly" — places coordination responsibilities on the wrong
component. A reader looking for the source of truth for `Tracking.Status(id)` would
read `Component-StoreAndForward.md` and search `src/ScadaLink.StoreAndForward/` in
vain. The doc also lists Site Call Audit / Audit Log telemetry-emission as a S&F
responsibility (line 22), but the emission actually happens via the `AuditLog` site
component subscribing to `ICachedCallLifecycleObserver`.
**Recommendation**
Reconcile the doc with the code. The simplest fix is doc-side: update
Component-StoreAndForward.md to scope its responsibilities back to the retry
mechanism + replication + parked-message management, and add a cross-reference to a
new (or existing) component doc for Operation Tracking (Component-SiteRuntime.md, or
a new Component-OperationTracking.md). The code is internally consistent — the audit
bridge subscribes to the observer hook, the SiteRuntime store writes the rows, the S&F
engine emits attempt telemetry on the cached-call hot path — but the design doc is
several refactors out of date. The hierarchical map should be:
- `Component-StoreAndForward.md` → S&F buffer + Replication + Parked-message
management + Notification forwarding to central + cached-call telemetry **hook**.
- New doc / SiteRuntime doc → Operation Tracking Table semantics and lifecycle.
- `Component-SiteCallAudit.md` / `Component-AuditLog.md` → telemetry emission +
central-side mirror.
**Resolution**
_Unresolved._
### StoreAndForward-022 — `NotifyCachedCallObserverAsync` silently drops the entire audit lifecycle when the message id is not a parseable `TrackedOperationId`
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:484``:515` |
**Description**
`NotifyCachedCallObserverAsync` (the per-attempt observer notifier wired by the M3
Bundle E rollout) bails out with no audit emission when
`TrackedOperationId.TryParse(message.Id, out var trackedId)` returns false
(`StoreAndForwardService.cs:510``:515`). The inline comment justifies the behaviour as
back-compat for "pre-M3 message (random GUID-N id from S&F itself, no
TrackedOperationId threaded in)", but the documented contract is broken in two ways:
1. **Silent dropping of every audit row, not just the first one.** The skip means no
`Attempted` row, no `CachedResolve` terminal row, no audit trail at all for that
operation's S&F lifecycle — yet the rest of the system (script trust boundary,
parked-message UI, etc.) still treats the operation as audit-tracked. The drop is
not surfaced via a metric, log warning (the path is a silent `return`), or counter,
so a misconfigured caller bypasses the audit hot path with zero feedback.
2. **The contract is hidden in field-level XML.** The `ICachedCallLifecycleObserver`
public interface contract (defined in `ScadaLink.Commons`) does not document that
the observer will be silently skipped when the underlying S&F message id is not a
GUID. A consumer reading the interface contract reasonably expects every cached-call
attempt to surface — the audit pipeline depends on it. The silent-drop is an
implementation detail of the S&F bridge that should be either lifted onto the
contract or removed.
The engine itself mints GUID-N ids via `Guid.NewGuid().ToString("N")` (line 224), which
`TrackedOperationId.TryParse` accepts, so the skip path is unreachable for engine-minted
ids. It is reachable only for callers that supply their own `messageId` argument with a
non-GUID format. The current callers (`NotificationOutbox` enqueue path with
NotificationId, cached-call enqueue path with `TrackedOperationId.ToString()`) all
supply GUID-shaped ids. The defect is latent — a future caller passing a non-GUID id
would silently bypass audit.
**Recommendation**
Two options. The cheap fix: change the skip to a `_logger.LogWarning` with the offending
id so a misconfigured caller is observable, and update the
`ICachedCallLifecycleObserver` XML doc to mention the "non-GUID id → no telemetry"
contract explicitly. The more correct fix: emit a still-audited row for the
non-GUID case (e.g. synthesise a `TrackedOperationId` from the underlying id, or emit a
distinguished "tracking-id-missing" audit row) so the audit pipeline never has silent
holes. Add a regression test in `CachedCallAttemptEmissionTests` capturing the chosen
contract — the existing
`Attempt_MessageIdNotAGuid_NoObserverNotification` test pins today's silent-skip; if
the fix is "log + skip", that test should be updated to also assert the log emission;
if the fix is "emit anyway", the test should be replaced.
**Resolution**
_Unresolved._
### StoreAndForward-023 — `siteId` silently defaults to empty when no `IStoreAndForwardSiteContext` is registered, degrading audit telemetry correlation
| | |
|--|--|
| Severity | Low |
| Category | Code organization & conventions |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/ServiceCollectionExtensions.cs:43``:53`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:99`, `:524` |
**Description**
`AddStoreAndForward`'s service-collection factory resolves the optional
`IStoreAndForwardSiteContext` and falls back to `string.Empty` when not registered:
```csharp
var siteContext = sp.GetService<IStoreAndForwardSiteContext>();
var siteId = siteContext?.SiteId ?? string.Empty;
return new StoreAndForwardService(storage, options, logger, replication,
cachedCallObserver, siteId);
```
The constructor's parameter is even defaulted to `""`. The empty-string `siteId` flows
straight into every emitted `CachedCallAttemptContext.SourceSite`, which the central
audit pipeline uses as part of the `(SourceSite, TrackedOperationId)` correlation key.
A host that registers an `ICachedCallLifecycleObserver` (the audit observer wired by
`AddAuditLog`) but forgets to register `IStoreAndForwardSiteContext` will produce a
stream of telemetry rows with `SourceSite = ""` — the central audit mirror cannot
distinguish them by site, and the central-site routing of
`RetryParkedOperation`/`DiscardParkedOperation` commands keyed on `SourceSite` will
fail to find the owning site.
The Host's `IStoreAndForwardSiteContext` adapter and the audit observer registration
are wired in lock-step, so the current configuration is correct, but the silent
empty-string fallback is a contract hazard for future hosts (CLI test harness, second
site cluster topology, etc.) and for tests that wire one without the other.
**Recommendation**
Make the contract explicit: when `cachedCallObserver` is non-null, require
`IStoreAndForwardSiteContext` to be registered — throw an `InvalidOperationException`
with a clear "Audit observer registered without a site context — register
IStoreAndForwardSiteContext" message at construction time. When the audit observer is
absent (no `AddAuditLog`), keep the empty-string default since `_siteId` is unused.
Alternatively, change `siteId` from a parameter to a `Func<string>` resolved lazily
from the service provider so a late-registered context still takes effect.
**Resolution**
_Unresolved._
### StoreAndForward-024 — `StopAsync` does not wait for an in-flight retry sweep, so disposed dependencies can be touched after shutdown
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:122``:127`, `:136``:143`, `:303``:329` |
**Description**
`StartAsync` arms `_retryTimer` with `_ => _ = RetryPendingMessagesAsync()` (line 123).
The `_ =` discards the returned `Task`, so when the timer fires the sweep runs **fire
and forget** on a thread-pool thread. `StopAsync` (lines 136143) disposes the timer:
```csharp
if (_retryTimer != null)
{
await _retryTimer.DisposeAsync();
_retryTimer = null;
}
```
`Timer.DisposeAsync()` returns once any in-flight timer **callback** has completed —
but the timer callback in this service is a one-line `_ = RetryPendingMessagesAsync()`
that synchronously returns immediately and leaves the actual sweep running on the
thread pool. So `Timer.DisposeAsync` does not wait for the sweep; only for the
synchronous `_ = ...` discarding step. `StopAsync` returns while a sweep is potentially
still running, touching `_storage` (which the host will dispose), `_replication`
(which the host will tear down), and `_cachedCallObserver` (whose downstream gRPC
channel the host will shut down).
The host shutdown sequence (`AkkaHostedService`) tears down the actor system and the
DI container after this service's `StopAsync` completes — meaning a sweep that runs
past `StopAsync` can call into disposed `SqliteConnection`s (yielding
`ObjectDisposedException`, caught by the sweep's outer `try/catch` as a log) or, more
seriously, push a replication operation into a half-disposed Akka actor pipeline and
trigger noisy dead-letter warnings during a clean shutdown.
The race window is small (the sweep typically finishes in <100 ms in tests) but it is
real, particularly when shutting down a site under load with a non-empty buffer.
**Recommendation**
Track in-flight sweep tasks and `await` them in `StopAsync`:
```csharp
private Task? _currentSweep;
public async Task StopAsync()
{
if (_retryTimer != null)
{
await _retryTimer.DisposeAsync();
_retryTimer = null;
}
if (_currentSweep is { } sweep)
{
try { await sweep; } catch { /* logged inside RetryPendingMessagesAsync */ }
}
}
```
Change the timer callback to:
```csharp
_retryTimer = new Timer(_ => _currentSweep = RetryPendingMessagesAsync(), ...);
```
Add a `CancellationTokenSource` so a long sweep can be cooperatively aborted on stop;
plumb the cancellation token into `_storage` / `_replication` / `_cachedCallObserver`
calls. Add a regression test in `StoreAndForwardServiceTests` that calls `StopAsync`
mid-sweep and asserts no further storage activity occurs after `StopAsync` returns.
**Resolution**
_Unresolved._