code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97

Re-applies the full 10-category checklist to every src/ project — including first-time reviews of the four newer components (AuditLog, NotificationOutbox, SiteCallAudit, Transport) — so the code-reviews/ index reflects today's codebase rather than the 2026-05-16 baseline. 172 new Open findings (0 Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules. regen-readme.py now derives each module's Last reviewed + Commit from its findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future single-module re-reviews show their own date in the Module Status table.
2026-05-28 02:55:47 -04:00
parent 1eb6e972b0
commit f93b7b99bb
25 changed files with 8793 additions and 115 deletions
@@ -5,10 +5,10 @@
 | Module | `src/ScadaLink.StoreAndForward` |
 | Design doc | `docs/requirements/Component-StoreAndForward.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-17 |
+| Last reviewed | 2026-05-28 |
 | Reviewer | claude-agent |
-| Commit reviewed | `39d737e` |
-| Open findings | 0 (3 Deferred: 002, 011, 012 — see notes) |
+| Commit reviewed | `1eb6e97` |
+| Open findings | 7 (3 Deferred: 002, 011, 012; 7 new Open: 018–024 — see Re-review 2026-05-28) |

 ## Summary

@@ -55,6 +55,76 @@ StoreAndForward-017 records that the Retry/Discard activity-log entries hard-cod
 `ExternalSystem` category, mislabelling notification and cached-DB-write messages in
 the site event log.

+#### Re-review 2026-05-28 (commit `1eb6e97`)
+
+Full re-review against commit `1eb6e97` with the same 10-category checklist. The
+batch-3 / batch-4 resolutions (001, 003–010, 013–017) are still present and intact; no
+regressions detected on prior fixes. Findings 002, 011 and 012 remain validly
+`Deferred` (their preconditions are unchanged) and findings 005, 006, 010, 013, 014,
+016, 017 are confirmed `Resolved` against the current source.
+
+This pass surfaced **seven new findings** clustered around two themes:
+
+The first theme is **design-doc drift on the notification path**, which has acquired
+two now-real defects since the engine became central-targeted. `StoreAndForward-018`
+(High) records that a corrupt notification payload — handled in `NotificationForwarder.
+DeliverAsync` by returning `false` — parks a notification on its first retry-sweep
+encounter, despite the design doc stating "Notifications do not park" (line 47, "Parking
+applies only to the external-system-call and cached-database-write categories"). The
+same path becomes a poison-payload retry-forever trap on the active node if the engine
+ever softened the `false` semantics. `StoreAndForward-019` (Medium) records the
+sibling defect: notifications are enqueued with `MaxRetries` defaulting to
+`StoreAndForwardOptions.DefaultMaxRetries` (50), and the legacy SMTP path
+(`NotificationDeliveryService.SendAsync`) passes a positive bounded `smtpConfig.
+MaxRetries` — so an unreachable central will silently park notifications after a
+finite retry budget rather than "retry at the fixed forward interval until central acks"
+as the design requires. The contract `0 = no limit` is not enforced for the
+notification category.
+
+The second theme is **subtle correctness and contract gaps around the operator paths**
+that survived the StoreAndForward-016/017 batch. `StoreAndForward-020` (Medium) records
+that `RetryParkedMessageAsync` skips replication entirely if `GetMessageByIdAsync`
+returns null after a successful local requeue (a narrow but real race window with a
+concurrent discard / sweep delete), re-introducing the StoreAndForward-016 standby
+divergence in that corner. `StoreAndForward-021` (Medium) is a design-doc-vs-code drift
+that should be reconciled in the doc: the **operation tracking table** is documented
+inside Component-StoreAndForward.md as a S&F responsibility (lines 21, 49, 77–87, 108,
+114), but the actual `OperationTrackingStore` lives in `src/ScadaLink.SiteRuntime/
+Tracking/` and is not consumed by S&F at all — the brief's own note flags this. The
+design doc should be updated to point at SiteRuntime, or the store moved to
+StoreAndForward.
+
+`StoreAndForward-022` (Low) records that `_cachedCallObserver` silently drops audit
+telemetry when a buffered cached-call's `Id` is not a parseable `TrackedOperationId`
+GUID — the engine returns from `NotifyCachedCallObserverAsync` before emitting anything,
+so a legacy enqueue path that buffered a non-GUID id (the engine's own default minting
+produces "N"-formatted GUIDs, which TrackedOperationId.TryParse accepts, but any
+caller passing a custom non-GUID id silently bypasses the entire `Submitted/Forwarded/
+Attempted/Delivered/Parked/Discarded` audit lifecycle). `StoreAndForward-023` (Low)
+records that `siteId` is silently defaulted to `string.Empty` when no
+`IStoreAndForwardSiteContext` is registered, so a misconfigured host produces audit
+telemetry with `SourceSite = ""` and the central audit-log's `(SourceSite,
+TrackedOperationId)` correlation degrades to a per-id-only index. `StoreAndForward-024`
+(Low) is a stop-time ordering defect: `StopAsync` disposes the timer but a
+mid-flight `RetryPendingMessagesAsync` invocation continues using `_storage` and
+`_replication` after `StopAsync` returns; downstream resources disposed by the host
+shutdown sequence (the DI container) can then NRE through the still-running sweep.
+
+## Checklist coverage — Re-review 2026-05-28
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | Notification corrupt-payload parks contrary to design (018); RetryParkedMessageAsync skips replication when message reload races a deletion (020). |
+| 2 | Akka.NET conventions | ☑ | `ParkedMessageHandlerActor` uses `PipeTo` correctly with success/failure projections (007 resolution preserved). No new findings. |
+| 3 | Concurrency & thread safety | ☑ | Sweep-vs-stop race: a timer callback running while `StopAsync` returns can touch disposed dependencies (024). |
+| 4 | Error handling & resilience | ☑ | Notifications park after `DefaultMaxRetries` exhaustion (019) — contradicts the design doc's "retried until central acks". |
+| 5 | Security | ☑ | No issues found — parameterised SQL throughout, payload JSON opaque, no secret material handled. |
+| 6 | Performance & resource management | ☑ | No new findings — the connection-per-call documented trade-off and pooled `OpenAsync` remain acceptable. |
+| 7 | Design-document adherence | ☑ | Operation Tracking Table documented in StoreAndForward but actually lives in SiteRuntime (021); notification non-parking guarantee broken by 018 + 019. |
+| 8 | Code organization & conventions | ☑ | `IStoreAndForwardSiteContext` silently defaults `SiteId` to empty (023) — a configuration hole rather than an entity placement issue. |
+| 9 | Testing coverage | ☑ | The seven new findings have no regression tests in `tests/ScadaLink.StoreAndForward.Tests/` — particularly the notification-doesn't-park invariant (018, 019), the requeue-after-reload-null replication gap (020), and the stop-during-sweep behaviour (024). |
+| 10 | Documentation & comments | ☑ | `CachedCallAttemptOutcome.ParkedMaxRetries` XML doc says "S&F semantics" but the code applies it to notifications too if 018/019 fire — minor drift, captured under 018. The `TrackedOperationId.TryParse` silent-skip behaviour in `NotifyCachedCallObserverAsync` is documented in the source but not on the public observer contract (022). |
+
 ## Checklist coverage

 | # | Category | Examined | Notes |
@@ -914,3 +984,474 @@ the StoreAndForward-016 replication) — and pass it to `RaiseActivity` (falling
 `RetryParkedMessageAsync_ActivityUsesMessageRealCategory` and
 `DiscardParkedMessageAsync_ActivityUsesMessageRealCategory` assert the activity carries
 `Notification` / `CachedDbWrite` respectively; both fail against the pre-fix code.
+
+### StoreAndForward-018 — Notification corrupt-payload parks the buffered message, contradicting the "notifications do not park" design invariant
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/NotificationForwarder.cs:62`–`:69`, `:105`–`:122`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:369`–`:397` |
+
+**Description**
+
+The Component design doc explicitly carves out notifications from the parking lifecycle:
+
+> "Notifications do not park — they are retried at the fixed forward interval until
+> central acks." (`docs/requirements/Component-StoreAndForward.md:47`)
+> "Parking applies only to the external-system-call and cached-database-write
+> categories." (same line)
+
+`NotificationForwarder.DeliverAsync` violates this. When `TryBuildSubmit` fails to
+deserialize the buffered payload — either because `JsonSerializer.Deserialize` throws a
+`JsonException` (line 114) or because it returns `null` (line 119) — `DeliverAsync`
+returns `false` (line 68). On the **retry path** the S&F engine treats handler `false`
+as a permanent failure and **parks the message immediately** via the conditional
+`UpdateMessageIfStatusAsync(... Parked)` write at `StoreAndForwardService.cs:373`–`:385`.
+Result: a notification with a corrupt buffered payload — a row that the engine itself
+treats as opaque ("Payload: Serialized message content…"; `Component-StoreAndForward.md:
+110`) — enters the parked state and surfaces in the central UI's parked-message list
+under the `Notification` category, contradicting the doc's invariant and the resolved
+StoreAndForward-017's "Notification / CachedDbWrite" Retry/Discard category mapping.
+
+The defect is real today: the inline comment on `NotificationForwarder.cs:64` even
+documents the violation ("An unreadable payload cannot be fixed by retrying — park it
+(return false)") as the intended behaviour, but that behaviour is what the design doc
+forbids. Either the doc needs to acknowledge a poison-payload parking exception for
+notifications, or the forwarder needs a different escape hatch (discard? log + drop?
+permanent-failure-as-`true` to clear the buffer?). Today there is no consistent answer
+between code and design.
+
+Additionally, on the **immediate-delivery** path (a fresh enqueue followed by a
+`DeliverAsync` returning `false`), the engine returns `WasBuffered: false` and the row
+is never persisted — so the corrupt-payload "park" only occurs on the retry path, where
+the message has already been buffered (and replicated to the standby). The
+**inconsistency between the two paths** ("not buffered" vs "parked") for the same
+permanent-failure outcome is itself a contract surprise; the resolved StoreAndForward-004
+documents the immediate vs retry asymmetry, but does not anticipate that the retry
+asymmetry will violate a per-category invariant.
+
+**Recommendation**
+
+Choose one consistent reconciliation. Preferred option: change `NotificationForwarder.
+DeliverAsync` to **discard** a corrupt payload rather than park it — delete the
+buffered row directly, log a Site Event Log entry under `Discard`, and return `true` so
+the engine clears the buffer. This preserves the design's "notifications do not park"
+invariant. Alternatives: (a) update the design doc to acknowledge a poison-payload
+parking exception specifically for notifications, and revise the resolved
+StoreAndForward-017 wording; (b) treat `JsonException` as transient (would retry-forever
+on a corrupt payload — bad); (c) introduce a per-category park-allowed flag on the
+engine and gate the retry-path park behind it for the Notification category.
+Add a regression test in `tests/ScadaLink.StoreAndForward.Tests/NotificationForwarderTests.
+cs` asserting that a corrupt-payload notification reaches a terminal **non-Parked**
+state — today the corrupt-payload behaviour is uncovered.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-019 — Notifications park after `DefaultMaxRetries` exhaustion, contradicting "retried until central acks"
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:229`, `:407`–`:437`; `src/ScadaLink.StoreAndForward/StoreAndForwardOptions.cs:18`; `src/ScadaLink.SiteRuntime/Scripts/ScriptRuntimeContext.cs:1773`–`:1778`; `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:149`–`:156` |
+
+**Description**
+
+The design doc requires a buffered notification to be retried indefinitely until
+central acks:
+
+> "The **notification** category retries differently: it has no source-entity setting.
+> The site→central forward uses a single fixed retry interval configured in the host
+> `appsettings.json`. … A buffered notification is retried until central acks it; it is
+> not parked on a retry limit (central, once reachable, owns delivery, retry, and
+> parking from that point on)." (`Component-StoreAndForward.md:55`–`:59`)
+
+The current engine cannot honour that. `RetryMessageAsync` enforces parking at
+`message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries`
+(`StoreAndForwardService.cs:407`); a `MaxRetries == 0` is the documented "no limit"
+escape hatch (now correctly explained by the resolved StoreAndForward-015). But the two
+notification enqueue paths both supply a positive bounded `MaxRetries`:
+
+- `ScriptRuntimeContext.cs:1773`–`:1778` (the `Notify.Send` site script path) calls
+  `EnqueueAsync` without supplying the `maxRetries` argument, so the engine
+  defaults to `StoreAndForwardOptions.DefaultMaxRetries = 50` (`StoreAndForwardOptions.
+  cs:18`). After 50 retry sweeps with central unreachable, the notification is parked.
+- `NotificationDeliveryService.cs:149`–`:156` (the legacy SMTP-style path retained for
+  the central-side `INotificationDeliveryService` callers) passes
+  `smtpConfig.MaxRetries > 0 ? smtpConfig.MaxRetries : null` — `null` falls back to the
+  same 50-retry default, and any positive `smtpConfig.MaxRetries` still bounds the
+  retry budget. Either way, a long central outage parks the notification.
+
+A parked notification cannot be cleared by a central recovery: it stays parked until an
+operator clicks **Retry** in the parked-message UI. The design's invariant — that
+notification delivery converges automatically as soon as central is reachable — is
+broken: an extended central outage requires manual intervention to clear the backlog,
+which is exactly the behaviour the central-only outbox redesign was meant to remove
+from the site.
+
+This is closely related to (but distinct from) StoreAndForward-018: 018 is the
+*permanent-failure-path* parking violation; 019 is the *transient-failure-path*
+parking violation under the engine's normal max-retries policy.
+
+**Recommendation**
+
+Make the notification enqueue paths pass `maxRetries: 0` so the documented "no limit /
+never parked" semantics apply, and guard against regression by adding an integration
+test in `tests/ScadaLink.StoreAndForward.Tests/NotificationForwarderTests.cs` that runs
+a sweep many more times than `DefaultMaxRetries` against an always-failing handler and
+asserts the buffered notification's status stays `Pending` (not `Parked`). A cleaner
+alternative is to special-case the `Notification` category inside
+`RetryMessageAsync`'s max-retries guard (treat it as `MaxRetries == 0` regardless of
+the field value) so the invariant is enforced at the single chokepoint rather than
+relying on every caller to pass the right value — this also fixes the legacy
+`NotificationDeliveryService` path without editing the consumer.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-020 — `RetryParkedMessageAsync` skips standby replication when the message is deleted between local update and re-load
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:599`–`:616` |
+
+**Description**
+
+The StoreAndForward-016 resolution wired Requeue replication into operator-initiated
+retry. The fix uses a two-step pattern:
+
+```csharp
+public async Task<bool> RetryParkedMessageAsync(string messageId)
+{
+    var success = await _storage.RetryParkedMessageAsync(messageId);   // step 1
+    if (success)
+    {
+        var message = await _storage.GetMessageByIdAsync(messageId);    // step 2 (no txn)
+        var category = message?.Category ?? StoreAndForwardCategory.ExternalSystem;
+        if (message != null)
+        {
+            _replication?.ReplicateRequeue(message);                    // step 3
+        }
+        RaiseActivity("Retry", category, ...);
+    }
+    return success;
+}
+```
+
+The two storage calls are on separate connections with no surrounding transaction. A
+concurrent writer between step 1 (which moved the row from Parked → Pending) and step 2
+(which re-reads the row) can delete or mutate the row:
+
+- An operator who issues `DiscardParkedMessageAsync` immediately after retry — the
+  `DiscardParkedMessageAsync` storage call is conditional on `status = Parked`, so it
+  will be a no-op (correct), but a sweep that succeeds in delivering the just-requeued
+  row will then call `_storage.RemoveMessageAsync` (unconditional), which deletes it.
+  In a single retry-sweep cycle this race is real because `DefaultRetryInterval = Zero`
+  is the standard test default and the operator action and a sweep tick can overlap.
+- A `RemoveMessageAsync` runs in step 1's wake; `GetMessageByIdAsync` returns null;
+  step 3 (`_replication?.ReplicateRequeue`) is **skipped entirely**, but step 1
+  already requeued the row locally. The standby is now left in `Parked` state while
+  the active node has Pending-then-Deleted, exactly the standby-divergence StoreAndForward-016
+  was supposed to fix. (On the active node a subsequent failover lands on a Parked
+  standby copy of a discarded message — the same regression 016 already documented.)
+
+The category-fallback path (`StoreAndForwardCategory.ExternalSystem` when message is
+null) silently mislabels the activity log entry too — the same defect that
+StoreAndForward-017 fixed for the non-racy path, except this branch handed back a
+hard-coded fallback rather than re-loading. The activity log entry is a minor side
+effect; the missing replication is the real defect.
+
+**Recommendation**
+
+Capture the message **once**, before the local Parked → Pending storage update, so the
+replication path has the row in hand even if a concurrent writer deletes it
+afterwards:
+
+```csharp
+var message = await _storage.GetMessageByIdAsync(messageId);  // before the update
+if (message == null || message.Status != StoreAndForwardMessageStatus.Parked)
+    return false;
+
+var success = await _storage.RetryParkedMessageAsync(messageId);
+if (!success) return false;
+
+// `message` was the parked row; the active node just wrote it back to Pending with
+// retry_count = 0 — construct the replicated state from those known mutations.
+message.Status = StoreAndForwardMessageStatus.Pending;
+message.RetryCount = 0;
+message.LastError = null;
+message.LastAttemptAt = null;
+_replication?.ReplicateRequeue(message);
+RaiseActivity("Retry", message.Category, $"Parked message {messageId} moved back to queue");
+return true;
+```
+
+Add a regression test in `StoreAndForwardReplicationTests` that simulates the
+delete-between-update-and-reload race and asserts the `Requeue` replication
+operation is still emitted with the correct category.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-021 — Design doc claims the Operation Tracking Table lives in StoreAndForward but the implementation is in SiteRuntime
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Open |
+| Location | `docs/requirements/Component-StoreAndForward.md:21`, `:49`–`:51`, `:77`–`:87`, `:108`, `:114`; `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:37`; `src/ScadaLink.StoreAndForward/` (whole module) |
+
+**Description**
+
+Component-StoreAndForward.md repeatedly assigns the **Operation Tracking Table** to
+this component:
+
+- **Responsibilities** (line 21): "Maintain a site-local **operation tracking table**
+  holding one row per `TrackedOperationId` for cached calls … the authoritative status
+  record consulted by `Tracking.Status(id)`."
+- **Message Lifecycle** (lines 49–51): "the operation tracking table is the status
+  record and the S&F buffer is purely the retry mechanism. A cached call that succeeds
+  on its first immediate attempt is written directly as a terminal `Delivered` tracking
+  row and never enters the S&F buffer."
+- **Operation Tracking Table** section (lines 77–87): "Alongside the S&F buffer DB,
+  each site node holds a **site-local operation tracking table** in SQLite. … Each row
+  records the operation kind (`TrackedOperationKind`) …"
+
+The actual implementation lives outside this module: `src/ScadaLink.SiteRuntime/
+Tracking/OperationTrackingStore.cs` (and `IOperationTrackingStore`, `OperationTrackingOptions`).
+The StoreAndForward project contains no references to the tracking store, owns no
+`operation_tracking` table, and `StoreAndForwardService.NotifyCachedCallObserverAsync`
+is only a hook handing telemetry context to an `ICachedCallLifecycleObserver` — the
+audit bridge wired in `ScadaLink.AuditLog`. The S&F module is **not** the table's
+owner; SiteRuntime is.
+
+This is a real design-doc drift, not a code defect, and is flagged explicitly in the
+brief's "Module-specific notes". The drift matters because the design doc's
+discussion of the lifecycle — "immediate success writes a terminal Delivered tracking
+row directly here", "operator discard sets terminal `Discarded`", "central never
+mutates the mirror row directly" — places coordination responsibilities on the wrong
+component. A reader looking for the source of truth for `Tracking.Status(id)` would
+read `Component-StoreAndForward.md` and search `src/ScadaLink.StoreAndForward/` in
+vain. The doc also lists Site Call Audit / Audit Log telemetry-emission as a S&F
+responsibility (line 22), but the emission actually happens via the `AuditLog` site
+component subscribing to `ICachedCallLifecycleObserver`.
+
+**Recommendation**
+
+Reconcile the doc with the code. The simplest fix is doc-side: update
+Component-StoreAndForward.md to scope its responsibilities back to the retry
+mechanism + replication + parked-message management, and add a cross-reference to a
+new (or existing) component doc for Operation Tracking (Component-SiteRuntime.md, or
+a new Component-OperationTracking.md). The code is internally consistent — the audit
+bridge subscribes to the observer hook, the SiteRuntime store writes the rows, the S&F
+engine emits attempt telemetry on the cached-call hot path — but the design doc is
+several refactors out of date. The hierarchical map should be:
+
+- `Component-StoreAndForward.md` → S&F buffer + Replication + Parked-message
+  management + Notification forwarding to central + cached-call telemetry **hook**.
+- New doc / SiteRuntime doc → Operation Tracking Table semantics and lifecycle.
+- `Component-SiteCallAudit.md` / `Component-AuditLog.md` → telemetry emission +
+  central-side mirror.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-022 — `NotifyCachedCallObserverAsync` silently drops the entire audit lifecycle when the message id is not a parseable `TrackedOperationId`
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Documentation & comments |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:484`–`:515` |
+
+**Description**
+
+`NotifyCachedCallObserverAsync` (the per-attempt observer notifier wired by the M3
+Bundle E rollout) bails out with no audit emission when
+`TrackedOperationId.TryParse(message.Id, out var trackedId)` returns false
+(`StoreAndForwardService.cs:510`–`:515`). The inline comment justifies the behaviour as
+back-compat for "pre-M3 message (random GUID-N id from S&F itself, no
+TrackedOperationId threaded in)", but the documented contract is broken in two ways:
+
+1. **Silent dropping of every audit row, not just the first one.** The skip means no
+   `Attempted` row, no `CachedResolve` terminal row, no audit trail at all for that
+   operation's S&F lifecycle — yet the rest of the system (script trust boundary,
+   parked-message UI, etc.) still treats the operation as audit-tracked. The drop is
+   not surfaced via a metric, log warning (the path is a silent `return`), or counter,
+   so a misconfigured caller bypasses the audit hot path with zero feedback.
+
+2. **The contract is hidden in field-level XML.** The `ICachedCallLifecycleObserver`
+   public interface contract (defined in `ScadaLink.Commons`) does not document that
+   the observer will be silently skipped when the underlying S&F message id is not a
+   GUID. A consumer reading the interface contract reasonably expects every cached-call
+   attempt to surface — the audit pipeline depends on it. The silent-drop is an
+   implementation detail of the S&F bridge that should be either lifted onto the
+   contract or removed.
+
+The engine itself mints GUID-N ids via `Guid.NewGuid().ToString("N")` (line 224), which
+`TrackedOperationId.TryParse` accepts, so the skip path is unreachable for engine-minted
+ids. It is reachable only for callers that supply their own `messageId` argument with a
+non-GUID format. The current callers (`NotificationOutbox` enqueue path with
+NotificationId, cached-call enqueue path with `TrackedOperationId.ToString()`) all
+supply GUID-shaped ids. The defect is latent — a future caller passing a non-GUID id
+would silently bypass audit.
+
+**Recommendation**
+
+Two options. The cheap fix: change the skip to a `_logger.LogWarning` with the offending
+id so a misconfigured caller is observable, and update the
+`ICachedCallLifecycleObserver` XML doc to mention the "non-GUID id → no telemetry"
+contract explicitly. The more correct fix: emit a still-audited row for the
+non-GUID case (e.g. synthesise a `TrackedOperationId` from the underlying id, or emit a
+distinguished "tracking-id-missing" audit row) so the audit pipeline never has silent
+holes. Add a regression test in `CachedCallAttemptEmissionTests` capturing the chosen
+contract — the existing
+`Attempt_MessageIdNotAGuid_NoObserverNotification` test pins today's silent-skip; if
+the fix is "log + skip", that test should be updated to also assert the log emission;
+if the fix is "emit anyway", the test should be replaced.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-023 — `siteId` silently defaults to empty when no `IStoreAndForwardSiteContext` is registered, degrading audit telemetry correlation
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Code organization & conventions |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/ServiceCollectionExtensions.cs:43`–`:53`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:99`, `:524` |
+
+**Description**
+
+`AddStoreAndForward`'s service-collection factory resolves the optional
+`IStoreAndForwardSiteContext` and falls back to `string.Empty` when not registered:
+
+```csharp
+var siteContext = sp.GetService<IStoreAndForwardSiteContext>();
+var siteId = siteContext?.SiteId ?? string.Empty;
+return new StoreAndForwardService(storage, options, logger, replication,
+    cachedCallObserver, siteId);
+```
+
+The constructor's parameter is even defaulted to `""`. The empty-string `siteId` flows
+straight into every emitted `CachedCallAttemptContext.SourceSite`, which the central
+audit pipeline uses as part of the `(SourceSite, TrackedOperationId)` correlation key.
+A host that registers an `ICachedCallLifecycleObserver` (the audit observer wired by
+`AddAuditLog`) but forgets to register `IStoreAndForwardSiteContext` will produce a
+stream of telemetry rows with `SourceSite = ""` — the central audit mirror cannot
+distinguish them by site, and the central-site routing of
+`RetryParkedOperation`/`DiscardParkedOperation` commands keyed on `SourceSite` will
+fail to find the owning site.
+
+The Host's `IStoreAndForwardSiteContext` adapter and the audit observer registration
+are wired in lock-step, so the current configuration is correct, but the silent
+empty-string fallback is a contract hazard for future hosts (CLI test harness, second
+site cluster topology, etc.) and for tests that wire one without the other.
+
+**Recommendation**
+
+Make the contract explicit: when `cachedCallObserver` is non-null, require
+`IStoreAndForwardSiteContext` to be registered — throw an `InvalidOperationException`
+with a clear "Audit observer registered without a site context — register
+IStoreAndForwardSiteContext" message at construction time. When the audit observer is
+absent (no `AddAuditLog`), keep the empty-string default since `_siteId` is unused.
+Alternatively, change `siteId` from a parameter to a `Func<string>` resolved lazily
+from the service provider so a late-registered context still takes effect.
+
+**Resolution**
+
+_Unresolved._
+
+### StoreAndForward-024 — `StopAsync` does not wait for an in-flight retry sweep, so disposed dependencies can be touched after shutdown
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Concurrency & thread safety |
+| Status | Open |
+| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:122`–`:127`, `:136`–`:143`, `:303`–`:329` |
+
+**Description**
+
+`StartAsync` arms `_retryTimer` with `_ => _ = RetryPendingMessagesAsync()` (line 123).
+The `_ =` discards the returned `Task`, so when the timer fires the sweep runs **fire
+and forget** on a thread-pool thread. `StopAsync` (lines 136–143) disposes the timer:
+
+```csharp
+if (_retryTimer != null)
+{
+    await _retryTimer.DisposeAsync();
+    _retryTimer = null;
+}
+```
+
+`Timer.DisposeAsync()` returns once any in-flight timer **callback** has completed —
+but the timer callback in this service is a one-line `_ = RetryPendingMessagesAsync()`
+that synchronously returns immediately and leaves the actual sweep running on the
+thread pool. So `Timer.DisposeAsync` does not wait for the sweep; only for the
+synchronous `_ = ...` discarding step. `StopAsync` returns while a sweep is potentially
+still running, touching `_storage` (which the host will dispose), `_replication`
+(which the host will tear down), and `_cachedCallObserver` (whose downstream gRPC
+channel the host will shut down).
+
+The host shutdown sequence (`AkkaHostedService`) tears down the actor system and the
+DI container after this service's `StopAsync` completes — meaning a sweep that runs
+past `StopAsync` can call into disposed `SqliteConnection`s (yielding
+`ObjectDisposedException`, caught by the sweep's outer `try/catch` as a log) or, more
+seriously, push a replication operation into a half-disposed Akka actor pipeline and
+trigger noisy dead-letter warnings during a clean shutdown.
+
+The race window is small (the sweep typically finishes in <100 ms in tests) but it is
+real, particularly when shutting down a site under load with a non-empty buffer.
+
+**Recommendation**
+
+Track in-flight sweep tasks and `await` them in `StopAsync`:
+
+```csharp
+private Task? _currentSweep;
+
+public async Task StopAsync()
+{
+    if (_retryTimer != null)
+    {
+        await _retryTimer.DisposeAsync();
+        _retryTimer = null;
+    }
+    if (_currentSweep is { } sweep)
+    {
+        try { await sweep; } catch { /* logged inside RetryPendingMessagesAsync */ }
+    }
+}
+```
+
+Change the timer callback to:
+
+```csharp
+_retryTimer = new Timer(_ => _currentSweep = RetryPendingMessagesAsync(), ...);
+```
+
+Add a `CancellationTokenSource` so a long sweep can be cooperatively aborted on stop;
+plumb the cancellation token into `_storage` / `_replication` / `_cachedCallObserver`
+calls. Add a regression test in `StoreAndForwardServiceTests` that calls `StopAsync`
+mid-sweep and asserts no further storage activity occurs after `StopAsync` returns.
+
+**Resolution**
+
+_Unresolved._
+