code-review: 2026-05-28 baseline re-review of all 23 modules at 1eb6e97
Re-applies the full 10-category checklist to every src/ project — including
first-time reviews of the four newer components (AuditLog, NotificationOutbox,
SiteCallAudit, Transport) — so the code-reviews/ index reflects today's
codebase rather than the 2026-05-16 baseline. 172 new Open findings (0
Critical, 18 High, 62 Medium, 92 Low); 481 findings total across 23 modules.
regen-readme.py now derives each module's Last reviewed + Commit from its
findings.md header instead of hard-coding 2026-05-16 / 9c60592, so future
single-module re-reviews show their own date in the Module Status table.
This commit is contained in:
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ScadaLink.StoreAndForward` |
|
||||
| Design doc | `docs/requirements/Component-StoreAndForward.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-17 |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `39d737e` |
|
||||
| Open findings | 0 (3 Deferred: 002, 011, 012 — see notes) |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 7 (3 Deferred: 002, 011, 012; 7 new Open: 018–024 — see Re-review 2026-05-28) |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -55,6 +55,76 @@ StoreAndForward-017 records that the Retry/Discard activity-log entries hard-cod
|
||||
`ExternalSystem` category, mislabelling notification and cached-DB-write messages in
|
||||
the site event log.
|
||||
|
||||
#### Re-review 2026-05-28 (commit `1eb6e97`)
|
||||
|
||||
Full re-review against commit `1eb6e97` with the same 10-category checklist. The
|
||||
batch-3 / batch-4 resolutions (001, 003–010, 013–017) are still present and intact; no
|
||||
regressions detected on prior fixes. Findings 002, 011 and 012 remain validly
|
||||
`Deferred` (their preconditions are unchanged) and findings 005, 006, 010, 013, 014,
|
||||
016, 017 are confirmed `Resolved` against the current source.
|
||||
|
||||
This pass surfaced **seven new findings** clustered around two themes:
|
||||
|
||||
The first theme is **design-doc drift on the notification path**, which has acquired
|
||||
two now-real defects since the engine became central-targeted. `StoreAndForward-018`
|
||||
(High) records that a corrupt notification payload — handled in `NotificationForwarder.
|
||||
DeliverAsync` by returning `false` — parks a notification on its first retry-sweep
|
||||
encounter, despite the design doc stating "Notifications do not park" (line 47, "Parking
|
||||
applies only to the external-system-call and cached-database-write categories"). The
|
||||
same path becomes a poison-payload retry-forever trap on the active node if the engine
|
||||
ever softened the `false` semantics. `StoreAndForward-019` (Medium) records the
|
||||
sibling defect: notifications are enqueued with `MaxRetries` defaulting to
|
||||
`StoreAndForwardOptions.DefaultMaxRetries` (50), and the legacy SMTP path
|
||||
(`NotificationDeliveryService.SendAsync`) passes a positive bounded `smtpConfig.
|
||||
MaxRetries` — so an unreachable central will silently park notifications after a
|
||||
finite retry budget rather than "retry at the fixed forward interval until central acks"
|
||||
as the design requires. The contract `0 = no limit` is not enforced for the
|
||||
notification category.
|
||||
|
||||
The second theme is **subtle correctness and contract gaps around the operator paths**
|
||||
that survived the StoreAndForward-016/017 batch. `StoreAndForward-020` (Medium) records
|
||||
that `RetryParkedMessageAsync` skips replication entirely if `GetMessageByIdAsync`
|
||||
returns null after a successful local requeue (a narrow but real race window with a
|
||||
concurrent discard / sweep delete), re-introducing the StoreAndForward-016 standby
|
||||
divergence in that corner. `StoreAndForward-021` (Medium) is a design-doc-vs-code drift
|
||||
that should be reconciled in the doc: the **operation tracking table** is documented
|
||||
inside Component-StoreAndForward.md as a S&F responsibility (lines 21, 49, 77–87, 108,
|
||||
114), but the actual `OperationTrackingStore` lives in `src/ScadaLink.SiteRuntime/
|
||||
Tracking/` and is not consumed by S&F at all — the brief's own note flags this. The
|
||||
design doc should be updated to point at SiteRuntime, or the store moved to
|
||||
StoreAndForward.
|
||||
|
||||
`StoreAndForward-022` (Low) records that `_cachedCallObserver` silently drops audit
|
||||
telemetry when a buffered cached-call's `Id` is not a parseable `TrackedOperationId`
|
||||
GUID — the engine returns from `NotifyCachedCallObserverAsync` before emitting anything,
|
||||
so a legacy enqueue path that buffered a non-GUID id (the engine's own default minting
|
||||
produces "N"-formatted GUIDs, which TrackedOperationId.TryParse accepts, but any
|
||||
caller passing a custom non-GUID id silently bypasses the entire `Submitted/Forwarded/
|
||||
Attempted/Delivered/Parked/Discarded` audit lifecycle). `StoreAndForward-023` (Low)
|
||||
records that `siteId` is silently defaulted to `string.Empty` when no
|
||||
`IStoreAndForwardSiteContext` is registered, so a misconfigured host produces audit
|
||||
telemetry with `SourceSite = ""` and the central audit-log's `(SourceSite,
|
||||
TrackedOperationId)` correlation degrades to a per-id-only index. `StoreAndForward-024`
|
||||
(Low) is a stop-time ordering defect: `StopAsync` disposes the timer but a
|
||||
mid-flight `RetryPendingMessagesAsync` invocation continues using `_storage` and
|
||||
`_replication` after `StopAsync` returns; downstream resources disposed by the host
|
||||
shutdown sequence (the DI container) can then NRE through the still-running sweep.
|
||||
|
||||
## Checklist coverage — Re-review 2026-05-28
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | Notification corrupt-payload parks contrary to design (018); RetryParkedMessageAsync skips replication when message reload races a deletion (020). |
|
||||
| 2 | Akka.NET conventions | ☑ | `ParkedMessageHandlerActor` uses `PipeTo` correctly with success/failure projections (007 resolution preserved). No new findings. |
|
||||
| 3 | Concurrency & thread safety | ☑ | Sweep-vs-stop race: a timer callback running while `StopAsync` returns can touch disposed dependencies (024). |
|
||||
| 4 | Error handling & resilience | ☑ | Notifications park after `DefaultMaxRetries` exhaustion (019) — contradicts the design doc's "retried until central acks". |
|
||||
| 5 | Security | ☑ | No issues found — parameterised SQL throughout, payload JSON opaque, no secret material handled. |
|
||||
| 6 | Performance & resource management | ☑ | No new findings — the connection-per-call documented trade-off and pooled `OpenAsync` remain acceptable. |
|
||||
| 7 | Design-document adherence | ☑ | Operation Tracking Table documented in StoreAndForward but actually lives in SiteRuntime (021); notification non-parking guarantee broken by 018 + 019. |
|
||||
| 8 | Code organization & conventions | ☑ | `IStoreAndForwardSiteContext` silently defaults `SiteId` to empty (023) — a configuration hole rather than an entity placement issue. |
|
||||
| 9 | Testing coverage | ☑ | The seven new findings have no regression tests in `tests/ScadaLink.StoreAndForward.Tests/` — particularly the notification-doesn't-park invariant (018, 019), the requeue-after-reload-null replication gap (020), and the stop-during-sweep behaviour (024). |
|
||||
| 10 | Documentation & comments | ☑ | `CachedCallAttemptOutcome.ParkedMaxRetries` XML doc says "S&F semantics" but the code applies it to notifications too if 018/019 fire — minor drift, captured under 018. The `TrackedOperationId.TryParse` silent-skip behaviour in `NotifyCachedCallObserverAsync` is documented in the source but not on the public observer contract (022). |
|
||||
|
||||
## Checklist coverage
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -914,3 +984,474 @@ the StoreAndForward-016 replication) — and pass it to `RaiseActivity` (falling
|
||||
`RetryParkedMessageAsync_ActivityUsesMessageRealCategory` and
|
||||
`DiscardParkedMessageAsync_ActivityUsesMessageRealCategory` assert the activity carries
|
||||
`Notification` / `CachedDbWrite` respectively; both fail against the pre-fix code.
|
||||
|
||||
### StoreAndForward-018 — Notification corrupt-payload parks the buffered message, contradicting the "notifications do not park" design invariant
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.StoreAndForward/NotificationForwarder.cs:62`–`:69`, `:105`–`:122`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:369`–`:397` |
|
||||
|
||||
**Description**
|
||||
|
||||
The Component design doc explicitly carves out notifications from the parking lifecycle:
|
||||
|
||||
> "Notifications do not park — they are retried at the fixed forward interval until
|
||||
> central acks." (`docs/requirements/Component-StoreAndForward.md:47`)
|
||||
> "Parking applies only to the external-system-call and cached-database-write
|
||||
> categories." (same line)
|
||||
|
||||
`NotificationForwarder.DeliverAsync` violates this. When `TryBuildSubmit` fails to
|
||||
deserialize the buffered payload — either because `JsonSerializer.Deserialize` throws a
|
||||
`JsonException` (line 114) or because it returns `null` (line 119) — `DeliverAsync`
|
||||
returns `false` (line 68). On the **retry path** the S&F engine treats handler `false`
|
||||
as a permanent failure and **parks the message immediately** via the conditional
|
||||
`UpdateMessageIfStatusAsync(... Parked)` write at `StoreAndForwardService.cs:373`–`:385`.
|
||||
Result: a notification with a corrupt buffered payload — a row that the engine itself
|
||||
treats as opaque ("Payload: Serialized message content…"; `Component-StoreAndForward.md:
|
||||
110`) — enters the parked state and surfaces in the central UI's parked-message list
|
||||
under the `Notification` category, contradicting the doc's invariant and the resolved
|
||||
StoreAndForward-017's "Notification / CachedDbWrite" Retry/Discard category mapping.
|
||||
|
||||
The defect is real today: the inline comment on `NotificationForwarder.cs:64` even
|
||||
documents the violation ("An unreadable payload cannot be fixed by retrying — park it
|
||||
(return false)") as the intended behaviour, but that behaviour is what the design doc
|
||||
forbids. Either the doc needs to acknowledge a poison-payload parking exception for
|
||||
notifications, or the forwarder needs a different escape hatch (discard? log + drop?
|
||||
permanent-failure-as-`true` to clear the buffer?). Today there is no consistent answer
|
||||
between code and design.
|
||||
|
||||
Additionally, on the **immediate-delivery** path (a fresh enqueue followed by a
|
||||
`DeliverAsync` returning `false`), the engine returns `WasBuffered: false` and the row
|
||||
is never persisted — so the corrupt-payload "park" only occurs on the retry path, where
|
||||
the message has already been buffered (and replicated to the standby). The
|
||||
**inconsistency between the two paths** ("not buffered" vs "parked") for the same
|
||||
permanent-failure outcome is itself a contract surprise; the resolved StoreAndForward-004
|
||||
documents the immediate vs retry asymmetry, but does not anticipate that the retry
|
||||
asymmetry will violate a per-category invariant.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Choose one consistent reconciliation. Preferred option: change `NotificationForwarder.
|
||||
DeliverAsync` to **discard** a corrupt payload rather than park it — delete the
|
||||
buffered row directly, log a Site Event Log entry under `Discard`, and return `true` so
|
||||
the engine clears the buffer. This preserves the design's "notifications do not park"
|
||||
invariant. Alternatives: (a) update the design doc to acknowledge a poison-payload
|
||||
parking exception specifically for notifications, and revise the resolved
|
||||
StoreAndForward-017 wording; (b) treat `JsonException` as transient (would retry-forever
|
||||
on a corrupt payload — bad); (c) introduce a per-category park-allowed flag on the
|
||||
engine and gate the retry-path park behind it for the Notification category.
|
||||
Add a regression test in `tests/ScadaLink.StoreAndForward.Tests/NotificationForwarderTests.
|
||||
cs` asserting that a corrupt-payload notification reaches a terminal **non-Parked**
|
||||
state — today the corrupt-payload behaviour is uncovered.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### StoreAndForward-019 — Notifications park after `DefaultMaxRetries` exhaustion, contradicting "retried until central acks"
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:229`, `:407`–`:437`; `src/ScadaLink.StoreAndForward/StoreAndForwardOptions.cs:18`; `src/ScadaLink.SiteRuntime/Scripts/ScriptRuntimeContext.cs:1773`–`:1778`; `src/ScadaLink.NotificationService/NotificationDeliveryService.cs:149`–`:156` |
|
||||
|
||||
**Description**
|
||||
|
||||
The design doc requires a buffered notification to be retried indefinitely until
|
||||
central acks:
|
||||
|
||||
> "The **notification** category retries differently: it has no source-entity setting.
|
||||
> The site→central forward uses a single fixed retry interval configured in the host
|
||||
> `appsettings.json`. … A buffered notification is retried until central acks it; it is
|
||||
> not parked on a retry limit (central, once reachable, owns delivery, retry, and
|
||||
> parking from that point on)." (`Component-StoreAndForward.md:55`–`:59`)
|
||||
|
||||
The current engine cannot honour that. `RetryMessageAsync` enforces parking at
|
||||
`message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries`
|
||||
(`StoreAndForwardService.cs:407`); a `MaxRetries == 0` is the documented "no limit"
|
||||
escape hatch (now correctly explained by the resolved StoreAndForward-015). But the two
|
||||
notification enqueue paths both supply a positive bounded `MaxRetries`:
|
||||
|
||||
- `ScriptRuntimeContext.cs:1773`–`:1778` (the `Notify.Send` site script path) calls
|
||||
`EnqueueAsync` without supplying the `maxRetries` argument, so the engine
|
||||
defaults to `StoreAndForwardOptions.DefaultMaxRetries = 50` (`StoreAndForwardOptions.
|
||||
cs:18`). After 50 retry sweeps with central unreachable, the notification is parked.
|
||||
- `NotificationDeliveryService.cs:149`–`:156` (the legacy SMTP-style path retained for
|
||||
the central-side `INotificationDeliveryService` callers) passes
|
||||
`smtpConfig.MaxRetries > 0 ? smtpConfig.MaxRetries : null` — `null` falls back to the
|
||||
same 50-retry default, and any positive `smtpConfig.MaxRetries` still bounds the
|
||||
retry budget. Either way, a long central outage parks the notification.
|
||||
|
||||
A parked notification cannot be cleared by a central recovery: it stays parked until an
|
||||
operator clicks **Retry** in the parked-message UI. The design's invariant — that
|
||||
notification delivery converges automatically as soon as central is reachable — is
|
||||
broken: an extended central outage requires manual intervention to clear the backlog,
|
||||
which is exactly the behaviour the central-only outbox redesign was meant to remove
|
||||
from the site.
|
||||
|
||||
This is closely related to (but distinct from) StoreAndForward-018: 018 is the
|
||||
*permanent-failure-path* parking violation; 019 is the *transient-failure-path*
|
||||
parking violation under the engine's normal max-retries policy.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Make the notification enqueue paths pass `maxRetries: 0` so the documented "no limit /
|
||||
never parked" semantics apply, and guard against regression by adding an integration
|
||||
test in `tests/ScadaLink.StoreAndForward.Tests/NotificationForwarderTests.cs` that runs
|
||||
a sweep many more times than `DefaultMaxRetries` against an always-failing handler and
|
||||
asserts the buffered notification's status stays `Pending` (not `Parked`). A cleaner
|
||||
alternative is to special-case the `Notification` category inside
|
||||
`RetryMessageAsync`'s max-retries guard (treat it as `MaxRetries == 0` regardless of
|
||||
the field value) so the invariant is enforced at the single chokepoint rather than
|
||||
relying on every caller to pass the right value — this also fixes the legacy
|
||||
`NotificationDeliveryService` path without editing the consumer.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### StoreAndForward-020 — `RetryParkedMessageAsync` skips standby replication when the message is deleted between local update and re-load
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:599`–`:616` |
|
||||
|
||||
**Description**
|
||||
|
||||
The StoreAndForward-016 resolution wired Requeue replication into operator-initiated
|
||||
retry. The fix uses a two-step pattern:
|
||||
|
||||
```csharp
|
||||
public async Task<bool> RetryParkedMessageAsync(string messageId)
|
||||
{
|
||||
var success = await _storage.RetryParkedMessageAsync(messageId); // step 1
|
||||
if (success)
|
||||
{
|
||||
var message = await _storage.GetMessageByIdAsync(messageId); // step 2 (no txn)
|
||||
var category = message?.Category ?? StoreAndForwardCategory.ExternalSystem;
|
||||
if (message != null)
|
||||
{
|
||||
_replication?.ReplicateRequeue(message); // step 3
|
||||
}
|
||||
RaiseActivity("Retry", category, ...);
|
||||
}
|
||||
return success;
|
||||
}
|
||||
```
|
||||
|
||||
The two storage calls are on separate connections with no surrounding transaction. A
|
||||
concurrent writer between step 1 (which moved the row from Parked → Pending) and step 2
|
||||
(which re-reads the row) can delete or mutate the row:
|
||||
|
||||
- An operator who issues `DiscardParkedMessageAsync` immediately after retry — the
|
||||
`DiscardParkedMessageAsync` storage call is conditional on `status = Parked`, so it
|
||||
will be a no-op (correct), but a sweep that succeeds in delivering the just-requeued
|
||||
row will then call `_storage.RemoveMessageAsync` (unconditional), which deletes it.
|
||||
In a single retry-sweep cycle this race is real because `DefaultRetryInterval = Zero`
|
||||
is the standard test default and the operator action and a sweep tick can overlap.
|
||||
- A `RemoveMessageAsync` runs in step 1's wake; `GetMessageByIdAsync` returns null;
|
||||
step 3 (`_replication?.ReplicateRequeue`) is **skipped entirely**, but step 1
|
||||
already requeued the row locally. The standby is now left in `Parked` state while
|
||||
the active node has Pending-then-Deleted, exactly the standby-divergence StoreAndForward-016
|
||||
was supposed to fix. (On the active node a subsequent failover lands on a Parked
|
||||
standby copy of a discarded message — the same regression 016 already documented.)
|
||||
|
||||
The category-fallback path (`StoreAndForwardCategory.ExternalSystem` when message is
|
||||
null) silently mislabels the activity log entry too — the same defect that
|
||||
StoreAndForward-017 fixed for the non-racy path, except this branch handed back a
|
||||
hard-coded fallback rather than re-loading. The activity log entry is a minor side
|
||||
effect; the missing replication is the real defect.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Capture the message **once**, before the local Parked → Pending storage update, so the
|
||||
replication path has the row in hand even if a concurrent writer deletes it
|
||||
afterwards:
|
||||
|
||||
```csharp
|
||||
var message = await _storage.GetMessageByIdAsync(messageId); // before the update
|
||||
if (message == null || message.Status != StoreAndForwardMessageStatus.Parked)
|
||||
return false;
|
||||
|
||||
var success = await _storage.RetryParkedMessageAsync(messageId);
|
||||
if (!success) return false;
|
||||
|
||||
// `message` was the parked row; the active node just wrote it back to Pending with
|
||||
// retry_count = 0 — construct the replicated state from those known mutations.
|
||||
message.Status = StoreAndForwardMessageStatus.Pending;
|
||||
message.RetryCount = 0;
|
||||
message.LastError = null;
|
||||
message.LastAttemptAt = null;
|
||||
_replication?.ReplicateRequeue(message);
|
||||
RaiseActivity("Retry", message.Category, $"Parked message {messageId} moved back to queue");
|
||||
return true;
|
||||
```
|
||||
|
||||
Add a regression test in `StoreAndForwardReplicationTests` that simulates the
|
||||
delete-between-update-and-reload race and asserts the `Requeue` replication
|
||||
operation is still emitted with the correct category.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### StoreAndForward-021 — Design doc claims the Operation Tracking Table lives in StoreAndForward but the implementation is in SiteRuntime
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Open |
|
||||
| Location | `docs/requirements/Component-StoreAndForward.md:21`, `:49`–`:51`, `:77`–`:87`, `:108`, `:114`; `src/ScadaLink.SiteRuntime/Tracking/OperationTrackingStore.cs:37`; `src/ScadaLink.StoreAndForward/` (whole module) |
|
||||
|
||||
**Description**
|
||||
|
||||
Component-StoreAndForward.md repeatedly assigns the **Operation Tracking Table** to
|
||||
this component:
|
||||
|
||||
- **Responsibilities** (line 21): "Maintain a site-local **operation tracking table**
|
||||
holding one row per `TrackedOperationId` for cached calls … the authoritative status
|
||||
record consulted by `Tracking.Status(id)`."
|
||||
- **Message Lifecycle** (lines 49–51): "the operation tracking table is the status
|
||||
record and the S&F buffer is purely the retry mechanism. A cached call that succeeds
|
||||
on its first immediate attempt is written directly as a terminal `Delivered` tracking
|
||||
row and never enters the S&F buffer."
|
||||
- **Operation Tracking Table** section (lines 77–87): "Alongside the S&F buffer DB,
|
||||
each site node holds a **site-local operation tracking table** in SQLite. … Each row
|
||||
records the operation kind (`TrackedOperationKind`) …"
|
||||
|
||||
The actual implementation lives outside this module: `src/ScadaLink.SiteRuntime/
|
||||
Tracking/OperationTrackingStore.cs` (and `IOperationTrackingStore`, `OperationTrackingOptions`).
|
||||
The StoreAndForward project contains no references to the tracking store, owns no
|
||||
`operation_tracking` table, and `StoreAndForwardService.NotifyCachedCallObserverAsync`
|
||||
is only a hook handing telemetry context to an `ICachedCallLifecycleObserver` — the
|
||||
audit bridge wired in `ScadaLink.AuditLog`. The S&F module is **not** the table's
|
||||
owner; SiteRuntime is.
|
||||
|
||||
This is a real design-doc drift, not a code defect, and is flagged explicitly in the
|
||||
brief's "Module-specific notes". The drift matters because the design doc's
|
||||
discussion of the lifecycle — "immediate success writes a terminal Delivered tracking
|
||||
row directly here", "operator discard sets terminal `Discarded`", "central never
|
||||
mutates the mirror row directly" — places coordination responsibilities on the wrong
|
||||
component. A reader looking for the source of truth for `Tracking.Status(id)` would
|
||||
read `Component-StoreAndForward.md` and search `src/ScadaLink.StoreAndForward/` in
|
||||
vain. The doc also lists Site Call Audit / Audit Log telemetry-emission as a S&F
|
||||
responsibility (line 22), but the emission actually happens via the `AuditLog` site
|
||||
component subscribing to `ICachedCallLifecycleObserver`.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Reconcile the doc with the code. The simplest fix is doc-side: update
|
||||
Component-StoreAndForward.md to scope its responsibilities back to the retry
|
||||
mechanism + replication + parked-message management, and add a cross-reference to a
|
||||
new (or existing) component doc for Operation Tracking (Component-SiteRuntime.md, or
|
||||
a new Component-OperationTracking.md). The code is internally consistent — the audit
|
||||
bridge subscribes to the observer hook, the SiteRuntime store writes the rows, the S&F
|
||||
engine emits attempt telemetry on the cached-call hot path — but the design doc is
|
||||
several refactors out of date. The hierarchical map should be:
|
||||
|
||||
- `Component-StoreAndForward.md` → S&F buffer + Replication + Parked-message
|
||||
management + Notification forwarding to central + cached-call telemetry **hook**.
|
||||
- New doc / SiteRuntime doc → Operation Tracking Table semantics and lifecycle.
|
||||
- `Component-SiteCallAudit.md` / `Component-AuditLog.md` → telemetry emission +
|
||||
central-side mirror.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### StoreAndForward-022 — `NotifyCachedCallObserverAsync` silently drops the entire audit lifecycle when the message id is not a parseable `TrackedOperationId`
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:484`–`:515` |
|
||||
|
||||
**Description**
|
||||
|
||||
`NotifyCachedCallObserverAsync` (the per-attempt observer notifier wired by the M3
|
||||
Bundle E rollout) bails out with no audit emission when
|
||||
`TrackedOperationId.TryParse(message.Id, out var trackedId)` returns false
|
||||
(`StoreAndForwardService.cs:510`–`:515`). The inline comment justifies the behaviour as
|
||||
back-compat for "pre-M3 message (random GUID-N id from S&F itself, no
|
||||
TrackedOperationId threaded in)", but the documented contract is broken in two ways:
|
||||
|
||||
1. **Silent dropping of every audit row, not just the first one.** The skip means no
|
||||
`Attempted` row, no `CachedResolve` terminal row, no audit trail at all for that
|
||||
operation's S&F lifecycle — yet the rest of the system (script trust boundary,
|
||||
parked-message UI, etc.) still treats the operation as audit-tracked. The drop is
|
||||
not surfaced via a metric, log warning (the path is a silent `return`), or counter,
|
||||
so a misconfigured caller bypasses the audit hot path with zero feedback.
|
||||
|
||||
2. **The contract is hidden in field-level XML.** The `ICachedCallLifecycleObserver`
|
||||
public interface contract (defined in `ScadaLink.Commons`) does not document that
|
||||
the observer will be silently skipped when the underlying S&F message id is not a
|
||||
GUID. A consumer reading the interface contract reasonably expects every cached-call
|
||||
attempt to surface — the audit pipeline depends on it. The silent-drop is an
|
||||
implementation detail of the S&F bridge that should be either lifted onto the
|
||||
contract or removed.
|
||||
|
||||
The engine itself mints GUID-N ids via `Guid.NewGuid().ToString("N")` (line 224), which
|
||||
`TrackedOperationId.TryParse` accepts, so the skip path is unreachable for engine-minted
|
||||
ids. It is reachable only for callers that supply their own `messageId` argument with a
|
||||
non-GUID format. The current callers (`NotificationOutbox` enqueue path with
|
||||
NotificationId, cached-call enqueue path with `TrackedOperationId.ToString()`) all
|
||||
supply GUID-shaped ids. The defect is latent — a future caller passing a non-GUID id
|
||||
would silently bypass audit.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Two options. The cheap fix: change the skip to a `_logger.LogWarning` with the offending
|
||||
id so a misconfigured caller is observable, and update the
|
||||
`ICachedCallLifecycleObserver` XML doc to mention the "non-GUID id → no telemetry"
|
||||
contract explicitly. The more correct fix: emit a still-audited row for the
|
||||
non-GUID case (e.g. synthesise a `TrackedOperationId` from the underlying id, or emit a
|
||||
distinguished "tracking-id-missing" audit row) so the audit pipeline never has silent
|
||||
holes. Add a regression test in `CachedCallAttemptEmissionTests` capturing the chosen
|
||||
contract — the existing
|
||||
`Attempt_MessageIdNotAGuid_NoObserverNotification` test pins today's silent-skip; if
|
||||
the fix is "log + skip", that test should be updated to also assert the log emission;
|
||||
if the fix is "emit anyway", the test should be replaced.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### StoreAndForward-023 — `siteId` silently defaults to empty when no `IStoreAndForwardSiteContext` is registered, degrading audit telemetry correlation
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.StoreAndForward/ServiceCollectionExtensions.cs:43`–`:53`; `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:99`, `:524` |
|
||||
|
||||
**Description**
|
||||
|
||||
`AddStoreAndForward`'s service-collection factory resolves the optional
|
||||
`IStoreAndForwardSiteContext` and falls back to `string.Empty` when not registered:
|
||||
|
||||
```csharp
|
||||
var siteContext = sp.GetService<IStoreAndForwardSiteContext>();
|
||||
var siteId = siteContext?.SiteId ?? string.Empty;
|
||||
return new StoreAndForwardService(storage, options, logger, replication,
|
||||
cachedCallObserver, siteId);
|
||||
```
|
||||
|
||||
The constructor's parameter is even defaulted to `""`. The empty-string `siteId` flows
|
||||
straight into every emitted `CachedCallAttemptContext.SourceSite`, which the central
|
||||
audit pipeline uses as part of the `(SourceSite, TrackedOperationId)` correlation key.
|
||||
A host that registers an `ICachedCallLifecycleObserver` (the audit observer wired by
|
||||
`AddAuditLog`) but forgets to register `IStoreAndForwardSiteContext` will produce a
|
||||
stream of telemetry rows with `SourceSite = ""` — the central audit mirror cannot
|
||||
distinguish them by site, and the central-site routing of
|
||||
`RetryParkedOperation`/`DiscardParkedOperation` commands keyed on `SourceSite` will
|
||||
fail to find the owning site.
|
||||
|
||||
The Host's `IStoreAndForwardSiteContext` adapter and the audit observer registration
|
||||
are wired in lock-step, so the current configuration is correct, but the silent
|
||||
empty-string fallback is a contract hazard for future hosts (CLI test harness, second
|
||||
site cluster topology, etc.) and for tests that wire one without the other.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Make the contract explicit: when `cachedCallObserver` is non-null, require
|
||||
`IStoreAndForwardSiteContext` to be registered — throw an `InvalidOperationException`
|
||||
with a clear "Audit observer registered without a site context — register
|
||||
IStoreAndForwardSiteContext" message at construction time. When the audit observer is
|
||||
absent (no `AddAuditLog`), keep the empty-string default since `_siteId` is unused.
|
||||
Alternatively, change `siteId` from a parameter to a `Func<string>` resolved lazily
|
||||
from the service provider so a late-registered context still takes effect.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### StoreAndForward-024 — `StopAsync` does not wait for an in-flight retry sweep, so disposed dependencies can be touched after shutdown
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Status | Open |
|
||||
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:122`–`:127`, `:136`–`:143`, `:303`–`:329` |
|
||||
|
||||
**Description**
|
||||
|
||||
`StartAsync` arms `_retryTimer` with `_ => _ = RetryPendingMessagesAsync()` (line 123).
|
||||
The `_ =` discards the returned `Task`, so when the timer fires the sweep runs **fire
|
||||
and forget** on a thread-pool thread. `StopAsync` (lines 136–143) disposes the timer:
|
||||
|
||||
```csharp
|
||||
if (_retryTimer != null)
|
||||
{
|
||||
await _retryTimer.DisposeAsync();
|
||||
_retryTimer = null;
|
||||
}
|
||||
```
|
||||
|
||||
`Timer.DisposeAsync()` returns once any in-flight timer **callback** has completed —
|
||||
but the timer callback in this service is a one-line `_ = RetryPendingMessagesAsync()`
|
||||
that synchronously returns immediately and leaves the actual sweep running on the
|
||||
thread pool. So `Timer.DisposeAsync` does not wait for the sweep; only for the
|
||||
synchronous `_ = ...` discarding step. `StopAsync` returns while a sweep is potentially
|
||||
still running, touching `_storage` (which the host will dispose), `_replication`
|
||||
(which the host will tear down), and `_cachedCallObserver` (whose downstream gRPC
|
||||
channel the host will shut down).
|
||||
|
||||
The host shutdown sequence (`AkkaHostedService`) tears down the actor system and the
|
||||
DI container after this service's `StopAsync` completes — meaning a sweep that runs
|
||||
past `StopAsync` can call into disposed `SqliteConnection`s (yielding
|
||||
`ObjectDisposedException`, caught by the sweep's outer `try/catch` as a log) or, more
|
||||
seriously, push a replication operation into a half-disposed Akka actor pipeline and
|
||||
trigger noisy dead-letter warnings during a clean shutdown.
|
||||
|
||||
The race window is small (the sweep typically finishes in <100 ms in tests) but it is
|
||||
real, particularly when shutting down a site under load with a non-empty buffer.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Track in-flight sweep tasks and `await` them in `StopAsync`:
|
||||
|
||||
```csharp
|
||||
private Task? _currentSweep;
|
||||
|
||||
public async Task StopAsync()
|
||||
{
|
||||
if (_retryTimer != null)
|
||||
{
|
||||
await _retryTimer.DisposeAsync();
|
||||
_retryTimer = null;
|
||||
}
|
||||
if (_currentSweep is { } sweep)
|
||||
{
|
||||
try { await sweep; } catch { /* logged inside RetryPendingMessagesAsync */ }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Change the timer callback to:
|
||||
|
||||
```csharp
|
||||
_retryTimer = new Timer(_ => _currentSweep = RetryPendingMessagesAsync(), ...);
|
||||
```
|
||||
|
||||
Add a `CancellationTokenSource` so a long sweep can be cooperatively aborted on stop;
|
||||
plumb the cancellation token into `_storage` / `_replication` / `_cachedCallObserver`
|
||||
calls. Add a regression test in `StoreAndForwardServiceTests` that calls `StopAsync`
|
||||
mid-sweep and asserts no further storage activity occurs after `StopAsync` returns.
|
||||
|
||||
**Resolution**
|
||||
|
||||
_Unresolved._
|
||||
|
||||
|
||||
Reference in New Issue
Block a user