docs(code-reviews): re-review batch 4 at 39d737e — SiteEventLogging, SiteRuntime, StoreAndForward, TemplateEngine

11 new findings: SiteEventLogging-012..014, SiteRuntime-017..019, StoreAndForward-015..017, TemplateEngine-015..016.
This commit is contained in:
Joseph Doherty
2026-05-17 00:51:58 -04:00
parent 3b3760f026
commit 0ba4e49e11
5 changed files with 613 additions and 27 deletions

View File

@@ -5,10 +5,10 @@
| Module | `src/ScadaLink.StoreAndForward` |
| Design doc | `docs/requirements/Component-StoreAndForward.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-16 |
| Last reviewed | 2026-05-17 |
| Reviewer | claude-agent |
| Commit reviewed | `9c60592` |
| Open findings | 0 (3 Deferred: 002, 011, 012 — see notes) |
| Commit reviewed | `39d737e` |
| Open findings | 3 (3 Deferred: 002, 011, 012 — see notes) |
## Summary
@@ -30,20 +30,45 @@ status set, and untested critical paths (retry-due timing, replication-from-acti
the actor bridge). None of the findings are blockers for compilation, but the
replication and retry-count issues are functional defects against the design.
#### Re-review 2026-05-17 (commit `39d737e`)
Re-reviewed at commit `39d737e` after the batch-3 fixes. All of findings 001 and
003010, plus 013014, are confirmed `Resolved` against the current source: the
replication wiring (`BufferAsync`/`ReplicateRemove`/`ReplicatePark`), the corrected
retry-count semantics, the conditional `UpdateMessageIfStatusAsync` writes, the
transactioned parked-message reads, the `PipeTo` refactor, the `RaiseActivity`
hardening, the `RetryParkedMessageAsync` `last_attempt_at` reset and the database
directory creation are all present as described. Findings 002, 011 and 012 were
re-verified and remain validly `Deferred` — their preconditions are unchanged (002's
residual no-handler gap, 011's Commons-owned enum, 012's Commons-owned entity placement).
This pass surfaced **three new findings**. StoreAndForward-015 records the
StoreAndForward side of the cross-module `MaxRetries == 0` ambiguity flagged by
ExternalSystemGateway-015: `EnqueueAsync`'s public contract documents `maxRetries` only
as "parked once `MaxRetries` is reached" and never states the `0 = no limit / retry
forever` special case that `RetryMessageAsync` actually enforces, so an ESG caller
passing `0` to mean "never retry" gets the opposite behaviour with no warning from the
S&F API surface. StoreAndForward-016 records that operator-initiated parked-message
retry and discard are not replicated to the standby — only the add/remove/park sweep
paths are — so a failover diverges the standby buffer from the active one.
StoreAndForward-017 records that the Retry/Discard activity-log entries hard-code the
`ExternalSystem` category, mislabelling notification and cached-DB-write messages in
the site event log.
## Checklist coverage
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | Off-by-one in retry counting (003); parked-message retry timing (010). |
| 1 | Correctness & logic bugs | ☑ | Off-by-one in retry counting (003); parked-message retry timing (010); Retry/Discard activity log hard-codes the category (017). |
| 2 | Akka.NET conventions | ☑ | `ContinueWith` used instead of `PipeTo`-friendly continuations; default supervision; see 007. |
| 3 | Concurrency & thread safety | ☑ | Sweep guarded by `Interlocked`, but no guard against retry-vs-manage races (005); `OnActivity` event not thread-safe (009). |
| 4 | Error handling & resilience | ☑ | Replication never invoked from active path (001); no-handler messages buffered then stuck (002). |
| 4 | Error handling & resilience | ☑ | Replication never invoked from active path (001); no-handler messages buffered then stuck (002); operator retry/discard not replicated to standby (016). |
| 5 | Security | ☑ | No issues found — parameterised SQL throughout; no secrets handled directly; payload JSON treated opaquely. |
| 6 | Performance & resource management | ☑ | New SQLite connection per call; multi-statement operations not wrapped in a transaction (006, 008). |
| 7 | Design-document adherence | ☑ | Replication gap (001); `InFlight` status undocumented/unused (011); "retrying" status from design doc not modelled. |
| 8 | Code organization & conventions | ☑ | `StoreAndForwardMessage` is an entity-like POCO living in the component, not Commons (012). |
| 9 | Testing coverage | ☑ | Retry-due timing, replication-from-active, and `ParkedMessageHandlerActor` are untested (013). |
| 10 | Documentation & comments | ☑ | XML doc on `RegisterDeliveryHandler` contract is inconsistent with code (004). |
| 10 | Documentation & comments | ☑ | XML doc on `RegisterDeliveryHandler` contract is inconsistent with code (004); `EnqueueAsync` never documents the `maxRetries == 0` = "retry forever" special case (015). |
## Findings
@@ -703,3 +728,158 @@ and bare filenames are skipped). Regression test
`InitializeAsync_FileInMissingDirectory_CreatesDirectory` fails against the pre-fix code;
all six `SiteActorPathTests` now pass. Fixed by the commit whose message references
`StoreAndForward-014`.
### StoreAndForward-015 — `EnqueueAsync`'s public contract never documents that `maxRetries == 0` means "retry forever"
| | |
|--|--|
| Severity | Medium |
| Category | Documentation & comments |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:114``:130`, `:285` |
**Description**
The re-review brief asks for the StoreAndForward side of the cross-module ambiguity
recorded as `ExternalSystemGateway-015`. The semantics are split across this module and
its callers, and the StoreAndForward side carries a genuine documentation/API-contract
fault:
- `RetryMessageAsync` parks a message only when `message.MaxRetries > 0 && message.RetryCount >= message.MaxRetries`
(`StoreAndForwardService.cs:285`). When `MaxRetries == 0` the guard is false on every
sweep, so a `0` value means **"no limit — retry forever"**. The
`StoreAndForwardMessage.MaxRetries` XML doc (`StoreAndForwardMessage.cs:31`) does state
`"0 = no limit"`, so the persistence model is internally consistent.
- But `EnqueueAsync` — the *only* public entry point into the engine — exposes a
`maxRetries` parameter (`StoreAndForwardService.cs:128`) with **no parameter
documentation at all**, and its method summary (lines 114122) describes the lifecycle
only as "On max retries → park" / "parked once `MaxRetries` is reached" (see also the
`_deliveryHandlers` field doc, line 5051). Nothing on the public surface tells a
caller that passing `0` flips the meaning from "park immediately / never retry" to
"retry forever". A caller reading only `EnqueueAsync` would reasonably assume `0`
retries means zero retries.
- This is exactly the trap ESG fell into: `ExternalSystemClient.CachedCallAsync` /
`DatabaseGateway.CachedWriteAsync` pass the source entity's `MaxRetries` verbatim,
intending `0` to mean "never retry", and instead get unbounded retry — the
duplicate-delivery / unbounded-buffer-growth hazard the design doc's idempotency note
warns against. The fault is not solely ESG's: the S&F public API silently overloads
`0` with the opposite of its natural reading and does not document it.
The defect is in this module's API contract and documentation, so it is recorded here
in addition to `ExternalSystemGateway-015`. (Whether `0` *should* mean "no limit" or
"no retry" is the cross-module design decision tracked by ESG-015; this finding is
specifically that the StoreAndForward public surface fails to document whichever
meaning is chosen.)
**Recommendation**
Document the `maxRetries` parameter on `EnqueueAsync` explicitly with a `<param>` tag
that states the `0` special case in the same words as `StoreAndForwardMessage.MaxRetries`
(`"0 = no limit — retried on every sweep until delivered, never parked"`), and add the
`0` case to the method summary's lifecycle description. Better still — and consistent
with the resolution of ESG-015 — make the engine reject the ambiguity at the API: accept
a nullable/`enum` retry policy, or treat `0` as an explicit "no retry" (do not buffer, or
park on the first sweep) so the natural reading and the behaviour agree. Either way the
public `EnqueueAsync` contract must state the chosen meaning; today it states nothing.
**Resolution**
_Unresolved._
### StoreAndForward-016 — Operator-initiated parked-message retry and discard are not replicated to the standby
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:339``:362`; `src/ScadaLink.StoreAndForward/ReplicationService.cs:131``:136` |
**Description**
`StoreAndForward-001`'s fix wired replication into the active *delivery* paths:
`BufferAsync` replicates an `Add`, a successful retry replicates a `Remove`, and a park
replicates a `Park`. But the two *operator* paths — `RetryParkedMessageAsync` (line 339)
and `DiscardParkedMessageAsync` (line 353) — change buffer state and never touch
`_replication`:
- `RetryParkedMessageAsync` flips a row from `Parked` back to `Pending` (and clears
`retry_count` / `last_attempt_at`) in the local SQLite only. The standby's copy stays
`Parked`.
- `DiscardParkedMessageAsync` `DELETE`s the row from the local SQLite only. The standby's
copy is left in place, still `Parked`.
The Component design doc ("Persistence") requires the active node to forward "each
buffer operation (add, remove, park)" so that on failover "the new active node has a
near-complete copy of the buffer." An operator retrying a parked message is a buffer
state change; an operator discarding one is a removal. After a failover that follows an
operator action:
1. A **discarded** message reappears on the new active node — it is still `Parked`
there, so it resurfaces in the central UI's parked-message list and an operator must
discard it a second time. For a message deliberately removed (e.g. a known-bad
payload) this is a correctness regression of the operator's intent.
2. A **retried** message is still `Parked` on the new active node, so the operator's
"move it back to the queue" action is silently lost across the failover and the
message is not re-attempted.
`ReplicationOperationType` only models `Add`/`Remove`/`Park` — there is no operation for
"un-park / move back to pending", so even a minimal fix needs either a new operation
type or a re-use of `Add` to overwrite the standby row. This is the same class of defect
as the now-resolved `StoreAndForward-001`, for the operator paths rather than the sweep
paths.
**Recommendation**
Replicate both operator actions. `DiscardParkedMessageAsync` should call
`_replication?.ReplicateRemove(messageId)` after a successful local delete (the existing
`Remove` op already deletes on the standby). For `RetryParkedMessageAsync`, add a
`Requeue`/`Unpark` `ReplicationOperationType` whose `ApplyReplicatedOperationAsync` case
resets the standby row to `Pending` with `retry_count = 0`, or have the method re-load
the updated message and replicate it as an `Add`-style upsert. Add replication tests for
both operator paths (the existing `StoreAndForwardReplicationTests` only cover the sweep
paths).
**Resolution**
_Unresolved._
### StoreAndForward-017 — Retry/Discard activity-log entries hard-code the `ExternalSystem` category
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ScadaLink.StoreAndForward/StoreAndForwardService.cs:344`, `:358` |
**Description**
`RetryParkedMessageAsync` and `DiscardParkedMessageAsync` raise an S&F activity
notification (consumed by Site Event Logging — WP-14) but pass a hard-coded
`StoreAndForwardCategory.ExternalSystem` as the category argument:
```csharp
RaiseActivity("Retry", StoreAndForwardCategory.ExternalSystem, $"Parked message {messageId} moved back to queue");
RaiseActivity("Discard", StoreAndForwardCategory.ExternalSystem, $"Parked message {messageId} discarded");
```
Both methods take only a `messageId` and never load the message, so they have no access
to its real category. When an operator retries or discards a parked **Notification** or
**CachedDbWrite** message, the site event log records the activity under
`ExternalSystem`. Every other `RaiseActivity` call in the service passes the message's
true `Category` (`EnqueueAsync`, `RetryMessageAsync`), so the operator paths are
inconsistent and produce mislabelled audit entries — misleading when an operator later
filters or reviews S&F activity by category.
**Recommendation**
Load the message (or have `StoreAndForwardStorage.RetryParkedMessageAsync` /
`DiscardParkedMessageAsync` return the affected row's category) and pass the real
`Category` to `RaiseActivity`. If loading the row is considered too costly on these
infrequent operator paths, change the `OnActivity` event / `RaiseActivity` signature to
allow a nullable category for management actions rather than asserting a false one.
**Resolution**
_Unresolved._