docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked

Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28) plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381. 67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings (IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision. Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024), an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001), and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary is semantically sound (symbol-based) in the production cluster config. README regenerated; regen-readme.py --check passes (4 pending / 567 total).
2026-06-20 18:02:32 -04:00
parent fd618cf1dc
commit d39089f4ed
19 changed files with 4031 additions and 69 deletions
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward` |
 | Design doc | `docs/requirements/Component-StoreAndForward.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
-| Open findings | 0 (3 Deferred: 002, 011, 012; all 5 Open from Re-review 2026-05-28 resolved 2026-05-28) |
+| Commit reviewed | `4307c381` |
+| Open findings | 0 (025 Resolved; 026, 027 Deferred; 5 Deferred total: 002, 011, 012, 026, 027; all prior Open findings resolved) |

 ## Summary

@@ -110,6 +110,39 @@ mid-flight `RetryPendingMessagesAsync` invocation continues using `_storage` and
 `_replication` after `StopAsync` returns; downstream resources disposed by the host
 shutdown sequence (the DI container) can then NRE through the still-running sweep.

+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+Full re-review against commit `4307c381` with the same 10-category checklist. All prior
+fixes are intact: the StoreAndForward-024 stop-wait (the captured `_sweepTask` +
+bounded `StopAsync` wait), the StoreAndForward-020 capture-before-requeue, the
+StoreAndForward-005 conditional `UpdateMessageIfStatusAsync` writes, and the
+StoreAndForward-023 site-id sentinel all remain present and correct, and the new WP-14
+in-process `_bufferedCount` queue-depth counter shows **no drift** — every Pending-population
+transition (`BufferAsync` +1, retry-remove −1, conditional Pending→Parked −1, operator
+requeue +1) adjusts it exactly once and only when the underlying conditional storage write
+wins. The three findings this pass surfaces are not core-delivery defects: a **gauge-stop
+resource leak** (025 — the process-global queue-depth provider slot is never cleared on
+`StopAsync`, so a stopped service's frozen depth is reported indefinitely and the dead
+instance is pinned via the closure), plus **two test/doc items** — the gauge stop/lifecycle
+and the `EmitSiteEvent` retry / retry-delivered branches are untested (026), and the
+per-retry-per-sweep `"Retried"` site event is a spec-conformant design-vs-scale tension
+worth flagging (027).
+
+## Checklist coverage — Re-review 2026-06-20
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | No new defects — `_bufferedCount` adjusts once per Pending transition, gated on the winning conditional write; retry-count/park semantics (003/005) intact. |
+| 2 | Akka.NET conventions | ☑ | No issues found — `ParkedMessageHandlerActor` `PipeTo` success/failure projections preserved (007). |
+| 3 | Concurrency & thread safety | ☑ | No new findings — StopAsync now awaits the captured `_sweepTask` (024); requeue captures the row up front (020); conditional writes guard sweep-vs-operator (005). |
+| 4 | Error handling & resilience | ☑ | No new findings — replication wired on all delivery + operator paths (001/016); notification parking documented (019). |
+| 5 | Security | ☑ | No issues found — parameterised SQL throughout; payload JSON opaque; no secret material handled. |
+| 6 | Performance & resource management | ☑ | Queue-depth gauge provider registered into a process-global slot is never cleared on `StopAsync` — frozen reading + dead-instance pin for the process lifetime (025). |
+| 7 | Design-document adherence | ☑ | Per-retry `"Retried"` site event per-row-per-sweep is spec-conformant (`Component-StoreAndForward.md:139`) but a flood risk against the unbounded buffer + capped site log — flagged as a judgment call (027). |
+| 8 | Code organization & conventions | ☑ | No new findings — options/POCO placement unchanged from prior passes (011/012 still Deferred to Commons-owning changes). |
+| 9 | Testing coverage | ☑ | Gauge stop/lifecycle and the `EmitSiteEvent` retry / retry-delivered branches are untested (026). |
+| 10 | Documentation & comments | ☑ | No new findings — the `_bufferedCount` and StopAsync XML docs accurately describe the implemented behaviour. |
+
 ## Checklist coverage — Re-review 2026-05-28

 | # | Category | Examined | Notes |
@@ -1520,3 +1553,186 @@ mid-sweep and asserts no further storage activity occurs after `StopAsync` retur

 _Unresolved._

+### StoreAndForward-025 — Queue-depth gauge provider is never cleared on `StopAsync`, leaking a frozen reading and pinning the dead service
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Performance & resource management |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:347`–`:351`, `:385`–`:425`; `src/ZB.MOM.WW.ScadaBridge.Commons/Observability/ScadaBridgeTelemetry.cs:55`–`:60`, `:85`–`:93` |
+
+**Description**
+
+The WP-14 queue-depth gauge is backed by a **process-global** provider slot.
+`StoreAndForwardService.StartAsync` seeds the in-process `_bufferedCount`, sets a
+one-time instance guard (`_queueDepthProviderRegistered`, line 347), and registers a
+closure that reads this instance's counter into the global slot
+(`ScadaBridgeTelemetry.SetQueueDepthProvider(() => Interlocked.Read(ref _bufferedCount))`,
+line 350). The `scadabridge.store_and_forward.queue.depth` `ObservableGauge`
+(`ScadaBridgeTelemetry.cs:56`–`:60`) invokes that provider on every collector scrape:
+`Volatile.Read(ref _queueDepthProvider) is { } p ? p() : 0L`.
+
+`SetQueueDepthProvider` (`ScadaBridgeTelemetry.cs:85`–`:93`) is the only mutator of the
+slot — there is **no** clear/reset method anywhere in the codebase (a repository-wide
+search for `ClearQueueDepthProvider` / writes to `_queueDepthProvider` finds only the
+registration). `StopAsync` (`StoreAndForwardService.cs:385`–`:425`) disposes the retry
+timer and (StoreAndForward-024) awaits the in-flight sweep, but it never touches the
+gauge: the global slot still points at the stopped instance's closure, and
+`_queueDepthProviderRegistered` is left at 1.
+
+Two consequences after a graceful stop:
+
+1. **Frozen reading.** The gauge keeps reporting the dead service's last `_bufferedCount`
+   value forever — a stopped (or failed-over-away) site node reports a stale, non-zero
+   queue depth that no longer corresponds to any live buffer, which is misleading on a
+   dashboard / alert built on the metric.
+2. **Instance pin (resource leak).** The captured closure holds a reference to the dead
+   `StoreAndForwardService`, so the stopped instance (and everything it transitively
+   roots) cannot be garbage-collected for the process lifetime. This is the classic
+   "static event / static delegate slot holding a finished object" leak. It also means a
+   later instance's `StartAsync` silently stomps the global slot (last-writer-wins), so
+   the gauge's identity is unmanaged.
+
+The instance guard's own XML doc (lines 156–164) already acknowledges the slot is
+process-global and "the last `StartAsync` wins the global slot" — but nothing on the stop
+path releases it, so a clean stop without an immediately-following start leaves a dangling
+provider.
+
+**Recommendation**
+
+Give `ScadaBridgeTelemetry` an identity-checked clear and call it from `StopAsync`. Add a
+compare-and-clear `ClearQueueDepthProvider(Func<long> expected)` that only nulls the slot
+when the current provider is reference-equal to `expected` (so a later instance that
+already re-registered is not stomped):
+
+```csharp
+public static void ClearQueueDepthProvider(Func<long> expected)
+{
+    Interlocked.CompareExchange(ref _queueDepthProvider, null, expected);
+}
+```
+
+In `StoreAndForwardService`, hold the registered provider delegate in a field at
+`StartAsync` time, and in `StopAsync` (only when this instance actually registered —
+`_queueDepthProviderRegistered == 1`) call
+`ScadaBridgeTelemetry.ClearQueueDepthProvider(_registeredProvider)` and reset
+`_queueDepthProviderRegistered` to 0 via `Interlocked.Exchange` so a subsequent restart
+re-seeds and re-registers cleanly. Add a regression test (extend `QueueDepthGaugeTests`)
+asserting that after `StopAsync` the gauge reports 0 (provider cleared) and that a second
+instance's registration is not clobbered by the first instance's stop.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): added an identity-checked `ScadaBridgeTelemetry.ClearQueueDepthProvider(provider)` (compare-and-clear via `Interlocked.CompareExchange`, so a newer instance's provider is never stomped) and call it in `StoreAndForwardService.StopAsync` (with `_queueDepthProviderRegistered` reset). The gauge no longer reports a frozen depth or pin the dead service after a graceful stop. Tests added (incl. the late-stop-doesn't-clobber-successor case).
+
+### StoreAndForward-026 — Gauge stop/lifecycle and the `EmitSiteEvent` retry / retry-delivered branches are untested
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Deferred |
+| Location | `tests/ZB.MOM.WW.ScadaBridge.StoreAndForward.Tests/QueueDepthGaugeTests.cs`; `tests/ZB.MOM.WW.ScadaBridge.StoreAndForward.Tests/StoreAndForwardSiteEventTests.cs`; `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:295`–`:308`, `:747` |
+
+**Description**
+
+Two behaviours exercised by the current code have no test coverage:
+
+1. **Gauge stop / lifecycle.** `QueueDepthGaugeTests` covers the live counter across
+   enqueue / drain / park / requeue (`Gauge_TracksBufferedDepth_AcrossEnqueueDrainAndPark`),
+   the start-up seed from existing Pending rows (`Gauge_SeedsFromExistingPendingRows_OnStart`),
+   and the additive-seed race (`Gauge_SeedAddsToConcurrentPreSeedIncrement_NotClobber`).
+   It calls `StopAsync` only as test teardown (`DisposeAsync`) with no assertion — so the
+   gauge's stop behaviour is unverified. This is the same coverage gap that would have
+   caught StoreAndForward-025: no test asserts what the gauge reports after a service stops
+   (today: a stale frozen value; after the 025 fix: 0). The gauge's behaviour across a
+   stop, and across a second instance starting after a first stops, is entirely untested.
+
+2. **`EmitSiteEvent` retry / retry-delivered branches.** `StoreAndForwardSiteEventTests`
+   covers buffer-for-retry (`"Queued"`), park (`"Parked"` for both notification and
+   cached-call categories), routine-enqueue-no-event, and immediate-delivered-no-event.
+   But two `EmitSiteEvent` branches that the retry sweep drives are uncovered:
+   - the per-retry `"Retried"` action (`StoreAndForwardService.cs:747`), which maps to a
+     `store_and_forward` site event at `Warning` severity (the `_ => "Warning"` arm of the
+     severity switch, lines 298–303);
+   - the retry-loop `"Delivered"` action whose detail is `"Delivered to … after N retries"`
+     (line 644–645), which — because the detail does **not** start with `"Immediate"` —
+     is **not** short-circuited at line 268 and instead emits a `store_and_forward` event
+     at `Info` severity (the `"Delivered" => "Info"` arm, line 301). This "recovery
+     recorded" path is the only one that produces an `Info`-severity S&F site event, and
+     nothing asserts it fires (or that the immediate-delivered path is correctly suppressed
+     by contrast).
+
+**Recommendation**
+
+Add a gauge-stop test to `QueueDepthGaugeTests`: enqueue to a known non-zero depth, call
+`StopAsync`, and assert the gauge reports the cleared value (0 after the StoreAndForward-025
+fix) — and, ideally, that a second `StoreAndForwardService` started afterward owns the slot
+without being clobbered by the first instance's stop. Add two `StoreAndForwardSiteEventTests`
+cases: (a) a transient handler with `MaxRetries > 1` driven through one sweep, asserting a
+`store_and_forward` / `Warning` event whose message contains `"retried"`; and (b) a handler
+that fails on the immediate attempt then succeeds on the sweep, asserting a `store_and_forward`
+/ `Info` event whose message contains `"delivered"` (recording the recovery), distinct from
+the suppressed immediate-delivery path.
+
+**Resolution**
+
+Deferred 2026-06-20: additional gauge-lifecycle and `EmitSiteEvent` retry-branch tests are low-value coverage gaps; recorded for a follow-up.
+
+### StoreAndForward-027 — Per-message `"Retried"` site event fires per-row-per-sweep, risking a flood of the capped site event log
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:734`–`:759`, `:295`–`:308`; `docs/requirements/Component-StoreAndForward.md:72`, `:139` |
+
+**Description**
+
+When `_siteEventLogger` is wired, `EmitSiteEvent` is subscribed to `OnActivity`, so every
+`RaiseActivity("Retried", …)` call in `RetryMessageAsync`'s transient-not-yet-max branch
+(`StoreAndForwardService.cs:747`–`:748`) produces one `store_and_forward` site event at
+`Warning` severity (lines 295–308). That branch fires **once per still-pending row per
+retry sweep**. Combined with the design's deliberate "**no maximum buffer size**"
+(`Component-StoreAndForward.md:72`), a sustained outage can buffer a large Pending
+population, and each fixed-interval sweep then emits a `"Retried"` event for every one of
+those rows — `(pending rows) × (sweeps)` events. The site event log is **capped** (default
+1 GB, 30-day retention, with oldest-first purge when the cap is hit before the retention
+window — `Component-SiteEventLogging.md:45`–`:46`), so a large, long-lived backlog can
+churn the log and evict genuinely useful operational history (deployments, connection
+events, alarms) well before the 30-day window.
+
+**This is a judgment call, not a clear bug.** The behaviour is **spec-conformant**:
+`Component-StoreAndForward.md:139` explicitly lists "retried" as a logged store-and-forward
+activity ("Logs store-and-forward activity (queued, delivered, retried, parked)"), so the
+code is doing exactly what the design doc says. The tension is between that
+per-retry-logging contract and the combination of an unbounded buffer + a capped site log:
+at scale the "retried" stream can dominate the log. The site-event-log cap bounds the
+absolute disk risk (it cannot exhaust disk), so the residual concern is **eviction of
+other history**, not unbounded growth — hence Low and flagged as a design-vs-scale
+decision rather than a defect.
+
+**Recommendation**
+
+Reconcile the design-vs-scale tension deliberately; this is a judgment call for the owners,
+not a forced fix. Options, in rough order of preference:
+
+1. **Drop or coarsen the per-retry event.** Stop emitting a `store_and_forward` event for
+   the routine `"Retried"` action (keep `"Queued"`, `"Parked"`, and the retry-recovery
+   `"Delivered"`), since the per-attempt detail is already available via the
+   `OnActivity`/health/telemetry surfaces and the central audit log. The first-buffer and
+   the terminal park/deliver events are the operationally interesting ones.
+2. **Sample / threshold it.** Emit a `"Retried"` event only on the first retry, on a count
+   boundary (e.g. every Nth retry), or once per sweep as a rolled-up summary
+   ("retried N messages this sweep") rather than one per row.
+3. **Document the volume risk.** If per-retry logging is intentionally retained, add a note
+   to `Component-StoreAndForward.md` / `Component-SiteEventLogging.md` that a large buffered
+   backlog can dominate the capped site event log and evict other history, so operators
+   size the cap / retention accordingly.
+
+**Resolution**
+
+Deferred 2026-06-20: the per-retry `"Retried"` site-event volume is spec-conformant (Component-StoreAndForward.md lists 'retried' as a logged activity) and the site log is capped, so the residual concern is history eviction, not disk exhaustion. Whether to drop/sample/threshold the event vs. document the volume is a design-vs-scale decision for the owner.
+