docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked

Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28)
plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381.

67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit
fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix
with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings
(IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision.

Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to
sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024),
an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001),
and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary
is semantically sound (symbol-based) in the production cluster config.

README regenerated; regen-readme.py --check passes (4 pending / 567 total).
This commit is contained in:
Joseph Doherty
2026-06-20 18:02:32 -04:00
parent fd618cf1dc
commit d39089f4ed
19 changed files with 4031 additions and 69 deletions
+219 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward` |
| Design doc | `docs/requirements/Component-StoreAndForward.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Last reviewed | 2026-06-20 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 0 (3 Deferred: 002, 011, 012; all 5 Open from Re-review 2026-05-28 resolved 2026-05-28) |
| Commit reviewed | `4307c381` |
| Open findings | 0 (025 Resolved; 026, 027 Deferred; 5 Deferred total: 002, 011, 012, 026, 027; all prior Open findings resolved) |
## Summary
@@ -110,6 +110,39 @@ mid-flight `RetryPendingMessagesAsync` invocation continues using `_storage` and
`_replication` after `StopAsync` returns; downstream resources disposed by the host
shutdown sequence (the DI container) can then NRE through the still-running sweep.
#### Re-review 2026-06-20 (commit `4307c381`) — full review
Full re-review against commit `4307c381` with the same 10-category checklist. All prior
fixes are intact: the StoreAndForward-024 stop-wait (the captured `_sweepTask` +
bounded `StopAsync` wait), the StoreAndForward-020 capture-before-requeue, the
StoreAndForward-005 conditional `UpdateMessageIfStatusAsync` writes, and the
StoreAndForward-023 site-id sentinel all remain present and correct, and the new WP-14
in-process `_bufferedCount` queue-depth counter shows **no drift** — every Pending-population
transition (`BufferAsync` +1, retry-remove 1, conditional Pending→Parked 1, operator
requeue +1) adjusts it exactly once and only when the underlying conditional storage write
wins. The three findings this pass surfaces are not core-delivery defects: a **gauge-stop
resource leak** (025 — the process-global queue-depth provider slot is never cleared on
`StopAsync`, so a stopped service's frozen depth is reported indefinitely and the dead
instance is pinned via the closure), plus **two test/doc items** — the gauge stop/lifecycle
and the `EmitSiteEvent` retry / retry-delivered branches are untested (026), and the
per-retry-per-sweep `"Retried"` site event is a spec-conformant design-vs-scale tension
worth flagging (027).
## Checklist coverage — Re-review 2026-06-20
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | No new defects — `_bufferedCount` adjusts once per Pending transition, gated on the winning conditional write; retry-count/park semantics (003/005) intact. |
| 2 | Akka.NET conventions | ☑ | No issues found — `ParkedMessageHandlerActor` `PipeTo` success/failure projections preserved (007). |
| 3 | Concurrency & thread safety | ☑ | No new findings — StopAsync now awaits the captured `_sweepTask` (024); requeue captures the row up front (020); conditional writes guard sweep-vs-operator (005). |
| 4 | Error handling & resilience | ☑ | No new findings — replication wired on all delivery + operator paths (001/016); notification parking documented (019). |
| 5 | Security | ☑ | No issues found — parameterised SQL throughout; payload JSON opaque; no secret material handled. |
| 6 | Performance & resource management | ☑ | Queue-depth gauge provider registered into a process-global slot is never cleared on `StopAsync` — frozen reading + dead-instance pin for the process lifetime (025). |
| 7 | Design-document adherence | ☑ | Per-retry `"Retried"` site event per-row-per-sweep is spec-conformant (`Component-StoreAndForward.md:139`) but a flood risk against the unbounded buffer + capped site log — flagged as a judgment call (027). |
| 8 | Code organization & conventions | ☑ | No new findings — options/POCO placement unchanged from prior passes (011/012 still Deferred to Commons-owning changes). |
| 9 | Testing coverage | ☑ | Gauge stop/lifecycle and the `EmitSiteEvent` retry / retry-delivered branches are untested (026). |
| 10 | Documentation & comments | ☑ | No new findings — the `_bufferedCount` and StopAsync XML docs accurately describe the implemented behaviour. |
## Checklist coverage — Re-review 2026-05-28
| # | Category | Examined | Notes |
@@ -1520,3 +1553,186 @@ mid-sweep and asserts no further storage activity occurs after `StopAsync` retur
_Unresolved._
### StoreAndForward-025 — Queue-depth gauge provider is never cleared on `StopAsync`, leaking a frozen reading and pinning the dead service
| | |
|--|--|
| Severity | Medium |
| Category | Performance & resource management |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:347``:351`, `:385``:425`; `src/ZB.MOM.WW.ScadaBridge.Commons/Observability/ScadaBridgeTelemetry.cs:55``:60`, `:85``:93` |
**Description**
The WP-14 queue-depth gauge is backed by a **process-global** provider slot.
`StoreAndForwardService.StartAsync` seeds the in-process `_bufferedCount`, sets a
one-time instance guard (`_queueDepthProviderRegistered`, line 347), and registers a
closure that reads this instance's counter into the global slot
(`ScadaBridgeTelemetry.SetQueueDepthProvider(() => Interlocked.Read(ref _bufferedCount))`,
line 350). The `scadabridge.store_and_forward.queue.depth` `ObservableGauge`
(`ScadaBridgeTelemetry.cs:56``:60`) invokes that provider on every collector scrape:
`Volatile.Read(ref _queueDepthProvider) is { } p ? p() : 0L`.
`SetQueueDepthProvider` (`ScadaBridgeTelemetry.cs:85``:93`) is the only mutator of the
slot — there is **no** clear/reset method anywhere in the codebase (a repository-wide
search for `ClearQueueDepthProvider` / writes to `_queueDepthProvider` finds only the
registration). `StopAsync` (`StoreAndForwardService.cs:385``:425`) disposes the retry
timer and (StoreAndForward-024) awaits the in-flight sweep, but it never touches the
gauge: the global slot still points at the stopped instance's closure, and
`_queueDepthProviderRegistered` is left at 1.
Two consequences after a graceful stop:
1. **Frozen reading.** The gauge keeps reporting the dead service's last `_bufferedCount`
value forever — a stopped (or failed-over-away) site node reports a stale, non-zero
queue depth that no longer corresponds to any live buffer, which is misleading on a
dashboard / alert built on the metric.
2. **Instance pin (resource leak).** The captured closure holds a reference to the dead
`StoreAndForwardService`, so the stopped instance (and everything it transitively
roots) cannot be garbage-collected for the process lifetime. This is the classic
"static event / static delegate slot holding a finished object" leak. It also means a
later instance's `StartAsync` silently stomps the global slot (last-writer-wins), so
the gauge's identity is unmanaged.
The instance guard's own XML doc (lines 156164) already acknowledges the slot is
process-global and "the last `StartAsync` wins the global slot" — but nothing on the stop
path releases it, so a clean stop without an immediately-following start leaves a dangling
provider.
**Recommendation**
Give `ScadaBridgeTelemetry` an identity-checked clear and call it from `StopAsync`. Add a
compare-and-clear `ClearQueueDepthProvider(Func<long> expected)` that only nulls the slot
when the current provider is reference-equal to `expected` (so a later instance that
already re-registered is not stomped):
```csharp
public static void ClearQueueDepthProvider(Func<long> expected)
{
Interlocked.CompareExchange(ref _queueDepthProvider, null, expected);
}
```
In `StoreAndForwardService`, hold the registered provider delegate in a field at
`StartAsync` time, and in `StopAsync` (only when this instance actually registered —
`_queueDepthProviderRegistered == 1`) call
`ScadaBridgeTelemetry.ClearQueueDepthProvider(_registeredProvider)` and reset
`_queueDepthProviderRegistered` to 0 via `Interlocked.Exchange` so a subsequent restart
re-seeds and re-registers cleanly. Add a regression test (extend `QueueDepthGaugeTests`)
asserting that after `StopAsync` the gauge reports 0 (provider cleared) and that a second
instance's registration is not clobbered by the first instance's stop.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): added an identity-checked `ScadaBridgeTelemetry.ClearQueueDepthProvider(provider)` (compare-and-clear via `Interlocked.CompareExchange`, so a newer instance's provider is never stomped) and call it in `StoreAndForwardService.StopAsync` (with `_queueDepthProviderRegistered` reset). The gauge no longer reports a frozen depth or pin the dead service after a graceful stop. Tests added (incl. the late-stop-doesn't-clobber-successor case).
### StoreAndForward-026 — Gauge stop/lifecycle and the `EmitSiteEvent` retry / retry-delivered branches are untested
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Deferred |
| Location | `tests/ZB.MOM.WW.ScadaBridge.StoreAndForward.Tests/QueueDepthGaugeTests.cs`; `tests/ZB.MOM.WW.ScadaBridge.StoreAndForward.Tests/StoreAndForwardSiteEventTests.cs`; `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:295``:308`, `:747` |
**Description**
Two behaviours exercised by the current code have no test coverage:
1. **Gauge stop / lifecycle.** `QueueDepthGaugeTests` covers the live counter across
enqueue / drain / park / requeue (`Gauge_TracksBufferedDepth_AcrossEnqueueDrainAndPark`),
the start-up seed from existing Pending rows (`Gauge_SeedsFromExistingPendingRows_OnStart`),
and the additive-seed race (`Gauge_SeedAddsToConcurrentPreSeedIncrement_NotClobber`).
It calls `StopAsync` only as test teardown (`DisposeAsync`) with no assertion — so the
gauge's stop behaviour is unverified. This is the same coverage gap that would have
caught StoreAndForward-025: no test asserts what the gauge reports after a service stops
(today: a stale frozen value; after the 025 fix: 0). The gauge's behaviour across a
stop, and across a second instance starting after a first stops, is entirely untested.
2. **`EmitSiteEvent` retry / retry-delivered branches.** `StoreAndForwardSiteEventTests`
covers buffer-for-retry (`"Queued"`), park (`"Parked"` for both notification and
cached-call categories), routine-enqueue-no-event, and immediate-delivered-no-event.
But two `EmitSiteEvent` branches that the retry sweep drives are uncovered:
- the per-retry `"Retried"` action (`StoreAndForwardService.cs:747`), which maps to a
`store_and_forward` site event at `Warning` severity (the `_ => "Warning"` arm of the
severity switch, lines 298303);
- the retry-loop `"Delivered"` action whose detail is `"Delivered to … after N retries"`
(line 644645), which — because the detail does **not** start with `"Immediate"`
is **not** short-circuited at line 268 and instead emits a `store_and_forward` event
at `Info` severity (the `"Delivered" => "Info"` arm, line 301). This "recovery
recorded" path is the only one that produces an `Info`-severity S&F site event, and
nothing asserts it fires (or that the immediate-delivered path is correctly suppressed
by contrast).
**Recommendation**
Add a gauge-stop test to `QueueDepthGaugeTests`: enqueue to a known non-zero depth, call
`StopAsync`, and assert the gauge reports the cleared value (0 after the StoreAndForward-025
fix) — and, ideally, that a second `StoreAndForwardService` started afterward owns the slot
without being clobbered by the first instance's stop. Add two `StoreAndForwardSiteEventTests`
cases: (a) a transient handler with `MaxRetries > 1` driven through one sweep, asserting a
`store_and_forward` / `Warning` event whose message contains `"retried"`; and (b) a handler
that fails on the immediate attempt then succeeds on the sweep, asserting a `store_and_forward`
/ `Info` event whose message contains `"delivered"` (recording the recovery), distinct from
the suppressed immediate-delivery path.
**Resolution**
Deferred 2026-06-20: additional gauge-lifecycle and `EmitSiteEvent` retry-branch tests are low-value coverage gaps; recorded for a follow-up.
### StoreAndForward-027 — Per-message `"Retried"` site event fires per-row-per-sweep, risking a flood of the capped site event log
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:734``:759`, `:295``:308`; `docs/requirements/Component-StoreAndForward.md:72`, `:139` |
**Description**
When `_siteEventLogger` is wired, `EmitSiteEvent` is subscribed to `OnActivity`, so every
`RaiseActivity("Retried", …)` call in `RetryMessageAsync`'s transient-not-yet-max branch
(`StoreAndForwardService.cs:747``:748`) produces one `store_and_forward` site event at
`Warning` severity (lines 295308). That branch fires **once per still-pending row per
retry sweep**. Combined with the design's deliberate "**no maximum buffer size**"
(`Component-StoreAndForward.md:72`), a sustained outage can buffer a large Pending
population, and each fixed-interval sweep then emits a `"Retried"` event for every one of
those rows — `(pending rows) × (sweeps)` events. The site event log is **capped** (default
1 GB, 30-day retention, with oldest-first purge when the cap is hit before the retention
window — `Component-SiteEventLogging.md:45``:46`), so a large, long-lived backlog can
churn the log and evict genuinely useful operational history (deployments, connection
events, alarms) well before the 30-day window.
**This is a judgment call, not a clear bug.** The behaviour is **spec-conformant**:
`Component-StoreAndForward.md:139` explicitly lists "retried" as a logged store-and-forward
activity ("Logs store-and-forward activity (queued, delivered, retried, parked)"), so the
code is doing exactly what the design doc says. The tension is between that
per-retry-logging contract and the combination of an unbounded buffer + a capped site log:
at scale the "retried" stream can dominate the log. The site-event-log cap bounds the
absolute disk risk (it cannot exhaust disk), so the residual concern is **eviction of
other history**, not unbounded growth — hence Low and flagged as a design-vs-scale
decision rather than a defect.
**Recommendation**
Reconcile the design-vs-scale tension deliberately; this is a judgment call for the owners,
not a forced fix. Options, in rough order of preference:
1. **Drop or coarsen the per-retry event.** Stop emitting a `store_and_forward` event for
the routine `"Retried"` action (keep `"Queued"`, `"Parked"`, and the retry-recovery
`"Delivered"`), since the per-attempt detail is already available via the
`OnActivity`/health/telemetry surfaces and the central audit log. The first-buffer and
the terminal park/deliver events are the operationally interesting ones.
2. **Sample / threshold it.** Emit a `"Retried"` event only on the first retry, on a count
boundary (e.g. every Nth retry), or once per sweep as a rolled-up summary
("retried N messages this sweep") rather than one per row.
3. **Document the volume risk.** If per-retry logging is intentionally retained, add a note
to `Component-StoreAndForward.md` / `Component-SiteEventLogging.md` that a large buffered
backlog can dominate the capped site event log and evict other history, so operators
size the cap / retention accordingly.
**Resolution**
Deferred 2026-06-20: the per-retry `"Retried"` site-event volume is spec-conformant (Component-StoreAndForward.md lists 'retried' as a logged activity) and the site log is capped, so the residual concern is history eviction, not disk exhaustion. Whether to drop/sample/threshold the event vs. document the volume is a design-vs-scale decision for the owner.