docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked
Full per-module re-review of the 16 stale modules (last seen1eb6e97/ 2026-05-28) plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD4307c381. 67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commitfd618cf1closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings (IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision. Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024), an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001), and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary is semantically sound (symbol-based) in the production cluster config. README regenerated; regen-readme.py --check passes (4 pending / 567 total).
This commit is contained in:
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward` |
|
||||
| Design doc | `docs/requirements/Component-StoreAndForward.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Last reviewed | 2026-06-20 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 0 (3 Deferred: 002, 011, 012; all 5 Open from Re-review 2026-05-28 resolved 2026-05-28) |
|
||||
| Commit reviewed | `4307c381` |
|
||||
| Open findings | 0 (025 Resolved; 026, 027 Deferred; 5 Deferred total: 002, 011, 012, 026, 027; all prior Open findings resolved) |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -110,6 +110,39 @@ mid-flight `RetryPendingMessagesAsync` invocation continues using `_storage` and
|
||||
`_replication` after `StopAsync` returns; downstream resources disposed by the host
|
||||
shutdown sequence (the DI container) can then NRE through the still-running sweep.
|
||||
|
||||
#### Re-review 2026-06-20 (commit `4307c381`) — full review
|
||||
|
||||
Full re-review against commit `4307c381` with the same 10-category checklist. All prior
|
||||
fixes are intact: the StoreAndForward-024 stop-wait (the captured `_sweepTask` +
|
||||
bounded `StopAsync` wait), the StoreAndForward-020 capture-before-requeue, the
|
||||
StoreAndForward-005 conditional `UpdateMessageIfStatusAsync` writes, and the
|
||||
StoreAndForward-023 site-id sentinel all remain present and correct, and the new WP-14
|
||||
in-process `_bufferedCount` queue-depth counter shows **no drift** — every Pending-population
|
||||
transition (`BufferAsync` +1, retry-remove −1, conditional Pending→Parked −1, operator
|
||||
requeue +1) adjusts it exactly once and only when the underlying conditional storage write
|
||||
wins. The three findings this pass surfaces are not core-delivery defects: a **gauge-stop
|
||||
resource leak** (025 — the process-global queue-depth provider slot is never cleared on
|
||||
`StopAsync`, so a stopped service's frozen depth is reported indefinitely and the dead
|
||||
instance is pinned via the closure), plus **two test/doc items** — the gauge stop/lifecycle
|
||||
and the `EmitSiteEvent` retry / retry-delivered branches are untested (026), and the
|
||||
per-retry-per-sweep `"Retried"` site event is a spec-conformant design-vs-scale tension
|
||||
worth flagging (027).
|
||||
|
||||
## Checklist coverage — Re-review 2026-06-20
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | No new defects — `_bufferedCount` adjusts once per Pending transition, gated on the winning conditional write; retry-count/park semantics (003/005) intact. |
|
||||
| 2 | Akka.NET conventions | ☑ | No issues found — `ParkedMessageHandlerActor` `PipeTo` success/failure projections preserved (007). |
|
||||
| 3 | Concurrency & thread safety | ☑ | No new findings — StopAsync now awaits the captured `_sweepTask` (024); requeue captures the row up front (020); conditional writes guard sweep-vs-operator (005). |
|
||||
| 4 | Error handling & resilience | ☑ | No new findings — replication wired on all delivery + operator paths (001/016); notification parking documented (019). |
|
||||
| 5 | Security | ☑ | No issues found — parameterised SQL throughout; payload JSON opaque; no secret material handled. |
|
||||
| 6 | Performance & resource management | ☑ | Queue-depth gauge provider registered into a process-global slot is never cleared on `StopAsync` — frozen reading + dead-instance pin for the process lifetime (025). |
|
||||
| 7 | Design-document adherence | ☑ | Per-retry `"Retried"` site event per-row-per-sweep is spec-conformant (`Component-StoreAndForward.md:139`) but a flood risk against the unbounded buffer + capped site log — flagged as a judgment call (027). |
|
||||
| 8 | Code organization & conventions | ☑ | No new findings — options/POCO placement unchanged from prior passes (011/012 still Deferred to Commons-owning changes). |
|
||||
| 9 | Testing coverage | ☑ | Gauge stop/lifecycle and the `EmitSiteEvent` retry / retry-delivered branches are untested (026). |
|
||||
| 10 | Documentation & comments | ☑ | No new findings — the `_bufferedCount` and StopAsync XML docs accurately describe the implemented behaviour. |
|
||||
|
||||
## Checklist coverage — Re-review 2026-05-28
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
@@ -1520,3 +1553,186 @@ mid-sweep and asserts no further storage activity occurs after `StopAsync` retur
|
||||
|
||||
_Unresolved._
|
||||
|
||||
### StoreAndForward-025 — Queue-depth gauge provider is never cleared on `StopAsync`, leaking a frozen reading and pinning the dead service
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Performance & resource management |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:347`–`:351`, `:385`–`:425`; `src/ZB.MOM.WW.ScadaBridge.Commons/Observability/ScadaBridgeTelemetry.cs:55`–`:60`, `:85`–`:93` |
|
||||
|
||||
**Description**
|
||||
|
||||
The WP-14 queue-depth gauge is backed by a **process-global** provider slot.
|
||||
`StoreAndForwardService.StartAsync` seeds the in-process `_bufferedCount`, sets a
|
||||
one-time instance guard (`_queueDepthProviderRegistered`, line 347), and registers a
|
||||
closure that reads this instance's counter into the global slot
|
||||
(`ScadaBridgeTelemetry.SetQueueDepthProvider(() => Interlocked.Read(ref _bufferedCount))`,
|
||||
line 350). The `scadabridge.store_and_forward.queue.depth` `ObservableGauge`
|
||||
(`ScadaBridgeTelemetry.cs:56`–`:60`) invokes that provider on every collector scrape:
|
||||
`Volatile.Read(ref _queueDepthProvider) is { } p ? p() : 0L`.
|
||||
|
||||
`SetQueueDepthProvider` (`ScadaBridgeTelemetry.cs:85`–`:93`) is the only mutator of the
|
||||
slot — there is **no** clear/reset method anywhere in the codebase (a repository-wide
|
||||
search for `ClearQueueDepthProvider` / writes to `_queueDepthProvider` finds only the
|
||||
registration). `StopAsync` (`StoreAndForwardService.cs:385`–`:425`) disposes the retry
|
||||
timer and (StoreAndForward-024) awaits the in-flight sweep, but it never touches the
|
||||
gauge: the global slot still points at the stopped instance's closure, and
|
||||
`_queueDepthProviderRegistered` is left at 1.
|
||||
|
||||
Two consequences after a graceful stop:
|
||||
|
||||
1. **Frozen reading.** The gauge keeps reporting the dead service's last `_bufferedCount`
|
||||
value forever — a stopped (or failed-over-away) site node reports a stale, non-zero
|
||||
queue depth that no longer corresponds to any live buffer, which is misleading on a
|
||||
dashboard / alert built on the metric.
|
||||
2. **Instance pin (resource leak).** The captured closure holds a reference to the dead
|
||||
`StoreAndForwardService`, so the stopped instance (and everything it transitively
|
||||
roots) cannot be garbage-collected for the process lifetime. This is the classic
|
||||
"static event / static delegate slot holding a finished object" leak. It also means a
|
||||
later instance's `StartAsync` silently stomps the global slot (last-writer-wins), so
|
||||
the gauge's identity is unmanaged.
|
||||
|
||||
The instance guard's own XML doc (lines 156–164) already acknowledges the slot is
|
||||
process-global and "the last `StartAsync` wins the global slot" — but nothing on the stop
|
||||
path releases it, so a clean stop without an immediately-following start leaves a dangling
|
||||
provider.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Give `ScadaBridgeTelemetry` an identity-checked clear and call it from `StopAsync`. Add a
|
||||
compare-and-clear `ClearQueueDepthProvider(Func<long> expected)` that only nulls the slot
|
||||
when the current provider is reference-equal to `expected` (so a later instance that
|
||||
already re-registered is not stomped):
|
||||
|
||||
```csharp
|
||||
public static void ClearQueueDepthProvider(Func<long> expected)
|
||||
{
|
||||
Interlocked.CompareExchange(ref _queueDepthProvider, null, expected);
|
||||
}
|
||||
```
|
||||
|
||||
In `StoreAndForwardService`, hold the registered provider delegate in a field at
|
||||
`StartAsync` time, and in `StopAsync` (only when this instance actually registered —
|
||||
`_queueDepthProviderRegistered == 1`) call
|
||||
`ScadaBridgeTelemetry.ClearQueueDepthProvider(_registeredProvider)` and reset
|
||||
`_queueDepthProviderRegistered` to 0 via `Interlocked.Exchange` so a subsequent restart
|
||||
re-seeds and re-registers cleanly. Add a regression test (extend `QueueDepthGaugeTests`)
|
||||
asserting that after `StopAsync` the gauge reports 0 (provider cleared) and that a second
|
||||
instance's registration is not clobbered by the first instance's stop.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): added an identity-checked `ScadaBridgeTelemetry.ClearQueueDepthProvider(provider)` (compare-and-clear via `Interlocked.CompareExchange`, so a newer instance's provider is never stomped) and call it in `StoreAndForwardService.StopAsync` (with `_queueDepthProviderRegistered` reset). The gauge no longer reports a frozen depth or pin the dead service after a graceful stop. Tests added (incl. the late-stop-doesn't-clobber-successor case).
|
||||
|
||||
### StoreAndForward-026 — Gauge stop/lifecycle and the `EmitSiteEvent` retry / retry-delivered branches are untested
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Deferred |
|
||||
| Location | `tests/ZB.MOM.WW.ScadaBridge.StoreAndForward.Tests/QueueDepthGaugeTests.cs`; `tests/ZB.MOM.WW.ScadaBridge.StoreAndForward.Tests/StoreAndForwardSiteEventTests.cs`; `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:295`–`:308`, `:747` |
|
||||
|
||||
**Description**
|
||||
|
||||
Two behaviours exercised by the current code have no test coverage:
|
||||
|
||||
1. **Gauge stop / lifecycle.** `QueueDepthGaugeTests` covers the live counter across
|
||||
enqueue / drain / park / requeue (`Gauge_TracksBufferedDepth_AcrossEnqueueDrainAndPark`),
|
||||
the start-up seed from existing Pending rows (`Gauge_SeedsFromExistingPendingRows_OnStart`),
|
||||
and the additive-seed race (`Gauge_SeedAddsToConcurrentPreSeedIncrement_NotClobber`).
|
||||
It calls `StopAsync` only as test teardown (`DisposeAsync`) with no assertion — so the
|
||||
gauge's stop behaviour is unverified. This is the same coverage gap that would have
|
||||
caught StoreAndForward-025: no test asserts what the gauge reports after a service stops
|
||||
(today: a stale frozen value; after the 025 fix: 0). The gauge's behaviour across a
|
||||
stop, and across a second instance starting after a first stops, is entirely untested.
|
||||
|
||||
2. **`EmitSiteEvent` retry / retry-delivered branches.** `StoreAndForwardSiteEventTests`
|
||||
covers buffer-for-retry (`"Queued"`), park (`"Parked"` for both notification and
|
||||
cached-call categories), routine-enqueue-no-event, and immediate-delivered-no-event.
|
||||
But two `EmitSiteEvent` branches that the retry sweep drives are uncovered:
|
||||
- the per-retry `"Retried"` action (`StoreAndForwardService.cs:747`), which maps to a
|
||||
`store_and_forward` site event at `Warning` severity (the `_ => "Warning"` arm of the
|
||||
severity switch, lines 298–303);
|
||||
- the retry-loop `"Delivered"` action whose detail is `"Delivered to … after N retries"`
|
||||
(line 644–645), which — because the detail does **not** start with `"Immediate"` —
|
||||
is **not** short-circuited at line 268 and instead emits a `store_and_forward` event
|
||||
at `Info` severity (the `"Delivered" => "Info"` arm, line 301). This "recovery
|
||||
recorded" path is the only one that produces an `Info`-severity S&F site event, and
|
||||
nothing asserts it fires (or that the immediate-delivered path is correctly suppressed
|
||||
by contrast).
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add a gauge-stop test to `QueueDepthGaugeTests`: enqueue to a known non-zero depth, call
|
||||
`StopAsync`, and assert the gauge reports the cleared value (0 after the StoreAndForward-025
|
||||
fix) — and, ideally, that a second `StoreAndForwardService` started afterward owns the slot
|
||||
without being clobbered by the first instance's stop. Add two `StoreAndForwardSiteEventTests`
|
||||
cases: (a) a transient handler with `MaxRetries > 1` driven through one sweep, asserting a
|
||||
`store_and_forward` / `Warning` event whose message contains `"retried"`; and (b) a handler
|
||||
that fails on the immediate attempt then succeeds on the sweep, asserting a `store_and_forward`
|
||||
/ `Info` event whose message contains `"delivered"` (recording the recovery), distinct from
|
||||
the suppressed immediate-delivery path.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Deferred 2026-06-20: additional gauge-lifecycle and `EmitSiteEvent` retry-branch tests are low-value coverage gaps; recorded for a follow-up.
|
||||
|
||||
### StoreAndForward-027 — Per-message `"Retried"` site event fires per-row-per-sweep, risking a flood of the capped site event log
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Deferred |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:734`–`:759`, `:295`–`:308`; `docs/requirements/Component-StoreAndForward.md:72`, `:139` |
|
||||
|
||||
**Description**
|
||||
|
||||
When `_siteEventLogger` is wired, `EmitSiteEvent` is subscribed to `OnActivity`, so every
|
||||
`RaiseActivity("Retried", …)` call in `RetryMessageAsync`'s transient-not-yet-max branch
|
||||
(`StoreAndForwardService.cs:747`–`:748`) produces one `store_and_forward` site event at
|
||||
`Warning` severity (lines 295–308). That branch fires **once per still-pending row per
|
||||
retry sweep**. Combined with the design's deliberate "**no maximum buffer size**"
|
||||
(`Component-StoreAndForward.md:72`), a sustained outage can buffer a large Pending
|
||||
population, and each fixed-interval sweep then emits a `"Retried"` event for every one of
|
||||
those rows — `(pending rows) × (sweeps)` events. The site event log is **capped** (default
|
||||
1 GB, 30-day retention, with oldest-first purge when the cap is hit before the retention
|
||||
window — `Component-SiteEventLogging.md:45`–`:46`), so a large, long-lived backlog can
|
||||
churn the log and evict genuinely useful operational history (deployments, connection
|
||||
events, alarms) well before the 30-day window.
|
||||
|
||||
**This is a judgment call, not a clear bug.** The behaviour is **spec-conformant**:
|
||||
`Component-StoreAndForward.md:139` explicitly lists "retried" as a logged store-and-forward
|
||||
activity ("Logs store-and-forward activity (queued, delivered, retried, parked)"), so the
|
||||
code is doing exactly what the design doc says. The tension is between that
|
||||
per-retry-logging contract and the combination of an unbounded buffer + a capped site log:
|
||||
at scale the "retried" stream can dominate the log. The site-event-log cap bounds the
|
||||
absolute disk risk (it cannot exhaust disk), so the residual concern is **eviction of
|
||||
other history**, not unbounded growth — hence Low and flagged as a design-vs-scale
|
||||
decision rather than a defect.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Reconcile the design-vs-scale tension deliberately; this is a judgment call for the owners,
|
||||
not a forced fix. Options, in rough order of preference:
|
||||
|
||||
1. **Drop or coarsen the per-retry event.** Stop emitting a `store_and_forward` event for
|
||||
the routine `"Retried"` action (keep `"Queued"`, `"Parked"`, and the retry-recovery
|
||||
`"Delivered"`), since the per-attempt detail is already available via the
|
||||
`OnActivity`/health/telemetry surfaces and the central audit log. The first-buffer and
|
||||
the terminal park/deliver events are the operationally interesting ones.
|
||||
2. **Sample / threshold it.** Emit a `"Retried"` event only on the first retry, on a count
|
||||
boundary (e.g. every Nth retry), or once per sweep as a rolled-up summary
|
||||
("retried N messages this sweep") rather than one per row.
|
||||
3. **Document the volume risk.** If per-retry logging is intentionally retained, add a note
|
||||
to `Component-StoreAndForward.md` / `Component-SiteEventLogging.md` that a large buffered
|
||||
backlog can dominate the capped site event log and evict other history, so operators
|
||||
size the cap / retention accordingly.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Deferred 2026-06-20: the per-retry `"Retried"` site-event volume is spec-conformant (Component-StoreAndForward.md lists 'retried' as a logged activity) and the site log is capped, so the residual concern is history eviction, not disk exhaustion. Whether to drop/sample/threshold the event vs. document the volume is a design-vs-scale decision for the owner.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user