docs(code-review): re-review 17 changed modules at 1f9de8a2 — 8 new findings

Re-reviewed the modules whose source changed since the last review baseline
(full-review remediation fd618cf1 + InboundAPI Database-helper fixes b3c90143),
focused on whether the fixes are sound and regression-free. 9 of 17 modules
clean; 8 new findings (0 Critical, 0 High, 4 Medium, 4 Low), all code-verified
by the orchestrator before recording:

- DataConnectionLayer-029 (Med): DCL-023's unsubscribe-clears-in-flight reopens a
  double-subscribe window that leaks an orphaned alarm feed; the alarm completion
  handler overwrites the subscription id without the tag-path guard at line 908.
- InboundAPI-031 (Med): WaitForAttribute's 5s grace backstop is tighter than the
  CommunicationService Ask's timeout+IntegrationTimeout (30s) round-trip slack, so
  a slow-but-valid timed-out 'false' arriving in the 5-30s window is cancelled into
  an unhandled OperationCanceledException/500 (contradicts spec 6 + its own comment).
- SiteRuntime-032 (Med): SiteRuntime-029's wasPresent guard skips the deployed-count
  decrement when deleting a DISABLED instance (absent from both maps), drifting the
  health-dashboard tally; self-heals on singleton restart (observational, hence Med).
- StoreAndForward-028 (Med): StoreAndForward-025 resets the register-guard but not
  _bufferedCount, so a same-instance Stop->Start re-seeds the depth gauge to ~2N.
- AuditLog-017, CentralUI-037, ScriptAnalysis-009, SiteRuntime-033 (Low): a
  test-coverage gap plus stale doc-comments/spec following the remediation.

Header commit/date bumped to 1f9de8a2 / 2026-06-24 on all 17 modules; README
regenerated (8 pending / 576 total).
This commit is contained in:
Joseph Doherty
2026-06-24 09:20:03 -04:00
parent 1f9de8a2b5
commit c42bb48585
18 changed files with 635 additions and 66 deletions
+46 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward` |
| Design doc | `docs/requirements/Component-StoreAndForward.md` |
| Status | Reviewed |
| Last reviewed | 2026-06-20 |
| Last reviewed | 2026-06-24 |
| Reviewer | claude-agent |
| Commit reviewed | `4307c381` |
| Open findings | 0 (025 Resolved; 026, 027 Deferred; 5 Deferred total: 002, 011, 012, 026, 027; all prior Open findings resolved) |
| Commit reviewed | `1f9de8a2` |
| Open findings | 1 (025 Resolved; 026, 027 Deferred; 5 Deferred total: 002, 011, 012, 026, 027; all prior Open findings resolved) |
## Summary
@@ -1736,3 +1736,46 @@ not a forced fix. Options, in rough order of preference:
Deferred 2026-06-20: the per-retry `"Retried"` site-event volume is spec-conformant (Component-StoreAndForward.md lists 'retried' as a logged activity) and the site log is capped, so the residual concern is history eviction, not disk exhaustion. Whether to drop/sample/threshold the event vs. document the volume is a design-vs-scale decision for the owner.
## Re-review — 2026-06-24 (commit `1f9de8a2`)
Focused re-review of the changes since the prior review — verifying the code-review remediation + feature fixes are sound and regression-free. Reviewed by a per-module workflow agent; findings code-verified by the orchestrator.
**Changes reviewed:** The diff implements the StoreAndForward-025 remediation in StoreAndForwardService.cs: it adds a `_queueDepthProvider` field that retains the exact delegate registered with the process-global `ScadaBridgeTelemetry` queue-depth gauge in `StartAsync`, and reworks `StopAsync` so that after the in-flight retry sweep drains it deregisters that delegate by identity via the new `ScadaBridgeTelemetry.ClearQueueDepthProvider` (compare-and-clear), nulls `_queueDepthProvider`, and resets the `_queueDepthProviderRegistered` one-time guard to 0. The sweep-wait block was also refactored from an early-return guard into an `if (inflight is not null && !inflight.IsCompleted)` block (behaviour-preserving). A companion `ClearQueueDepthProvider` was added to ScadaBridgeTelemetry and three regression tests were added to QueueDepthGaugeTests.
**Verdict:** The StoreAndForward-025 fix is largely sound: the identity-checked compare-and-clear correctly prevents a late stop from stomping a successor instance's provider (proven by the passing late-stop test), the reference-type `Interlocked.CompareExchange` overload binds correctly, the project builds clean, and all five QueueDepthGaugeTests pass. The StopAsync refactor preserves the StoreAndForward-024 sweep-wait semantics. However, the change introduces one regression in the restart path it newly enables: resetting `_queueDepthProviderRegistered` to 0 without resetting `_bufferedCount` means a Stop→Start cycle on a non-empty buffer re-runs the additive seed and double-counts still-buffered rows into the gauge — contradicting the fix's own comment that a later StartAsync "re-registers cleanly." Impact is gauge-accuracy only (the metric is documented as approximate), so it is bounded, but it is a real behavioural drift versus the prior permanent-latch guard.
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | StopAsync resets _queueDepthProviderRegistered to 0 but never resets _bufferedCount, so a same-instance Stop→Start over a non-empty buffer re-runs Interlocked.Add(pending) and double-counts the gauge. See finding. |
| 2 | Akka.NET conventions | ☑ | No issues found. StoreAndForwardService is a plain service, not an actor; no actor-state, Tell/Ask, or supervision concerns touched by the diff. |
| 3 | Concurrency & thread safety | ☑ | Telemetry clear uses Interlocked.CompareExchange (reference overload) and the gauge reads via Volatile — sync-safe. _queueDepthProvider is read/written only on the same instance during the documented single-threaded Start/Stop lifecycle. No data race in the changed code. |
| 4 | Error handling & resilience | ☑ | StopAsync's bounded WaitAsync with TimeoutException + catch-all (swallow + log Warning) is preserved through the refactor; the provider clear runs unconditionally afterward so a timed-out sweep still releases the gauge slot. No issues found. |
| 5 | Security | ☑ | No issues found. No secrets, no SQL/LDAP, no script-trust surface, no logging of sensitive data in the changed code. |
| 6 | Performance & resource management | ☑ | The fix correctly addresses the original StoreAndForward-025 leak: the global delegate slot no longer pins the dead service after a graceful StopAsync, and the gauge stops reporting a frozen depth. |
| 7 | Design-document adherence | ☑ | Component-StoreAndForward.md references buffer-depth metrics only generically; the gauge provider lifecycle is an implementation detail. No spec drift. Note: Host AkkaHostedService.StopAsync never calls StoreAndForwardService.StopAsync, so the clear path is not exercised on real host shutdown — pre-existing wiring gap, out of scope of this file. |
| 8 | Code organization & conventions | ☑ | No issues found. New field, telemetry method, and XML docs follow existing patterns; the explicit (Func<long>) cast on the seed delegate is needed for overload disambiguation. |
| 9 | Testing coverage | ☑ | Three new tests cover clear-on-stop, late-stop-doesn't-clobber-successor, and the additive seed. No test covers a Stop→Start restart on the same instance with a non-empty buffer, which is exactly the double-count path the finding describes. |
| 10 | Documentation & comments | ☑ | The StopAsync XML doc and the inline comment at lines 405-406/451-452 claim a later StartAsync 're-registers cleanly', which is inaccurate for _bufferedCount — the comment overstates the cleanliness of the restart path. Tied to the finding. |
**New findings from this re-review (1):**
### StoreAndForward-028 — StopAsync resets register-guard but not _bufferedCount, double-counting gauge on restart
| | |
|--|--|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Status | Open |
| Location | `src/ZB.MOM.WW.ScadaBridge.StoreAndForward/StoreAndForwardService.cs:459` |
**Description**
The StoreAndForward-025 fix added `Interlocked.Exchange(ref _queueDepthProviderRegistered, 0)` in StopAsync (line 459) so a later StartAsync re-registers — and the new XML doc/comment (lines 405-406, 451-452) explicitly claims StartAsync then 're-registers cleanly.' But StopAsync never resets `_bufferedCount` (verified: it is only ever Increment/Decrement/Add, never reset — lines 154/359/619/678/712/751/973). On a same-instance Stop→Start cycle the guard CompareExchange at line 357 now succeeds again, so `Interlocked.Add(ref _bufferedCount, pending)` (line 359) re-adds the freshly-read durable Pending count on top of the leftover in-memory value from the prior lifecycle. When the service stops with a non-empty buffer (the normal case — S&F stops with queued rows), `_bufferedCount` is ~N at stop and stays N; the restart adds another N, so the gauge reports ~2N. Before this change the guard was a permanent latch (never reset), so a second StartAsync short-circuited and the seed never re-ran — no double-count. The service is a DI singleton and the Host comment notes the test harness triggers second start/stop interleavings, so this restart path is reachable. Impact is confined to gauge accuracy (the depth metric is documented as approximate, dashboard-only — no data loss), but it is a real regression in the path this change newly enables and mis-documents.
**Recommendation**
Reset the cached counter on stop alongside the guard, e.g. `Interlocked.Exchange(ref _bufferedCount, 0);` immediately before/after the `Interlocked.Exchange(ref _queueDepthProviderRegistered, 0)` at line 459, so a subsequent StartAsync seeds from a clean zero base. Then add a regression test to QueueDepthGaugeTests that buffers N rows, calls StopAsync, calls StartAsync again on the same instance, and asserts the gauge reports N (not 2N). Alternatively, correct the 're-registers cleanly' comment to document that restart is unsupported and keep the guard a permanent latch.
**Resolution**
_Unresolved._