docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked

Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28) plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381. 67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings (IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision. Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024), an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001), and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary is semantically sound (symbol-based) in the production cluster config. README regenerated; regen-readme.py --check passes (4 pending / 567 total).
2026-06-20 18:02:32 -04:00
parent fd618cf1dc
commit d39089f4ed
19 changed files with 4031 additions and 69 deletions
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit` |
 | Design doc | `docs/requirements/Component-SiteCallAudit.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
-| Open findings | 2 |
+| Commit reviewed | `4307c381` |
+| Open findings | 0 |

 ## Summary

@@ -42,6 +42,23 @@ tests using a shared `MsSqlMigrationFixture`.
 | 9 | Testing coverage | Yes | Relay path well covered (6 unit tests). Ingest/KPI well covered by MSSQL fixture. Stuck-only paging boundary edge not directly exercised (Finding 006). |
 | 10 | Documentation & comments | Yes | XML docstring claims `SupervisorStrategy` uses Resume — incorrect (Finding 001). `AckErrorMessage` switch arm for `SiteUnreachable` falls through instead of throwing (Finding 005). |

+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+Since the `1eb6e97` baseline the module grew the two pieces that were "still deferred" at the prior pass: the periodic per-site reconciliation puller (Piece A — `OnReconciliationTickAsync` → `ReconcileSiteAsync`, in-memory per-site cursor + idempotent monotonic upsert + per-site failure isolation) and the daily terminal-row purge scheduler (Piece B — `OnPurgeTickAsync` → `PurgeTerminalAsync`, continue-on-error). All of the new code is clean, idiomatic, and well-tested (dedicated `SiteCallAuditReconciliationTests`, `SiteCallAuditPurgeTests`, `SiteCallRelayTests`, KPI-sample and options suites); the M6 `SiteCallAuditKpiSampleSource` correctly anchors both cutoffs on a single capture instant and emits Global + per-Site + per-Node snapshots, and the per-node KPI handler is a clean additive peer of the per-site one. The three new findings are all forward-compat / cosmetic: a latent purge-gating gap (the purge timer is armed only when the reconciliation collaborators are present — currently masked because `Program.cs` co-registers them) plus two doc/edge items (the design doc over-claims six charted trend series when only three are charted, and the reconciliation cursor ignores `MoreAvailable` so a single-timestamp batch saturation re-pulls without progress). No correctness, concurrency, security, or resilience defects in the new code; the inclusive-cursor + idempotent-upsert pairing keeps every edge case safe-by-construction (wasted work, never corruption).
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | Yes | Reconciliation cursor ignores `MoreAvailable`; a single-`UpdatedAtUtc` batch saturation re-pulls without progress (Finding 009 — wasted work only, idempotent upsert prevents corruption). Finding 003 (`IngestedAtUtc`) resolved — now stamped in `OnUpsertAsync` and `ReconcileSiteAsync`. |
+| 2 | Akka.NET conventions | Yes | `Sender` captured before first await on every async handler; `PipeTo` used for all read/relay replies; self-ticks scheduled via `ScheduleTellRepeatedlyCancelable` with `Self` sender, cancelled in `PostStop`. No issues found. |
+| 3 | Concurrency & thread safety | Yes | `_reconciliationCursors`, `_centralCommunication`, timers all mutated only on the actor thread. Per-tick/per-message DI scope (`CreateAsyncScope` for the async tick paths, `CreateScope` for sync read paths) disposed in `finally` / `await using`. No issues found. |
+| 4 | Error handling & resilience | Yes | Reconciliation has per-site try/catch isolation; purge is continue-on-error; ingest catches all and replies `Accepted=false`. Deliberate coarse-retry divergence from `SiteAuditReconciliationActor` (no per-row abandon) is documented and safe given idempotent upsert. No issues found. |
+| 5 | Security | Yes | All SQL parameterised at the repository. Relay carries no user-controlled strings beyond `SourceSite` (a site id). No issues found. |
+| 6 | Performance & resource management | Yes | One DI scope per tick reused across all sites; `MaxPageSize=200` clamp; async DbContext disposal off the dispatcher. `MoreAvailable`-ignoring re-pull is bounded wasted work (Finding 009). |
+| 7 | Design-document adherence | Yes | Reconciliation puller + purge scheduler now implemented (prior Finding 004 resolved). Design doc over-claims six charted KPI trend series; only three are charted (Finding 008). |
+| 8 | Code organization & conventions | Yes | Purge timer gated on reconciliation collaborators rather than just the repository — a host registering SiteCallAudit without the reconciliation client silently never purges (Finding 007, latent). `KpiSampleSource` registered via `TryAddEnumerable`; options owned by the component. |
+| 9 | Testing coverage | Yes | Reconciliation/purge/relay/KPI/options all have dedicated suites. No test exercises a saturated reconciliation batch (`MoreAvailable: true`) or the single-timestamp no-progress edge — every reconciliation test uses `MoreAvailable: false` (relates to Finding 009). |
+| 10 | Documentation & comments | Yes | Prior Findings 001/005 doc fixes still in place. Reconciliation `ReconcileSiteAsync` XML claims parity with `SiteAuditReconciliationActor` while diverging on `MoreAvailable` (the sibling reads it for stalled detection; this actor ignores it) — Finding 009 recommendation includes a doc correction. |
+
 ## Findings

 ### SiteCallAudit-001 — SupervisorStrategy override is dead code; XML claims Resume that is not enforced
@@ -323,3 +340,221 @@ Add a test that (a) inserts 6 rows in interleaved order: stuck, not-stuck,
 stuck, not-stuck, stuck, not-stuck (oldest first); (b) issues a `StuckOnly`
 page-size-1 query; (c) asserts each page returns exactly the stuck row, with
 no overlap and all 3 stuck rows visited.
+
+### SiteCallAudit-007 — Daily purge timer is armed only when the reconciliation collaborators are present; a host without the reconciliation client silently never purges
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Code organization & conventions |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:307-321`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:202-227` |
+
+**Description**
+
+`StartPurgeTimer` (lines 307-321) gates the daily terminal-row purge tick on
+the **reconciliation** collaborators being present:
+
+```csharp
+private void StartPurgeTimer()
+{
+    if (_pullClient is null || _siteEnumerator is null)
+    {
+        return;
+    }
+    // ... schedule PurgeTick ...
+}
+```
+
+But the purge pass (`OnPurgeTickAsync` → `PurgeWithRepositoryAsync` →
+`ISiteCallAuditRepository.PurgeTerminalAsync`) needs only the repository — it
+has no dependency on `IPullSiteCallsClient` / `ISiteEnumerator`. The two
+collaborators are resolved by the production constructor (lines 202-227) via
+`serviceProvider.GetService<IPullSiteCallsClient>()` /
+`GetService<ISiteEnumerator>()` — both registered by
+`AddAuditLogCentralReconciliationClient`
+(`AuditLog/ServiceCollectionExtensions.cs:473,523`, registered as `ISiteEnumerator`
+and `IPullSiteCallsClient`). `GetService` (not `GetRequiredService`) returns
+`null` if that helper was never called, so a host that registers
+`AddSiteCallAudit()` **without** also calling
+`AddAuditLogCentralReconciliationClient(...)` constructs the actor with both
+collaborators null. In `PreStart` both `StartReconciliationTimer` and
+`StartPurgeTimer` early-return, so the actor runs forever with **no purge timer
+at all** → unbounded growth of the central `SiteCalls` table, with no log line
+to say the purge was skipped.
+
+This is **currently latent, not live**: `Program.cs` co-registers
+`AddAuditLogCentralReconciliationClient(builder.Configuration)` (line 107)
+immediately before `AddSiteCallAudit()` (line 113), so on the real central node
+both collaborators resolve and both timers run today. The risk is a future host
+(or a refactor that splits the reconciliation client out of the central
+composition root, or a test/embedded host that wants only ingest + purge)
+silently losing the purge with no diagnostic. The gate exists for a legitimate
+reason — keeping the repo-only test ctor free of *both* background timers so the
+MSSQL read/upsert tests see no scheduled side effects — but it conflates "no
+reconciliation route" with "no purge", and the actor's own XML
+(lines 36-38) documents the coupling as deliberate rather than flagging it as a
+hazard.
+
+**Recommendation**
+
+Decouple `StartPurgeTimer` from the reconciliation collaborators — purge needs
+only a repository, which both the production and the repo-only test ctors
+always have. Two viable shapes:
+
+- Preferred: gate the purge timer on its own real precondition (a repository is
+  always available, so arm it unconditionally in the production + reconciliation
+  ctors; keep it off only in the repo-only test ctor via an explicit
+  "background timers off" flag rather than by proxy of the reconciliation
+  collaborators). This keeps the MSSQL test isolation while removing the
+  accidental coupling.
+- At minimum: log a `Warning` in `StartPurgeTimer` (and `StartReconciliationTimer`)
+  when the timer is **not** armed on the production path, so a misconfigured host
+  surfaces "SiteCallAudit purge timer not started — `PurgeTerminalAsync` will
+  never run" instead of growing the table silently.
+
+Severity is a judgment call: Medium because the consequence (unbounded central
+table growth) is real and silent, but it is latent today (the only production
+composition root co-registers the reconciliation client), so an argument for Low
+is defensible. Flagged Medium to weight the silent-failure + data-growth aspect.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): decoupled the daily purge timer from the reconciliation collaborators (a new `_backgroundTimersEnabled` flag) so a central node that omits the reconciliation client still purges — no more silent unbounded `SiteCalls` growth; a Warning is logged if the reconciliation collaborators are absent. Test added (purge arms without the reconciliation client).
+
+### SiteCallAudit-008 — Design doc claims six charted KPI trend series; only three are actually charted
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Design-document adherence |
+| Status | Resolved |
+| Location | `docs/requirements/Component-SiteCallAudit.md:149-154`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/Kpi/SiteCallAuditKpiSampleSource.cs:42-47`, `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiMetrics.cs:52-62` |
+
+**Description**
+
+The design doc (`Component-SiteCallAudit.md` lines 149-154, the KPI History
+interaction) states:
+
+> "… the resulting `buffered` / `parked` / `failedLastInterval` /
+> `deliveredLastInterval` / `stuck` / `oldestPendingAgeSeconds` series render as
+> trends on the Site Calls page via `KpiTrendChart`."
+
+That lists **six** series as charted. In the code only **three** are charted:
+
+- The public charted catalog `KpiMetrics.SiteCallAudit`
+  (`KpiMetrics.cs:52-62`) exposes exactly `Buffered`, `Parked`, and
+  `FailedLastInterval` — and its own XML says "Charted Site Call Audit (#22)
+  metrics … Rendered by the Central UI Site Calls report trend panel."
+- `SiteCallAuditKpiSampleSource` (lines 42-47) keys those three off the public
+  Commons catalog (`KpiMetrics.SiteCallAudit.*`) and keeps the other three —
+  `deliveredLastInterval`, `stuck`, `oldestPendingAgeSeconds` — as **private**
+  string literals, with the comment "Charted metrics share the public Commons
+  catalog … the uncharted internal metrics stay private here (#178)."
+- The UI confirms it: `SiteCalls/SiteCallsReport.razor.cs` calls
+  `LoadSeriesAsync` for exactly `KpiMetrics.SiteCallAudit.Buffered`, `.Parked`,
+  and `.FailedLastInterval` — three series, no more.
+
+So all six metrics are *sampled* into the `KpiSample` history store, but only
+three are *charted*. The doc reads as if all six are rendered, which is a small
+design-doc drift (over-claiming the UI surface). The code is internally
+consistent and correctly commented; the doc is the stale party.
+
+**Recommendation**
+
+One-line doc edit on `Component-SiteCallAudit.md` lines 149-154: state that the
+three series `buffered` / `parked` / `failedLastInterval` render as trends on
+the Site Calls page via `KpiTrendChart`, and that `deliveredLastInterval` /
+`stuck` / `oldestPendingAgeSeconds` are **sampled into the KPI-history store but
+not charted** (available for future trend panels / ad-hoc query). No code change
+— the code is already the intended state.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): the design doc now lists the three actually-charted KPI metrics (buffered/parked/failedLastInterval) and marks the other three as sampled-but-not-yet-charted, matching the `KpiMetrics` catalog. Doc-only.
+
+### SiteCallAudit-009 — Reconciliation cursor advances inclusively and ignores `MoreAvailable`; a single-timestamp batch saturation re-pulls without progress
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Correctness & logic bugs |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:505-535` |
+
+**Description**
+
+`ReconcileSiteAsync` (lines 505-535) pulls rows since the per-site cursor,
+upserts each, and advances the cursor to the maximum `UpdatedAtUtc` observed:
+
+```csharp
+var since = _reconciliationCursors.TryGetValue(site.SiteId, out var c) ? c : DateTime.MinValue;
+var response = await client.PullAsync(site.SiteId, since, _options.ReconciliationBatchSize, ...);
+
+var maxUpdated = since;
+foreach (var row in response.SiteCalls)
+{
+    await repository.UpsertAsync(row with { IngestedAtUtc = nowUtc });
+    if (row.UpdatedAtUtc > maxUpdated) maxUpdated = row.UpdatedAtUtc;
+}
+_reconciliationCursors[site.SiteId] = maxUpdated;
+```
+
+Two interacting properties:
+
+1. The pull is **inclusive** — `PullAsync(site, since, …)` asks for rows with
+   `UpdatedAtUtc >= since` (documented in the method's XML, lines 497-503), and
+   the cursor advances to the **max** `UpdatedAtUtc` seen, which is itself one
+   of the rows just pulled. So the boundary row is re-pulled every tick and
+   deduped by the idempotent monotonic upsert — intended, harmless.
+2. `response.MoreAvailable` is **never read** at all. The `PullSiteCallsResponse`
+   carries `MoreAvailable` (`PullSiteCallsResponse.cs:14-17`: "True when the
+   site saturated the requested batch size — the caller should advance the
+   cursor and pull again"), but `ReconcileSiteAsync` ignores it and relies on
+   the natural tick cadence to drain the backlog over successive ticks.
+
+The edge case: if a site has **more rows than the batch size all sharing one
+exact `UpdatedAtUtc`** (e.g. a burst written in the same tick / same clock
+value), the saturated batch returns rows whose max `UpdatedAtUtc` equals
+`since`. `maxUpdated` therefore stays at `since`, the cursor does **not**
+advance, and because the pull is inclusive the next tick re-pulls the identical
+window — and again, and again — making no forward progress on that site's
+backlog. Because the upsert is idempotent and the `SiteCalls` table is an
+eventually-consistent mirror (not the source of truth), this is **wasted work,
+never corruption** — but it is an unbounded re-pull loop on a pathological
+input, and any rows in that backlog beyond the batch ceiling never get
+reconciled.
+
+The sibling `SiteAuditReconciliationActor` shares the inclusive-cursor /
+max-timestamp shape, so the same single-timestamp-saturation no-progress edge
+applies to it — but that sibling **does read `MoreAvailable`** (it feeds its
+stalled-detection state machine, `SiteAuditReconciliationActor.cs:324-325`,
+publishing `SiteAuditTelemetryStalledChanged` so a non-draining site surfaces a
+health signal). `ReconcileSiteAsync`'s XML (lines 530-534) claims the
+no-immediate-re-pull behaviour "match[es] `SiteAuditReconciliationActor`", which
+is only half true: it matches the cursor cadence but diverges by dropping
+`MoreAvailable` entirely, so this actor has neither a continuation pull nor a
+stalled signal — a saturated site lags silently with no observability.
+
+**Recommendation**
+
+- Consume `MoreAvailable`: either continue draining within the same tick while
+  `MoreAvailable` is true (bounded by a max-iterations guard), or — matching the
+  sibling — surface a stalled/non-draining signal when a batch comes back
+  saturated so a stuck site is observable rather than silent.
+- Defend the single-timestamp no-progress edge with a tiebreaker beyond the
+  timestamp (e.g. advance on `(UpdatedAtUtc, TrackedOperationId)` as a composite
+  keyset cursor, and ask the pull for rows strictly after that composite), so a
+  burst sharing one `UpdatedAtUtc` cannot pin the cursor.
+- Correct the `ReconcileSiteAsync` XML (lines 530-534): it claims parity with
+  `SiteAuditReconciliationActor` while diverging on `MoreAvailable` (the sibling
+  reads it for stalled detection; this actor ignores it).
+
+Severity Low: idempotent upsert + mirror-not-source-of-truth mean no data
+corruption, and the saturated-single-timestamp input is pathological; the cost
+is wasted re-pulls and an un-drained backlog tail on that one input, plus the
+missing observability.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): `ReconcileSiteAsync` now consumes `response.MoreAvailable` via a within-tick continuation drain bounded by a max-pages guard, with explicit no-progress (single-timestamp-saturation) detection that breaks and logs a Warning instead of re-pulling forever. The XML claim of parity with the sibling reconciler was corrected. Idempotent upsert retained. Tests added.