docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked

Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28)
plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381.

67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit
fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix
with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings
(IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision.

Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to
sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024),
an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001),
and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary
is semantically sound (symbol-based) in the production cluster config.

README regenerated; regen-readme.py --check passes (4 pending / 567 total).
This commit is contained in:
Joseph Doherty
2026-06-20 18:02:32 -04:00
parent fd618cf1dc
commit d39089f4ed
19 changed files with 4031 additions and 69 deletions
+238 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit` |
| Design doc | `docs/requirements/Component-SiteCallAudit.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Last reviewed | 2026-06-20 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 2 |
| Commit reviewed | `4307c381` |
| Open findings | 0 |
## Summary
@@ -42,6 +42,23 @@ tests using a shared `MsSqlMigrationFixture`.
| 9 | Testing coverage | Yes | Relay path well covered (6 unit tests). Ingest/KPI well covered by MSSQL fixture. Stuck-only paging boundary edge not directly exercised (Finding 006). |
| 10 | Documentation & comments | Yes | XML docstring claims `SupervisorStrategy` uses Resume — incorrect (Finding 001). `AckErrorMessage` switch arm for `SiteUnreachable` falls through instead of throwing (Finding 005). |
#### Re-review 2026-06-20 (commit `4307c381`) — full review
Since the `1eb6e97` baseline the module grew the two pieces that were "still deferred" at the prior pass: the periodic per-site reconciliation puller (Piece A — `OnReconciliationTickAsync``ReconcileSiteAsync`, in-memory per-site cursor + idempotent monotonic upsert + per-site failure isolation) and the daily terminal-row purge scheduler (Piece B — `OnPurgeTickAsync``PurgeTerminalAsync`, continue-on-error). All of the new code is clean, idiomatic, and well-tested (dedicated `SiteCallAuditReconciliationTests`, `SiteCallAuditPurgeTests`, `SiteCallRelayTests`, KPI-sample and options suites); the M6 `SiteCallAuditKpiSampleSource` correctly anchors both cutoffs on a single capture instant and emits Global + per-Site + per-Node snapshots, and the per-node KPI handler is a clean additive peer of the per-site one. The three new findings are all forward-compat / cosmetic: a latent purge-gating gap (the purge timer is armed only when the reconciliation collaborators are present — currently masked because `Program.cs` co-registers them) plus two doc/edge items (the design doc over-claims six charted trend series when only three are charted, and the reconciliation cursor ignores `MoreAvailable` so a single-timestamp batch saturation re-pulls without progress). No correctness, concurrency, security, or resilience defects in the new code; the inclusive-cursor + idempotent-upsert pairing keeps every edge case safe-by-construction (wasted work, never corruption).
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | Yes | Reconciliation cursor ignores `MoreAvailable`; a single-`UpdatedAtUtc` batch saturation re-pulls without progress (Finding 009 — wasted work only, idempotent upsert prevents corruption). Finding 003 (`IngestedAtUtc`) resolved — now stamped in `OnUpsertAsync` and `ReconcileSiteAsync`. |
| 2 | Akka.NET conventions | Yes | `Sender` captured before first await on every async handler; `PipeTo` used for all read/relay replies; self-ticks scheduled via `ScheduleTellRepeatedlyCancelable` with `Self` sender, cancelled in `PostStop`. No issues found. |
| 3 | Concurrency & thread safety | Yes | `_reconciliationCursors`, `_centralCommunication`, timers all mutated only on the actor thread. Per-tick/per-message DI scope (`CreateAsyncScope` for the async tick paths, `CreateScope` for sync read paths) disposed in `finally` / `await using`. No issues found. |
| 4 | Error handling & resilience | Yes | Reconciliation has per-site try/catch isolation; purge is continue-on-error; ingest catches all and replies `Accepted=false`. Deliberate coarse-retry divergence from `SiteAuditReconciliationActor` (no per-row abandon) is documented and safe given idempotent upsert. No issues found. |
| 5 | Security | Yes | All SQL parameterised at the repository. Relay carries no user-controlled strings beyond `SourceSite` (a site id). No issues found. |
| 6 | Performance & resource management | Yes | One DI scope per tick reused across all sites; `MaxPageSize=200` clamp; async DbContext disposal off the dispatcher. `MoreAvailable`-ignoring re-pull is bounded wasted work (Finding 009). |
| 7 | Design-document adherence | Yes | Reconciliation puller + purge scheduler now implemented (prior Finding 004 resolved). Design doc over-claims six charted KPI trend series; only three are charted (Finding 008). |
| 8 | Code organization & conventions | Yes | Purge timer gated on reconciliation collaborators rather than just the repository — a host registering SiteCallAudit without the reconciliation client silently never purges (Finding 007, latent). `KpiSampleSource` registered via `TryAddEnumerable`; options owned by the component. |
| 9 | Testing coverage | Yes | Reconciliation/purge/relay/KPI/options all have dedicated suites. No test exercises a saturated reconciliation batch (`MoreAvailable: true`) or the single-timestamp no-progress edge — every reconciliation test uses `MoreAvailable: false` (relates to Finding 009). |
| 10 | Documentation & comments | Yes | Prior Findings 001/005 doc fixes still in place. Reconciliation `ReconcileSiteAsync` XML claims parity with `SiteAuditReconciliationActor` while diverging on `MoreAvailable` (the sibling reads it for stalled detection; this actor ignores it) — Finding 009 recommendation includes a doc correction. |
## Findings
### SiteCallAudit-001 — SupervisorStrategy override is dead code; XML claims Resume that is not enforced
@@ -323,3 +340,221 @@ Add a test that (a) inserts 6 rows in interleaved order: stuck, not-stuck,
stuck, not-stuck, stuck, not-stuck (oldest first); (b) issues a `StuckOnly`
page-size-1 query; (c) asserts each page returns exactly the stuck row, with
no overlap and all 3 stuck rows visited.
### SiteCallAudit-007 — Daily purge timer is armed only when the reconciliation collaborators are present; a host without the reconciliation client silently never purges
| | |
|--|--|
| Severity | Medium |
| Category | Code organization & conventions |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:307-321`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:202-227` |
**Description**
`StartPurgeTimer` (lines 307-321) gates the daily terminal-row purge tick on
the **reconciliation** collaborators being present:
```csharp
private void StartPurgeTimer()
{
if (_pullClient is null || _siteEnumerator is null)
{
return;
}
// ... schedule PurgeTick ...
}
```
But the purge pass (`OnPurgeTickAsync``PurgeWithRepositoryAsync`
`ISiteCallAuditRepository.PurgeTerminalAsync`) needs only the repository — it
has no dependency on `IPullSiteCallsClient` / `ISiteEnumerator`. The two
collaborators are resolved by the production constructor (lines 202-227) via
`serviceProvider.GetService<IPullSiteCallsClient>()` /
`GetService<ISiteEnumerator>()` — both registered by
`AddAuditLogCentralReconciliationClient`
(`AuditLog/ServiceCollectionExtensions.cs:473,523`, registered as `ISiteEnumerator`
and `IPullSiteCallsClient`). `GetService` (not `GetRequiredService`) returns
`null` if that helper was never called, so a host that registers
`AddSiteCallAudit()` **without** also calling
`AddAuditLogCentralReconciliationClient(...)` constructs the actor with both
collaborators null. In `PreStart` both `StartReconciliationTimer` and
`StartPurgeTimer` early-return, so the actor runs forever with **no purge timer
at all** → unbounded growth of the central `SiteCalls` table, with no log line
to say the purge was skipped.
This is **currently latent, not live**: `Program.cs` co-registers
`AddAuditLogCentralReconciliationClient(builder.Configuration)` (line 107)
immediately before `AddSiteCallAudit()` (line 113), so on the real central node
both collaborators resolve and both timers run today. The risk is a future host
(or a refactor that splits the reconciliation client out of the central
composition root, or a test/embedded host that wants only ingest + purge)
silently losing the purge with no diagnostic. The gate exists for a legitimate
reason — keeping the repo-only test ctor free of *both* background timers so the
MSSQL read/upsert tests see no scheduled side effects — but it conflates "no
reconciliation route" with "no purge", and the actor's own XML
(lines 36-38) documents the coupling as deliberate rather than flagging it as a
hazard.
**Recommendation**
Decouple `StartPurgeTimer` from the reconciliation collaborators — purge needs
only a repository, which both the production and the repo-only test ctors
always have. Two viable shapes:
- Preferred: gate the purge timer on its own real precondition (a repository is
always available, so arm it unconditionally in the production + reconciliation
ctors; keep it off only in the repo-only test ctor via an explicit
"background timers off" flag rather than by proxy of the reconciliation
collaborators). This keeps the MSSQL test isolation while removing the
accidental coupling.
- At minimum: log a `Warning` in `StartPurgeTimer` (and `StartReconciliationTimer`)
when the timer is **not** armed on the production path, so a misconfigured host
surfaces "SiteCallAudit purge timer not started — `PurgeTerminalAsync` will
never run" instead of growing the table silently.
Severity is a judgment call: Medium because the consequence (unbounded central
table growth) is real and silent, but it is latent today (the only production
composition root co-registers the reconciliation client), so an argument for Low
is defensible. Flagged Medium to weight the silent-failure + data-growth aspect.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): decoupled the daily purge timer from the reconciliation collaborators (a new `_backgroundTimersEnabled` flag) so a central node that omits the reconciliation client still purges — no more silent unbounded `SiteCalls` growth; a Warning is logged if the reconciliation collaborators are absent. Test added (purge arms without the reconciliation client).
### SiteCallAudit-008 — Design doc claims six charted KPI trend series; only three are actually charted
| | |
|--|--|
| Severity | Low |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `docs/requirements/Component-SiteCallAudit.md:149-154`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/Kpi/SiteCallAuditKpiSampleSource.cs:42-47`, `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiMetrics.cs:52-62` |
**Description**
The design doc (`Component-SiteCallAudit.md` lines 149-154, the KPI History
interaction) states:
> "… the resulting `buffered` / `parked` / `failedLastInterval` /
> `deliveredLastInterval` / `stuck` / `oldestPendingAgeSeconds` series render as
> trends on the Site Calls page via `KpiTrendChart`."
That lists **six** series as charted. In the code only **three** are charted:
- The public charted catalog `KpiMetrics.SiteCallAudit`
(`KpiMetrics.cs:52-62`) exposes exactly `Buffered`, `Parked`, and
`FailedLastInterval` — and its own XML says "Charted Site Call Audit (#22)
metrics … Rendered by the Central UI Site Calls report trend panel."
- `SiteCallAuditKpiSampleSource` (lines 42-47) keys those three off the public
Commons catalog (`KpiMetrics.SiteCallAudit.*`) and keeps the other three —
`deliveredLastInterval`, `stuck`, `oldestPendingAgeSeconds` — as **private**
string literals, with the comment "Charted metrics share the public Commons
catalog … the uncharted internal metrics stay private here (#178)."
- The UI confirms it: `SiteCalls/SiteCallsReport.razor.cs` calls
`LoadSeriesAsync` for exactly `KpiMetrics.SiteCallAudit.Buffered`, `.Parked`,
and `.FailedLastInterval` — three series, no more.
So all six metrics are *sampled* into the `KpiSample` history store, but only
three are *charted*. The doc reads as if all six are rendered, which is a small
design-doc drift (over-claiming the UI surface). The code is internally
consistent and correctly commented; the doc is the stale party.
**Recommendation**
One-line doc edit on `Component-SiteCallAudit.md` lines 149-154: state that the
three series `buffered` / `parked` / `failedLastInterval` render as trends on
the Site Calls page via `KpiTrendChart`, and that `deliveredLastInterval` /
`stuck` / `oldestPendingAgeSeconds` are **sampled into the KPI-history store but
not charted** (available for future trend panels / ad-hoc query). No code change
— the code is already the intended state.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): the design doc now lists the three actually-charted KPI metrics (buffered/parked/failedLastInterval) and marks the other three as sampled-but-not-yet-charted, matching the `KpiMetrics` catalog. Doc-only.
### SiteCallAudit-009 — Reconciliation cursor advances inclusively and ignores `MoreAvailable`; a single-timestamp batch saturation re-pulls without progress
| | |
|--|--|
| Severity | Low |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:505-535` |
**Description**
`ReconcileSiteAsync` (lines 505-535) pulls rows since the per-site cursor,
upserts each, and advances the cursor to the maximum `UpdatedAtUtc` observed:
```csharp
var since = _reconciliationCursors.TryGetValue(site.SiteId, out var c) ? c : DateTime.MinValue;
var response = await client.PullAsync(site.SiteId, since, _options.ReconciliationBatchSize, ...);
var maxUpdated = since;
foreach (var row in response.SiteCalls)
{
await repository.UpsertAsync(row with { IngestedAtUtc = nowUtc });
if (row.UpdatedAtUtc > maxUpdated) maxUpdated = row.UpdatedAtUtc;
}
_reconciliationCursors[site.SiteId] = maxUpdated;
```
Two interacting properties:
1. The pull is **inclusive**`PullAsync(site, since, …)` asks for rows with
`UpdatedAtUtc >= since` (documented in the method's XML, lines 497-503), and
the cursor advances to the **max** `UpdatedAtUtc` seen, which is itself one
of the rows just pulled. So the boundary row is re-pulled every tick and
deduped by the idempotent monotonic upsert — intended, harmless.
2. `response.MoreAvailable` is **never read** at all. The `PullSiteCallsResponse`
carries `MoreAvailable` (`PullSiteCallsResponse.cs:14-17`: "True when the
site saturated the requested batch size — the caller should advance the
cursor and pull again"), but `ReconcileSiteAsync` ignores it and relies on
the natural tick cadence to drain the backlog over successive ticks.
The edge case: if a site has **more rows than the batch size all sharing one
exact `UpdatedAtUtc`** (e.g. a burst written in the same tick / same clock
value), the saturated batch returns rows whose max `UpdatedAtUtc` equals
`since`. `maxUpdated` therefore stays at `since`, the cursor does **not**
advance, and because the pull is inclusive the next tick re-pulls the identical
window — and again, and again — making no forward progress on that site's
backlog. Because the upsert is idempotent and the `SiteCalls` table is an
eventually-consistent mirror (not the source of truth), this is **wasted work,
never corruption** — but it is an unbounded re-pull loop on a pathological
input, and any rows in that backlog beyond the batch ceiling never get
reconciled.
The sibling `SiteAuditReconciliationActor` shares the inclusive-cursor /
max-timestamp shape, so the same single-timestamp-saturation no-progress edge
applies to it — but that sibling **does read `MoreAvailable`** (it feeds its
stalled-detection state machine, `SiteAuditReconciliationActor.cs:324-325`,
publishing `SiteAuditTelemetryStalledChanged` so a non-draining site surfaces a
health signal). `ReconcileSiteAsync`'s XML (lines 530-534) claims the
no-immediate-re-pull behaviour "match[es] `SiteAuditReconciliationActor`", which
is only half true: it matches the cursor cadence but diverges by dropping
`MoreAvailable` entirely, so this actor has neither a continuation pull nor a
stalled signal — a saturated site lags silently with no observability.
**Recommendation**
- Consume `MoreAvailable`: either continue draining within the same tick while
`MoreAvailable` is true (bounded by a max-iterations guard), or — matching the
sibling — surface a stalled/non-draining signal when a batch comes back
saturated so a stuck site is observable rather than silent.
- Defend the single-timestamp no-progress edge with a tiebreaker beyond the
timestamp (e.g. advance on `(UpdatedAtUtc, TrackedOperationId)` as a composite
keyset cursor, and ask the pull for rows strictly after that composite), so a
burst sharing one `UpdatedAtUtc` cannot pin the cursor.
- Correct the `ReconcileSiteAsync` XML (lines 530-534): it claims parity with
`SiteAuditReconciliationActor` while diverging on `MoreAvailable` (the sibling
reads it for stalled detection; this actor ignores it).
Severity Low: idempotent upsert + mirror-not-source-of-truth mean no data
corruption, and the saturated-single-timestamp input is pathological; the cost
is wasted re-pulls and an un-drained backlog tail on that one input, plus the
missing observability.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): `ReconcileSiteAsync` now consumes `response.MoreAvailable` via a within-tick continuation drain bounded by a max-pages guard, with explicit no-progress (single-timestamp-saturation) detection that breaks and logs a Warning instead of re-pulling forever. The XML claim of parity with the sibling reconciler was corrected. Idempotent upsert retained. Tests added.