docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked
Full per-module re-review of the 16 stale modules (last seen1eb6e97/ 2026-05-28) plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD4307c381. 67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commitfd618cf1closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings (IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision. Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024), an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001), and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary is semantically sound (symbol-based) in the production cluster config. README regenerated; regen-readme.py --check passes (4 pending / 567 total).
This commit is contained in:
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit` |
|
||||
| Design doc | `docs/requirements/Component-SiteCallAudit.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Last reviewed | 2026-06-20 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 2 |
|
||||
| Commit reviewed | `4307c381` |
|
||||
| Open findings | 0 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -42,6 +42,23 @@ tests using a shared `MsSqlMigrationFixture`.
|
||||
| 9 | Testing coverage | Yes | Relay path well covered (6 unit tests). Ingest/KPI well covered by MSSQL fixture. Stuck-only paging boundary edge not directly exercised (Finding 006). |
|
||||
| 10 | Documentation & comments | Yes | XML docstring claims `SupervisorStrategy` uses Resume — incorrect (Finding 001). `AckErrorMessage` switch arm for `SiteUnreachable` falls through instead of throwing (Finding 005). |
|
||||
|
||||
#### Re-review 2026-06-20 (commit `4307c381`) — full review
|
||||
|
||||
Since the `1eb6e97` baseline the module grew the two pieces that were "still deferred" at the prior pass: the periodic per-site reconciliation puller (Piece A — `OnReconciliationTickAsync` → `ReconcileSiteAsync`, in-memory per-site cursor + idempotent monotonic upsert + per-site failure isolation) and the daily terminal-row purge scheduler (Piece B — `OnPurgeTickAsync` → `PurgeTerminalAsync`, continue-on-error). All of the new code is clean, idiomatic, and well-tested (dedicated `SiteCallAuditReconciliationTests`, `SiteCallAuditPurgeTests`, `SiteCallRelayTests`, KPI-sample and options suites); the M6 `SiteCallAuditKpiSampleSource` correctly anchors both cutoffs on a single capture instant and emits Global + per-Site + per-Node snapshots, and the per-node KPI handler is a clean additive peer of the per-site one. The three new findings are all forward-compat / cosmetic: a latent purge-gating gap (the purge timer is armed only when the reconciliation collaborators are present — currently masked because `Program.cs` co-registers them) plus two doc/edge items (the design doc over-claims six charted trend series when only three are charted, and the reconciliation cursor ignores `MoreAvailable` so a single-timestamp batch saturation re-pulls without progress). No correctness, concurrency, security, or resilience defects in the new code; the inclusive-cursor + idempotent-upsert pairing keeps every edge case safe-by-construction (wasted work, never corruption).
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | Yes | Reconciliation cursor ignores `MoreAvailable`; a single-`UpdatedAtUtc` batch saturation re-pulls without progress (Finding 009 — wasted work only, idempotent upsert prevents corruption). Finding 003 (`IngestedAtUtc`) resolved — now stamped in `OnUpsertAsync` and `ReconcileSiteAsync`. |
|
||||
| 2 | Akka.NET conventions | Yes | `Sender` captured before first await on every async handler; `PipeTo` used for all read/relay replies; self-ticks scheduled via `ScheduleTellRepeatedlyCancelable` with `Self` sender, cancelled in `PostStop`. No issues found. |
|
||||
| 3 | Concurrency & thread safety | Yes | `_reconciliationCursors`, `_centralCommunication`, timers all mutated only on the actor thread. Per-tick/per-message DI scope (`CreateAsyncScope` for the async tick paths, `CreateScope` for sync read paths) disposed in `finally` / `await using`. No issues found. |
|
||||
| 4 | Error handling & resilience | Yes | Reconciliation has per-site try/catch isolation; purge is continue-on-error; ingest catches all and replies `Accepted=false`. Deliberate coarse-retry divergence from `SiteAuditReconciliationActor` (no per-row abandon) is documented and safe given idempotent upsert. No issues found. |
|
||||
| 5 | Security | Yes | All SQL parameterised at the repository. Relay carries no user-controlled strings beyond `SourceSite` (a site id). No issues found. |
|
||||
| 6 | Performance & resource management | Yes | One DI scope per tick reused across all sites; `MaxPageSize=200` clamp; async DbContext disposal off the dispatcher. `MoreAvailable`-ignoring re-pull is bounded wasted work (Finding 009). |
|
||||
| 7 | Design-document adherence | Yes | Reconciliation puller + purge scheduler now implemented (prior Finding 004 resolved). Design doc over-claims six charted KPI trend series; only three are charted (Finding 008). |
|
||||
| 8 | Code organization & conventions | Yes | Purge timer gated on reconciliation collaborators rather than just the repository — a host registering SiteCallAudit without the reconciliation client silently never purges (Finding 007, latent). `KpiSampleSource` registered via `TryAddEnumerable`; options owned by the component. |
|
||||
| 9 | Testing coverage | Yes | Reconciliation/purge/relay/KPI/options all have dedicated suites. No test exercises a saturated reconciliation batch (`MoreAvailable: true`) or the single-timestamp no-progress edge — every reconciliation test uses `MoreAvailable: false` (relates to Finding 009). |
|
||||
| 10 | Documentation & comments | Yes | Prior Findings 001/005 doc fixes still in place. Reconciliation `ReconcileSiteAsync` XML claims parity with `SiteAuditReconciliationActor` while diverging on `MoreAvailable` (the sibling reads it for stalled detection; this actor ignores it) — Finding 009 recommendation includes a doc correction. |
|
||||
|
||||
## Findings
|
||||
|
||||
### SiteCallAudit-001 — SupervisorStrategy override is dead code; XML claims Resume that is not enforced
|
||||
@@ -323,3 +340,221 @@ Add a test that (a) inserts 6 rows in interleaved order: stuck, not-stuck,
|
||||
stuck, not-stuck, stuck, not-stuck (oldest first); (b) issues a `StuckOnly`
|
||||
page-size-1 query; (c) asserts each page returns exactly the stuck row, with
|
||||
no overlap and all 3 stuck rows visited.
|
||||
|
||||
### SiteCallAudit-007 — Daily purge timer is armed only when the reconciliation collaborators are present; a host without the reconciliation client silently never purges
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Code organization & conventions |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:307-321`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:202-227` |
|
||||
|
||||
**Description**
|
||||
|
||||
`StartPurgeTimer` (lines 307-321) gates the daily terminal-row purge tick on
|
||||
the **reconciliation** collaborators being present:
|
||||
|
||||
```csharp
|
||||
private void StartPurgeTimer()
|
||||
{
|
||||
if (_pullClient is null || _siteEnumerator is null)
|
||||
{
|
||||
return;
|
||||
}
|
||||
// ... schedule PurgeTick ...
|
||||
}
|
||||
```
|
||||
|
||||
But the purge pass (`OnPurgeTickAsync` → `PurgeWithRepositoryAsync` →
|
||||
`ISiteCallAuditRepository.PurgeTerminalAsync`) needs only the repository — it
|
||||
has no dependency on `IPullSiteCallsClient` / `ISiteEnumerator`. The two
|
||||
collaborators are resolved by the production constructor (lines 202-227) via
|
||||
`serviceProvider.GetService<IPullSiteCallsClient>()` /
|
||||
`GetService<ISiteEnumerator>()` — both registered by
|
||||
`AddAuditLogCentralReconciliationClient`
|
||||
(`AuditLog/ServiceCollectionExtensions.cs:473,523`, registered as `ISiteEnumerator`
|
||||
and `IPullSiteCallsClient`). `GetService` (not `GetRequiredService`) returns
|
||||
`null` if that helper was never called, so a host that registers
|
||||
`AddSiteCallAudit()` **without** also calling
|
||||
`AddAuditLogCentralReconciliationClient(...)` constructs the actor with both
|
||||
collaborators null. In `PreStart` both `StartReconciliationTimer` and
|
||||
`StartPurgeTimer` early-return, so the actor runs forever with **no purge timer
|
||||
at all** → unbounded growth of the central `SiteCalls` table, with no log line
|
||||
to say the purge was skipped.
|
||||
|
||||
This is **currently latent, not live**: `Program.cs` co-registers
|
||||
`AddAuditLogCentralReconciliationClient(builder.Configuration)` (line 107)
|
||||
immediately before `AddSiteCallAudit()` (line 113), so on the real central node
|
||||
both collaborators resolve and both timers run today. The risk is a future host
|
||||
(or a refactor that splits the reconciliation client out of the central
|
||||
composition root, or a test/embedded host that wants only ingest + purge)
|
||||
silently losing the purge with no diagnostic. The gate exists for a legitimate
|
||||
reason — keeping the repo-only test ctor free of *both* background timers so the
|
||||
MSSQL read/upsert tests see no scheduled side effects — but it conflates "no
|
||||
reconciliation route" with "no purge", and the actor's own XML
|
||||
(lines 36-38) documents the coupling as deliberate rather than flagging it as a
|
||||
hazard.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Decouple `StartPurgeTimer` from the reconciliation collaborators — purge needs
|
||||
only a repository, which both the production and the repo-only test ctors
|
||||
always have. Two viable shapes:
|
||||
|
||||
- Preferred: gate the purge timer on its own real precondition (a repository is
|
||||
always available, so arm it unconditionally in the production + reconciliation
|
||||
ctors; keep it off only in the repo-only test ctor via an explicit
|
||||
"background timers off" flag rather than by proxy of the reconciliation
|
||||
collaborators). This keeps the MSSQL test isolation while removing the
|
||||
accidental coupling.
|
||||
- At minimum: log a `Warning` in `StartPurgeTimer` (and `StartReconciliationTimer`)
|
||||
when the timer is **not** armed on the production path, so a misconfigured host
|
||||
surfaces "SiteCallAudit purge timer not started — `PurgeTerminalAsync` will
|
||||
never run" instead of growing the table silently.
|
||||
|
||||
Severity is a judgment call: Medium because the consequence (unbounded central
|
||||
table growth) is real and silent, but it is latent today (the only production
|
||||
composition root co-registers the reconciliation client), so an argument for Low
|
||||
is defensible. Flagged Medium to weight the silent-failure + data-growth aspect.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): decoupled the daily purge timer from the reconciliation collaborators (a new `_backgroundTimersEnabled` flag) so a central node that omits the reconciliation client still purges — no more silent unbounded `SiteCalls` growth; a Warning is logged if the reconciliation collaborators are absent. Test added (purge arms without the reconciliation client).
|
||||
|
||||
### SiteCallAudit-008 — Design doc claims six charted KPI trend series; only three are actually charted
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Resolved |
|
||||
| Location | `docs/requirements/Component-SiteCallAudit.md:149-154`, `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/Kpi/SiteCallAuditKpiSampleSource.cs:42-47`, `src/ZB.MOM.WW.ScadaBridge.Commons/Types/Kpi/KpiMetrics.cs:52-62` |
|
||||
|
||||
**Description**
|
||||
|
||||
The design doc (`Component-SiteCallAudit.md` lines 149-154, the KPI History
|
||||
interaction) states:
|
||||
|
||||
> "… the resulting `buffered` / `parked` / `failedLastInterval` /
|
||||
> `deliveredLastInterval` / `stuck` / `oldestPendingAgeSeconds` series render as
|
||||
> trends on the Site Calls page via `KpiTrendChart`."
|
||||
|
||||
That lists **six** series as charted. In the code only **three** are charted:
|
||||
|
||||
- The public charted catalog `KpiMetrics.SiteCallAudit`
|
||||
(`KpiMetrics.cs:52-62`) exposes exactly `Buffered`, `Parked`, and
|
||||
`FailedLastInterval` — and its own XML says "Charted Site Call Audit (#22)
|
||||
metrics … Rendered by the Central UI Site Calls report trend panel."
|
||||
- `SiteCallAuditKpiSampleSource` (lines 42-47) keys those three off the public
|
||||
Commons catalog (`KpiMetrics.SiteCallAudit.*`) and keeps the other three —
|
||||
`deliveredLastInterval`, `stuck`, `oldestPendingAgeSeconds` — as **private**
|
||||
string literals, with the comment "Charted metrics share the public Commons
|
||||
catalog … the uncharted internal metrics stay private here (#178)."
|
||||
- The UI confirms it: `SiteCalls/SiteCallsReport.razor.cs` calls
|
||||
`LoadSeriesAsync` for exactly `KpiMetrics.SiteCallAudit.Buffered`, `.Parked`,
|
||||
and `.FailedLastInterval` — three series, no more.
|
||||
|
||||
So all six metrics are *sampled* into the `KpiSample` history store, but only
|
||||
three are *charted*. The doc reads as if all six are rendered, which is a small
|
||||
design-doc drift (over-claiming the UI surface). The code is internally
|
||||
consistent and correctly commented; the doc is the stale party.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
One-line doc edit on `Component-SiteCallAudit.md` lines 149-154: state that the
|
||||
three series `buffered` / `parked` / `failedLastInterval` render as trends on
|
||||
the Site Calls page via `KpiTrendChart`, and that `deliveredLastInterval` /
|
||||
`stuck` / `oldestPendingAgeSeconds` are **sampled into the KPI-history store but
|
||||
not charted** (available for future trend panels / ad-hoc query). No code change
|
||||
— the code is already the intended state.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): the design doc now lists the three actually-charted KPI metrics (buffered/parked/failedLastInterval) and marks the other three as sampled-but-not-yet-charted, matching the `KpiMetrics` catalog. Doc-only.
|
||||
|
||||
### SiteCallAudit-009 — Reconciliation cursor advances inclusively and ignores `MoreAvailable`; a single-timestamp batch saturation re-pulls without progress
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteCallAudit/SiteCallAuditActor.cs:505-535` |
|
||||
|
||||
**Description**
|
||||
|
||||
`ReconcileSiteAsync` (lines 505-535) pulls rows since the per-site cursor,
|
||||
upserts each, and advances the cursor to the maximum `UpdatedAtUtc` observed:
|
||||
|
||||
```csharp
|
||||
var since = _reconciliationCursors.TryGetValue(site.SiteId, out var c) ? c : DateTime.MinValue;
|
||||
var response = await client.PullAsync(site.SiteId, since, _options.ReconciliationBatchSize, ...);
|
||||
|
||||
var maxUpdated = since;
|
||||
foreach (var row in response.SiteCalls)
|
||||
{
|
||||
await repository.UpsertAsync(row with { IngestedAtUtc = nowUtc });
|
||||
if (row.UpdatedAtUtc > maxUpdated) maxUpdated = row.UpdatedAtUtc;
|
||||
}
|
||||
_reconciliationCursors[site.SiteId] = maxUpdated;
|
||||
```
|
||||
|
||||
Two interacting properties:
|
||||
|
||||
1. The pull is **inclusive** — `PullAsync(site, since, …)` asks for rows with
|
||||
`UpdatedAtUtc >= since` (documented in the method's XML, lines 497-503), and
|
||||
the cursor advances to the **max** `UpdatedAtUtc` seen, which is itself one
|
||||
of the rows just pulled. So the boundary row is re-pulled every tick and
|
||||
deduped by the idempotent monotonic upsert — intended, harmless.
|
||||
2. `response.MoreAvailable` is **never read** at all. The `PullSiteCallsResponse`
|
||||
carries `MoreAvailable` (`PullSiteCallsResponse.cs:14-17`: "True when the
|
||||
site saturated the requested batch size — the caller should advance the
|
||||
cursor and pull again"), but `ReconcileSiteAsync` ignores it and relies on
|
||||
the natural tick cadence to drain the backlog over successive ticks.
|
||||
|
||||
The edge case: if a site has **more rows than the batch size all sharing one
|
||||
exact `UpdatedAtUtc`** (e.g. a burst written in the same tick / same clock
|
||||
value), the saturated batch returns rows whose max `UpdatedAtUtc` equals
|
||||
`since`. `maxUpdated` therefore stays at `since`, the cursor does **not**
|
||||
advance, and because the pull is inclusive the next tick re-pulls the identical
|
||||
window — and again, and again — making no forward progress on that site's
|
||||
backlog. Because the upsert is idempotent and the `SiteCalls` table is an
|
||||
eventually-consistent mirror (not the source of truth), this is **wasted work,
|
||||
never corruption** — but it is an unbounded re-pull loop on a pathological
|
||||
input, and any rows in that backlog beyond the batch ceiling never get
|
||||
reconciled.
|
||||
|
||||
The sibling `SiteAuditReconciliationActor` shares the inclusive-cursor /
|
||||
max-timestamp shape, so the same single-timestamp-saturation no-progress edge
|
||||
applies to it — but that sibling **does read `MoreAvailable`** (it feeds its
|
||||
stalled-detection state machine, `SiteAuditReconciliationActor.cs:324-325`,
|
||||
publishing `SiteAuditTelemetryStalledChanged` so a non-draining site surfaces a
|
||||
health signal). `ReconcileSiteAsync`'s XML (lines 530-534) claims the
|
||||
no-immediate-re-pull behaviour "match[es] `SiteAuditReconciliationActor`", which
|
||||
is only half true: it matches the cursor cadence but diverges by dropping
|
||||
`MoreAvailable` entirely, so this actor has neither a continuation pull nor a
|
||||
stalled signal — a saturated site lags silently with no observability.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
- Consume `MoreAvailable`: either continue draining within the same tick while
|
||||
`MoreAvailable` is true (bounded by a max-iterations guard), or — matching the
|
||||
sibling — surface a stalled/non-draining signal when a batch comes back
|
||||
saturated so a stuck site is observable rather than silent.
|
||||
- Defend the single-timestamp no-progress edge with a tiebreaker beyond the
|
||||
timestamp (e.g. advance on `(UpdatedAtUtc, TrackedOperationId)` as a composite
|
||||
keyset cursor, and ask the pull for rows strictly after that composite), so a
|
||||
burst sharing one `UpdatedAtUtc` cannot pin the cursor.
|
||||
- Correct the `ReconcileSiteAsync` XML (lines 530-534): it claims parity with
|
||||
`SiteAuditReconciliationActor` while diverging on `MoreAvailable` (the sibling
|
||||
reads it for stalled detection; this actor ignores it).
|
||||
|
||||
Severity Low: idempotent upsert + mirror-not-source-of-truth mean no data
|
||||
corruption, and the saturated-single-timestamp input is pathological; the cost
|
||||
is wasted re-pulls and an un-drained backlog tail on that one input, plus the
|
||||
missing observability.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): `ReconcileSiteAsync` now consumes `response.MoreAvailable` via a within-tick continuation drain bounded by a max-pages guard, with explicit no-progress (single-timestamp-saturation) detection that breaks and logs a Warning instead of re-pulling forever. The XML claim of parity with the sibling reconciler was corrected. Idempotent upsert retained. Tests added.
|
||||
|
||||
Reference in New Issue
Block a user