docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked

Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28)
plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381.

67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit
fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix
with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings
(IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision.

Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to
sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024),
an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001),
and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary
is semantically sound (symbol-based) in the production cluster config.

README regenerated; regen-readme.py --check passes (4 pending / 567 total).
This commit is contained in:
Joseph Doherty
2026-06-20 18:02:32 -04:00
parent fd618cf1dc
commit d39089f4ed
19 changed files with 4031 additions and 69 deletions
+227 -2
View File
@@ -5,9 +5,9 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.AuditLog` |
| Design doc | `docs/requirements/Component-AuditLog.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Last reviewed | 2026-06-20 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Commit reviewed | `4307c381` |
| Open findings | 0 |
## Summary
@@ -59,6 +59,231 @@ chain doesn't reject a central composition root that mistakenly calls the site b
## Findings
#### Re-review 2026-06-20 (commit `4307c381`) — full review
Since `1eb6e97` the module was renamed (`ScadaLink → ZB.MOM.WW.ScadaBridge`) and substantially
re-architected: the site SQLite store is now a two-table canonical design (`audit_event`
append-only + `audit_forward_state` sidecar with a precomputed `IsCachedKind`); the payload
filter became the `IAuditRedactor` seam (`ScadaBridgeAuditRedactor` + `SafeDefaultAuditRedactor`
+ stateless `AuditRedactionPrimitives`); and milestone work landed (M5.3 T7 inbound response
capture, M5.5 T3 per-channel retention, K8 KPI source, central health snapshot). The module
remains one of the best-engineered in the tree: the AuditLog-001..011 fixes from the last pass
are all present and intact (dual read connection, async scopes, per-EventId reconciliation
retry escape valve, thread-pool-hop `Dispose`, lifecycle CTS), the best-effort "never abort the
action" contract is honoured throughout, and redaction over-redacts on fault. This pass found
**no Critical/High** issues; two design-adherence drifts where documented per-target/purge
config knobs are silently inert, one error-handling asymmetry that lets a transient DI fault
restart the ingest singleton, and two Low items. Five new Open findings (AuditLog-012..016).
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | Yes | `IsErrorRow` case-sensitive parse is correct (Status is `enum.ToString()` PascalCase). `IsCachedKind` parse falls back to `InboundRequest` (non-cached) on bad JSON — safe. `OnIngestAsync` scope-resolution throw is uncaught (AuditLog-014). |
| 2 | Akka.NET conventions | Yes | Sender/EventStream captured before first await across all three central actors; Tell for hot path, Ask only at the gRPC/ClusterClient boundary; supervisor-strategy comments now correctly credit per-row catch (AuditLog-002 fix intact). No findings. |
| 3 | Concurrency & thread safety | Yes | WAL dual-connection + `_readLock`/`_writeLock` split intact; `Interlocked` counters; `_drainGate` serialises ring drain. `WriteAsync` fast path ignores `ct` when `TryWrite` succeeds (AuditLog-015). |
| 4 | Error handling & resilience | Yes | Best-effort contract honoured everywhere; gRPC pull collapses tolerable faults to empty; reconciliation retry escape valve (AuditLog-004) intact. `OnIngestAsync` lacks the scope-creation try/catch its sibling `OnCachedTelemetryAsync` has (AuditLog-014). |
| 5 | Security | Yes | Header/body/SQL-param redaction + over-redact-on-fault safety net solid; `SafeDefaultAuditRedactor` fallback wired at all three writer sites (AuditLog-008 fix intact); SQL built with bound parameters only. No findings. |
| 6 | Performance & resource management | Yes | Hot path batched + back-pressured; backlog scan off the write lock; partition switch metadata-only; per-channel DELETE batched + clamped; gRPC channel cache race-safe. No findings. |
| 7 | Design-document adherence | Yes | Per-target `CapBytes` override is documented + has a config property but is read NOWHERE (AuditLog-012); purge config section/key drift from the doc's `AuditLogPurge`/`ChannelPurgeBatchSize` (AuditLog-013). Combined-telemetry transport (AuditLog-001) now wired. `SkipBodyCapture` honoured in InboundAPI middleware (out of module). |
| 8 | Code organization & conventions | Yes | Composition root well-segmented with "safe from any root" invariant; `INodeIdentityProvider` standardised on `GetRequiredService` (AuditLog-007 fix intact); idempotency sentinel on the health bridge (AuditLog-011 fix intact). No new findings. |
| 9 | Testing coverage | Yes | Broad (~13k lines). Gaps: no test asserts per-target `CapBytes` is applied (because it isn't — AuditLog-012), and no test binds `AuditLogPurgeOptions` from an `IConfiguration` shaped like the doc (would catch AuditLog-013). |
| 10 | Documentation & comments | Yes | Mostly excellent and accurate. `AuditLogIngestActor` class XML has a duplicated "the central-side" phrase and several stale "Bundle X" milestone tags that mislead (AuditLog-016). |
### AuditLog-012 — Per-target `CapBytes` override is documented and bindable but never read by the redactor
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Redaction/ScadaBridgeAuditRedactor.cs:185-196`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Configuration/PerTargetRedactionOverride.cs:13` |
**Description**
The design (Component-AuditLog.md §"Payload Capture Policy" / §"Configuration") documents a
per-target payload-cap override and gives it as a worked example:
`"PerTargetOverrides": { "Weather/GetForecast": { "CapBytes": 4096 } }`. The
`PerTargetRedactionOverride.CapBytes` property exists and its XML doc promises "Optional payload
cap override (bytes); null inherits the global cap." But `ScadaBridgeAuditRedactor.SelectCap`
chooses among `InboundMaxBytes` / `ErrorCapBytes` / `DefaultCapBytes` only — it never consults
`opts.PerTargetOverrides[target].CapBytes`. A repo-wide grep confirms `CapBytes` (the per-target
property) is read in zero production code paths. An operator who sets a per-target cap to bound a
chatty target's payloads gets the global `DefaultCapBytes`/`ErrorCapBytes` instead, with no error
and no log — the knob is silently inert. This is a code-vs-spec drift: either the feature was
dropped in the redactor re-architecture (C2/C3) without updating the doc + override XML, or it
was never implemented. The `AdditionalBodyRedactors` and `RedactSqlParamsMatching` per-target
fields ARE honoured, which makes the silent `CapBytes` omission especially easy to miss.
**Recommendation**
Either (a) implement it: in `SelectCap`, when `category != ApiInbound` and the resolved
`PerTargetOverrides` entry for `rawEvent.Target` has a non-null `CapBytes`, use
`Math.Min`/`Max(over.CapBytes.Value, …)` per the intended semantics (the doc says it *overrides*
the global cap — clarify whether error rows still get the larger of the two), and add a
`ScadaBridgeAuditRedactorTests` case asserting a per-target cap shortens a payload the global cap
would not; or (b) if the feature is intentionally dropped, delete `PerTargetRedactionOverride.CapBytes`,
remove the `CapBytes` example from the Component-AuditLog.md Configuration block, and note the
removal in the design doc so the override surface stops advertising a no-op.
**Resolution**
Deferred 2026-06-20: whether to implement per-target `CapBytes` in `SelectCap` or delete the unused property + doc example is a design-owner decision (was it intentionally dropped in the redactor re-architecture, or lost?). Recorded; no change this pass.
### AuditLog-013 — Purge config section + key drift: documented `AuditLogPurge`/`ChannelPurgeBatchSize` does not bind
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/ServiceCollectionExtensions.cs:56,370-371`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogPurgeOptions.cs:40,47` |
**Description**
Component-AuditLog.md §"Retention & Purge" / §"Configuration" documents the purge tuning as a
top-level section named `AuditLogPurge` with keys `IntervalHours` and `ChannelPurgeBatchSize`:
```jsonc
"AuditLogPurge": { "IntervalHours": 24, "ChannelPurgeBatchSize": 5000 }
```
The code does not match on either axis. (1) `ServiceCollectionExtensions.PurgeSectionName` is
`"AuditLog:Purge"` — a nested subsection, not the documented top-level `AuditLogPurge`. (2) The
batch-size property is `AuditLogPurgeOptions.ChannelPurgeBatchSizeConfigured`, so its bind key is
`ChannelPurgeBatchSizeConfigured`, not the documented `ChannelPurgeBatchSize` (which is a
computed read-only property the binder ignores). There is no `[ConfigurationKeyName]` attribute
to reconcile the names. The net effect: an operator who follows the design doc and writes
`"AuditLogPurge": { "ChannelPurgeBatchSize": 1000 }` has BOTH values silently ignored — the
wrong section path means `IntervalHours` falls back to the 24 h default, and the property-name
mismatch means the channel batch size falls back to 5000. The actor still functions on defaults,
so nothing crashes, but the documented operator-facing tuning surface is inert. No test binds
`AuditLogPurgeOptions` from an `IConfiguration` shaped like the doc, so the drift is unguarded.
(The §407 doc text says `AuditLogPurge:ChannelPurgeBatchSize`, agreeing with the JSON block and
disagreeing with the code on both section and key.)
**Recommendation**
Pick one canonical shape and make code, doc, and a binding test agree. Recommended: keep the
nested `AuditLog:Purge` section (consistent with `AuditLog:SiteWriter` / `AuditLog:SiteTelemetry`
/ `AuditLog:Reconciliation`), but then fix the doc's JSON block + §407 to read
`"AuditLog": { "Purge": { "IntervalHours": …, "ChannelPurgeBatchSize": … } }`, AND add
`[ConfigurationKeyName("ChannelPurgeBatchSize")]` to `ChannelPurgeBatchSizeConfigured` (or rename
it and move the clamp into the `Interval`-style computed getter pattern) so the documented key
binds. Add an `AuditLogOptionsBindingTests`-style test that binds the purge section from an
in-memory config and asserts both `IntervalHours` and the channel batch size take effect.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): added `[ConfigurationKeyName("ChannelPurgeBatchSize")]` so the documented key binds, reconciled the design doc to the real nested `AuditLog:Purge` section, and added a binding test proving the purge options bind from the documented keys.
### AuditLog-014 — `AuditLogIngestActor.OnIngestAsync` does not guard scope/repository resolution, so a transient DI fault restarts the singleton and drops the reply
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:135-148` |
**Description**
`OnIngestAsync` opens `await using var scope = _serviceProvider!.CreateAsyncScope();` and
`scope.ServiceProvider.GetRequiredService<IAuditLogRepository>()` with NO surrounding try/catch.
The per-row try/catch lives inside `IngestWithRepositoryAsync`, so it only protects the insert
loop — a throw from scope creation or from `GetRequiredService` (a `DbContext` factory throw on
transient SQL-connection exhaustion, a pooled-context init fault, a DI resolution race during
host churn) propagates straight out of the message handler. When a `ReceiveActor`'s async handler
faults, Akka applies the parent's supervision and restarts the singleton; the captured
`replyTo.Tell(...)` at line 147 never runs, so the site's `Ask<IngestAuditEventsReply>` times out
and (correctly) retries — but the singleton has been bounced over a transient fault. The class
XML even claims "that per-row catch is what keeps this actor alive across handler throws" — which
is true for insert throws but false for the scope-resolution throw this handler leaves uncaught.
The sibling `OnCachedTelemetryAsync` (line 216) wraps its entire scope-creation-plus-loop body in
a try/catch and logs + replies-with-partial precisely to avoid this; `OnIngestAsync` is the
asymmetric one. `AuditLogPurgeActor.OnTickAsync` and `SiteAuditReconciliationActor.OnTickAsync`
both guard their `GetRequiredService` with try/catch too — `OnIngestAsync` is the outlier.
**Recommendation**
Wrap the scope-creation + repository-resolution in `OnIngestAsync` in a try/catch that mirrors
`OnCachedTelemetryAsync`: on failure, log the resolution fault and `replyTo.Tell(new
IngestAuditEventsReply(accepted))` (accepted will be whatever was processed before the throw,
typically empty) so the site keeps its rows `Pending` and retries on the next drain without
bouncing the singleton. Optionally bump the `ICentralAuditWriteFailureCounter` on that path too,
so a sustained DI/connection fault is visible on the dashboard.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): `OnIngestAsync` now wraps the scope-creation + repo resolution + ingest in a try/catch that logs, best-effort bumps the write-failure counter, and still `replyTo.Tell(IngestAuditEventsReply(accepted))` — a transient DI/DbContext fault no longer bounces the central ingest singleton or drops the reply (mirrors `OnCachedTelemetryAsync`).
### AuditLog-015 — `SqliteAuditWriter.WriteAsync` ignores the cancellation token on the fast (`TryWrite`) path
| | |
|--|--|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/SqliteAuditWriter.cs:254-275` |
| Status | Deferred |
**Description**
`WriteAsync(AuditEvent, CancellationToken ct)` accepts a `ct` but, on the common fast path where
`_writeQueue.Writer.TryWrite(pending)` succeeds, it returns `pending.Completion.Task` without ever
observing `ct`. The token is only honoured on the slow path (`WriteSlowPathAsync`, which passes it
to `WriteAsync`). So a caller that passes an already-cancelled (or soon-cancelled) token still
enqueues the row and then awaits an uncancellable completion. In practice the hot-path callers
(`FallbackAuditWriter`, `CachedCallTelemetryForwarder`) thread a real `ct` through, and the audit
contract is best-effort durable-in-microseconds, so the impact is minor — the row simply gets
written instead of cancelled. But the signature advertises cancellation it does not deliver on the
dominant path, which is a latent surprise for any future caller that relies on it (e.g. to bound a
shutdown drain). The completion `Task` returned is not linked to `ct`, so an awaiter cannot abandon
the wait either.
**Recommendation**
Either (a) honour the token on the fast path — `ct.ThrowIfCancellationRequested()` before
`TryWrite`, and return a `ct`-linked wait (e.g. `pending.Completion.Task.WaitAsync(ct)`) so an
awaiter can abandon a stuck completion; or (b) if cancellation is intentionally a no-op for this
best-effort hot path, drop the `ct` parameter (or document on the method that it is observed only
on the back-pressured slow path) so the surface stops advertising behaviour it does not provide.
**Resolution**
Deferred 2026-06-20: whether cancellation should be honoured on the best-effort `TryWrite` hot path (vs. intentionally a no-op) is a design decision. Recorded; no change this pass.
### AuditLog-016 — `AuditLogIngestActor` class XML has a duplicated phrase and stale "Bundle"/`IngestedAtUtc`-column wording
| | |
|--|--|
| Severity | Low |
| Category | Documentation & comments |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:13-46` |
**Description**
The `AuditLogIngestActor` class summary reads "Each row is stamped with the central-side **the
central-side IngestedAtUtc**" — the phrase "the central-side" is duplicated, a copy/paste slip
that survived the rename. The same summary describes `IngestedAtUtc` as if it were a promoted
column, but post-C3 it is a `DetailsJson` field stamped via `AuditRowProjection.WithIngestedAtUtc`
(as the handler bodies and their own inline comments correctly note) — the class-level prose is
now stale relative to the canonical-record shim. The remarks also still narrate the design in
"Bundle A / Bundle D / Bundle E" milestone terms (and the two-constructor rationale references
"Bundle D's tests" / "Bundle E's host wiring"), which no longer map to any current artifact and
read as archaeology to a new maintainer. None of this is wrong behaviourally, but the duplicated
phrase plus the column-vs-DetailsJson mismatch is exactly the kind of doc rot that misleads.
**Recommendation**
Fix the duplicated "the central-side the central-side" to a single phrase; reword the
`IngestedAtUtc` sentence to say it is stamped into `DetailsJson` (matching the handler comments);
and either drop the "Bundle X" tags in favour of the feature/milestone names used elsewhere in
the module, or add a one-line legend. A quick sweep of the other central actors' class XML for
the same "Bundle" archaeology would be worthwhile while in there.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): class XML doc deduped ('the central-side' duplicate removed), `IngestedAtUtc` reworded to 'stamped into DetailsJson (no promoted column)', and the stale Bundle milestone tags replaced with role-neutral wording.
### AuditLog-001 — Combined-telemetry transport is plumbed end-to-end but never invoked in production
| | |