docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked
Full per-module re-review of the 16 stale modules (last seen1eb6e97/ 2026-05-28) plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD4307c381. 67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commitfd618cf1closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings (IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision. Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024), an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001), and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary is semantically sound (symbol-based) in the production cluster config. README regenerated; regen-readme.py --check passes (4 pending / 567 total).
This commit is contained in:
@@ -5,9 +5,9 @@
|
||||
| Module | `src/ZB.MOM.WW.ScadaBridge.AuditLog` |
|
||||
| Design doc | `docs/requirements/Component-AuditLog.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Last reviewed | 2026-06-20 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Commit reviewed | `4307c381` |
|
||||
| Open findings | 0 |
|
||||
|
||||
## Summary
|
||||
@@ -59,6 +59,231 @@ chain doesn't reject a central composition root that mistakenly calls the site b
|
||||
|
||||
## Findings
|
||||
|
||||
#### Re-review 2026-06-20 (commit `4307c381`) — full review
|
||||
|
||||
Since `1eb6e97` the module was renamed (`ScadaLink → ZB.MOM.WW.ScadaBridge`) and substantially
|
||||
re-architected: the site SQLite store is now a two-table canonical design (`audit_event`
|
||||
append-only + `audit_forward_state` sidecar with a precomputed `IsCachedKind`); the payload
|
||||
filter became the `IAuditRedactor` seam (`ScadaBridgeAuditRedactor` + `SafeDefaultAuditRedactor`
|
||||
+ stateless `AuditRedactionPrimitives`); and milestone work landed (M5.3 T7 inbound response
|
||||
capture, M5.5 T3 per-channel retention, K8 KPI source, central health snapshot). The module
|
||||
remains one of the best-engineered in the tree: the AuditLog-001..011 fixes from the last pass
|
||||
are all present and intact (dual read connection, async scopes, per-EventId reconciliation
|
||||
retry escape valve, thread-pool-hop `Dispose`, lifecycle CTS), the best-effort "never abort the
|
||||
action" contract is honoured throughout, and redaction over-redacts on fault. This pass found
|
||||
**no Critical/High** issues; two design-adherence drifts where documented per-target/purge
|
||||
config knobs are silently inert, one error-handling asymmetry that lets a transient DI fault
|
||||
restart the ingest singleton, and two Low items. Five new Open findings (AuditLog-012..016).
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | Yes | `IsErrorRow` case-sensitive parse is correct (Status is `enum.ToString()` PascalCase). `IsCachedKind` parse falls back to `InboundRequest` (non-cached) on bad JSON — safe. `OnIngestAsync` scope-resolution throw is uncaught (AuditLog-014). |
|
||||
| 2 | Akka.NET conventions | Yes | Sender/EventStream captured before first await across all three central actors; Tell for hot path, Ask only at the gRPC/ClusterClient boundary; supervisor-strategy comments now correctly credit per-row catch (AuditLog-002 fix intact). No findings. |
|
||||
| 3 | Concurrency & thread safety | Yes | WAL dual-connection + `_readLock`/`_writeLock` split intact; `Interlocked` counters; `_drainGate` serialises ring drain. `WriteAsync` fast path ignores `ct` when `TryWrite` succeeds (AuditLog-015). |
|
||||
| 4 | Error handling & resilience | Yes | Best-effort contract honoured everywhere; gRPC pull collapses tolerable faults to empty; reconciliation retry escape valve (AuditLog-004) intact. `OnIngestAsync` lacks the scope-creation try/catch its sibling `OnCachedTelemetryAsync` has (AuditLog-014). |
|
||||
| 5 | Security | Yes | Header/body/SQL-param redaction + over-redact-on-fault safety net solid; `SafeDefaultAuditRedactor` fallback wired at all three writer sites (AuditLog-008 fix intact); SQL built with bound parameters only. No findings. |
|
||||
| 6 | Performance & resource management | Yes | Hot path batched + back-pressured; backlog scan off the write lock; partition switch metadata-only; per-channel DELETE batched + clamped; gRPC channel cache race-safe. No findings. |
|
||||
| 7 | Design-document adherence | Yes | Per-target `CapBytes` override is documented + has a config property but is read NOWHERE (AuditLog-012); purge config section/key drift from the doc's `AuditLogPurge`/`ChannelPurgeBatchSize` (AuditLog-013). Combined-telemetry transport (AuditLog-001) now wired. `SkipBodyCapture` honoured in InboundAPI middleware (out of module). |
|
||||
| 8 | Code organization & conventions | Yes | Composition root well-segmented with "safe from any root" invariant; `INodeIdentityProvider` standardised on `GetRequiredService` (AuditLog-007 fix intact); idempotency sentinel on the health bridge (AuditLog-011 fix intact). No new findings. |
|
||||
| 9 | Testing coverage | Yes | Broad (~13k lines). Gaps: no test asserts per-target `CapBytes` is applied (because it isn't — AuditLog-012), and no test binds `AuditLogPurgeOptions` from an `IConfiguration` shaped like the doc (would catch AuditLog-013). |
|
||||
| 10 | Documentation & comments | Yes | Mostly excellent and accurate. `AuditLogIngestActor` class XML has a duplicated "the central-side" phrase and several stale "Bundle X" milestone tags that mislead (AuditLog-016). |
|
||||
|
||||
### AuditLog-012 — Per-target `CapBytes` override is documented and bindable but never read by the redactor
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Deferred |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Redaction/ScadaBridgeAuditRedactor.cs:185-196`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Configuration/PerTargetRedactionOverride.cs:13` |
|
||||
|
||||
**Description**
|
||||
|
||||
The design (Component-AuditLog.md §"Payload Capture Policy" / §"Configuration") documents a
|
||||
per-target payload-cap override and gives it as a worked example:
|
||||
`"PerTargetOverrides": { "Weather/GetForecast": { "CapBytes": 4096 } }`. The
|
||||
`PerTargetRedactionOverride.CapBytes` property exists and its XML doc promises "Optional payload
|
||||
cap override (bytes); null inherits the global cap." But `ScadaBridgeAuditRedactor.SelectCap`
|
||||
chooses among `InboundMaxBytes` / `ErrorCapBytes` / `DefaultCapBytes` only — it never consults
|
||||
`opts.PerTargetOverrides[target].CapBytes`. A repo-wide grep confirms `CapBytes` (the per-target
|
||||
property) is read in zero production code paths. An operator who sets a per-target cap to bound a
|
||||
chatty target's payloads gets the global `DefaultCapBytes`/`ErrorCapBytes` instead, with no error
|
||||
and no log — the knob is silently inert. This is a code-vs-spec drift: either the feature was
|
||||
dropped in the redactor re-architecture (C2/C3) without updating the doc + override XML, or it
|
||||
was never implemented. The `AdditionalBodyRedactors` and `RedactSqlParamsMatching` per-target
|
||||
fields ARE honoured, which makes the silent `CapBytes` omission especially easy to miss.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) implement it: in `SelectCap`, when `category != ApiInbound` and the resolved
|
||||
`PerTargetOverrides` entry for `rawEvent.Target` has a non-null `CapBytes`, use
|
||||
`Math.Min`/`Max(over.CapBytes.Value, …)` per the intended semantics (the doc says it *overrides*
|
||||
the global cap — clarify whether error rows still get the larger of the two), and add a
|
||||
`ScadaBridgeAuditRedactorTests` case asserting a per-target cap shortens a payload the global cap
|
||||
would not; or (b) if the feature is intentionally dropped, delete `PerTargetRedactionOverride.CapBytes`,
|
||||
remove the `CapBytes` example from the Component-AuditLog.md Configuration block, and note the
|
||||
removal in the design doc so the override surface stops advertising a no-op.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Deferred 2026-06-20: whether to implement per-target `CapBytes` in `SelectCap` or delete the unused property + doc example is a design-owner decision (was it intentionally dropped in the redactor re-architecture, or lost?). Recorded; no change this pass.
|
||||
|
||||
### AuditLog-013 — Purge config section + key drift: documented `AuditLogPurge`/`ChannelPurgeBatchSize` does not bind
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/ServiceCollectionExtensions.cs:56,370-371`, `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogPurgeOptions.cs:40,47` |
|
||||
|
||||
**Description**
|
||||
|
||||
Component-AuditLog.md §"Retention & Purge" / §"Configuration" documents the purge tuning as a
|
||||
top-level section named `AuditLogPurge` with keys `IntervalHours` and `ChannelPurgeBatchSize`:
|
||||
|
||||
```jsonc
|
||||
"AuditLogPurge": { "IntervalHours": 24, "ChannelPurgeBatchSize": 5000 }
|
||||
```
|
||||
|
||||
The code does not match on either axis. (1) `ServiceCollectionExtensions.PurgeSectionName` is
|
||||
`"AuditLog:Purge"` — a nested subsection, not the documented top-level `AuditLogPurge`. (2) The
|
||||
batch-size property is `AuditLogPurgeOptions.ChannelPurgeBatchSizeConfigured`, so its bind key is
|
||||
`ChannelPurgeBatchSizeConfigured`, not the documented `ChannelPurgeBatchSize` (which is a
|
||||
computed read-only property the binder ignores). There is no `[ConfigurationKeyName]` attribute
|
||||
to reconcile the names. The net effect: an operator who follows the design doc and writes
|
||||
`"AuditLogPurge": { "ChannelPurgeBatchSize": 1000 }` has BOTH values silently ignored — the
|
||||
wrong section path means `IntervalHours` falls back to the 24 h default, and the property-name
|
||||
mismatch means the channel batch size falls back to 5000. The actor still functions on defaults,
|
||||
so nothing crashes, but the documented operator-facing tuning surface is inert. No test binds
|
||||
`AuditLogPurgeOptions` from an `IConfiguration` shaped like the doc, so the drift is unguarded.
|
||||
(The §407 doc text says `AuditLogPurge:ChannelPurgeBatchSize`, agreeing with the JSON block and
|
||||
disagreeing with the code on both section and key.)
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Pick one canonical shape and make code, doc, and a binding test agree. Recommended: keep the
|
||||
nested `AuditLog:Purge` section (consistent with `AuditLog:SiteWriter` / `AuditLog:SiteTelemetry`
|
||||
/ `AuditLog:Reconciliation`), but then fix the doc's JSON block + §407 to read
|
||||
`"AuditLog": { "Purge": { "IntervalHours": …, "ChannelPurgeBatchSize": … } }`, AND add
|
||||
`[ConfigurationKeyName("ChannelPurgeBatchSize")]` to `ChannelPurgeBatchSizeConfigured` (or rename
|
||||
it and move the clamp into the `Interval`-style computed getter pattern) so the documented key
|
||||
binds. Add an `AuditLogOptionsBindingTests`-style test that binds the purge section from an
|
||||
in-memory config and asserts both `IntervalHours` and the channel batch size take effect.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): added `[ConfigurationKeyName("ChannelPurgeBatchSize")]` so the documented key binds, reconciled the design doc to the real nested `AuditLog:Purge` section, and added a binding test proving the purge options bind from the documented keys.
|
||||
|
||||
### AuditLog-014 — `AuditLogIngestActor.OnIngestAsync` does not guard scope/repository resolution, so a transient DI fault restarts the singleton and drops the reply
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:135-148` |
|
||||
|
||||
**Description**
|
||||
|
||||
`OnIngestAsync` opens `await using var scope = _serviceProvider!.CreateAsyncScope();` and
|
||||
`scope.ServiceProvider.GetRequiredService<IAuditLogRepository>()` with NO surrounding try/catch.
|
||||
The per-row try/catch lives inside `IngestWithRepositoryAsync`, so it only protects the insert
|
||||
loop — a throw from scope creation or from `GetRequiredService` (a `DbContext` factory throw on
|
||||
transient SQL-connection exhaustion, a pooled-context init fault, a DI resolution race during
|
||||
host churn) propagates straight out of the message handler. When a `ReceiveActor`'s async handler
|
||||
faults, Akka applies the parent's supervision and restarts the singleton; the captured
|
||||
`replyTo.Tell(...)` at line 147 never runs, so the site's `Ask<IngestAuditEventsReply>` times out
|
||||
and (correctly) retries — but the singleton has been bounced over a transient fault. The class
|
||||
XML even claims "that per-row catch is what keeps this actor alive across handler throws" — which
|
||||
is true for insert throws but false for the scope-resolution throw this handler leaves uncaught.
|
||||
The sibling `OnCachedTelemetryAsync` (line 216) wraps its entire scope-creation-plus-loop body in
|
||||
a try/catch and logs + replies-with-partial precisely to avoid this; `OnIngestAsync` is the
|
||||
asymmetric one. `AuditLogPurgeActor.OnTickAsync` and `SiteAuditReconciliationActor.OnTickAsync`
|
||||
both guard their `GetRequiredService` with try/catch too — `OnIngestAsync` is the outlier.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Wrap the scope-creation + repository-resolution in `OnIngestAsync` in a try/catch that mirrors
|
||||
`OnCachedTelemetryAsync`: on failure, log the resolution fault and `replyTo.Tell(new
|
||||
IngestAuditEventsReply(accepted))` (accepted will be whatever was processed before the throw,
|
||||
typically empty) so the site keeps its rows `Pending` and retries on the next drain without
|
||||
bouncing the singleton. Optionally bump the `ICentralAuditWriteFailureCounter` on that path too,
|
||||
so a sustained DI/connection fault is visible on the dashboard.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): `OnIngestAsync` now wraps the scope-creation + repo resolution + ingest in a try/catch that logs, best-effort bumps the write-failure counter, and still `replyTo.Tell(IngestAuditEventsReply(accepted))` — a transient DI/DbContext fault no longer bounces the central ingest singleton or drops the reply (mirrors `OnCachedTelemetryAsync`).
|
||||
|
||||
### AuditLog-015 — `SqliteAuditWriter.WriteAsync` ignores the cancellation token on the fast (`TryWrite`) path
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Site/SqliteAuditWriter.cs:254-275` |
|
||||
| Status | Deferred |
|
||||
|
||||
**Description**
|
||||
|
||||
`WriteAsync(AuditEvent, CancellationToken ct)` accepts a `ct` but, on the common fast path where
|
||||
`_writeQueue.Writer.TryWrite(pending)` succeeds, it returns `pending.Completion.Task` without ever
|
||||
observing `ct`. The token is only honoured on the slow path (`WriteSlowPathAsync`, which passes it
|
||||
to `WriteAsync`). So a caller that passes an already-cancelled (or soon-cancelled) token still
|
||||
enqueues the row and then awaits an uncancellable completion. In practice the hot-path callers
|
||||
(`FallbackAuditWriter`, `CachedCallTelemetryForwarder`) thread a real `ct` through, and the audit
|
||||
contract is best-effort durable-in-microseconds, so the impact is minor — the row simply gets
|
||||
written instead of cancelled. But the signature advertises cancellation it does not deliver on the
|
||||
dominant path, which is a latent surprise for any future caller that relies on it (e.g. to bound a
|
||||
shutdown drain). The completion `Task` returned is not linked to `ct`, so an awaiter cannot abandon
|
||||
the wait either.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Either (a) honour the token on the fast path — `ct.ThrowIfCancellationRequested()` before
|
||||
`TryWrite`, and return a `ct`-linked wait (e.g. `pending.Completion.Task.WaitAsync(ct)`) so an
|
||||
awaiter can abandon a stuck completion; or (b) if cancellation is intentionally a no-op for this
|
||||
best-effort hot path, drop the `ct` parameter (or document on the method that it is observed only
|
||||
on the back-pressured slow path) so the surface stops advertising behaviour it does not provide.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Deferred 2026-06-20: whether cancellation should be honoured on the best-effort `TryWrite` hot path (vs. intentionally a no-op) is a design decision. Recorded; no change this pass.
|
||||
|
||||
### AuditLog-016 — `AuditLogIngestActor` class XML has a duplicated phrase and stale "Bundle"/`IngestedAtUtc`-column wording
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Documentation & comments |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.AuditLog/Central/AuditLogIngestActor.cs:13-46` |
|
||||
|
||||
**Description**
|
||||
|
||||
The `AuditLogIngestActor` class summary reads "Each row is stamped with the central-side **the
|
||||
central-side IngestedAtUtc**" — the phrase "the central-side" is duplicated, a copy/paste slip
|
||||
that survived the rename. The same summary describes `IngestedAtUtc` as if it were a promoted
|
||||
column, but post-C3 it is a `DetailsJson` field stamped via `AuditRowProjection.WithIngestedAtUtc`
|
||||
(as the handler bodies and their own inline comments correctly note) — the class-level prose is
|
||||
now stale relative to the canonical-record shim. The remarks also still narrate the design in
|
||||
"Bundle A / Bundle D / Bundle E" milestone terms (and the two-constructor rationale references
|
||||
"Bundle D's tests" / "Bundle E's host wiring"), which no longer map to any current artifact and
|
||||
read as archaeology to a new maintainer. None of this is wrong behaviourally, but the duplicated
|
||||
phrase plus the column-vs-DetailsJson mismatch is exactly the kind of doc rot that misleads.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Fix the duplicated "the central-side the central-side" to a single phrase; reword the
|
||||
`IngestedAtUtc` sentence to say it is stamped into `DetailsJson` (matching the handler comments);
|
||||
and either drop the "Bundle X" tags in favour of the feature/milestone names used elsewhere in
|
||||
the module, or add a one-line legend. A quick sweep of the other central actors' class XML for
|
||||
the same "Bundle" archaeology would be worthwhile while in there.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): class XML doc deduped ('the central-side' duplicate removed), `IngestedAtUtc` reworded to 'stamped into DetailsJson (no promoted column)', and the stale Bundle milestone tags replaced with role-neutral wording.
|
||||
|
||||
### AuditLog-001 — Combined-telemetry transport is plumbed end-to-end but never invoked in production
|
||||
|
||||
| | |
|
||||
|
||||
Reference in New Issue
Block a user