docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked

Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28)
plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381.

67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit
fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix
with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings
(IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision.

Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to
sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024),
an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001),
and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary
is semantically sound (symbol-based) in the production cluster config.

README regenerated; regen-readme.py --check passes (4 pending / 567 total).
This commit is contained in:
Joseph Doherty
2026-06-20 18:02:32 -04:00
parent fd618cf1dc
commit d39089f4ed
19 changed files with 4031 additions and 69 deletions
+295 -3
View File
@@ -5,10 +5,10 @@
| Module | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging` |
| Design doc | `docs/requirements/Component-SiteEventLogging.md` |
| Status | Reviewed |
| Last reviewed | 2026-05-28 |
| Last reviewed | 2026-06-20 |
| Reviewer | claude-agent |
| Commit reviewed | `1eb6e97` |
| Open findings | 2 |
| Commit reviewed | `4307c381` |
| Open findings | 0 |
## Summary
@@ -101,6 +101,36 @@ _Re-review (2026-05-28, `1eb6e97`):_
| 9 | Testing coverage | ☑ | Non-volatile `stop` flag in `PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` (-023). No tests for `PageSize` bounds, `From`/`To` timezone handling, or unobserved `FailedWriteCount`. |
| 10 | Documentation & comments | ☑ | `FailedWriteCount` XML doc claims "Health Monitoring can poll" but nothing does (-018). Severity / event-type docs enumerate values that are not enforced (-020). |
#### Re-review 2026-06-20 (commit `4307c381`) — full review
Re-reviewed the module at commit `4307c381`. All twenty-three prior findings are still
recorded; twenty-two of their resolutions hold up under inspection — **but one does not**.
SiteEventLogging-016 (**High**, the `From`/`To` UTC-normalisation defect) is marked
`Resolved` and its Resolution text claims `EventLogQueryService.ExecuteQuery` now calls
`.ToUniversalTime()` before `ToString("o")` and that a regression test
`Query_FiltersByTimeRange_HandlesNonUtcOffset` was added — **neither is true at this
commit**. `EventLogQueryService.cs:77,83` still stringifies `request.From`/`request.To`
verbatim with `.ToString("o")` and no UTC normalisation, and a repo-wide search finds no
test by that name. The claimed -016 fix was never committed; the High-severity defect is
live. This is re-opened as SiteEventLogging-024 (cross-referencing -016) so the audit trail
shows the resolution was asserted but never landed. Four new findings were recorded: -024
(the never-landed -016 fix, High), -025 (synchronous severity validation faults
fire-and-forget callers, Medium), -026 (purge active-node gate diverges from the query
singleton's placement, Medium), and -027 (the time-range test masks -024, Low).
| # | Category | Examined | Notes |
|---|----------|----------|-------|
| 1 | Correctness & logic bugs | ☑ | **-024**: `From`/`To` filters still compare non-UTC-normalised ISO 8601 strings against UTC-stored timestamps — the SiteEventLogging-016 fix was never committed (code at `EventLogQueryService.cs:77,83` is unchanged; the claimed regression test does not exist). |
| 2 | Akka.NET conventions | ☑ | `EventLogHandlerActor` remains a simple `Receive`/`Tell` bridge; no new findings. |
| 3 | Concurrency & thread safety | ☑ | Lock-guarded `WithConnection` pattern is correct; no new findings. |
| 4 | Error handling & resilience | ☑ | **-025**: `LogEventAsync` validates severity/args by throwing synchronously, bypassing the documented "faults the returned Task" contract for fire-and-forget callers. |
| 5 | Security | ☑ | Queries fully parameterised; `PageSize` now clamped (-017). No new findings. |
| 6 | Performance & resource management | ☑ | Bounded write queue (-015) and `PageSize` clamp (-017) in place; no new findings. |
| 7 | Design-document adherence | ☑ | **-026**: the purge active-node gate (`SelfIsPrimary` = cluster leader) can diverge from the query singleton's placement (oldest member of role), leaving the queried node's DB unpurged / over-cap. |
| 8 | Code organization & conventions | ☑ | Concrete-recorder DI wiring correct; no new findings. |
| 9 | Testing coverage | ☑ | **-027**: `Query_FiltersByTimeRange` asserts count only with UTC-only inputs, masking -024; no non-UTC-offset regression test exists. |
| 10 | Documentation & comments | ☑ | No new documentation findings beyond those above. |
## Findings
### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space
@@ -1144,3 +1174,265 @@ Use a `CancellationTokenSource` (`while (!cts.IsCancellationRequested)`), or cha
`stop` to a `volatile bool`, or use `Interlocked.Exchange` / `Volatile.Read`.
`CancellationTokenSource` is the canonical .NET pattern and also lets the test
cooperate with xUnit's `Task.WhenAll` timeout.
### SiteEventLogging-024 — `From`/`To` filters still compare non-normalised ISO 8601 strings against UTC-stored timestamps (the SiteEventLogging-016 fix was never committed)
| | |
|--|--|
| Severity | High |
| Category | Correctness & logic bugs |
| Status | Resolved |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging/EventLogQueryService.cs:77,83` |
**Description**
This is a re-open of SiteEventLogging-016. That finding was marked **Resolved**
(2026-05-28) with a Resolution stating that `EventLogQueryService.ExecuteQuery` now calls
`.ToUniversalTime()` on `request.From`/`request.To` before `ToString("o")`, that
`EventLogPurgeService.PurgeByRetention` was made defensive with an explicit
`.ToUniversalTime()`, and that a regression test
`Query_FiltersByTimeRange_HandlesNonUtcOffset` was added. **None of that is present at
commit `4307c381`.** The code at `EventLogQueryService.cs` is byte-for-byte the original
defective form:
```
if (request.From.HasValue)
{
whereClauses.Add("timestamp >= $from");
parameters.Add(new SqliteParameter("$from", request.From.Value.ToString("o"))); // line 77
}
if (request.To.HasValue)
{
whereClauses.Add("timestamp <= $to");
parameters.Add(new SqliteParameter("$to", request.To.Value.ToString("o"))); // line 83
}
```
There is no `.ToUniversalTime()` (or `.UtcDateTime`) on either bound, and a repo-wide
search for `Query_FiltersByTimeRange_HandlesNonUtcOffset` returns zero matches — the
claimed regression test does not exist. The -016 resolution was asserted but never landed
in source, so the original High-severity defect is **live**.
The mechanism is exactly as -016 described. Event rows are persisted with
`timestamp = DateTimeOffset.UtcNow.ToString("o")` (`SiteEventLogger.cs:202`), which always
emits the round-trip ISO 8601 form ending in the literal offset `+00:00`
(e.g. `2026-06-20T12:34:56.7890123+00:00`). `request.From`/`request.To` are
`DateTimeOffset?`, and `ToString("o")` preserves whatever offset the caller passed. A
central client in, say, `UTC+05:00` that filters with `DateTimeOffset.Now` produces
`"2026-06-20T17:34:56.0000000+05:00"`, which under SQLite's default `BINARY` collation
sorts lexicographically *greater* than the equivalent UTC instant string
`"2026-06-20T12:34:56.0000000+00:00"`. The `timestamp >= $from` / `timestamp <= $to`
comparison is then a byte-by-byte string compare, so the query silently includes events
from the wrong hour or excludes events that genuinely fall in the window. The design
states "All timestamps are UTC throughout the system", but the central-`DateTimeOffset`
→ SQLite boundary does not enforce it. A central UI rendered in a non-UTC timezone is the
most likely trigger, and it corrupts precisely the "show me what happened around the
failover" time-range query that operators most often run. The retention purge
(`DateTimeOffset.UtcNow.AddDays(-N).ToString("o")`) is UTC by construction and remains
safe; only the central query path is vulnerable.
**Recommendation**
Apply the one-line fix on each bound that -016 specified but that was never committed:
```
parameters.Add(new SqliteParameter("$from", request.From.Value.ToUniversalTime().ToString("o")));
parameters.Add(new SqliteParameter("$to", request.To.Value.ToUniversalTime().ToString("o")));
```
so the produced offset is always `+00:00` and the comparison is lexicographically sound
against the UTC-stored strings. Add the regression test that -016 claimed but never
delivered (see SiteEventLogging-027): construct a `From`/`To` carrying a non-zero offset
(e.g. `+05:00`) and assert the matching UTC-stored events are returned and out-of-range
ones excluded, asserting on returned row identities rather than just count. Optionally
store `timestamp` as Unix-epoch `INTEGER` to eliminate the lexicographic-comparison hazard
structurally. When fixing, also reconcile SiteEventLogging-016's Resolution text with
reality (its commit reference was `<pending>` and the fix never merged).
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): `EventLogQueryService` now calls `.ToUniversalTime()` on the `From`/`To` `DateTimeOffset` bounds before `.ToString("o")`, so the comparison strings are UTC and match the UTC-stored values for any non-UTC client offset. This is the fix that the closed -016 claimed but never committed. Regression test added (verified failing pre-fix).
### SiteEventLogging-025 — `LogEventAsync` severity/argument validation throws synchronously, bypassing the "faults the returned Task" contract for fire-and-forget callers
| | |
|--|--|
| Severity | Medium |
| Category | Error handling & resilience |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging/SiteEventLogger.cs:187-199` |
**Description**
`LogEventAsync` is documented as a fire-and-forget enqueue whose returned `Task`
"completes once the event is durably persisted and faults if the write fails"
(`SiteEventLogger.cs:24-29`), and the disposed-drop paths were deliberately reworked under
SiteEventLogging-012 to *fault the Task* rather than throw. But the argument and severity
validation at the top of the method throws **synchronously** on the caller's thread before
any `Task` is returned:
```
ArgumentException.ThrowIfNullOrWhiteSpace(eventType);
ArgumentException.ThrowIfNullOrWhiteSpace(severity);
ArgumentException.ThrowIfNullOrWhiteSpace(source);
ArgumentException.ThrowIfNullOrWhiteSpace(message);
if (!AllowedSeverities.Contains(severity)) // SiteEventLogging-020 closed set
{
throw new ArgumentException(
$"Severity '{severity}' is not one of the allowed values: Info, Warning, Error.",
nameof(severity));
}
```
Every recording caller invokes this fire-and-forget as `_ = _eventLogger.LogEventAsync(...)`
(no `await`, return discarded), so a thrown `ArgumentException` does **not** flow into the
discarded `Task` — it propagates straight up the calling actor's message-handling stack.
For the recorder hot-path callers (the computed `AlarmActor`, the `DataConnectionActor`,
the `ScriptActor`/`ScriptExecutionActor`) an unhandled synchronous throw would fault the
actor and trip its supervision strategy, when the documented and intended behaviour for a
best-effort audit write is that a bad event is dropped/faulted, never that it crashes the
subsystem doing the logging.
This is **currently latent**: every production call site passes a hard-coded canonical
`severity` literal (`"Info"`/`"Warning"`/`"Error"`) and non-empty `eventType`/`source`/
`message`, so the throw path is never reached today. The exposure is a future caller that
passes a dynamic, computed, or typo'd severity (e.g. `"error"`, `"ERR"`, a value derived
from an alarm payload) — the SiteEventLogging-020 closed-set check, added defensively, then
becomes a crash vector for the very actor it is meant to protect. The mismatch is that
-012 chose Task-faulting for one drop reason (disposal) while validation kept synchronous
throwing for another (bad input), so the method's failure contract is now inconsistent.
**Recommendation**
Make the failure mode uniform with the rest of the method: return a faulted `Task` for
validation failures instead of throwing synchronously — e.g.
`return Task.FromException(new ArgumentException(...))` for the severity check (and,
consistently, for the null/whitespace guards) — so a fire-and-forget caller is never
crashed by a bad audit-event payload and an `await`-ing caller still observes the failure.
Alternatively, accept a `severity` enum on `ISiteEventLogger.LogEventAsync` and convert to
the canonical string at the boundary, eliminating the runtime-validation throw entirely.
The exact fix shape is a judgment call — faulted-Task is the minimal-diff change that
matches the documented contract; the enum is the stronger design but a wider
interface/caller change. Whichever is chosen, update the XML doc to state the failure
semantics for invalid input.
**Resolution**
Deferred 2026-06-20: the fix shape (return a faulted Task vs. validate against a severity enum) is a design decision, and the defect is latent (every current caller passes a canonical severity literal). Recorded; no change this pass.
### SiteEventLogging-026 — Purge active-node gate (cluster leader) can diverge from the query singleton's placement (oldest member of role)
| | |
|--|--|
| Severity | Medium |
| Category | Design-document adherence |
| Status | Deferred |
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/SiteServiceRegistration.cs:116-120`, `src/ZB.MOM.WW.ScadaBridge.Host/Health/AkkaClusterNodeProvider.cs:29-39`, `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:889-895` |
**Description**
SiteEventLogging-019 closed the "purge runs on every node" finding by gating
`EventLogPurgeService` on an optional `SiteEventLogActiveNodeCheck`. The Host wires that
check to `IClusterNodeProvider.SelfIsPrimary` (`SiteServiceRegistration.cs:116-120`), which
returns true only when "this node is `Up` AND is the **cluster leader**"
(`AkkaClusterNodeProvider.cs:29-39`: `cluster.State.Leader.Equals(cluster.SelfAddress)`).
The query path, however, is served by the `event-log-handler` **cluster singleton**
(`AkkaHostedService.cs:889-895`), which Akka.NET pins to the **oldest member of the role**,
not the cluster leader. These are two different election concepts:
- The Akka **cluster leader** is the lowest-address member in the `Up`/`Leaving` state set
for the *whole cluster* (role-agnostic) — it is what `SelfIsPrimary` tests.
- A **cluster singleton** is hosted on the *oldest* member of its configured role
(`siteRole`) — it is what owns the event-log query DB.
In a steady-state 2-node site cluster these usually coincide, so the gate works in
practice. But they are not guaranteed equal — most plausibly during membership churn /
after a failover, when the oldest-member computation and the leader election can settle on
different nodes for a window. When they diverge, the query singleton reads node X's
SQLite DB while the purge runs on node Y. The result is that the node whose DB is actually
queried (the populated, active one) is left **unpurged** — its 30-day retention sweep and
1 GB cap-purge do not run — so the queried database can drift over the documented 1 GB cap
and retain rows past the 30-day window, which is exactly the invariant the storage section
of Component-SiteEventLogging requires. The SiteEventLogging-004 re-triage note already
established the correct co-location principle: the *query* singleton and the
*deployment-manager* singleton share `siteRole` and so co-locate on the oldest member; the
purge gate keyed on a *different* signal (`SelfIsPrimary`/leader) silently breaks that
co-location for the purge.
**Recommendation**
Re-gate the purge on the **same** ownership signal as the query singleton so the two never
diverge — i.e. gate on "this node hosts the `event-log-handler` singleton" (host the purge
inside that same cluster singleton, or test oldest-member-of-`siteRole` rather than cluster
leader). Alternatively, drop the purge gate entirely and let every node purge its own local
DB: the SiteEventLogging-019 and -004 analysis already showed standby purge is harmless
(the standby receives no writes), so running it on both nodes guarantees the queried DB is
always purged at the cost of a no-op sweep on the standby. The direction is a **judgment
call** — re-gating on singleton ownership keeps the "active node only" intent but adds
coupling to the singleton's lifecycle; dropping the gate is simpler and provably correct
for the cap/retention invariant but reverts the -019 "active node only" optimisation.
Whichever is chosen, the gate (`SelfIsPrimary` = leader) and the singleton (oldest member)
must not be left keyed on two different elections.
**Resolution**
Deferred 2026-06-20: the direction (re-gate purge on singleton ownership vs. drop the gate per the -019 analysis) is a design-owner decision. Divergence between the Akka leader and the oldest-member singleton is a churn-window condition. Recorded; no change this pass.
### SiteEventLogging-027 — Time-range test asserts count only with UTC-only inputs, masking SiteEventLogging-024
| | |
|--|--|
| Severity | Low |
| Category | Testing coverage |
| Status | Resolved |
| Location | `tests/ZB.MOM.WW.ScadaBridge.SiteEventLogging.Tests/EventLogQueryServiceTests.cs:197-214` |
**Description**
`Query_FiltersByTimeRange` is the only test covering the `From`/`To` time-range filter, and
it cannot catch SiteEventLogging-024 by construction:
```
var now = DateTimeOffset.UtcNow; // UTC-only input
InsertEventAt(now.AddHours(-2), "script", "Info", null, "S1", "Old event");
InsertEventAt(now.AddMinutes(-30),"script", "Info", null, "S2", "Recent event");
InsertEventAt(now, "script", "Info", null, "S3", "Now event");
var response = _queryService.ExecuteQuery(MakeRequest(
from: now.AddHours(-1),
to: now.AddMinutes(1)));
Assert.True(response.Success);
Assert.Equal(2, response.Entries.Count); // count only
```
Both shortcomings hide the live -024 defect:
1. **UTC-only inputs.** `now` is `DateTimeOffset.UtcNow`, so `from`/`to` already carry a
`+00:00` offset. `ToString("o")` then produces a correctly-comparable string and the
missing `.ToUniversalTime()` is never exercised. The defect only manifests for a
`DateTimeOffset` with a *non-zero* offset, which this test never constructs.
2. **Count-only assertion.** It asserts `response.Entries.Count == 2` but never checks
*which* rows came back. A boundary off-by-an-hour bug that swapped one in-range row for
an out-of-range row (the -024 symptom) could still yield a count of 2 and pass.
The net effect is a green test suite that gives false confidence the time-range filter is
correct — and almost certainly contributed to SiteEventLogging-016 being marked Resolved
when its fix was never committed.
**Recommendation**
Add the non-UTC-offset regression test that SiteEventLogging-016 claimed but never
delivered (`Query_FiltersByTimeRange_HandlesNonUtcOffset`): construct `From`/`To` as a
`DateTimeOffset` with a non-zero offset (e.g. `+05:00`) bracketing UTC-stored events, and
assert on the **identities** of the returned rows (e.g. by `Source`/`Message` or `Id`), not
just `Count`, so a window shifted by the offset is detected. Strengthen the existing
`Query_FiltersByTimeRange` likewise to assert which rows are returned. These tests should
fail against the current code and pass once SiteEventLogging-024 is fixed.
**Resolution**
Resolved 2026-06-20 (commit `fd618cf1`): added `Query_FiltersByTimeRange_HandlesNonUtcOffset` — seeds known UTC instants, queries with a +05:00 offset, and asserts the returned row identities (not just count). Verified failing pre-fix, passing post-fix.