docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked

Full per-module re-review of the 16 stale modules (last seen 1eb6e97 / 2026-05-28) plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD 4307c381. 67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commit fd618cf1 closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings (IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision. Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024), an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001), and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary is semantically sound (symbol-based) in the production cluster config. README regenerated; regen-readme.py --check passes (4 pending / 567 total).
2026-06-20 18:02:32 -04:00
parent fd618cf1dc
commit d39089f4ed
19 changed files with 4031 additions and 69 deletions
@@ -5,10 +5,10 @@
 | Module | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging` |
 | Design doc | `docs/requirements/Component-SiteEventLogging.md` |
 | Status | Reviewed |
-| Last reviewed | 2026-05-28 |
+| Last reviewed | 2026-06-20 |
 | Reviewer | claude-agent |
-| Commit reviewed | `1eb6e97` |
-| Open findings | 2 |
+| Commit reviewed | `4307c381` |
+| Open findings | 0 |

 ## Summary

@@ -101,6 +101,36 @@ _Re-review (2026-05-28, `1eb6e97`):_
 | 9 | Testing coverage | ☑ | Non-volatile `stop` flag in `PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` (-023). No tests for `PageSize` bounds, `From`/`To` timezone handling, or unobserved `FailedWriteCount`. |
 | 10 | Documentation & comments | ☑ | `FailedWriteCount` XML doc claims "Health Monitoring can poll" but nothing does (-018). Severity / event-type docs enumerate values that are not enforced (-020). |

+#### Re-review 2026-06-20 (commit `4307c381`) — full review
+
+Re-reviewed the module at commit `4307c381`. All twenty-three prior findings are still
+recorded; twenty-two of their resolutions hold up under inspection — **but one does not**.
+SiteEventLogging-016 (**High**, the `From`/`To` UTC-normalisation defect) is marked
+`Resolved` and its Resolution text claims `EventLogQueryService.ExecuteQuery` now calls
+`.ToUniversalTime()` before `ToString("o")` and that a regression test
+`Query_FiltersByTimeRange_HandlesNonUtcOffset` was added — **neither is true at this
+commit**. `EventLogQueryService.cs:77,83` still stringifies `request.From`/`request.To`
+verbatim with `.ToString("o")` and no UTC normalisation, and a repo-wide search finds no
+test by that name. The claimed -016 fix was never committed; the High-severity defect is
+live. This is re-opened as SiteEventLogging-024 (cross-referencing -016) so the audit trail
+shows the resolution was asserted but never landed. Four new findings were recorded: -024
+(the never-landed -016 fix, High), -025 (synchronous severity validation faults
+fire-and-forget callers, Medium), -026 (purge active-node gate diverges from the query
+singleton's placement, Medium), and -027 (the time-range test masks -024, Low).
+
+| # | Category | Examined | Notes |
+|---|----------|----------|-------|
+| 1 | Correctness & logic bugs | ☑ | **-024**: `From`/`To` filters still compare non-UTC-normalised ISO 8601 strings against UTC-stored timestamps — the SiteEventLogging-016 fix was never committed (code at `EventLogQueryService.cs:77,83` is unchanged; the claimed regression test does not exist). |
+| 2 | Akka.NET conventions | ☑ | `EventLogHandlerActor` remains a simple `Receive`/`Tell` bridge; no new findings. |
+| 3 | Concurrency & thread safety | ☑ | Lock-guarded `WithConnection` pattern is correct; no new findings. |
+| 4 | Error handling & resilience | ☑ | **-025**: `LogEventAsync` validates severity/args by throwing synchronously, bypassing the documented "faults the returned Task" contract for fire-and-forget callers. |
+| 5 | Security | ☑ | Queries fully parameterised; `PageSize` now clamped (-017). No new findings. |
+| 6 | Performance & resource management | ☑ | Bounded write queue (-015) and `PageSize` clamp (-017) in place; no new findings. |
+| 7 | Design-document adherence | ☑ | **-026**: the purge active-node gate (`SelfIsPrimary` = cluster leader) can diverge from the query singleton's placement (oldest member of role), leaving the queried node's DB unpurged / over-cap. |
+| 8 | Code organization & conventions | ☑ | Concrete-recorder DI wiring correct; no new findings. |
+| 9 | Testing coverage | ☑ | **-027**: `Query_FiltersByTimeRange` asserts count only with UTC-only inputs, masking -024; no non-UTC-offset regression test exists. |
+| 10 | Documentation & comments | ☑ | No new documentation findings beyond those above. |
+
 ## Findings

 ### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space
@@ -1144,3 +1174,265 @@ Use a `CancellationTokenSource` (`while (!cts.IsCancellationRequested)`), or cha
 `stop` to a `volatile bool`, or use `Interlocked.Exchange` / `Volatile.Read`.
 `CancellationTokenSource` is the canonical .NET pattern and also lets the test
 cooperate with xUnit's `Task.WhenAll` timeout.
+
+### SiteEventLogging-024 — `From`/`To` filters still compare non-normalised ISO 8601 strings against UTC-stored timestamps (the SiteEventLogging-016 fix was never committed)
+
+| | |
+|--|--|
+| Severity | High |
+| Category | Correctness & logic bugs |
+| Status | Resolved |
+| Location | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging/EventLogQueryService.cs:77,83` |
+
+**Description**
+
+This is a re-open of SiteEventLogging-016. That finding was marked **Resolved**
+(2026-05-28) with a Resolution stating that `EventLogQueryService.ExecuteQuery` now calls
+`.ToUniversalTime()` on `request.From`/`request.To` before `ToString("o")`, that
+`EventLogPurgeService.PurgeByRetention` was made defensive with an explicit
+`.ToUniversalTime()`, and that a regression test
+`Query_FiltersByTimeRange_HandlesNonUtcOffset` was added. **None of that is present at
+commit `4307c381`.** The code at `EventLogQueryService.cs` is byte-for-byte the original
+defective form:
+
+```
+if (request.From.HasValue)
+{
+    whereClauses.Add("timestamp >= $from");
+    parameters.Add(new SqliteParameter("$from", request.From.Value.ToString("o")));   // line 77
+}
+
+if (request.To.HasValue)
+{
+    whereClauses.Add("timestamp <= $to");
+    parameters.Add(new SqliteParameter("$to", request.To.Value.ToString("o")));       // line 83
+}
+```
+
+There is no `.ToUniversalTime()` (or `.UtcDateTime`) on either bound, and a repo-wide
+search for `Query_FiltersByTimeRange_HandlesNonUtcOffset` returns zero matches — the
+claimed regression test does not exist. The -016 resolution was asserted but never landed
+in source, so the original High-severity defect is **live**.
+
+The mechanism is exactly as -016 described. Event rows are persisted with
+`timestamp = DateTimeOffset.UtcNow.ToString("o")` (`SiteEventLogger.cs:202`), which always
+emits the round-trip ISO 8601 form ending in the literal offset `+00:00`
+(e.g. `2026-06-20T12:34:56.7890123+00:00`). `request.From`/`request.To` are
+`DateTimeOffset?`, and `ToString("o")` preserves whatever offset the caller passed. A
+central client in, say, `UTC+05:00` that filters with `DateTimeOffset.Now` produces
+`"2026-06-20T17:34:56.0000000+05:00"`, which under SQLite's default `BINARY` collation
+sorts lexicographically *greater* than the equivalent UTC instant string
+`"2026-06-20T12:34:56.0000000+00:00"`. The `timestamp >= $from` / `timestamp <= $to`
+comparison is then a byte-by-byte string compare, so the query silently includes events
+from the wrong hour or excludes events that genuinely fall in the window. The design
+states "All timestamps are UTC throughout the system", but the central-`DateTimeOffset`
+→ SQLite boundary does not enforce it. A central UI rendered in a non-UTC timezone is the
+most likely trigger, and it corrupts precisely the "show me what happened around the
+failover" time-range query that operators most often run. The retention purge
+(`DateTimeOffset.UtcNow.AddDays(-N).ToString("o")`) is UTC by construction and remains
+safe; only the central query path is vulnerable.
+
+**Recommendation**
+
+Apply the one-line fix on each bound that -016 specified but that was never committed:
+
+```
+parameters.Add(new SqliteParameter("$from", request.From.Value.ToUniversalTime().ToString("o")));
+parameters.Add(new SqliteParameter("$to",   request.To.Value.ToUniversalTime().ToString("o")));
+```
+
+so the produced offset is always `+00:00` and the comparison is lexicographically sound
+against the UTC-stored strings. Add the regression test that -016 claimed but never
+delivered (see SiteEventLogging-027): construct a `From`/`To` carrying a non-zero offset
+(e.g. `+05:00`) and assert the matching UTC-stored events are returned and out-of-range
+ones excluded, asserting on returned row identities rather than just count. Optionally
+store `timestamp` as Unix-epoch `INTEGER` to eliminate the lexicographic-comparison hazard
+structurally. When fixing, also reconcile SiteEventLogging-016's Resolution text with
+reality (its commit reference was `<pending>` and the fix never merged).
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): `EventLogQueryService` now calls `.ToUniversalTime()` on the `From`/`To` `DateTimeOffset` bounds before `.ToString("o")`, so the comparison strings are UTC and match the UTC-stored values for any non-UTC client offset. This is the fix that the closed -016 claimed but never committed. Regression test added (verified failing pre-fix).
+
+### SiteEventLogging-025 — `LogEventAsync` severity/argument validation throws synchronously, bypassing the "faults the returned Task" contract for fire-and-forget callers
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Error handling & resilience |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging/SiteEventLogger.cs:187-199` |
+
+**Description**
+
+`LogEventAsync` is documented as a fire-and-forget enqueue whose returned `Task`
+"completes once the event is durably persisted and faults if the write fails"
+(`SiteEventLogger.cs:24-29`), and the disposed-drop paths were deliberately reworked under
+SiteEventLogging-012 to *fault the Task* rather than throw. But the argument and severity
+validation at the top of the method throws **synchronously** on the caller's thread before
+any `Task` is returned:
+
+```
+ArgumentException.ThrowIfNullOrWhiteSpace(eventType);
+ArgumentException.ThrowIfNullOrWhiteSpace(severity);
+ArgumentException.ThrowIfNullOrWhiteSpace(source);
+ArgumentException.ThrowIfNullOrWhiteSpace(message);
+
+if (!AllowedSeverities.Contains(severity))      // SiteEventLogging-020 closed set
+{
+    throw new ArgumentException(
+        $"Severity '{severity}' is not one of the allowed values: Info, Warning, Error.",
+        nameof(severity));
+}
+```
+
+Every recording caller invokes this fire-and-forget as `_ = _eventLogger.LogEventAsync(...)`
+(no `await`, return discarded), so a thrown `ArgumentException` does **not** flow into the
+discarded `Task` — it propagates straight up the calling actor's message-handling stack.
+For the recorder hot-path callers (the computed `AlarmActor`, the `DataConnectionActor`,
+the `ScriptActor`/`ScriptExecutionActor`) an unhandled synchronous throw would fault the
+actor and trip its supervision strategy, when the documented and intended behaviour for a
+best-effort audit write is that a bad event is dropped/faulted, never that it crashes the
+subsystem doing the logging.
+
+This is **currently latent**: every production call site passes a hard-coded canonical
+`severity` literal (`"Info"`/`"Warning"`/`"Error"`) and non-empty `eventType`/`source`/
+`message`, so the throw path is never reached today. The exposure is a future caller that
+passes a dynamic, computed, or typo'd severity (e.g. `"error"`, `"ERR"`, a value derived
+from an alarm payload) — the SiteEventLogging-020 closed-set check, added defensively, then
+becomes a crash vector for the very actor it is meant to protect. The mismatch is that
+-012 chose Task-faulting for one drop reason (disposal) while validation kept synchronous
+throwing for another (bad input), so the method's failure contract is now inconsistent.
+
+**Recommendation**
+
+Make the failure mode uniform with the rest of the method: return a faulted `Task` for
+validation failures instead of throwing synchronously — e.g.
+`return Task.FromException(new ArgumentException(...))` for the severity check (and,
+consistently, for the null/whitespace guards) — so a fire-and-forget caller is never
+crashed by a bad audit-event payload and an `await`-ing caller still observes the failure.
+Alternatively, accept a `severity` enum on `ISiteEventLogger.LogEventAsync` and convert to
+the canonical string at the boundary, eliminating the runtime-validation throw entirely.
+The exact fix shape is a judgment call — faulted-Task is the minimal-diff change that
+matches the documented contract; the enum is the stronger design but a wider
+interface/caller change. Whichever is chosen, update the XML doc to state the failure
+semantics for invalid input.
+
+**Resolution**
+
+Deferred 2026-06-20: the fix shape (return a faulted Task vs. validate against a severity enum) is a design decision, and the defect is latent (every current caller passes a canonical severity literal). Recorded; no change this pass.
+
+### SiteEventLogging-026 — Purge active-node gate (cluster leader) can diverge from the query singleton's placement (oldest member of role)
+
+| | |
+|--|--|
+| Severity | Medium |
+| Category | Design-document adherence |
+| Status | Deferred |
+| Location | `src/ZB.MOM.WW.ScadaBridge.Host/SiteServiceRegistration.cs:116-120`, `src/ZB.MOM.WW.ScadaBridge.Host/Health/AkkaClusterNodeProvider.cs:29-39`, `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:889-895` |
+
+**Description**
+
+SiteEventLogging-019 closed the "purge runs on every node" finding by gating
+`EventLogPurgeService` on an optional `SiteEventLogActiveNodeCheck`. The Host wires that
+check to `IClusterNodeProvider.SelfIsPrimary` (`SiteServiceRegistration.cs:116-120`), which
+returns true only when "this node is `Up` AND is the **cluster leader**"
+(`AkkaClusterNodeProvider.cs:29-39`: `cluster.State.Leader.Equals(cluster.SelfAddress)`).
+
+The query path, however, is served by the `event-log-handler` **cluster singleton**
+(`AkkaHostedService.cs:889-895`), which Akka.NET pins to the **oldest member of the role**,
+not the cluster leader. These are two different election concepts:
+
+- The Akka **cluster leader** is the lowest-address member in the `Up`/`Leaving` state set
+  for the *whole cluster* (role-agnostic) — it is what `SelfIsPrimary` tests.
+- A **cluster singleton** is hosted on the *oldest* member of its configured role
+  (`siteRole`) — it is what owns the event-log query DB.
+
+In a steady-state 2-node site cluster these usually coincide, so the gate works in
+practice. But they are not guaranteed equal — most plausibly during membership churn /
+after a failover, when the oldest-member computation and the leader election can settle on
+different nodes for a window. When they diverge, the query singleton reads node X's
+SQLite DB while the purge runs on node Y. The result is that the node whose DB is actually
+queried (the populated, active one) is left **unpurged** — its 30-day retention sweep and
+1 GB cap-purge do not run — so the queried database can drift over the documented 1 GB cap
+and retain rows past the 30-day window, which is exactly the invariant the storage section
+of Component-SiteEventLogging requires. The SiteEventLogging-004 re-triage note already
+established the correct co-location principle: the *query* singleton and the
+*deployment-manager* singleton share `siteRole` and so co-locate on the oldest member; the
+purge gate keyed on a *different* signal (`SelfIsPrimary`/leader) silently breaks that
+co-location for the purge.
+
+**Recommendation**
+
+Re-gate the purge on the **same** ownership signal as the query singleton so the two never
+diverge — i.e. gate on "this node hosts the `event-log-handler` singleton" (host the purge
+inside that same cluster singleton, or test oldest-member-of-`siteRole` rather than cluster
+leader). Alternatively, drop the purge gate entirely and let every node purge its own local
+DB: the SiteEventLogging-019 and -004 analysis already showed standby purge is harmless
+(the standby receives no writes), so running it on both nodes guarantees the queried DB is
+always purged at the cost of a no-op sweep on the standby. The direction is a **judgment
+call** — re-gating on singleton ownership keeps the "active node only" intent but adds
+coupling to the singleton's lifecycle; dropping the gate is simpler and provably correct
+for the cap/retention invariant but reverts the -019 "active node only" optimisation.
+Whichever is chosen, the gate (`SelfIsPrimary` = leader) and the singleton (oldest member)
+must not be left keyed on two different elections.
+
+**Resolution**
+
+Deferred 2026-06-20: the direction (re-gate purge on singleton ownership vs. drop the gate per the -019 analysis) is a design-owner decision. Divergence between the Akka leader and the oldest-member singleton is a churn-window condition. Recorded; no change this pass.
+
+### SiteEventLogging-027 — Time-range test asserts count only with UTC-only inputs, masking SiteEventLogging-024
+
+| | |
+|--|--|
+| Severity | Low |
+| Category | Testing coverage |
+| Status | Resolved |
+| Location | `tests/ZB.MOM.WW.ScadaBridge.SiteEventLogging.Tests/EventLogQueryServiceTests.cs:197-214` |
+
+**Description**
+
+`Query_FiltersByTimeRange` is the only test covering the `From`/`To` time-range filter, and
+it cannot catch SiteEventLogging-024 by construction:
+
+```
+var now = DateTimeOffset.UtcNow;                                  // UTC-only input
+InsertEventAt(now.AddHours(-2),   "script", "Info", null, "S1", "Old event");
+InsertEventAt(now.AddMinutes(-30),"script", "Info", null, "S2", "Recent event");
+InsertEventAt(now,                "script", "Info", null, "S3", "Now event");
+
+var response = _queryService.ExecuteQuery(MakeRequest(
+    from: now.AddHours(-1),
+    to:   now.AddMinutes(1)));
+
+Assert.True(response.Success);
+Assert.Equal(2, response.Entries.Count);                          // count only
+```
+
+Both shortcomings hide the live -024 defect:
+
+1. **UTC-only inputs.** `now` is `DateTimeOffset.UtcNow`, so `from`/`to` already carry a
+   `+00:00` offset. `ToString("o")` then produces a correctly-comparable string and the
+   missing `.ToUniversalTime()` is never exercised. The defect only manifests for a
+   `DateTimeOffset` with a *non-zero* offset, which this test never constructs.
+2. **Count-only assertion.** It asserts `response.Entries.Count == 2` but never checks
+   *which* rows came back. A boundary off-by-an-hour bug that swapped one in-range row for
+   an out-of-range row (the -024 symptom) could still yield a count of 2 and pass.
+
+The net effect is a green test suite that gives false confidence the time-range filter is
+correct — and almost certainly contributed to SiteEventLogging-016 being marked Resolved
+when its fix was never committed.
+
+**Recommendation**
+
+Add the non-UTC-offset regression test that SiteEventLogging-016 claimed but never
+delivered (`Query_FiltersByTimeRange_HandlesNonUtcOffset`): construct `From`/`To` as a
+`DateTimeOffset` with a non-zero offset (e.g. `+05:00`) bracketing UTC-stored events, and
+assert on the **identities** of the returned rows (e.g. by `Source`/`Message` or `Id`), not
+just `Count`, so a window shifted by the offset is detected. Strengthen the existing
+`Query_FiltersByTimeRange` likewise to assert which rows are returned. These tests should
+fail against the current code and pass once SiteEventLogging-024 is fixed.
+
+**Resolution**
+
+Resolved 2026-06-20 (commit `fd618cf1`): added `Query_FiltersByTimeRange_HandlesNonUtcOffset` — seeds known UTC instants, queries with a +05:00 offset, and asserts the returned row identities (not just count). Verified failing pre-fix, passing post-fix.