docs(code-review): full review at 4307c381 — 18 modules, 67 findings recorded + remediation tracked
Full per-module re-review of the 16 stale modules (last seen1eb6e97/ 2026-05-28) plus first-ever reviews of KpiHistory (#26) and ScriptAnalysis (#25), at HEAD4307c381. 67 new findings (0 Critical, 6 High, 27 Medium, 34 Low). Remediation in commitfd618cf1closed 5 of the 6 Highs and ~33 Medium/Low; the rest are Deferred/Won't Fix with rationale. Remaining pending (4) are all InboundAPI's Database-helper findings (IA-026 High .. IA-029), left to the active feat/ipsen-movein effort per owner decision. Highlights: caught a central-only-delivery security drift (SMTP creds broadcast to sites — DM-025/SR-031), a never-committed 'Resolved' fix (SiteEventLogging-016 → -024), an unguarded KPI recorder tick (KH-001), a trust-analyzer fallback weakening (SA-001), and a native-alarm subscribe-path leak (DCL-023). ScriptAnalysis verdict: trust boundary is semantically sound (symbol-based) in the production cluster config. README regenerated; regen-readme.py --check passes (4 pending / 567 total).
This commit is contained in:
@@ -5,10 +5,10 @@
|
||||
| Module | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging` |
|
||||
| Design doc | `docs/requirements/Component-SiteEventLogging.md` |
|
||||
| Status | Reviewed |
|
||||
| Last reviewed | 2026-05-28 |
|
||||
| Last reviewed | 2026-06-20 |
|
||||
| Reviewer | claude-agent |
|
||||
| Commit reviewed | `1eb6e97` |
|
||||
| Open findings | 2 |
|
||||
| Commit reviewed | `4307c381` |
|
||||
| Open findings | 0 |
|
||||
|
||||
## Summary
|
||||
|
||||
@@ -101,6 +101,36 @@ _Re-review (2026-05-28, `1eb6e97`):_
|
||||
| 9 | Testing coverage | ☑ | Non-volatile `stop` flag in `PurgeByStorageCap_ConcurrentWritesDoNotCorruptConnection` (-023). No tests for `PageSize` bounds, `From`/`To` timezone handling, or unobserved `FailedWriteCount`. |
|
||||
| 10 | Documentation & comments | ☑ | `FailedWriteCount` XML doc claims "Health Monitoring can poll" but nothing does (-018). Severity / event-type docs enumerate values that are not enforced (-020). |
|
||||
|
||||
#### Re-review 2026-06-20 (commit `4307c381`) — full review
|
||||
|
||||
Re-reviewed the module at commit `4307c381`. All twenty-three prior findings are still
|
||||
recorded; twenty-two of their resolutions hold up under inspection — **but one does not**.
|
||||
SiteEventLogging-016 (**High**, the `From`/`To` UTC-normalisation defect) is marked
|
||||
`Resolved` and its Resolution text claims `EventLogQueryService.ExecuteQuery` now calls
|
||||
`.ToUniversalTime()` before `ToString("o")` and that a regression test
|
||||
`Query_FiltersByTimeRange_HandlesNonUtcOffset` was added — **neither is true at this
|
||||
commit**. `EventLogQueryService.cs:77,83` still stringifies `request.From`/`request.To`
|
||||
verbatim with `.ToString("o")` and no UTC normalisation, and a repo-wide search finds no
|
||||
test by that name. The claimed -016 fix was never committed; the High-severity defect is
|
||||
live. This is re-opened as SiteEventLogging-024 (cross-referencing -016) so the audit trail
|
||||
shows the resolution was asserted but never landed. Four new findings were recorded: -024
|
||||
(the never-landed -016 fix, High), -025 (synchronous severity validation faults
|
||||
fire-and-forget callers, Medium), -026 (purge active-node gate diverges from the query
|
||||
singleton's placement, Medium), and -027 (the time-range test masks -024, Low).
|
||||
|
||||
| # | Category | Examined | Notes |
|
||||
|---|----------|----------|-------|
|
||||
| 1 | Correctness & logic bugs | ☑ | **-024**: `From`/`To` filters still compare non-UTC-normalised ISO 8601 strings against UTC-stored timestamps — the SiteEventLogging-016 fix was never committed (code at `EventLogQueryService.cs:77,83` is unchanged; the claimed regression test does not exist). |
|
||||
| 2 | Akka.NET conventions | ☑ | `EventLogHandlerActor` remains a simple `Receive`/`Tell` bridge; no new findings. |
|
||||
| 3 | Concurrency & thread safety | ☑ | Lock-guarded `WithConnection` pattern is correct; no new findings. |
|
||||
| 4 | Error handling & resilience | ☑ | **-025**: `LogEventAsync` validates severity/args by throwing synchronously, bypassing the documented "faults the returned Task" contract for fire-and-forget callers. |
|
||||
| 5 | Security | ☑ | Queries fully parameterised; `PageSize` now clamped (-017). No new findings. |
|
||||
| 6 | Performance & resource management | ☑ | Bounded write queue (-015) and `PageSize` clamp (-017) in place; no new findings. |
|
||||
| 7 | Design-document adherence | ☑ | **-026**: the purge active-node gate (`SelfIsPrimary` = cluster leader) can diverge from the query singleton's placement (oldest member of role), leaving the queried node's DB unpurged / over-cap. |
|
||||
| 8 | Code organization & conventions | ☑ | Concrete-recorder DI wiring correct; no new findings. |
|
||||
| 9 | Testing coverage | ☑ | **-027**: `Query_FiltersByTimeRange` asserts count only with UTC-only inputs, masking -024; no non-UTC-offset regression test exists. |
|
||||
| 10 | Documentation & comments | ☑ | No new documentation findings beyond those above. |
|
||||
|
||||
## Findings
|
||||
|
||||
### SiteEventLogging-001 — `PRAGMA incremental_vacuum` is a no-op; storage cap cannot reclaim space
|
||||
@@ -1144,3 +1174,265 @@ Use a `CancellationTokenSource` (`while (!cts.IsCancellationRequested)`), or cha
|
||||
`stop` to a `volatile bool`, or use `Interlocked.Exchange` / `Volatile.Read`.
|
||||
`CancellationTokenSource` is the canonical .NET pattern and also lets the test
|
||||
cooperate with xUnit's `Task.WhenAll` timeout.
|
||||
|
||||
### SiteEventLogging-024 — `From`/`To` filters still compare non-normalised ISO 8601 strings against UTC-stored timestamps (the SiteEventLogging-016 fix was never committed)
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | High |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Status | Resolved |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging/EventLogQueryService.cs:77,83` |
|
||||
|
||||
**Description**
|
||||
|
||||
This is a re-open of SiteEventLogging-016. That finding was marked **Resolved**
|
||||
(2026-05-28) with a Resolution stating that `EventLogQueryService.ExecuteQuery` now calls
|
||||
`.ToUniversalTime()` on `request.From`/`request.To` before `ToString("o")`, that
|
||||
`EventLogPurgeService.PurgeByRetention` was made defensive with an explicit
|
||||
`.ToUniversalTime()`, and that a regression test
|
||||
`Query_FiltersByTimeRange_HandlesNonUtcOffset` was added. **None of that is present at
|
||||
commit `4307c381`.** The code at `EventLogQueryService.cs` is byte-for-byte the original
|
||||
defective form:
|
||||
|
||||
```
|
||||
if (request.From.HasValue)
|
||||
{
|
||||
whereClauses.Add("timestamp >= $from");
|
||||
parameters.Add(new SqliteParameter("$from", request.From.Value.ToString("o"))); // line 77
|
||||
}
|
||||
|
||||
if (request.To.HasValue)
|
||||
{
|
||||
whereClauses.Add("timestamp <= $to");
|
||||
parameters.Add(new SqliteParameter("$to", request.To.Value.ToString("o"))); // line 83
|
||||
}
|
||||
```
|
||||
|
||||
There is no `.ToUniversalTime()` (or `.UtcDateTime`) on either bound, and a repo-wide
|
||||
search for `Query_FiltersByTimeRange_HandlesNonUtcOffset` returns zero matches — the
|
||||
claimed regression test does not exist. The -016 resolution was asserted but never landed
|
||||
in source, so the original High-severity defect is **live**.
|
||||
|
||||
The mechanism is exactly as -016 described. Event rows are persisted with
|
||||
`timestamp = DateTimeOffset.UtcNow.ToString("o")` (`SiteEventLogger.cs:202`), which always
|
||||
emits the round-trip ISO 8601 form ending in the literal offset `+00:00`
|
||||
(e.g. `2026-06-20T12:34:56.7890123+00:00`). `request.From`/`request.To` are
|
||||
`DateTimeOffset?`, and `ToString("o")` preserves whatever offset the caller passed. A
|
||||
central client in, say, `UTC+05:00` that filters with `DateTimeOffset.Now` produces
|
||||
`"2026-06-20T17:34:56.0000000+05:00"`, which under SQLite's default `BINARY` collation
|
||||
sorts lexicographically *greater* than the equivalent UTC instant string
|
||||
`"2026-06-20T12:34:56.0000000+00:00"`. The `timestamp >= $from` / `timestamp <= $to`
|
||||
comparison is then a byte-by-byte string compare, so the query silently includes events
|
||||
from the wrong hour or excludes events that genuinely fall in the window. The design
|
||||
states "All timestamps are UTC throughout the system", but the central-`DateTimeOffset`
|
||||
→ SQLite boundary does not enforce it. A central UI rendered in a non-UTC timezone is the
|
||||
most likely trigger, and it corrupts precisely the "show me what happened around the
|
||||
failover" time-range query that operators most often run. The retention purge
|
||||
(`DateTimeOffset.UtcNow.AddDays(-N).ToString("o")`) is UTC by construction and remains
|
||||
safe; only the central query path is vulnerable.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Apply the one-line fix on each bound that -016 specified but that was never committed:
|
||||
|
||||
```
|
||||
parameters.Add(new SqliteParameter("$from", request.From.Value.ToUniversalTime().ToString("o")));
|
||||
parameters.Add(new SqliteParameter("$to", request.To.Value.ToUniversalTime().ToString("o")));
|
||||
```
|
||||
|
||||
so the produced offset is always `+00:00` and the comparison is lexicographically sound
|
||||
against the UTC-stored strings. Add the regression test that -016 claimed but never
|
||||
delivered (see SiteEventLogging-027): construct a `From`/`To` carrying a non-zero offset
|
||||
(e.g. `+05:00`) and assert the matching UTC-stored events are returned and out-of-range
|
||||
ones excluded, asserting on returned row identities rather than just count. Optionally
|
||||
store `timestamp` as Unix-epoch `INTEGER` to eliminate the lexicographic-comparison hazard
|
||||
structurally. When fixing, also reconcile SiteEventLogging-016's Resolution text with
|
||||
reality (its commit reference was `<pending>` and the fix never merged).
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): `EventLogQueryService` now calls `.ToUniversalTime()` on the `From`/`To` `DateTimeOffset` bounds before `.ToString("o")`, so the comparison strings are UTC and match the UTC-stored values for any non-UTC client offset. This is the fix that the closed -016 claimed but never committed. Regression test added (verified failing pre-fix).
|
||||
|
||||
### SiteEventLogging-025 — `LogEventAsync` severity/argument validation throws synchronously, bypassing the "faults the returned Task" contract for fire-and-forget callers
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Status | Deferred |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.SiteEventLogging/SiteEventLogger.cs:187-199` |
|
||||
|
||||
**Description**
|
||||
|
||||
`LogEventAsync` is documented as a fire-and-forget enqueue whose returned `Task`
|
||||
"completes once the event is durably persisted and faults if the write fails"
|
||||
(`SiteEventLogger.cs:24-29`), and the disposed-drop paths were deliberately reworked under
|
||||
SiteEventLogging-012 to *fault the Task* rather than throw. But the argument and severity
|
||||
validation at the top of the method throws **synchronously** on the caller's thread before
|
||||
any `Task` is returned:
|
||||
|
||||
```
|
||||
ArgumentException.ThrowIfNullOrWhiteSpace(eventType);
|
||||
ArgumentException.ThrowIfNullOrWhiteSpace(severity);
|
||||
ArgumentException.ThrowIfNullOrWhiteSpace(source);
|
||||
ArgumentException.ThrowIfNullOrWhiteSpace(message);
|
||||
|
||||
if (!AllowedSeverities.Contains(severity)) // SiteEventLogging-020 closed set
|
||||
{
|
||||
throw new ArgumentException(
|
||||
$"Severity '{severity}' is not one of the allowed values: Info, Warning, Error.",
|
||||
nameof(severity));
|
||||
}
|
||||
```
|
||||
|
||||
Every recording caller invokes this fire-and-forget as `_ = _eventLogger.LogEventAsync(...)`
|
||||
(no `await`, return discarded), so a thrown `ArgumentException` does **not** flow into the
|
||||
discarded `Task` — it propagates straight up the calling actor's message-handling stack.
|
||||
For the recorder hot-path callers (the computed `AlarmActor`, the `DataConnectionActor`,
|
||||
the `ScriptActor`/`ScriptExecutionActor`) an unhandled synchronous throw would fault the
|
||||
actor and trip its supervision strategy, when the documented and intended behaviour for a
|
||||
best-effort audit write is that a bad event is dropped/faulted, never that it crashes the
|
||||
subsystem doing the logging.
|
||||
|
||||
This is **currently latent**: every production call site passes a hard-coded canonical
|
||||
`severity` literal (`"Info"`/`"Warning"`/`"Error"`) and non-empty `eventType`/`source`/
|
||||
`message`, so the throw path is never reached today. The exposure is a future caller that
|
||||
passes a dynamic, computed, or typo'd severity (e.g. `"error"`, `"ERR"`, a value derived
|
||||
from an alarm payload) — the SiteEventLogging-020 closed-set check, added defensively, then
|
||||
becomes a crash vector for the very actor it is meant to protect. The mismatch is that
|
||||
-012 chose Task-faulting for one drop reason (disposal) while validation kept synchronous
|
||||
throwing for another (bad input), so the method's failure contract is now inconsistent.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Make the failure mode uniform with the rest of the method: return a faulted `Task` for
|
||||
validation failures instead of throwing synchronously — e.g.
|
||||
`return Task.FromException(new ArgumentException(...))` for the severity check (and,
|
||||
consistently, for the null/whitespace guards) — so a fire-and-forget caller is never
|
||||
crashed by a bad audit-event payload and an `await`-ing caller still observes the failure.
|
||||
Alternatively, accept a `severity` enum on `ISiteEventLogger.LogEventAsync` and convert to
|
||||
the canonical string at the boundary, eliminating the runtime-validation throw entirely.
|
||||
The exact fix shape is a judgment call — faulted-Task is the minimal-diff change that
|
||||
matches the documented contract; the enum is the stronger design but a wider
|
||||
interface/caller change. Whichever is chosen, update the XML doc to state the failure
|
||||
semantics for invalid input.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Deferred 2026-06-20: the fix shape (return a faulted Task vs. validate against a severity enum) is a design decision, and the defect is latent (every current caller passes a canonical severity literal). Recorded; no change this pass.
|
||||
|
||||
### SiteEventLogging-026 — Purge active-node gate (cluster leader) can diverge from the query singleton's placement (oldest member of role)
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Medium |
|
||||
| Category | Design-document adherence |
|
||||
| Status | Deferred |
|
||||
| Location | `src/ZB.MOM.WW.ScadaBridge.Host/SiteServiceRegistration.cs:116-120`, `src/ZB.MOM.WW.ScadaBridge.Host/Health/AkkaClusterNodeProvider.cs:29-39`, `src/ZB.MOM.WW.ScadaBridge.Host/Actors/AkkaHostedService.cs:889-895` |
|
||||
|
||||
**Description**
|
||||
|
||||
SiteEventLogging-019 closed the "purge runs on every node" finding by gating
|
||||
`EventLogPurgeService` on an optional `SiteEventLogActiveNodeCheck`. The Host wires that
|
||||
check to `IClusterNodeProvider.SelfIsPrimary` (`SiteServiceRegistration.cs:116-120`), which
|
||||
returns true only when "this node is `Up` AND is the **cluster leader**"
|
||||
(`AkkaClusterNodeProvider.cs:29-39`: `cluster.State.Leader.Equals(cluster.SelfAddress)`).
|
||||
|
||||
The query path, however, is served by the `event-log-handler` **cluster singleton**
|
||||
(`AkkaHostedService.cs:889-895`), which Akka.NET pins to the **oldest member of the role**,
|
||||
not the cluster leader. These are two different election concepts:
|
||||
|
||||
- The Akka **cluster leader** is the lowest-address member in the `Up`/`Leaving` state set
|
||||
for the *whole cluster* (role-agnostic) — it is what `SelfIsPrimary` tests.
|
||||
- A **cluster singleton** is hosted on the *oldest* member of its configured role
|
||||
(`siteRole`) — it is what owns the event-log query DB.
|
||||
|
||||
In a steady-state 2-node site cluster these usually coincide, so the gate works in
|
||||
practice. But they are not guaranteed equal — most plausibly during membership churn /
|
||||
after a failover, when the oldest-member computation and the leader election can settle on
|
||||
different nodes for a window. When they diverge, the query singleton reads node X's
|
||||
SQLite DB while the purge runs on node Y. The result is that the node whose DB is actually
|
||||
queried (the populated, active one) is left **unpurged** — its 30-day retention sweep and
|
||||
1 GB cap-purge do not run — so the queried database can drift over the documented 1 GB cap
|
||||
and retain rows past the 30-day window, which is exactly the invariant the storage section
|
||||
of Component-SiteEventLogging requires. The SiteEventLogging-004 re-triage note already
|
||||
established the correct co-location principle: the *query* singleton and the
|
||||
*deployment-manager* singleton share `siteRole` and so co-locate on the oldest member; the
|
||||
purge gate keyed on a *different* signal (`SelfIsPrimary`/leader) silently breaks that
|
||||
co-location for the purge.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Re-gate the purge on the **same** ownership signal as the query singleton so the two never
|
||||
diverge — i.e. gate on "this node hosts the `event-log-handler` singleton" (host the purge
|
||||
inside that same cluster singleton, or test oldest-member-of-`siteRole` rather than cluster
|
||||
leader). Alternatively, drop the purge gate entirely and let every node purge its own local
|
||||
DB: the SiteEventLogging-019 and -004 analysis already showed standby purge is harmless
|
||||
(the standby receives no writes), so running it on both nodes guarantees the queried DB is
|
||||
always purged at the cost of a no-op sweep on the standby. The direction is a **judgment
|
||||
call** — re-gating on singleton ownership keeps the "active node only" intent but adds
|
||||
coupling to the singleton's lifecycle; dropping the gate is simpler and provably correct
|
||||
for the cap/retention invariant but reverts the -019 "active node only" optimisation.
|
||||
Whichever is chosen, the gate (`SelfIsPrimary` = leader) and the singleton (oldest member)
|
||||
must not be left keyed on two different elections.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Deferred 2026-06-20: the direction (re-gate purge on singleton ownership vs. drop the gate per the -019 analysis) is a design-owner decision. Divergence between the Akka leader and the oldest-member singleton is a churn-window condition. Recorded; no change this pass.
|
||||
|
||||
### SiteEventLogging-027 — Time-range test asserts count only with UTC-only inputs, masking SiteEventLogging-024
|
||||
|
||||
| | |
|
||||
|--|--|
|
||||
| Severity | Low |
|
||||
| Category | Testing coverage |
|
||||
| Status | Resolved |
|
||||
| Location | `tests/ZB.MOM.WW.ScadaBridge.SiteEventLogging.Tests/EventLogQueryServiceTests.cs:197-214` |
|
||||
|
||||
**Description**
|
||||
|
||||
`Query_FiltersByTimeRange` is the only test covering the `From`/`To` time-range filter, and
|
||||
it cannot catch SiteEventLogging-024 by construction:
|
||||
|
||||
```
|
||||
var now = DateTimeOffset.UtcNow; // UTC-only input
|
||||
InsertEventAt(now.AddHours(-2), "script", "Info", null, "S1", "Old event");
|
||||
InsertEventAt(now.AddMinutes(-30),"script", "Info", null, "S2", "Recent event");
|
||||
InsertEventAt(now, "script", "Info", null, "S3", "Now event");
|
||||
|
||||
var response = _queryService.ExecuteQuery(MakeRequest(
|
||||
from: now.AddHours(-1),
|
||||
to: now.AddMinutes(1)));
|
||||
|
||||
Assert.True(response.Success);
|
||||
Assert.Equal(2, response.Entries.Count); // count only
|
||||
```
|
||||
|
||||
Both shortcomings hide the live -024 defect:
|
||||
|
||||
1. **UTC-only inputs.** `now` is `DateTimeOffset.UtcNow`, so `from`/`to` already carry a
|
||||
`+00:00` offset. `ToString("o")` then produces a correctly-comparable string and the
|
||||
missing `.ToUniversalTime()` is never exercised. The defect only manifests for a
|
||||
`DateTimeOffset` with a *non-zero* offset, which this test never constructs.
|
||||
2. **Count-only assertion.** It asserts `response.Entries.Count == 2` but never checks
|
||||
*which* rows came back. A boundary off-by-an-hour bug that swapped one in-range row for
|
||||
an out-of-range row (the -024 symptom) could still yield a count of 2 and pass.
|
||||
|
||||
The net effect is a green test suite that gives false confidence the time-range filter is
|
||||
correct — and almost certainly contributed to SiteEventLogging-016 being marked Resolved
|
||||
when its fix was never committed.
|
||||
|
||||
**Recommendation**
|
||||
|
||||
Add the non-UTC-offset regression test that SiteEventLogging-016 claimed but never
|
||||
delivered (`Query_FiltersByTimeRange_HandlesNonUtcOffset`): construct `From`/`To` as a
|
||||
`DateTimeOffset` with a non-zero offset (e.g. `+05:00`) bracketing UTC-stored events, and
|
||||
assert on the **identities** of the returned rows (e.g. by `Source`/`Message` or `Id`), not
|
||||
just `Count`, so a window shifted by the offset is detected. Strengthen the existing
|
||||
`Query_FiltersByTimeRange` likewise to assert which rows are returned. These tests should
|
||||
fail against the current code and pass once SiteEventLogging-024 is fixed.
|
||||
|
||||
**Resolution**
|
||||
|
||||
Resolved 2026-06-20 (commit `fd618cf1`): added `Query_FiltersByTimeRange_HandlesNonUtcOffset` — seeds known UTC instants, queries with a +05:00 offset, and asserts the returned row identities (not just count). Verified failing pre-fix, passing post-fix.
|
||||
|
||||
Reference in New Issue
Block a user